Insert data into Kudu using Impala
Using Impala you can index one or more rows, or you can insert in bulk. Before inserting, learn about the syntax, and the advantages and disadvantages of the three inserting in bulk methods.
Inserting a row
The following syntax shows how to insert one or more rows using Impala:
INSERT INTO my_first_table VALUES (99, "sarah");
INSERT INTO my_first_table VALUES (1, "john"), (2, "jane"), (3, "jim");
The primary key must not be null.
Multiple single INSERT statements
Using multiple single INSERT statements is an insert in bulk method.
This approach has the advantage of being easy to understand and implement. This approach is likely to be inefficient because Impala has a high query start-up cost compared to Kudu's insertion performance. This will lead to relatively high latency and poor throughput.
Single INSERT statement with multiple VALUES subclauses
Using a single INSERT statements with multiple VALUES subclauses is an insert in bulk method.
VALUES
statements, Impala batches them into groups of 1024 (or the value of batch_size
) before sending the requests to
Kudu. This approach may perform slightly better than multiple sequential INSERT
statements by amortizing the query
start-up penalties on the Impala side. To set the batch size for the current Impala
Shell session, use the following syntax:
set batch_size=10000;
Batch insert
Batch insert is an insert in bulk method.
The approach that usually performs best, from the standpoint of both Impala and Kudu, is
usually to import the data using a SELECT
FROM
subclause in Impala.
- If your data is not already in Impala, one strategy is to import it from a text file, such as a TSV or CSV file.
- Create the Kudu table, being mindful that the columns designated as primary keys cannot have null values.
- Insert values into the Kudu table by querying the table
containing the original data, as in the following example:
INSERT INTO my_kudu_table SELECT * FROM legacy_data_import_table;
Ingest using the C++ or Java API
In many cases, the appropriate ingest path is to use the C++ or Java API to insert
directly into Kudu tables. Unlike other Impala tables, data inserted into Kudu tables
using the API becomes available for query in Impala without the need for any INVALIDATE METADATA
statements or other
statements needed for other Impala storage types.