Inserting in bulk
When inserting in bulk, there are at least three common choices. Each may have advantages and disadvantages, depending on your data and circumstances.
Multiple single INSERT statements
This approach has the advantage of being easy to understand and implement. This approach is likely to be inefficient because Impala has a high query start-up cost compared to Kudu's insertion performance. This will lead to relatively high latency and poor throughput.
Single INSERT statement with multiple VALUES subclauses
VALUES
statements, Impala batches them into groups of 1024 (or the value of batch_size
) before sending the requests to
Kudu. This approach may perform slightly better than multiple sequential INSERT
statements by amortizing the query
start-up penalties on the Impala side. To set the batch size for the current Impala
Shell session, use the following syntax:
set batch_size=10000;
Batch insert
The approach that usually performs best, from the standpoint of both Impala and Kudu, is
usually to import the data using a SELECT
FROM
subclause in Impala.
- If your data is not already in Impala, one strategy is to import it from a text file, such as a TSV or CSV file.
- Create the Kudu table, being mindful that the columns designated as primary keys cannot have null values.
- Insert values into the Kudu table by querying the table
containing the original data, as in the following example:
INSERT INTO my_kudu_table SELECT * FROM legacy_data_import_table;
Ingest using the C++ or Java API
In many cases, the appropriate ingest path is to use the C++ or Java API to insert
directly into Kudu tables. Unlike other Impala tables, data inserted into Kudu tables
using the API becomes available for query in Impala without the need for any INVALIDATE METADATA
statements or other
statements needed for other Impala storage types.