Insert data into Kudu using Impala

Using Impala you can index one or more rows, or you can insert in bulk. Before inserting, learn about the syntax, and the advantages and disadvantages of the three inserting in bulk methods.

Inserting a row

The following syntax shows how to insert one or more rows using Impala:

INSERT INTO my_first_table VALUES (99, "sarah");
INSERT INTO my_first_table VALUES (1, "john"), (2, "jane"), (3, "jim");

The primary key must not be null.

Multiple single INSERT statements

Using multiple single INSERT statements is an insert in bulk method.

This approach has the advantage of being easy to understand and implement. This approach is likely to be inefficient because Impala has a high query start-up cost compared to Kudu's insertion performance. This will lead to relatively high latency and poor throughput.

Single INSERT statement with multiple VALUES subclauses

Using a single INSERT statements with multiple VALUES subclauses is an insert in bulk method.

If you include more than 1024 VALUES statements, Impala batches them into groups of 1024 (or the value of batch_size) before sending the requests to Kudu. This approach may perform slightly better than multiple sequential INSERT statements by amortizing the query start-up penalties on the Impala side. To set the batch size for the current Impala Shell session, use the following syntax:
set batch_size=10000;

Batch insert

Batch insert is an insert in bulk method.

The approach that usually performs best, from the standpoint of both Impala and Kudu, is usually to import the data using a SELECT FROM subclause in Impala.

  1. If your data is not already in Impala, one strategy is to import it from a text file, such as a TSV or CSV file.
  2. Create the Kudu table, being mindful that the columns designated as primary keys cannot have null values.
  3. Insert values into the Kudu table by querying the table containing the original data, as in the following example:
    INSERT INTO my_kudu_table
    SELECT * FROM legacy_data_import_table;

Ingest using the C++ or Java API

In many cases, the appropriate ingest path is to use the C++ or Java API to insert directly into Kudu tables. Unlike other Impala tables, data inserted into Kudu tables using the API becomes available for query in Impala without the need for any INVALIDATE METADATA statements or other statements needed for other Impala storage types.