Kudu client API for data operations

The Kudu client APIs allow you to insert and manage arrays in tables using various programming languages.

The Kudu client supports the following languages:

  • C++
  • Java
  • Scala and Spark bindings
  • Python

Working with array column data in C++

The following code demonstrates how to manage array column data using the Kudu C++ client API.

Setting array column data in C++

The Kudu C++ client API allows you to set, retrieve, and manage array column values.

Setting array column data in C++
The Kudu C++ client API allows you to set, retrieve, and manage array column values.
  • Setting array column values: The following example shows how to set an INT64 array column named arr_int64 to an array of three elements. This includes two non-null elements and one null element in the middle, for example, [6, NULL, 7]:
    // KuduPartialRow* row ...;
    KUDU_RETURN_NOT_OK(row->SetArrayInt64("arr_int64", { 6, 0, 7 }, { true, false, true }));
    
  • Setting a column to NULL: To set the entire array column to a null value:
    KUDU_RETURN_NOT_OK(row->SetNull("arr_int64"));
  • Setting an empty array: To set the column to an empty array, for example []:
    KUDU_RETURN_NOT_OK(row->SetArrayInt64("arr_int64", {}, {}));
  • Setting an array with a single null element: To set an array containing one null element, for example [NULL]:
    KUDU_RETURN_NOT_OK(row->SetArrayInt64("arr_int64", { 0 }, { false }));
  • Setting an array with all non-null elements: There is a convention to accept an empty validity vector when all array elements are non-null. The same convention applies when retrieving data from array type columns.
    KUDU_RETURN_NOT_OK(row->SetArrayInt32("arr_int32", { 1, 2, 3 }, {}))
Retrieving array column data in C++
The following example shows how to access the value of an INT64 array column in a row scan projection. This example uses column index instead of column name, assuming an INT64 array column is at index 3 of the row scan projection’s schema:
// KuduScanBatch::const_iterator it = ...;
KuduScanBatch::RowPtr row(*it);
std::vector<int64_t> arr;
std::vector<bool> arr_notnull;
KUDU_RETURN_NOT_OK(row.GetArrayInt64(3, &arr, &arr_notnull));
Accessing raw array cell data:
In the Kudu C++ client API, it’s also possible to access raw data in an array cell. The following example shows how to access raw array cell data using the KuduArrayCellView facade.
const void* cell_ptr = row.cell(3);
KuduArrayCellView view(cell_ptr);
KUDU_RETURN_NOT_OK(view.Init());
KUDU_CHECK(view.elem_num() == arr.size());
const uint8_t* arr_raw_notnull_bitmap = view.not_null_bitmap();
KUDU_CHECK(arr_raw_notnull_bitmap);
const int64_t* arr_raw = reinterpret_cast<const int64_t*>(view.data(KuduColumnSchema::INT64));
KUDU_CHECK(arr_raw);
for (size_t i = 0; i < arr_notnull.size(); ++i) {
  if (arr_notnull[i]) {
    KUDU_CHECK(arr[i] == *(arr_raw + i));  // that's the same data for non-null array elements as when accessing it via RowPtr::GetArrayInt64()
  }
  bool arr_raw_elem_validity = *(arr_raw_notnull_bitmap + (i >> 3)) & (1 << (i & 7));	// the validity bit directly corresponds to the boolean value at the corresponding index of the 'arr_notnull' validity array 
  KUDU_CHECK(arr_notnull[i] == arr_raw_elem_validity);
}

Working with array column data in Java

The Kudu Java client allows you to set and manage array column values using primitive arrays (long[]) and validity arrays (boolean[]).

Setting array column data in Java

Setting array column values
To set an INT64 array column named arr_int64 to [1L, NULL, 3L], provide the values List<Long> containing a null entry:
import org.apache.kudu.client.PartialRow;
// PartialRow row ...;
long[] vals = {1L, 2L, 3L};
boolean[] validity = {true, false, true};
row.addArrayInt64("arr_int64", vals);
  • Setting a column to NULL: To set the entire column (the array itself) to a null value:
    row.setNull("arr_int64");
  • Setting an empty array: To set the column to an empty array (e.g., []):
    long[] values = {};
    row.addArrayInt64("arr_int64", values);
  • Setting an array with a single null element: To set an array containing exactly one null element (e.g., [NULL]):
    long[] values = { 0L };
    boolean[] validity = { false };
    row.addArrayInt64("arr_int64", values, validity);
    
  • Setting an array with all non-null elements: If the array contains no null values, you can use a convenience overload that omits the validity vector entirely. Kudu will automatically treat all elements in the provided array as non-null.
    long[] values = { 10L, 20L, 30L };
    row.addArrayInt64("arr_int64", values);
Retrieving array column data in Java
The following example shows how to access an INT64 array column using either the column name or the column index:
while (scanner.hasMoreRows()) {
  RowResultIterator it = scanner.nextRows();
  while (it.hasNext()) {
    rows.add(it.next());
  }
}
RowResult r1 = rows.get(0);
Long[] int64Vals = (Long[]) r1.getArrayData("arr_int64");

Working with array column data in Spark bindings

Learn how Kudu Spark bindings treat Kudu arrays as standard Spark ArrayType and provides examples for creating and accessing DataFrame with array columns.

Kudu Spark bindings treat Kudu arrays as standard Spark ArrayType (mapped to Seq in Scala).

Creating DataFrame with array columns

This example demonstrates how to create a DataFrame in Apache Spark with array columns, including handling null elements within arrays.


import org.apache.spark.sql.functions._
import spark.implicits._
// Sample data with arrays (including null elements)
val df = Seq(
  (1, Seq(1, null, 3), Seq("a", null, "c")),
  (2, Seq(4, 5, 6), Seq("x", "y", "z"))
).toDF("key", "int_arr", "str_arr")

Accessing array column data

You can use standard SQL index notation to access elements.


val rows = df.select("key", "int_arr", "str_arr").collect()
val first = rows(0)
val ints = first.getAs[Seq[Integer]]("int_arr")
val strs = first.getAs[Seq[String]]("str_arr")
println(ints) // Seq(1, null, 3)
println(strs) // Seq("a", null, "c")

Using Spark SQL to access array column data

This section demonstrates how to use Spark SQL to access and manipulate array column data in a DataFrame.


df.createOrReplaceTempView("t")
spark.sql("""
  SELECT key, int_arr[0] AS first_elem
  FROM t
  ORDER BY key
""").show()

+---+----------+
|key|first_elem|
+---+----------+
|  1|         1|
|  2|         4|
+---+----------+

Working with array column data in Python

Learn how to work with array column data using the Kudu Python client.

The Kudu Python client maps Kudu arrays directly to Python lists, where None represents a null element.

Setting array column values

When defining a schema or inserting rows, use kudu.array_type and standard Python lists.

Adding an array column to a schema:

builder = kudu.schema_builder()
builder.add_column('arr_int64').nested_type(kudu.array_type(kudu.int64))
nserting a row with nulls [6, NULL, 7]:

op = table.new_insert()
op['arr_int64'] = [6, None, 7]
session.apply(op)
Setting an empty array [] or a NULL column:
  • Empty Array: op['arr_int64'] = []
  • NULL Column: op['arr_int64'] = None

Retrieving array column data

When selecting data, array columns are returned as Python lists.


scanner = client.new_scanner('my_table')
scanner.open()
while scanner.has_more_rows():
    batch = scanner.next_batch()
    for row in batch:
        print(row['arr_int64']) # Output: [6, None, 7]