Kudu client API for data operations
The Kudu client APIs allow you to insert and manage arrays in tables using various programming languages.
The Kudu client supports the following languages:
- C++
- Java
- Scala and Spark bindings
- Python
Working with array column data in C++
The following code demonstrates how to manage array column data using the Kudu C++ client API.
Setting array column data in C++
The Kudu C++ client API allows you to set, retrieve, and manage array column values.
- Setting array column data in C++
- The Kudu C++ client API allows you to set, retrieve, and manage array column values.
- Retrieving array column data in C++
- The following example shows how to access the value of an INT64 array column in a row
scan projection. This example uses column index instead of column name, assuming an
INT64 array column is at index 3 of the row scan projection’s
schema:
// KuduScanBatch::const_iterator it = ...; KuduScanBatch::RowPtr row(*it); std::vector<int64_t> arr; std::vector<bool> arr_notnull; KUDU_RETURN_NOT_OK(row.GetArrayInt64(3, &arr, &arr_notnull));
- Accessing raw array cell data:
- In the Kudu C++ client API, it’s also possible to access raw data in an array cell.
The following example shows how to access raw array cell data using the
KuduArrayCellViewfacade.const void* cell_ptr = row.cell(3); KuduArrayCellView view(cell_ptr); KUDU_RETURN_NOT_OK(view.Init()); KUDU_CHECK(view.elem_num() == arr.size()); const uint8_t* arr_raw_notnull_bitmap = view.not_null_bitmap(); KUDU_CHECK(arr_raw_notnull_bitmap); const int64_t* arr_raw = reinterpret_cast<const int64_t*>(view.data(KuduColumnSchema::INT64)); KUDU_CHECK(arr_raw); for (size_t i = 0; i < arr_notnull.size(); ++i) { if (arr_notnull[i]) { KUDU_CHECK(arr[i] == *(arr_raw + i)); // that's the same data for non-null array elements as when accessing it via RowPtr::GetArrayInt64() } bool arr_raw_elem_validity = *(arr_raw_notnull_bitmap + (i >> 3)) & (1 << (i & 7)); // the validity bit directly corresponds to the boolean value at the corresponding index of the 'arr_notnull' validity array KUDU_CHECK(arr_notnull[i] == arr_raw_elem_validity); }
Working with array column data in Java
The Kudu Java client allows you to set and manage array column values using primitive
arrays (long[]) and validity arrays (boolean[]).
Setting array column data in Java
- Setting array column values
- To set an
INT64array column namedarr_int64to[1L, NULL, 3L], provide the valuesList<Long>containing a null entry:
- Retrieving array column data in Java
- The following example shows how to access an INT64 array column using either the
column name or the column
index:
while (scanner.hasMoreRows()) { RowResultIterator it = scanner.nextRows(); while (it.hasNext()) { rows.add(it.next()); } } RowResult r1 = rows.get(0); Long[] int64Vals = (Long[]) r1.getArrayData("arr_int64");
Working with array column data in Spark bindings
Learn how Kudu Spark bindings treat Kudu arrays as standard Spark ArrayType and provides examples for creating and accessing DataFrame with array columns.
Kudu Spark bindings treat Kudu arrays as standard Spark ArrayType (mapped to Seq in Scala).
Creating DataFrame with array columns
This example demonstrates how to create a DataFrame in Apache Spark with array columns, including handling null elements within arrays.
import org.apache.spark.sql.functions._
import spark.implicits._
// Sample data with arrays (including null elements)
val df = Seq(
(1, Seq(1, null, 3), Seq("a", null, "c")),
(2, Seq(4, 5, 6), Seq("x", "y", "z"))
).toDF("key", "int_arr", "str_arr")
Accessing array column data
You can use standard SQL index notation to access elements.
val rows = df.select("key", "int_arr", "str_arr").collect()
val first = rows(0)
val ints = first.getAs[Seq[Integer]]("int_arr")
val strs = first.getAs[Seq[String]]("str_arr")
println(ints) // Seq(1, null, 3)
println(strs) // Seq("a", null, "c")
Using Spark SQL to access array column data
This section demonstrates how to use Spark SQL to access and manipulate array column data in a DataFrame.
df.createOrReplaceTempView("t")
spark.sql("""
SELECT key, int_arr[0] AS first_elem
FROM t
ORDER BY key
""").show()
+---+----------+
|key|first_elem|
+---+----------+
| 1| 1|
| 2| 4|
+---+----------+
Working with array column data in Python
Learn how to work with array column data using the Kudu Python client.
The Kudu Python client maps Kudu arrays directly to Python lists, where None represents a null element.
Setting array column values
When defining a schema or inserting rows, use kudu.array_type and standard
Python lists.
- Adding an array column to a schema:
-
builder = kudu.schema_builder() builder.add_column('arr_int64').nested_type(kudu.array_type(kudu.int64))
- nserting a row with nulls
[6, NULL, 7]: -
op = table.new_insert() op['arr_int64'] = [6, None, 7] session.apply(op)
- Setting an empty array [] or a NULL column:
-
- Empty Array:
op['arr_int64'] = [] - NULL Column:
op['arr_int64'] = None
- Empty Array:
Retrieving array column data
When selecting data, array columns are returned as Python lists.
scanner = client.new_scanner('my_table')
scanner.open()
while scanner.has_more_rows():
batch = scanner.next_batch()
for row in batch:
print(row['arr_int64']) # Output: [6, None, 7]
