Developing Applications With Apache Kudu
Apache Kudu provides C++ and Java client APIs, as well as reference examples to illustrate their use. A Python API is included, but it is currently considered experimental, unstable, and is subject to change at any time.
Viewing the API Documentation
C++ API Documentation
The documentation for the C++ client APIs is included in the header files in /usr/include/kudu/ if you installed Kudu using packages or subdirectories of src/kudu/client/ if you built Kudu from source. If you installed Kudu using parcels, no headers are included in your installation. and you will need to build Kudu from source in order to have access to the headers and shared libraries.
$ find /usr/include/kudu -type f -name *.h
Kudu Example Applications
- java-example
- A simple Java application which connects to a Kudu instance, creates a table, writes data to it, then drops the table.
- java/collectl
- A simple Java application which listens on a TCP socket for time series data corresponding to the Collectl wire protocol. The commonly-available collectl tool can be used to send example data to the server.
- java/insert-loadgen
-
A Java application that generates random insert load.
- python/dstat-kudu
-
An example program that shows how to use the Kudu Python API to load data into a new / existing Kudu table generated by an external program, dstat in this case.
- python/graphite-kudu
-
An experimental plugin for using graphite-web with Kudu as a backend.
- demo-vm-setup
- Scripts to download and run a VirtualBox virtual machine with Kudu already installed. For more information see the Kudu Quickstart documentation.
These examples should serve as helpful starting points for your own Kudu applications and integrations.
Maven Artifacts
he following Maven <dependency> element is valid for the Apache Kudu GA release:
<dependency> <groupId>org.apache.kudu</groupId> <artifactId>kudu-client</artifactId> <version>1.1.0</version> </dependency>
Convenience binary artifacts for the Java client and various Java integrations (e.g. Spark, Flume) are also now available via the ASF Maven repository and the Central Maven repository.
Kudu Python Client
The Kudu Python client provides a Python friendly interface to the C++ client API. The sample below demonstrates the use of part of the Python client.
import kudu from kudu.client import Partitioning from datetime import datetime # Connect to Kudu master server client = kudu.connect(host='kudu.master', port=7051) # Define a schema for a new table builder = kudu.schema_builder() builder.add_column('key').type(kudu.int64).nullable(False).primary_key() builder.add_column('ts_val', type_=kudu.unixtime_micros, nullable=False, compression='lz4') schema = builder.build() # Define partitioning schema partitioning = Partitioning().add_hash_partitions(column_names=['key'], num_buckets=3) # Create new table client.create_table('python-example', schema, partitioning) # Open a table table = client.table('python-example') # Create a new session so that we can apply write operations session = client.new_session() # Insert a row op = table.new_insert({'key': 1, 'ts_val': datetime.utcnow()}) session.apply(op) # Upsert a row op = table.new_upsert({'key': 2, 'ts_val': "2016-01-01T00:00:00.000000"}) session.apply(op) # Updating a row op = table.new_update({'key': 1, 'ts_val': ("2017-01-01", "%Y-%m-%d")}) session.apply(op) # Delete a row op = table.new_delete({'key': 2}) session.apply(op) # Flush write operations, if failures occur, capture print them. try: session.flush() except kudu.KuduBadStatus as e: print(session.get_pending_errors()) # Create a scanner and add a predicate scanner = table.scanner() scanner.add_predicate(table['ts_val'] == datetime(2017, 1, 1)) # Open Scanner and read all tuples # Note: This doesn't scale for large scans result = scanner.open().read_all_tuples()
Example Apache Impala Commands With Kudu
See Using Apache Impala (incubating) with Kudu for guidance on installing and using Impala with Kudu, including several impala-shell examples.
Kudu Integration with Spark
Kudu integrates with Spark through the Data Source API as of version 1.0.0. Include the kudu-spark dependency using the --packages option:
Use the kudu-spark_2.10 artifact if using Spark with Scala 2.10
spark-shell --packages org.apache.kudu:kudu-spark_2.10:1.1.0
Use the kudu-spark2_2.11 artifact if using Spark 2 with Scala 2.11
spark-shell --packages org.apache.kudu:kudu-spark2_2.11:1.1.0
then import kudu-spark and create a dataframe:
import org.apache.kudu.spark.kudu._ // Read a table from Kudu val df = sqlContext.read.options(Map("kudu.master" -> "kudu.master:7051","kudu.table" -> "kudu_table")).kudu // Query using the Spark API... df.select("id").filter("id" >= 5).show() // ...or register a temporary table and use SQL df.registerTempTable("kudu_table") val filteredDF = sqlContext.sql("select id from kudu_table where id >= 5").show() // Use KuduContext to create, delete, or write to Kudu tables val kuduContext = new KuduContext("kudu.master:7051") // Create a new Kudu table from a dataframe schema // NB: No rows from the dataframe are inserted into the table kuduContext.createTable("test_table", df.schema, Seq("key"), new CreateTableOptions().setNumReplicas(1)) // Insert data kuduContext.insertRows(df, "test_table") // Delete data kuduContext.deleteRows(filteredDF, "test_table") // Upsert data kuduContext.upsertRows(df, "test_table") // Update data val alteredDF = df.select("id", $"count" + 1) kuduContext.updateRows(filteredRows, "test_table" // Data can also be inserted into the Kudu table using the data source, though the methods on KuduContext are preferred // NB: The default is to upsert rows; to perform standard inserts instead, set operation = insert in the options map // NB: Only mode Append is supported df.write.options(Map("kudu.master"-> "kudu.master:7051", "kudu.table"-> "test_table")).mode("append").kudu // Check for the existence of a Kudu table kuduContext.tableExists("another_table") // Delete a Kudu table kuduContext.deleteTable("unwanted_table")
Spark Integration Known Issues and Limitations
-
Kudu tables with a name containing upper case or non-ASCII characters must be assigned an alternate name when registered as a temporary table.
-
Kudu tables with a column name containing upper case or non-ASCII characters may not be used with SparkSQL. Non-primary key columns may be renamed in Kudu to work around this issue.
-
NULL, NOT NULL, <>, OR, LIKE, and IN predicates are not pushed to Kudu, and instead will be evaluated by the Spark task.
-
Kudu does not support all types supported by Spark SQL, such as Date, Decimal and complex types.
Integration with MapReduce, YARN, and Other Frameworks
Kudu was designed to integrate with MapReduce, YARN, Spark, and other frameworks in the Hadoop ecosystem. See RowCounter.java and ImportCsv.java for examples which you can model your own integrations on.