Apache Kudu architecture in a CDP public cloud deployment
In a CDP public cloud deployment, Kudu is available as one of the many Cloudera Runtime services within the Real-time Data Mart template. To use Kudu, create a Data Hub cluster from the Management Console, and select the Real-time Data Mart template from the Cluster Definition dropdown menu.
Each Data Hub cluster that is deployed using the Real-time Data Mart template has an instance of Apache Kudu, Impala, Spark, and Knox. It also contains YARN—which is used to run Spark, Hue—which can be used to issue queries via Impala, and HDFS—because the cluster is managed using Cloudera Manager. These components use the shared resources present within the Data Lake.
Impala manages hierarchical storage between Kudu and the object storage. You can use SQL
statements to age out range partitions worth of data from Kudu into the Parquet partitions on
object storage, for example, on Amazon S3. You can then use an Impala UNION ALL
or a VIEW
statement to query the Kudu and Parquet data together.
Besides ingesting data, Spark is used for backup and disaster recovery (BDR) operations. The Kudu BDR feature is built as a Spark application, and it can take either full or incremental table backups. It then restores a stack of backups into a table.
The following diagram shows Apache Kudu and its dependencies as a part of the Data Hub cluster, and the shared resources they use which are present in the Data Lake: