Using Apache Iceberg in Cloudera Data Engineering (Technical Preview)
Cloudera Data Engineering (CDE) supports Apache Iceberg which provides a table format
for huge analytic datasets in the cloud. Iceberg enables you to work with large tables,
especially on object stores, and supports concurrent reads and writes on all storage media. You
can use Cloudera Data Engineering virtual clusters running Spark 3 to interact with Apache
Prerequisites Learn about the supported versions for Spark and CDP Private Cloud Base cluster to use with Apache Iceberg CDE. Creating Virtual Cluster with Spark 3 Create a virtual cluster with Spark 3 as the Spark version. Creating and running Spark 3.2.1 Iceberg jobs Create and run a spark job which uses iceberg tables. Creating a new Iceberg table from Spark 3 In Cloudera Data Engineering (CDE), you can create a Spark job that creates a new Iceberg table or import an existing Hive table. Once created, the table can be used for subsequent operations. Configuring Hive Metastore for Iceberg column changes To make schema changes to an existing column of an Iceberg table, you must configure the Hive Metastore of the Data Lake. Importing and migrating Iceberg table in Spark 3 Importing or migrating tables are supported only on existing external Hive tables. When you import a table to Iceberg, the source and destination remain intact and independent. When you migrate a table, the existing Hive table is converted into an Iceberg table. You can use Spark SQL to import or migrate a Hive table to Iceberg. Configuring Catalog When using Spark SQL to query an Iceberg table from Spark, you refer to a table using the following dot notation: Loading data into an unpartitioned table You can insert data into an unpartitioned table. The syntax to load data into an iceberg table: Querying data in an Iceberg table To read the Iceberg table, you can use SparkSQL to query the Iceberg tables. Updating Iceberg table data Iceberg table data can be updated using copy-on-write or merge-on-read. The table version you are using will determine how you can update the table data. Iceberg library dependencies for Spark applications If your Spark application only uses Spark SQL to create, read, or write Iceberg tables, and does not use any Iceberg APIs, you do not need to build it against any Iceberg dependencies. The runtime dependencies needed for Spark to use Iceberg are in the CDE Spark classpath by default. If your code uses Iceberg APIs, then you need to build it against Iceberg dependencies.