Using Apache Iceberg in Cloudera Data Engineering (Technical Preview)
CDE now supports Apache Iceberg, a new table format for huge analytic datasets in the
cloud. Iceberg enables you to work with large tables, especially on object stores, and supports
concurrent reads and writes on all storage media. You can use Cloudera Data Engineering virtual
clusters running Spark 3 to interact with Apache Iceberg tables.
Prerequisites The following are the prerequisites to use Apache Iceberg in Cloudera Data Engineering.Creating Virtual Cluster with Spark 3 Create a virtual cluster with Spark 3 as the Spark version. Creating a new Iceberg table from Spark 3 In CDE, you can create a Spark job that creates a new Iceberg table or import an existing Hive table. Once created, the table can be used for subsequent operations.Importing and migrating Iceberg table in Spark 3 Importing or migrating tables are supported only on existing external Hive tables. When you import a table to Iceberg, the source and destination remain intact and independent. When you migrate a table the existing Hive table is converted into an Iceberg table. Configuring Catalog When using Spark SQL to query an Iceberg table from Spark, you refer to a table using the following dot notation:Loading data into an unpartitioned table You can insert data into an unpartitioned table. The syntax to load data into an iceberg table:Querying data in an Iceberg table To read the Iceberg table, you can use SparkSQL to query the Iceberg tables. Iceberg library dependencies for Spark applications If your Spark application only uses Spark SQL to create, read, or write Iceberg tables, and does not use any Iceberg APIs, you do not need to build it against any Iceberg dependencies. The runtime dependencies needed for Spark to use Iceberg are in the CDE Spark classpath by default. If your code uses Iceberg APIs, then you need to build it against Iceberg dependencies.