Using Apache Iceberg in Cloudera Data Engineering
Cloudera Data Engineering (CDE) supports Apache Iceberg which provides a table format
for huge analytic datasets in the cloud. Iceberg enables you to work with large tables,
especially on object stores, and supports concurrent reads and writes on all storage media. You
can use Cloudera Data Engineering virtual clusters running Spark 3 to interact with Apache
Prerequisites and limitations for using Iceberg To use Apache Iceberg in CDE, you'll need the following prerequisites: Accessing Iceberg tables CDP uses Apache Ranger to provide centralized security administration and management. The Ranger Admin UI is the central interface for security administration. You can use Ranger to create two policies that allow users to query Iceberg tables. Creating Virtual Cluster with Spark 3 Create a virtual cluster with Spark 3 as the Spark version. Creating and running Spark 3.2.1 Iceberg jobs Create and run a spark job which uses iceberg tables. Creating a new Iceberg table from Spark 3 You can create an Iceberg table using Spark SQL. Configuring Hive Metastore for Iceberg column changes To make schema changes to an existing column of an Iceberg table, you must configure the Hive Metastore of the Data Lake. Importing and migrating Iceberg table in Spark 3 Importing or migrating tables are supported only on existing external Hive tables. When you import a table to Iceberg, the source and destination remain intact and independent. When you migrate a table, the existing Hive table is converted into an Iceberg table. You can use Spark SQL to import or migrate a Hive table to Iceberg. Importing and migrating Iceberg table format v2 Importing or migrating Hive tables Iceberg table formats v2 are supported only on existing external Hive tables. When you import a table to Iceberg, the source and destination remain intact and independent. When you migrate a table, the existing Hive table is converted into an Iceberg table. You can use Spark SQL to import or migrate a Hive table to Iceberg. Configuring Catalog When using Spark SQL to query an Iceberg table from Spark, you refer to a table using the following dot notation: Loading data into an unpartitioned table You can insert data into an unpartitioned table. The syntax to load data into an iceberg table: Querying data in an Iceberg table To read the Iceberg table, you can use SparkSQL to query the Iceberg tables. Updating Iceberg table data Iceberg table data can be updated using copy-on-write or merge-on-read. The table version you are using will determine how you can update the table data. Iceberg library dependencies for Spark applications If your Spark application only uses Spark SQL to create, read, or write Iceberg tables, and does not use any Iceberg APIs, you do not need to build it against any Iceberg dependencies. The runtime dependencies needed for Spark to use Iceberg are in the Spark classpath by default. If your code uses Iceberg APIs, then you need to build it against Iceberg dependencies.