Create an Iceberg data connection

Cloudera AI supports data connections to Iceberg data lakes.

You can connect automatically, using the Cloudera AI-provided data connection. If necessary, you can also set up a manual connection using the snippet provided below. To connect with Iceberg, you must use Spark 3.

In your project:

  1. In Project Settings, view the Data Connections tab. There should be an available data connection to a Spark Data Lake.
  2. Start a New Session.
  3. Select Enable Spark, and choose Spark 3 from the dropdown.
  4. Select Start Session.
  5. In the Connection Code Snippet UI, select the Spark Data Lake connection.
  6. In the code window, select Copy Code, then Close.
  7. Select File > New File, and paste the Cloudera AI-provided code snippet into the file.
  8. Select Run.

You see a list of available databases in the data lake.

Instead of using the Cloudera AI-provided data connection, you can also manually connect to a Spark Data Lake using the Spark command as shown in the snippet below.

Make sure to set the following parameters:

  • DATALAKE_DIRECTORY
  • Valid database and table name in the describe formatted SQL command.
from pyspark.sql import SparkSession
# Change to the appropriate Datalake directory location
DATALAKE_DIRECTORY = "s3a://your-aws-demo/"

spark = (
  SparkSession.builder.appName("MyApp")
  .config("spark.sql.hive.hwc.execution.mode", "spark")
  .config("spark.sql.extensions", "com.qubole.spark.hiveacid.HiveAcidAutoConvertExtension, org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
  .config("spark.sql.catalog.spark_catalog.type", "hive")
  .config("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog")
  .config("spark.yarn.access.hadoopFileSystems", DATALAKE_DIRECTORY)
  .config("spark.hadoop.iceberg.engine.hive.enabled", "true")
  .config("spark.jars", "/opt/spark/optional-lib/iceberg-spark-runtime.jar, /opt/spark/optional-lib/iceberg-hive-runtime.jar") 
  .getOrCreate()
  )

spark.sql("show databases").show() 
spark.sql("describe formatted <database_name>.<table_name>").show()