Create an Iceberg data connection

Cloudera AI supports data connections to Iceberg data lakes.

You can connect automatically, using the Cloudera AI-provided data connection. If necessary, you can also set up a manual connection using the snippet provided below. To connect with Iceberg, you must use Spark 3.

In your project:

In Project Settings, view the Data Connections tab. There should be an available data connection to a Spark Data Lake.
Start a New Session.
Select Enable Spark, and choose Spark 3 from the dropdown.
Select Start Session.
In the Connection Code Snippet UI, select the Spark Data Lake connection.
In the code window, select Copy Code, then Close.
Select File > New File, and paste the Cloudera AI-provided code snippet into the file.
Select Run.

You see a list of available databases in the data lake.

Instead of using the Cloudera AI-provided data connection, you can also manually connect to a Spark Data Lake using the Spark command as shown in the snippet below.

Make sure to set the following parameters:

DATALAKE_DIRECTORY
Valid database and table name in the describe formatted SQL command.

from pyspark.sql import SparkSession
# Change to the appropriate Datalake directory location
DATALAKE_DIRECTORY = "s3a://your-aws-demo/"

spark = (
SparkSession.builder.appName("MyApp")
.config("spark.sql.hive.hwc.execution.mode", "spark")
.config("spark.sql.extensions","com.qubole.spark.hiveacid.HiveAcidAutoConvertExtension,org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
.config("spark.sql.catalog.spark_catalog.type", "hive")
.config("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkCatalog")
.config("spark.kerberos.access.hadoopFileSystems","hdfs://nn1.com:8032,hdfs://nn2.com:8032,webhdfs://nn3.com:50070")
.config("spark.hadoop.iceberg.engine.hive.enabled", "true")
.config("spark.executorEnv.HADOOP_CONF_DIR", "/home/cdsw/hadoop_config_dir")
.config("spark.sql.iceberg.handle-timestamp-without-timezone", "true")
.config("spark.jars","/opt/spark/optional-11b/iceberg-spark-runtime.jar,/opt/spark/optional-11b/iceberg-hive-runtime.jar")
.getOrCreate()
)

spark.sql("show databases").show() 
spark.sql("describe formatted <database_name>.<table_name>").show()