Create an Iceberg data connection
Cloudera AI supports data connections to Iceberg data lakes.
You can connect automatically, using the Cloudera AI-provided data connection. If necessary, you can also set up a manual connection using the snippet provided below. To connect with Iceberg, you must use Spark 3.
In your project:
- In Project Settings, view the Data Connections tab. There should be an available data connection to a Spark Data Lake.
- Start a New Session.
- Select Enable Spark, and choose Spark 3 from the dropdown.
- Select Start Session.
- In the Connection Code Snippet UI, select the Spark Data Lake connection.
- In the code window, select Copy Code, then Close.
- Select Cloudera AI-provided code snippet into the file. , and paste the
- Select Run.
You see a list of available databases in the data lake.
Instead of using the Cloudera AI-provided data connection, you can also manually connect to a Spark Data Lake using the Spark command as shown in the snippet below.
Make sure to set the following parameters:
DATALAKE_DIRECTORY
- Valid database and table name in the
describe formatted
SQL command.
from pyspark.sql import SparkSession
# Change to the appropriate Datalake directory location
DATALAKE_DIRECTORY = "s3a://your-aws-demo/"
spark = (
SparkSession.builder.appName("MyApp")
.config("spark.sql.hive.hwc.execution.mode", "spark")
.config("spark.sql.extensions", "com.qubole.spark.hiveacid.HiveAcidAutoConvertExtension, org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
.config("spark.sql.catalog.spark_catalog.type", "hive")
.config("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog")
.config("spark.yarn.access.hadoopFileSystems", DATALAKE_DIRECTORY)
.config("spark.hadoop.iceberg.engine.hive.enabled", "true")
.config("spark.jars", "/opt/spark/optional-lib/iceberg-spark-runtime.jar, /opt/spark/optional-lib/iceberg-hive-runtime.jar")
.getOrCreate()
)
spark.sql("show databases").show()
spark.sql("describe formatted <database_name>.<table_name>").show()