Create an Iceberg data connection
CML supports data connections to Iceberg data lakes.
You can connect automatically, using the CML-provided data connection. If necessary, you can also set up a manual connection using the snippet provided below. To connect with Iceberg, you must use Spark 3.
In your project:
- In Project Settings, view the Data Connections tab. There should be an available data connection to a Spark Data Lake.
- Start a New Session.
- Select Enable Spark, and choose Spark 3 from the dropdown.
- Select Start Session.
- In the Connection Code Snippet UI, select the Spark Data Lake connection.
- In the code window, select Copy Code, then Close.
- Select , and paste the CML-provided code snippet into the file.
- Select Run.
You see a list of available databases in the data lake.
Instead of using the CML-provided data connection, you can also manually connect to a Spark Data Lake using the Spark command as shown in the snippet below.
Make sure to set the following parameters:
DATALAKE_DIRECTORY
- Valid database and table name in the
describe formatted
SQL command.
from pyspark.sql import SparkSession
# Change to the appropriate Datalake directory location
DATALAKE_DIRECTORY = "s3a://your-aws-demo/"
spark = (
SparkSession.builder.appName("MyApp")
.config("spark.jars", "/opt/spark/optional-lib/iceberg-spark-runtime.jar")
.config("spark.sql.hive.hwc.execution.mode", "spark")
.config( "spark.sql.extensions", "com.qubole.spark.hiveacid.HiveAcidAutoConvertExtension, org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
.config("spark.sql.catalog.spark_catalog.type", "hive")
.config( "spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog")
.config("spark.yarn.access.hadoopFileSystems", DATALAKE_DIRECTORY)
.getOrCreate()
)
spark.sql("show databases").show()
spark.sql("describe formatted <database_name>.<table_name>").show()