Example: Deploying the Apache Iceberg Sink Connector for Kafka Connect
If you need to make your Kafka streaming data accessible for analytics, you can deploy the Apache Iceberg Sink Connector to write data into Iceberg table format. This example shows you how to configure the connector with a Nessie catalog and S3 storage, enabling you to query your real-time Kafka data using analytics engines such as Spark.
This example demonstrates how to set up the following end-to-end Apache Iceberg Sink Connector deployment:
In this example setup, the connector reads records from a Kafka topic and writes them to S3 storage as Parquet files with enhanced metadata including schema and partitioning information. The connector also updates a Nessie data catalog with references to the current metadata and schema.
After the connector is deployed and data from the Kafka topic is written to S3, you retrieve the data by querying the table with Apache Spark.
About connector dependencies
The Apache Iceberg Sink Connector is part of a large ecosystem with many storage and catalog options. The required dependencies vary depending on your specific use case, including the target storage system, such as S3, ADLS, or HDFS, the data catalog implementation, such as Nessie, AWS Glue, Hive, or JDBC, and the data serialization format, such as Parquet, ORC, or Avro. Each combination requires a different set of artifacts to function correctly.
For this example, which uses Nessie as the data catalog, S3 as the storage backend, and Parquet as the data format, the following artifacts are required:
-
iceberg-kafka-connect– The connector plugin itself -
hadoop-common– Core Hadoop libraries -
iceberg-parquet– Parquet file format support -
iceberg-nessie– Nessie catalog client libraries -
iceberg-aws-bundle– AWS SDK bundle for S3 integration -
iceberg-aws– Iceberg AWS integration libraries
These artifacts are specified in your KafkaConnect resource and will be downloaded from Maven.
-
Ensure that the Strimzi Cluster Operator is installed and running. See Installation.
-
Ensure that you have a Kafka cluster deployed and running on Kubernetes. If not, deploy one, see Deploying Kafka.
This example assumes a Kafka cluster deployed with Cloudera Streams Messaging Operator for Kubernetes. The name of the cluster is referred to as [***KAFKA CLUSTER NAME***]. The namespace is referred to as [***KAFKA NAMESPACE***].
-
Ensure that a namespace is available where you can deploy your Kafka Connect cluster. If not, create one.
kubectl create namespace [***KAFKA CONNECT NAMESPACE***] -
Ensure that you have access to a container registry where you can upload a Kafka Connect container image.
The registry is required as you will be building your own custom Kafka Connect image that includes the Apache Iceberg Sink Connector and its dependencies. You can use your own private registry or a public registry such as Quay.io or Docker Hub. The registry is referred to as [***YOUR REGISTRY***] in this example.
-
Ensure that you have an S3 bucket available for storing Iceberg table data.
You will need the name and AWS region of the bucket as well access credentials (access key ID and secret access key). These are referred to as [***S3 BUCKET***], [***S3 REGION***], [***AWS ACCESS KEY ID***], and [***AWS SECRET ACCESS KEY***] in this example.
- Download the
kafka_shell.shtool from the Cloudera Archive.You will use this tool to create topics and to produce data for testing.
