Creating Iceberg tables

Apache Iceberg is an open, high-performance table format for organizing datasets that can contain petabytes of data. Iceberg can be used to add tables to computing engines, such as Apache Hive and Apache Flink, from which the data can be queried using SQL.

As Iceberg is integrated as a connector to Flink, you can use the table format the same way for SQL Stream Builder (SSB). Both the V1 and V2 version specifics are supported by the Flink connector.For more information, about the version changes, see the Apache Iceberg documentation.

Table 1. SQL feature support for the Iceberg and SSB integration
Feature SSB
Create catalog Supported
Create database Supported
Create table Supported
Alter table Supported
Drop table Supported
Select Supported
Insert into Supported
Insert overwrite Supported
Metadata tables Supported
Rewrite files action Supported
Upsert Technical preview1
Equality delete Not supported
You can use a Hive service that is located on the same cluster as SSB, and you can use a Hive service that is available in a Cloudera Data Warehouse (CDW) cluster.

Using Hive for Iceberg integration

When using the Hive service located on your cluster, you can add it as a catalog on Streaming SQL Console. Before creating the Iceberg table, ensure that you have added Hive as a catalog using the steps described in Adding catalogs documentation.

After setting up Hive for SSB, you can define Iceberg as a connector in the CREATE TABLE statement as the example shows below:
CREATE TABLE ‘ssb’.’ssb_default’.’iceberg_hive’ (
‘column_int’ INT,
‘column_str’ STRING,
) WITH (
‘connector’ = ‘iceberg’,
‘catalog-database’ = ‘test_db’,
‘catalog-type’ = ‘hive’,
‘catalog-name’ = ‘iceberg_hive_catalog’,
‘catalog-table’ = ‘iceberg_hive_table’,
‘ssb-hive-catalog’ = ‘ssb_hive_catalog’
‘engine.hive.enabled’ = ‘true’
)
The following properties are mandatory when using the Iceberg connector:
Property Example Description
catalog-database test_db The Iceberg database name in the backend catalog, uses the current Flink database name by default. It will be created automatically if it does not exist when writing records into the Flink table
catalog-type hive Type of the catalog
catalog-name iceberg_hive_catalog User-specified catalog name. It is required as the connector does not have any default value.
catalog-table iceberg_hive_table Name of the Iceberg table in the backend catalog.
ssb-hive-catalog ssb_hive_catalog The name of the Hive catalog that you have registered in SSB. The configuration can be used when creating Iceberg catalogs as well.

Using the Hive service from Cloudera Data Warehouse (CDW)

Before you begin
  • Ensure that the ssb user has access to the all - database, table policy under the Hadoop SQL service.
  • Create ID Broker mapping for the workload user.
  • Create an Iceberg table using Hue in the Hive virtual warehouse on the CDW cluster.
Steps
  1. Access Cloudera Manager of the CDW environment.
  2. Click Cluster > View Client Configuration URLs > Hive Metastore.
  3. Download the hive-conf tarball.

    The downloaded tarball needs to be uploaded to the manager node of the Flink cluster.

  4. Copy the hive-conf to the manager node of the Flink cluster.
    scp <location>/hive-conf <workload_username>@<manager_node_FQDN>:.
    Password:<your_workload_password>
    
  5. Extract the file under the /tmp directory that is accessible by the ssb user using the following commands:
    mkdir /opt/hive-conf
    cd /opt/hive-conf
    tar -xvf /path/to/hive-conf.tar.gz
    chmod a+x .
    chmod -R a+r .
  6. Access the Streaming SQL Console and create the table using the following example or select Iceberg from the Template drop-down and fill out the required parameters:
    CREATE TABLE my_ssb_table (
        id   INT
    ) WITH (
        'connector' = 'iceberg',
        'catalog-name' = 'internal-use',
        'catalog-database' = 'default',
        'catalog-table' = 'my_iceberg_table',
        'hive-conf-dir' = '/opt/hive-conf'
    );
1 The upsert feature of the Iceberg and SSB integration is in Technical Preview and not ready for production deployment. Cloudera encourages you to explore these features in non-production environments and provide feedback on your experiences through the Cloudera Community Forums.