Migrating a Hive table to Iceberg

You see how to use a simple ALTER TABLE statement from Hive or Impala to migrate an external Hive table to an Iceberg table. You see how to configure table input and output by setting table properties.

In this task, from a Data Hub cluster, you open Hue, and use Hive or Impala to create a table. In Impala, you can configure the NUM_THREADS_FOR_TABLE_MIGRATION query option to tweak the performance of the table migration. It sets the maximum number of threads to be used for the migration process but could also be limited by the number of CPUs. If set to zero then the number of available CPUs on the coordinator node is used as the maximum number of threads. Parallelism occurs on the basis of data files within a partition, which means one partition is processed at a time with multiple threads processing the files inside the partition. In case there is only one file in each partition, sequential execution occurs.
You must meet the prerequisites to query Iceberg tables, including obtaining Ranger access permissions.
  1. Log into CDP, and click Data Hub.
  2. Click Hue.
  3. Select a database.
  4. Enter a query to use a database.
    For example:
    USE mydb;
  5. Enter a query to migrate an existing external Hive table to an Iceberg v2 table.
    Hive example:
    ALTER TABLE tbl 
    SET TBLPROPERTIES ('storage_handler'='org.apache.iceberg.mr.hive.HiveIcebergStorageHandler',
     'format-version' = '2');
    Impala example, which requires two queries:
    ALTER TABLE table_name CONVERT TO ICEBERG;
    ALTER TABLE table_name SET TBLPROPERTIES ('format-version'='2');
    The first ALTER command converts the Hive table to an Iceberg V1 table.
  6. Click to run the queries.
    An Iceberg V2 table is created, replacing the Hive table.