Importing and migrating Iceberg table in Spark 3
Importing or migrating tables are supported only on existing external Hive tables. When you import a table to Iceberg, the source and destination remain intact and independent. When you migrate a table, the existing Hive table is converted into an Iceberg table. You can use Spark SQL to import or migrate a Hive table to Iceberg.
Importing
spark.sql("CALL <catalog>.system.snapshot('<src>', '<dest>')")
-
<src>
is the qualified name of the Hive table -
<dest>
is the qualified name of the Iceberg table to be created -
<catalog>
is the name of the catalog, which you pass in a configuration file. For more information, see Configuring Catalog linked below.
For example:
spark.sql("CALL spark_catalog.system.snapshot('hive_db.hive_tbl',
'iceberg_db.iceberg_tbl')")
Migrating
When you migrate a Hive table to Iceberg, a backup of the table, named
<table_name>backup
, is created.
Ensure that the TRANSLATED_TO_EXTERNAL property, that is located in TBLPROPERTIES, is set to false before migrating the table. This ensures that a table backup is created by renaming the table in Hive metastore (HMS) instead of moving the physical location of the table. Moving the physical location of the table would entail copying files in Amazon s3.
We recommend that you refrain from dropping the backup table, as doing so will invalidate the newly migrated table.
If you want to delete the backup table, set the following:
'external.table.purge'='FALSE'
Deleting the backup table in the manner above will prevent underlying data from being deleted, therefore, only the table will be deleted from the metastore.
To undo the migration, drop the migrated table and restore the Hive table from the backup table by renaming it.
spark.sql("CALL <catalog>.system.migrate('<src>')")
-
<src>
is the qualified name of the Hive table -
<catalog>
is the name of the catalog, which you pass in a configuration file. For more information, see Configuring Catalog linked below.
For example:
spark.sql(“CALL
spark_catalog.system.migrate(‘hive_db.hive_tbl’)”)