Create table feature
You use CREATE TABLE from Impala or CREATE EXTERNAL TABLE from Hive to create an external table in Iceberg. You learn the subtle differences in these features for creating Iceberg tables from Hive and Impala. You also learn about partitioning.
Hive and Impala handle external table creation a little differently, and that extends to
creating tables in Iceberg. By default, Iceberg tables you create are v1. To create an Iceberg v2
table from Hive or Impala, you need to set a table property as follows:'format-version' =
'2'.
Enable HDFS HA before you create Iceberg tables
Learn about how Apache Iceberg stores absolute paths in its metadata manifest files, and how HDFS HA ensures that the table is accessible regardless of which NameNode is active.
Background
Apache Iceberg stores absolute paths in its metadata manifest files for Iceberg tables. Opposed to this, Apache Hive writes the relative path to the table into the metadata, while building the absolute path based on the HDFS configuration.
The format of these paths depends on your HDFS configuration at the time of table creation. Without HDFS HA, Iceberg writes the paths using the specific hostname and port of the active NameNode. With HDFS HA, Iceberg writes the paths using the logical nameservice. Using the HDFS nameservice ensures that the paths remain valid regardless of which NameNode is active.
To avoid possible data loss when switching from one HDFS configuration to another, Cloudera highly recommends enabling HDFS HA in the environment.
Explanation
Writing a table in Hadoop in either Hive or Iceberg format includes the data itself and the metadata of the table.
While Cloudera stores the metadata in the Hive Metastore (HMS) for both Hive and Iceberg tables, the method that table location is written in the metadata varies:
- Hive uses a relative path, where the table location is concatenated with the relative path in the metadata and the HDFS definition for the NameNode or NameService that is active.
- Iceberg uses an absolute path in the table metadata to describe the location
of the table at the time of its creation. If HDFS runs in non-HA mode, the cluster uses
either the Active NameNode and Standby NameNode, and the used one is written to the table
information in the
hdfs://node4.cloudera.com:8020/warehouse/tablespace/external/hive/test.db/ice_table/
format.
That path is accessible as long as that NameNode ( node4.cloudera.com) is active. However, if the node becomes unavailable, and the secondary (standby) node takes over, the table becomes inaccessible because the original node that was used to write the table is not active. This applies to both the table information itself and the manifest JSON file associated with it, containing the absolute path to the table location.
When you enable HDFS HA, the service removes the NameNode references, and uses nameservice as the only reference. The nameservice internally manages its NameNodes, such as Load Balancer, exposing only the nameservice as the access point.
Tables created after HDFS HA is enabled use the following format (assuming the nameservice is named nameservice01): hdfs://nameservice01/warehouse/tablespace/external/hive/test.db/ice_table/
This format ensures that the table is accessible regardless of which NameNode is active, as the NameService takes care of the path, thereby supporting the failover cases.
Iceberg table creation from Hive
From Hive, CREATE EXTERNAL TABLE is recommended to create an Iceberg table in Cloudera.
When you use the EXTERNAL keyword to create the Iceberg table, by default only the schema is dropped when you drop the table. The actual data is not purged. Conversely, if you do not use EXTERNAL, by default the schema and actual data is purged. You can override the default behavior. For more information, see the Drop table feature.
CREATE EXTERNAL TABLE ice_fm_hive (i int) STORED BY ICEBERG TBLPROPERTIES ('metadata_location'='<object store or file system path>')See examples below.
Iceberg table creation from Impala
From Impala, CREATE TABLE is recommended to create an Iceberg table in Cloudera. Impala creates the Iceberg table metadata in the metastore and also initializes the actual Iceberg table data in the object store.
The difference between Hive and Impala with regard to creating an Iceberg table is related to Impala compatibility with Kudu, HBase, and other tables. For more information, see the Apache documentation, "Using Impala with Iceberg Tables".
Metadata storage of Iceberg tables
When you create an Iceberg table using CREATE EXTERNAL TABLE in Hive or using CREATE TABLE in Impala, HiveCatalog creates an HMS table and also stores some metadata about the table on your object store, such as S3. Creating an Iceberg table generates a metadata.json file, but not a snapshot. In the metadata.json, the snapshot-id of a new table is -1. Inserting, deleting, or updating table data generates a snapshot. The Iceberg metadata files and data files are stored in the table directory under the warehouse folder. Any optional partition data is converted into Iceberg partitions instead of creating partitions in the Hive Metastore, thereby removing the bottleneck.
- Hive: STORED BY ICEBERG
- Impala: STORED AS ICEBERG or STORED BY ICEBERG
Supported file formats
- From Hive: Parquet (default), Avro, ORC
- From Impala: Parquet
Impala supports writing Iceberg tables in only Parquet format. Impala does not support
defining both file format and storage engine. For example, CREATE TABLE tbl ... STORED
AS PARQUET STORED BY ICEBERG works from Hive, but not from Impala.
- From Hive: Parquet, Avro, ORC
- From Impala: Parquet, Avro, ORC
Hive syntax
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name data_type, ... )]
[PARTITIONED BY [SPEC]([col_name][, spec(value)][, spec(value)]...)]]
STORED BY ICEBERG
[STORED AS file_format]
[TBLPROPERTIES ('key'='value', 'key'='value', ...)]
Impala syntax
CREATE TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name data_type, ... )]
[PARTITIONED BY [SPEC]([col_name][, spec(value)][, spec(value)]...)]]
STORED {AS | BY} ICEBERG
[TBLPROPERTIES (property_name=property_value, ...)]Hive examples
CREATE EXTERNAL TABLE ice_1 (i INT, t TIMESTAMP, j BIGINT) STORED BY ICEBERG;
CREATE EXTERNAL TABLE ice_2 (i INT, t TIMESTAMP) PARTITIONED BY (j BIGINT) STORED BY ICEBERG;
CREATE EXTERNAL TABLE ice_4 (i int) STORED BY ICEBERG STORED AS ORC;
CREATE EXTERNAL TABLE ice_5 (i int) STORED BY ICEBERG TBLPROPERTIES ('metadata_location'='s3a://bucketName/ice_table/metadata/v1.metadata.json')
CREATE EXTERNAL TABLE ice_6 (i int) STORED BY ICEBERG STORED AS ORC TBLPROPERTIES ('format-version' = '2');
Impala examples
CREATE TABLE ice_7 (i INT, t TIMESTAMP, j BIGINT) STORED BY ICEBERG; //creates only the schema
CREATE TABLE ice_8 (i INT, t TIMESTAMP) PARTITIONED BY (j BIGINT) STORED BY ICEBERG; //creates schema and initializes data
CREATE TABLE ice_v2 (i INT, t TIMESTAMP) PARTITIONED BY (j BIGINT) STORED BY ICEBERG TBLPROPERTIES ('format-version' = '2'); //creates a v2 table
