Populating Partition-Related Information

When working with data stored in cloud object stores, the steps for populating partition-related information are the same as when working with data in HDFS.

Creating table definitions does not by itself auto-populate partition-related information to the metastore. When a dataset available in Amazon S3 is already partitioned, you must run the MSCK command in order to populate the partition-related information into the metastore.

For example, consider the following statement:

CREATE EXTERNAL TABLE `inventory`(
  `inv_item_sk` int,
  `inv_warehouse_sk` int,
  `inv_quantity_on_hand` int)
PARTITIONED BY (
  `inv_date_sk` int) STORED AS ORC
LOCATION
  's3a://BUCKET_NAME/tpcds_bin_partitioned_orc_200.db/inventory';

This statement creates a table definition in the metastore, but does not populate the partition-related information.

To populate the partition-related information, you need to run MSCK REPAIR TABLE inventory.

You can increase the value of the hive.metastore.fshandler.threads parameter to increase the number of threads used for scanning the partitions in the MSCK phase (defaut is 15). This will speed up load if you have hardware capacity.

​Populating Partition-Related Information

Populating Partition-Related Information