Populating Partition-Related Information
When working with data stored in cloud object stores, the steps for populating partition-related information are the same as when working with data in HDFS.
Creating table definitions does not by itself auto-populate partition-related
information to the metastore. When a dataset available in Amazon S3 is already partitioned,
you must run the MSCK
command in order to populate the partition-related
information into the metastore.
For example, consider the following statement:
CREATE EXTERNAL TABLE `inventory`( `inv_item_sk` int, `inv_warehouse_sk` int, `inv_quantity_on_hand` int) PARTITIONED BY ( `inv_date_sk` int) STORED AS ORC LOCATION 's3a://BUCKET_NAME/tpcds_bin_partitioned_orc_200.db/inventory';
This statement creates a table definition in the metastore, but does not populate the partition-related information.
To populate the partition-related information, you need to run MSCK REPAIR TABLE
inventory
.
You can increase the value of the hive.metastore.fshandler.threads
parameter to increase the number of threads used for scanning the partitions in the MSCK
phase (defaut is 15). This will speed up load if you have hardware capacity.