Partition refresh and configuration
You can discover partition changes and synchronize Hive metadata automatically. Performing synchronization automatically, instead of manually, can save substantial time, especially when partitioned data, such as logs, changes frequently. You can also configure how long to retain partition data and metadata.
After creating a partitioned table, Hive does not update metadata about corresponding objects or directories on the file system that you add or drop. The partition metadata in the Hive metastore becomes stale after corresponding objects/directories are added or deleted. You need to synchronize the metastore and the file system.
- Manually
You run the MSCK (metastore consistency check) Hive command:
MSCK REPAIR TABLE <table_name> ADD/DROP/SYNC PARTITIONS
every time you need to synchronize a partition with the file system.This command triggers a full repair which might take a longer time if the number of partitions are more. Instead, you can use operators in the command, such as EQUAL, NOTEQUAL, LESSTHANOREQUALTO, LESSTHAN, GREATERTHANOREQUALTO, GREATERTHAN, or LIKE in the partition column so that a larger subset of partitions can be recovered (added/removed) without triggering a full repair.
Syntax:
MSCK REPAIR TABLE <table_name> ADD/DROP/SYNC PARTITIONS (<pcol1><operator><value>,<pcol2><operator><value>);
- Automatically
You set up partition discovery to occur periodically.
discover.partitions
table property is automatically
created and enabled for external partitioned tables. When
discover.partitions
is enabled for a table, Hive performs an
automatic refresh as follows: - Adds corresponding partitions that are in the file system, but not in the metastore, to the metastore.
- Removes partition schema information from metastore if you removed the corresponding partitions from the file system.
Partition retention
You can configure how long to keep partition metadata and data, and remove it after the retention period elapses.Limitations
Generally, partition discovery and retention is not recommended for use on managed tables. The Hive metastore acquires an exclusive lock on a table that enables partition discovery that can slow down other queries.