Manage partitions automatically
You can discover file system changes related to partitions and synchronize Hive metadata automatically. Performing synchronization automatically as opposed to manually can save substantial time, especially when partitioned data, such as logs, changes frequently. You can also configure how long to retain partition data and metadata.
After creating a partitioned table, Hive does not update metadata about corresponding directories on the file system or object store that you add or drop. The partition metadata in the Hive metastore becomes stale after corresponding directories are added or deleted. You need to synchronize the metastore and the file system.
- Manually
You run the MSCK (metastore consistency check) Hive command:
MSCK REPAIR TABLE table_name SYNC PARTITIONS
every time you need to synchronize a partition with your file system. - Automatically
You set up partition discovery to occur periodically.
discover.partitions
table property is automatically
created and enabled for external partitioned tables. When
discover.partitions
is enabled for a table, Hive performs an
automatic refresh as follows: - Adds corresponding partitions that are in the file system, but not in metastore, to the metastore.
- Removes partition schema information from metastore if you removed the corresponding partitions from the file system.
Partition retention
You can configure how long to keep partition metadata and data and remove it after the retention period elapses.Limitations
- Generally, partition discovery and retention is not recommended for use on managed tables.
- You must deploy a remote Hive metastore for your cluster by installing and configuring a
supported database on the cluster.
The Hive metastore acquires an exclusive lock on a table that enables partition discovery that can slow down other queries. Using the default metastore, which is embedded in the HiveServer process and installed by Ambari, you cannot manage a partition automatically.