Using Apache Hive
Also available as:
PDF

Manage partitions automatically

You can discover file system changes related to partitions and synchronize Hive metadata automatically. Performing synchronization automatically as opposed to manually can save substantial time, especially when partitioned data, such as logs, changes frequently.You can also configure how long to retain partition data and metadata.

After creating a partitioned table, Hive does not update metadata about corresponding directories on the file system or object store that you add or drop. The partition metadata in the Hive metastore becomes stale after corresponding directories are added or deleted. You need to synchronize the metastore and the file system.

You can refresh Hive metastore partition information manually or automatically.
  • Manually

    You run the MSCK (metastore consistency check) Hive command: MSCK REPAIR TABLE table_name SYNC PARTITIONS every time you need to synchronize a partition with your file system.

  • Automatically

    You set up partition discovery to occur periodically.

The discover.partitions table property is automatically created and enabled for external partitioned tables. When discover.partitions is enabled for a table, Hive performs an automatic refresh as follows:
  • Adds corresponding partitions that are in the file system, but not in metastore, to the metastore.
  • Removes partition schema information from metastore if you removed the corresponding partitions from the file system.

Partition retention

You can configure how long to keep partition metadata and data and remove it after the retention period elapses.

Limitations

  • Generally, partition discovery and retention is not recommended for use on managed tables.
  • You must deploy a remote Hive metastore for your cluster by installing and configuring a supported database on the cluster.

    The Hive metastore acquires an exclusive lock on a table that enables partition discovery that can slow down other queries. Using the default metastore, which is embedded in the HiveServer process and installed by Ambari, you cannot manage a partition automatically.