Partition refresh and configuration

You can discover partition changes and synchronize Hive metadata automatically. Performing synchronization automatically as opposed to manually can save substantial time, especially when partitioned data, such as logs, changes frequently. You can also configure how long to retain partition data and metadata.

After creating a partitioned table, Hive does not update metadata about corresponding objects or directories on the file system that you add or drop. The partition metadata in the Hive metastore becomes stale after corresponding objects/directories are added or deleted. You need to synchronize the metastore and the file system.

You can refresh Hive metastore partition information manually or automatically. The time it takes to refresh the partition information is proportional to the number of partitions involved.
  • Manually

    You run the MSCK (metastore consistency check) Hive command: MSCK REPAIR TABLE table_name SYNC PARTITIONS every time you need to synchronize a partition with the file system.

  • Automatically

    You set up partition discovery to occur periodically.

The discover.partitions table property is automatically created and enabled for external partitioned tables. When discover.partitions is enabled for a table, Hive performs an automatic refresh as follows:
  • Adds corresponding partitions that are in the file system, but not in metastore, to the metastore.
  • Removes partition schema information from metastore if you removed the corresponding partitions from the file system.

Partition retention

You can configure how long to keep partition metadata and data, and remove it after the retention period elapses.

Limitations

Generally, partition discovery and retention is not recommended for use on managed tables. The Hive metastore acquires an exclusive lock on a table that enables partition discovery that can slow down other queries.