Automate partition discovery and repair

Automated partition discovery and repair is useful for processing log data, and other data, in Spark and Hive catalogs. You learn how to set the partition discovery parameter to suit your use case. An aggressive partition discovery and repair configuration can delay the upgrade process.

Apache Hive can automatically and periodically discover discrepancies in partition metadata in the Hive metastore and in corresponding directories, or objects, on the file system. After discovering discrepancies, Hive performs synchronization.

The discover.partitions table property enables and disables synchronization of the file system with partitions. In external partitioned tables, this property is enabled (true) by default when you create the table. To a legacy external table (created using a version of Hive that does not support this feature), you need to add discover.partitions to the table properties to enable partition discovery.

By default, the discovery and synchronization of partitions occurs every 5 minutes. This is too often if you are upgrading and can result in the Hive DB being queried every few milliseconds, leading to performance degradation. During upgrading the high frequency of batch routines dictates running discovery and synchronization infrequently, perhaps hourly or even daily. You can configure the frequency as shown in this task.

  1. Assuming you have an external table created using a version of Hive that does not support partition discovery, enable partition discovery for the table.
    ALTER TABLE exttbl SET TBLPROPERTIES ('discover.partitions' = 'true');
  2. In Cloudera Manager, click Clusters > Hive > Configuration, search for Hive Server Advanced Configuration Snippet (Safety Valve) for hive-site.xml.
  3. Add the following property and value to hive-site.xml: Property: metastore.partition.management.task.frequency Value: 600.
    This action sets synchronization of partitions to occur every 10 minutes expressed in seconds. If you are upgrading, consider running discovery and synchonization once every 24 hours by setting the value to 86,400 seconds.