Automated partition discovery and repair is useful for processing log data, and other data, in
Spark and Hive catalogs. You learn how to set the partition discovery parameter to suit your
use case. An aggressive partition discovery and repair configuration can delay the upgrade process.
Apache Hive can automatically and periodically discover discrepancies in partition
metadata in the Hive metastore and in corresponding directories, or objects, on the
file system. After discovering discrepancies, Hive performs synchronization.
The discover.partitions table property enables and disables
synchronization of the file system with partitions. In external partitioned tables,
this property is disabled (false) by default when you create the
table. To a legacy external table (created using a version of Hive that does not
support this feature), you need to add discover.partitions to the
table properties to enable partition discovery.
By default, discovery and synchronization of partitions occur every 5 minutes. If you
are upgrading, this frequency causes performance degradation. Adjust the frequency
of these batch routines to hourly or daily, depending on your use case.
For external partitioned tables and for legacy external tables that are created
using a version of Hive that does not support partition discovery, enable
partition discovery for the table.
ALTER TABLE exttbl SET TBLPROPERTIES ('discover.partitions' = 'true');
In Cloudera Manager, click Clusters > Hive > Configuration, search for Hive Server Advanced Configuration Snippet
(Safety Valve) for hive-site.xml.
Add the following property and value to hive-site.xml: Property:
metastore.partition.management.task.frequency Value:
600.
The default value of
metastore.partition.management.task.frequency is
300. Changing this to 600 sets synchronization of
partitions to occur every 10 minutes expressed in seconds. If you are upgrading
to a cloud environment (AWS, Azure, GCP), consider running discovery and
synchonization once every 24 hours by setting the value to 86,400 seconds.
Decreasing the value of
metastore.partition.management.task.frequency incur higher
cloud costs.