Metadata Management

This topic describes various knobs you can use to control how Impala manages its metadata in order to improve performance and scalability.

Startup Options for Automatic Invalidation of Metadata

To keep the size of metadata bounded, catalogd periodically scans all the tables and invalidates those not recently used. There are two types of configurations in catalogd.

  • Time-based invalidation with the ‑‑invalidate_tables_timeout_s flag: Catalogd invalidates tables that are not recently used in the specified time period (in seconds). This flag needs to be applied to catalogd only.
  • Memory-based invalidation with the ‑‑invalidate_tables_on_memory_pressure flag: When the memory pressure reaches 60% of JVM heap size after a Java garbage collection in catalogd, Impala invalidates 10% of the least recently used tables. This flag needs to be applied to catalogd only.

Automatic invalidation of metadata provides more stability with lower chances of running out of memory, but the feature could potentially cause performance issues and may require tuning.

Loading Incremental Statistics from Catalog Server

Starting in Impala 3.1, a new configuration setting, ‑‑pull_incremental_statistics, was added and set to true by default. When you start Impala catalogd and impalad coordinators with this setting enabled:

  • Newly created incremental stats will be smaller in size thus reducing memory pressure on the catalogd daemon. Your users can keep more tables and partitions in the same catalog and have lower chances of crashing catalogd due to out-of-memory issues.
  • Incremental stats will not be replicated to impalad and will be accessed on demand from catalogd, resulting in a reduced memory footprint of impalad.

We do not recommend you change the default setting of ‑‑pull_incremental_statistics.

Automatic Metadata Sync using Hive Metastore Notification Events

When this feature is enabled, catalogd polls Hive Metastore (HMS) notifications events at a configurable interval and processes the following changes:

  • Invalidates the tables when it receives the ALTER TABLE events or the ALTER, ADD, or DROP their partitions.
  • Adds the tables or databases when it receives the CREATE TABLE or CREATE DATABASE events.
  • Removes the tables from catalogd when it receives the DROP TABLE or DROP DATABASE events.

This feature is controlled by the ‑‑hms_event_polling_interval_s flag. Start the catalogd with the ‑‑hms_event_polling_interval_s flag set to a non-zero value to enable the feature and set the polling frequency in seconds. We recommend the value to be less than 5 seconds.

The following use cases are not supported:

  • The operations that do not generate events in HMS, such as adding new data to existing tables/partitions from Spark, are not supported.
  • Adding data from one Impala cluster to existing tables/partitions will not synced to another Impala cluster.

    Only new tables and partitions are synced.

  • The ALTER DATABASE events are not supported and currently ignored.

This feature is turned off by default with the ‑‑hms_event_polling_interval_s flag set to 0.

Configure HMS for Event Based Automatic Metadata Sync

As the first step to use the HMS event based metadata sync, enable and configure HMS notifications in Cloudera Manager.

  1. In the Hive cluster, navigate to Configuration > Filters > SCOPE > Hive Metastore Server.
  2. Select Enable Stored Notifications in Database.
  3. In Hive Metastore Server Advanced Configuration Snippet (Safety Valve) for hive-site.xml, click + to expand and enter the following:
    • Name: hive.metastore.notifications.add.thrift.objects
    • Value: true
    • Name: hive.metastore.alter.notifications.basic
    • Value: false
  4. Click Save Changes.
  5. Restart Hive.

Disable Event Based Automatic Metadata Sync

When the ‑‑hms_event_polling_interval_s flag is set to a non-zero value for your catalogd, the event-based automatic invalidation is enabled for all databases and tables. If you wish to have the fine-grained control on which tables or databases need to be synced using events, you can use the impala.disableHmsSync property to disable the event processing at the table or database level.

When you add the DBPROPERTIES or TBLPROPERTIES with the impala.disableHmsSync key, the HMS event based sync is turned on or off. The value of the impala.disableHmsSync property determines if the event processing needs to be disabled for a particular table or database.

  • If 'impala.disableHmsSync'='true', the events for that table or database are ignored and not synced with HMS.
  • If 'impala.disableHmsSync'='false' or if impala.disableHmsSync is not set, the automatic sync with HMS is enabled if the ‑‑hms_event_polling_interval_s global flag is set to non-zero.
  • To disable the event based HMS sync for a new database, set the impala.disableHmsSync database properties in Hive as currently, Impala does not support setting database properties:
    CREATE DATABASE <name> WITH DBPROPERTIES ('impala.disableHmsSync'='true');
  • To enable or disable the event based HMS sync for a table:
    CREATE TABLE <name> WITH TBLPROPERTIES ('impala.disableHmsSync'='true' | 'false');
  • To change the event based HMS sync at the table level:
    ALTER TABLE <name> WITH TBLPROPERTIES ('impala.disableHmsSync'='true' | 'false');

When both table and database level properties are set, the table level property takes precedence. If the table level property is not set, then the database level property is used to evaluate if the event needs to be processed or not.

If the property is changed from true (meaning events are skipped) to false (meaning events are not skipped), you need to issue a manual INVALIDATE METADATA command to reset event processor because it doesn't know how many events have been skipped in the past and cannot know if the object in the event is the latest. In such a case, the status of the event processor changes to NEEDS_INVALIDATE.

Metrics for Event Based Automatic Metadata Sync

You can use the web UI of the catalogd to check the state of the automatic invalidate event processor.

By default, the debug web UI of catalogd is at http://impala-server-hostname:25020 (non-secure cluster) or https://impala-server-hostname:25020 (secure cluster).

Under the web UI, there are two pages that presents the metrics for HMS event processor that is responsible for the event based automatic metadata sync.
  • /metrics#events
  • /events

    This provides a detailed view of the metrics of the event processor, including min, max, mean, median, of the durations and rate metrics for all the counters listed on the /metrics#events page.

/metrics#events Page

The /metrics#events page provides the following metrics about the HMS event processor.

Name Description
events-processor.avg-events-fetch-duration Average duration to fetch a batch of events and process it.
events-processor.avg-events-process-duration Average time taken to process a batch of events received from metastore.
events-processor.events-received Total number of metastore events received.
events-processor.events-received-15min-rate Exponentially weighted moving average (EWMA) of number of events received in last 15 min.

This rate of events can be used to determine if there are spikes in event processor activity during certain hours of the day.

events-processor.events-received-1min-rate Exponentially weighted moving average (EWMA) of number of events received in last 1 min.

This rate of events can be used to determine if there are spikes in event processor activity during certain hours of the day.

events-processor.events-received-5min-rate Exponentially weighted moving average (EWMA) of number of events received in last 5 min.

This rate of events can be used to determine if there are spikes in event processor activity during certain hours of the day.

events-processor.events-skipped Total number of metastore events skipped.
Events can be skipped based on certain flags are table and database level. You can use this metric to make decisions, such as:
  • If most of the events are being skipped, see if you might just turn off the event processing.
  • If most of the events are not skipped, see if you need to add flags on certain databases.
events-processor.status Metastore event processor status to see if there are events being received or not. Possible states are:
  • ACTIVE

    The event processor is scheduled at a given frequency.

  • ERROR

    The event processor is in error state and event processing has stopped.

  • NEEDS_INVALIDATE

    The event processor could not resolve certain events and needs a manual INVALIDATE command to reset the state.

  • STOPPED

    The event processor is not configured to run.