Automatic Invalidation/Refresh of Metadata
In this release, you can invalidate or refresh metadata automatically after changes to databases, tables or partitions render metadata stale. You control the synching of tables or database metadata by basing the process on events. You learn how to access metrics and state information about the invalidate event processor.
When tools such as Hive and Spark are used to process the raw data
ingested into Hive tables, new HMS metadata (database, tables, partitions)
and filesystem metadata (new files in existing partitions/tables) are
generated. In previous versions of Impala, in order to pick up this new
information, Impala users needed to manually issue an
INVALIDATE or REFRESH commands.
When automatic invalidate/refresh of metadata is enabled,, the Catalog Server polls Hive Metastore (HMS) notification events at a configurable interval and automatically applies the changes to Impala catalog.
Impala Catalog Server polls and processes the following changes.
-
Invalidates the tables when it receives the
ALTER TABLEevent. -
Refreshes the partition when it receives the
ALTER,ADD, orDROPpartitions. -
Adds the tables or databases when it receives the
CREATE TABLEorCREATE DATABASEevents. -
Removes the tables from
catalogdwhen it receives theDROP TABLEorDROP DATABASEevents. -
Refreshes the table and partitions when it receives the
INSERTevents.If the table is not loaded at the time of processing the
INSERTevent, the event processor does not need to refresh the table and skips it. -
Changes the database and updates
catalogdwhen it receives theALTER DATABASEevents. The following changes are supported. This event does not invalidate the tables in the database.- Change the database properties
- Change the comment on the database
- Change the owner of the database
-
Change the default location of the database
Changing the default location of the database does not move the tables of that database to the new location. Only the new tables which are created subsequently use the default location of the database in case it is not provided in the create table statement.
This feature is controlled by the ‑‑hms_event_polling_interval_s
flag. Start the catalogd with the
‑‑hms_event_polling_interval_s flag set to a positive integer to
enable the feature and set the polling frequency in seconds. We recommend the value to be
less than 5 seconds.
The following use cases are not supported:
-
When you bypass HMS and add or remove data into table by adding files directly on the
filesystem, HMS does not generate the
INSERTevent, and the event processor will not invalidate the corresponding table or refresh the corresponding partition.It is recommended that you use the
LOAD DATAcommand to do the data load in such cases, so that event processor can act on the events generated by theLOADcommand. -
The Spark API that saves data to a specified location does not generate events in HMS,
thus is not supported. For example:
Seq((1, 2)).toDF("i", "j").write.save("/user/hive/warehouse/spark_etl.db/customers/date=01012019")
Disable Event Based Automatic Metadata Sync
When the ‑‑hms_event_polling_interval_s flag is set to a non-zero
value for your catalogd, the event-based automatic invalidation is
enabled for all databases and tables. If you wish to have the fine-grained control on
which tables or databases need to be synced using events, you can use the
impala.disableHmsSync property to disable the event processing at the
table or database level.
This feature can be turned off by setting the
‑‑hms_event_polling_interval_s flag set to 0.
When you add the DBPROPERTIES or TBLPROPERTIES with
the impala.disableHmsSync key, the HMS event based sync is turned on or
off. The value of the impala.disableHmsSync property determines if the
event processing needs to be disabled for a particular table or database.
-
If
'impala.disableHmsSync'='true', the events for that table or database are ignored and not synced with HMS. -
If
'impala.disableHmsSync'='false'or ifimpala.disableHmsSyncis not set, the automatic sync with HMS is enabled if the‑‑hms_event_polling_interval_sglobal flag is set to non-zero.
-
To disable the event based HMS sync for a new database, set the
impala.disableHmsSyncdatabase properties in Hive as currently, Impala does not support setting database properties:CREATE DATABASE <name> WITH DBPROPERTIES ('impala.disableHmsSync'='true'); -
To enable or disable the event based HMS sync for a table:
CREATE TABLE <name> ... TBLPROPERTIES ('impala.disableHmsSync'='true' | 'false'); -
To change the event based HMS sync at the table level:
ALTER TABLE <name> SET TBLPROPERTIES ('impala.disableHmsSync'='true' | 'false');
When both table and database level properties are set, the table level property takes precedence. If the table level property is not set, then the database level property is used to evaluate if the event needs to be processed or not.
If the property is changed from true (meaning events are skipped) to
false (meaning events are not skipped), you need to issue a manual
INVALIDATE METADATA command to reset event processor because it doesn't
know how many events have been skipped in the past and cannot know if the object in the
event is the latest. In such a case, the status of the event processor changes to
NEEDS_INVALIDATE.
Metrics for Event Based Automatic Metadata Sync
You can use the web UI of the catalogd to check the state of the
automatic invalidate event processor.
By default, the debug web UI of catalogd is at
http://impala-server-hostname:25020 (non-secure
cluster) or https://impala-server-hostname:25020
(secure cluster).
- /metrics#events
-
/events
This provides a detailed view of the metrics of the event processor, including min, max, mean, median, of the durations and rate metrics for all the counters listed on the /metrics#events page.
The /metrics#events page provides the following metrics about the HMS event processor.
| Name | Description |
|---|---|
| events-processor.avg-events-fetch-duration | Average duration to fetch a batch of events and process it. |
| events-processor.avg-events-process-duration | Average time taken to process a batch of events received from the Metastore. |
| events-processor.events-received | Total number of the Metastore events received. |
| events-processor.events-received-15min-rate |
Exponentially weighted moving average (EWMA) of number of events received in
last 15 min.
This rate of events can be used to determine if there are spikes in event processor activity during certain hours of the day. |
| events-processor.events-received-1min-rate |
Exponentially weighted moving average (EWMA) of number of events received in
last 1 min.
This rate of events can be used to determine if there are spikes in event processor activity during certain hours of the day. |
| events-processor.events-received-5min-rate |
Exponentially weighted moving average (EWMA) of number of events received in
last 5 min.
This rate of events can be used to determine if there are spikes in event processor activity during certain hours of the day. |
| events-processor.events-skipped |
Total number of the Metastore events skipped.
Events can be skipped based on certain flags are table and database level. You
can use this metric to make decisions, such as:
|
| events-processor.status |
Metastore event processor status to see if there are events being received or
not. Possible states are:
|
