Ignore or Prune pattern to filter Hive metadata entities
Atlas supports metadata and lineage updates from services like HBase, Hive, Impala, and Spark.
These updates are in the form of messages that are posted by these services. The messages contain Atlas entities specific to the service. The notification processing module within Atlas processes these messages.
Typically, most of the metadata is tracked. Sometimes, a part of the schema changes more often than not and tracking these frequent changes creates metadata that is insignificant. The Atlas notification processing system gets overloaded with the frequently changing schema updates. The resultant outcome might be that the low-value messages are processed at the expense of messages that contain critical schema updates.
To overcome such a pattern within a data processing pipeline, you can employ a couple of options:
- Ignore schema updates.
- Preserve an abbreviated form of the entity.
The Ignore and Prune feature within Atlas addresses this scenario for Hive Metastore and Hive Server2 (HS2) hooks. This feature is a mechanism to specify which Hive tables should be ignored and which ones should be pruned. This feature helps regulate data that is posted to Atlas. The user is able to choose data that is important for metadata management and lineage capture.
Tables whose lifecycle is of no consequence are targeted for being ignored. Tables whose lifecycle need not be tracked closely or for garnering minute details are targeted for pruning.
Use case
As a part of the Extract/Transform/Load (ETL) data pipeline, services such as Hive use a number of temporary and/or staging tables that are short-lived. These temporary and/or staging tables are generally employed during the extract or transform phase before the data is loaded. Once the processing is complete, these tables are not used anymore and are deleted.
With Atlas Hive Hook enabled, Atlas captures metadata events, lifecycle, and lineage of all the Hive entities.
Temporary tables that are created only to aid the development process are safe to be ignored. Metadata for these tables are not generated or reported into Atlas.
For staging tables, tracking details like columns and column-lineage in Atlas may not be useful. By not tracking the information in Atlas, it can significantly reduce the time it takes to process notification and can help the overall performance of Atlas.
You can ignore temporary tables completely. Just the minimum details of these staging tables can be stored in Atlas, to capture data lineage from source to target table through all the intermediate staging tables.
Setting Ignore and Prune Properties
The ignore and prune configuration properties can be set both at Atlas server-side and Hive hooks configuration.
Setting it at Hive Hook side prevents Atlas’ metadata from being generated.
If the metadata for ignored and pruned elements is generated and posted on Atlas’ Kafka topic, setting this property on Atlas’ server side handles these elements before they get stored within Atlas.
Both these properties accept Java regex expressions. For more information, see documentation.