Using Ignore and Prune patterns
You can configure both Ignore and Prune patterns to manage your data.
Using the Ignore pattern
Atlas ignores temporary managed tables by default. But an external temporary table is captured because the table uses the HDFS path for storage and Atlas creates a lineage in between.
To disregard the temporary table and avoid Atlas processing it, you can set up appropriate configurations in Hive and Atlas and later restart the services.
For example, if all tables in the ‘sales’ database and the tables that contain ‘_tmp’ in the ‘finance’ database should be ignored, the property can be set as follows in your Cloudera Manager instance.
Hive Metastore Server and Hive settings:
atlas.hook.hive.hive_table.ignore.pattern=finance\..*_tmp.*,sales\..*
is set in Cloudera Manager Hive Service Advanced Configuration Snippet
(Safety Valve) for atlas-application properties in Hive(HMS) and Hiveserver2.
Atlas server
atlas.notification.consumer.preprocess.hive_table.ignore.pattern=finance\..*_tmp.*
With the above configurations, tables having _tmp
in
their names, in the finance
database are ignored.
Using the Prune pattern
Staging tables are created to hold data temporarily during a query execution and are manually dropped once the processing is completed. It might be insignificant to track the details of the staging tables.
For example, in the below images, the finance table contains 333 columns (column names blurred) and the staging tables are created frequently by running an “INSERT OVERWRITE TABLE” query on the finance table. Processing is executed on the data in staging tables and later the staging tables are deleted as observed in the table level lineage diagram.
Every time you run a query to create lineage between tables, column-level lineage is also created along with table-level lineage.