Apache Atlas metadata collection overview
Actions performed in cluster services create metadata in Atlas.
Atlas provides addons to many Hadoop cluster services to collect metadata when the service performs certain operations. The Atlas addon or “hook” assembles a predefined set of information and sends it to the Atlas server. The Atlas server reads through the metadata and creates entities to represent the data sets and processes described by the metadata. Atlas may create one or many entities for each event it processes. For example, when a user creates a namespace in HBase, Atlas creates a single entity to represent the new HBase namespace. When a user runs a query in HiveServer, Atlas may create many entities, including entities to describe the query itself, any tables involved in the query, entities for each column for each table involved in the query, and so on.
The following table lists the services that are integrated with Atlas by default. For each service, the table lists the events produced by the service that Atlas acknowledges and the entities Atlas produces in response to each event. Note that there isn’t always a one-to-one relationship between the event and an entity: the entities produced from a single event depend on the event itself.
Source | Actions Acknowledged | Entities Created/Updated |
---|---|---|
HiveServer |
ALTER DATABASE DROP DATABASE |
hive_db, hive_db_ddl |
ALTER TABLE |
hive_process, hive_process_execution, hive_table, hive_table_ddl, hive_column, hive_column_lineage, hive_storagedesc, hdfs_path | |
ALTER VIEW |
hive_process, hive_process_execution, hive_table, hive_column, hive_column_lineage, hive_table_ddl | |
INSERT INTO (SELECT) |
hive_process, hive_process_execution | |
HBase | alter_async | hbase_namespace, hbase_table, hbase_column_family |
create_namespace alter_namespace drop_namespace |
hbase_namespace |
|
create table |
hbase_table, hbase_column_family | |
alter table (create column family) alter table (alter column family) alter table (delete column family) | hbase_table, hbase_column_family | |
Impala* |
CREATETABLE_AS_SELECT | impala_process, impala_process_execution, impala_column_lineage, hive_db hive_table_ddl |
CREATEVIEW | impala_process, impala_process_execution, impala_column_lineage, hive_table_ddl | |
ALTERVIEW_AS_SELECT | impala_process, impala_process_execution, impala_column_lineage, hive_table_ddl | |
INSERT INTO |
impala_process, |
|
Spark* |
CREATE TABLE USING |
spark_process |
CREATE VIEW AS SELECT, | spark_process | |
INSERT INTO (SELECT), |
spark_process |
*For these sources, Atlas collects the corresponding asset metadata from HMS. Atlas reconciles the entity metadata received from Kafka messages from each source.