Apache Atlas metadata collection overview

Actions performed in cluster services create metadata in Atlas.

Atlas provides addons to many Hadoop cluster services to collect metadata when the service performs certain operations. The Atlas addon or “hook” assembles a predefined set of information and sends it to the Atlas server. The Atlas server reads through the metadata and creates entities to represent the data sets and processes described by the metadata. Atlas may create one or many entities for each event it processes. For example, when a user creates a namespace in HBase, Atlas creates a single entity to represent the new HBase namespace. When a user runs a query in HiveServer, Atlas may create many entities, including entities to describe the query itself, any tables involved in the query, entities for each column for each table involved in the query, and so on.

The following table lists the services that are integrated with Atlas by default. For each service, the table lists the events produced by the service that Atlas acknowledges and the entities Atlas produces in response to each event. Note that there isn’t always a one-to-one relationship between the event and an entity: the entities produced from a single event depend on the event itself.

Source Actions Acknowledged Entities Created/Updated
HiveServer

ALTER DATABASE
CREATE DATABASE

DROP DATABASE

hive_db, hive_db_ddl

ALTER TABLE
CREATE TABLE
CREATE TABLE AS SELECT
DROP TABLE

hive_process, hive_process_execution, hive_table, hive_table_ddl, hive_column, hive_column_lineage, hive_storagedesc, hdfs_path

ALTER VIEW
ALTERVIEW_AS_SELECT
CREATE VIEW
CREATE VIEW AS SELECT
DROP VIEW

hive_process, hive_process_execution, hive_table, hive_column, hive_column_lineage, hive_table_ddl

INSERT INTO (SELECT)
INSERT OVERWRITE

hive_process, hive_process_execution
HBase alter_async hbase_namespace, hbase_table, hbase_column_family
create_namespace alter_namespace drop_namespace

hbase_namespace

create table
alter table
drop table
drop_all tables

hbase_table, hbase_column_family
alter table (create column family) alter table (alter column family) alter table (delete column family) hbase_table, hbase_column_family

Impala*

CREATETABLE_AS_SELECT impala_process, impala_process_execution, impala_column_lineage, hive_db hive_table_ddl
CREATEVIEW impala_process, impala_process_execution, impala_column_lineage, hive_table_ddl
ALTERVIEW_AS_SELECT impala_process, impala_process_execution, impala_column_lineage, hive_table_ddl

INSERT INTO
INSERT
OVERWRITE

impala_process,
impala_process_execution

Spark*

CREATE TABLE USING
CREATE TABLE AS SELECT,
CREATE TABLE USING ... AS SELECT

spark_process
CREATE VIEW AS SELECT, spark_process

INSERT INTO (SELECT),
LOAD DATA [LOCAL] INPATH

spark_process

*For these sources, Atlas collects the corresponding asset metadata from HMS. Atlas reconciles the entity metadata received from Kafka messages from each source.