Lineage overview

Atlas lineage helps you understand the source and impact of data and changes to data over time and across all your data.

Lineage information helps you understand the origin of data and the transformations it may have gone through before arriving in a file or table. In Atlas, if transformations occurred in services that provide process metadata, a lineage graph shows how data in a given column was generated. When a column appears in the output of a process, Atlas reads the content of the process and links the input column or columns to the output asset. This relationship is stored as a vertex in Atlas’s graph database. It is displayed as a lineage graph in the details of each entity.

By default, Atlas can collect lineage information from the following sources:

  • Hive Server 2
  • Impala
  • Spark

The lineage metadata produced by these sources may refer to additional asset entities. For example, when a Hive operation moves data from a Hive table to an HBase table or an HDFS file, Atlas includes an entity to represent the HBase table or HDFS file, and the entity is connected to the Hive table through lineage. The following sources may appear in lineage graphs when referenced:

  • HBase
  • HDFS
  • S3

Data flow lineage from Cloudera Flow Management (NiFi) can be included as well by configuring the appropriate reporting task.