HDFS lineage data extraction in Atlas

Atlas supports the HDFS lineage data extraction mechanism. Learn how it works, which HDFS shell script commands are supported, how to use these commands, and what the supported HDFS entities and hierarchies are.

Previously, HDFS data extraction ensured that only partial data was captured. Data stewards and platform owners track lineage of their ingest and ETL processes that depend on custom programs that convert copy and convert data when moving from one directory in HDFS to another.

The data descends into a staging directory in HDFS and later, periodically moves to an input directory in HDFS. Hive External tables are constructed on top of the input directory and Atlas does not support capturing HDFS locations without the Hive External table being created on top of the HDFS location(s). Typically, the move from staging to input directory is captured in Atlas lineage. The connectivity between staging and input directories does not generate the lineage automatically.

With the enhancement, lineage would automatically be captured and connected for the data ingest processes. This development allows the data architect and data steward to understand and govern data from ingest, ETL to consumption.

How HDFS lineage extraction works

The enhanced implementation consists of a wrapper, which is based around the HDFS commands that creates a new lineage for filesystem operations.

The lineage captures the directory / file specified in the scripting tool. The lineage also creates a parent-child relationship of directories / files and its contents upto the NameNode.

A filtering mechanism is also built to exclude or include paths and directories.

Currently, the following HDFS shell script commands are supported:

  • -put and -copyFromLocal - Copy the contents from the local machine into HDFS.
  • -cp - Copies the contents from one path to another HDFS path.
  • -mv - Moves the contents from one path to another HDFS path.
  • -rm - Removes the contents from HDFS.