Inclusion and exclusion operation for HDFS files

The content or the files available in HDFS are either captured or ignored as lineage in Atlas. This scenario is achieved by including and excluding files and folders that Atlas checks for creating a lineage.

The hdfs.conf file contains the configuration for inclusion and exclusion options as described in Prerequisites for HDFS lineage extraction.

Some use cases where the including and excluding of HDFS files can be achieved are as follows

  • When a simple file is created in the “whitelist” folder, a lineage is created in Atlas.

./hdfs-lineage.sh -put ~/sample.txt /tmp/whitelist/sample2.txt

  • Similarly, when you try to create a file in the “blacklist” folder, Atlas does not create a lineage.

./hdfs-lineage.sh -put ~/sample.txt /tmp/blacklist/sample3.txt

  • If a folder is not listed for any specific operation (Whitelist or Blacklist), Atlas does not capture the lineage.

./hdfs-lineage.sh -put ~/sample.txt /tmp/non/folder1/folder1file.txt

For example:

./hdfs-lineage.sh -put ~/sample.txt /tmp/whitelist/blacklist/blacklistfile.txt

  • Atlas captures the lineage of the sub-folder if the parent directory or folder is included.

./hdfs-lineage.sh -put ~/sample.txt /tmp/whitelist/whitesub/whitelistsubfile.txt

  • Atlas does not create a lineage to an excluded folder but a lineage is created to the same excluded folder if there is an included path available.

./hdfs-lineage.sh -put ~/sample.txt /tmp/blacklist/blToWl.txt

./hdfs-lineage.sh -cp /tmp/blacklist/blToWl.txt /tmp/whitelist/blToWl.txt

Atlas captures the lineage when folder(s) are included and later the lineage is captured from the excluded entity.

The included folder takes precedence over the excluded folder, always.

./hdfs-lineage.sh -put ~/sample.txt /tmp/whitelist/wlToBl.txt

./hdfs-lineage.sh -cp /tmp/whitelist/wlToBl.txt /tmp/blacklist/wlToBl.txt