Extracting S3 Metadata using AtlasPDF version

Defining what assets to extract metadata for

By default, the Atlas S3 extractor collects metadata for all buckets that are visible through the active IAM policy.

You can specify the objects to include using configuration properties to include buckets, directory trees, and specific objects (whitelist.paths) or exclude buckets, directory trees, and specific objects (blacklist.paths). When you include a path in the whitelist, Atlas extracts metadata for that directory object and all objects underneath it.

The include and exclude path lists are comma-separated lists of paths for any given bucket. Each item in the list can be as general as a bucket name or a directory path, or as specific as a low level directory or even a single object. The path names in the lists are formatted as S3 URIs, so it may be that they include repeated slash characters to represent virtual objects in S3.

If the same path ends up on both lists, the exclude list wins.

The following examples show how to include objects in or exclude objects from metadata collection.

This example includes a single file from the 2018 directory and all files in the data subdirectory for 2019 and 2020:

atlas.s3.extraction.whitelist.paths=s3a://acme-finance-set4/year-end/2018/data/storage/only-this-file.json,acme-finance-set4/year-end/2019/data,acme-finance-set4/year-end/2020/data

This example includes a single file from one bucket and all contents of a second bucket:

atlas.s3.extraction.whitelist.paths=s3a://demo-bucket/demo.txt, s3a://acme-finance-set4/

This example excludes specific files from a directory. If no include paths were given, only these files would be excluded from all the buckets available.

atlas.s3.extraction.blacklist.paths=s3a://acme-finance-set4/year-end/admin/tools-archive.zip,atlas.s3.extraction.blacklist.paths=s3a://acme-finance-set4/year-end/reporting/template.xls