Running bulk extraction

Bulk extraction takes a snapshot of the metadata for the current version of all objects defined by the allowlist as of the initial start time of the extraction.

If the bulk extraction process runs for a few hours, there may be additions or modifications to AWS S3 content after the start of the extraction. The changes after the extraction start time are not included in the extraction.

The first time bulk extraction runs, it creates entities in Atlas for each S3 bucket and object. If you run bulk extraction again, it will:

  • Update Atlas entities with new information it finds for buckets or objects whose qualifiedNames match existing entities.

  • Create new entities for any new buckets or objects that are not found in Atlas.

The second extractions will NOT:

  • Mark entities deleted in Atlas if the object from the first extraction does not exist at the time of the second extraction.

  • Collect version information for objects other than the current version.

The following command line example runs the bulk extraction. Assuming the mandatory properties are set in the default configuration file, no parameters are required:

/opt/cloudera/parcels/CDH/lib/atlas/extractors/bin/aws-s3-extractor.sh