Extracting S3 Metadata using AtlasPDF version

Extractor configuration properties

Some of the extractor configuration properties.

Configuration Parameter (prefixed with atlas.s3.extraction) Purpose
.aws.region

Required. AWS S3 region for extraction. CDP Public Cloud: provided by IDBroker integration.

CDP Data Center: the region must be provided.

.access.key

.secret.key

.session.token

AWS S3 credentials.

CDP Public Cloud: provided by IDBroker integration.

CDP Data Center:

.whitelist.paths

Set of data assets to include in metadata extraction. If no paths are included, the extractor extracts metadata for all buckets. Buckets inside the region and accessible by the credential used to run the extractor. For public cloud, this is defined in aws-cdp-bucket-access-policy.

The list can include buckets or paths; separate multiple values with commas. See Defining what assets to extract metadata for.

.blacklist.paths Set of data assets to exclude from metadata extraction. The list can include buckets or paths; separate multiple values with commas. See Defining what assets to extract metadata for.
.max.object.per.call Number of objects returned in one call to AWS S3 in bulk extraction. Defaults to 1000.
.resume.from.progress.file Enable bulk extraction resume. Disabled by default.
.progress.file Name of the progress file that tracks bulk extraction that will be used if the extraction has to be resumed mid-way after a failure. This file is created in the Atlas log directory. Defaults to extractor_progress_file.log
.max.reconnect.count Number of times the extractor will try to reconnect to AWS in case of credentials expiring or retryable exception from AWS S3. Defaults to 2.
.fs.scheme File system scheme used in case of AWS S3. At this time, S3a is the only option.
.incremental.queueName Required for incremental mode. AWS SQS Queue name configured to get the AWS S3 bucket events.
.incremental.messagesPerRequest The number of messages the extractor will try to get from the SQS queue in one call. Defaults to 10.
.incremental.requestWaitTime The interval that the extractor will wait to collect messages so long as the limit set in .incremental.messagesPerRequest hasn't been met. Defaults to 20 seconds.
.incremental.max.retry Maximum retry count if the SQS queue doesn't respond during an extractor request. Defaults to 20.