Extraction Configuration

Some of the configurations that must be set-up before you perform the metadata extraction.

The configuration file is located at: /opt/cloudera/parcels/CDH/lib/atlas/extractor/adls.conf

Configuration Parameter Purpoose Default Value
atlas.adls.extraction.account.name ADLS Gen2 storage account which was created as part of Extraction Prerequisites. Mandatory.
atlas.adls.extraction.account.key

ADLS account key if IDBroker is not configured.

To be specified if Knox IDBroker is not configured at CDP.
atlas.adls.extraction.access.token Access token for token based authentication. If Knox IDBroker is not configured at CDP, token based authentication is required. It must be configured.
atlas.adls.extraction.allowlist.paths=abfs://<containername>@<accountname>.dfs.core.windows.net/<path>

Comma separated ABFS paths or patterns from which ADLS metadata (directory, blob) needs to be extracted. Multiple values can be configured by ',' separated.

Example: abfs://testcontainer@teststorageaccount.dfs.core.windows.net/testdir1/

atlas.adls.extraction.denylist.paths=abfs://<containername>@<accountname>.dfs.core.windows.net/<path> Comma separated ABFS paths or patterns from which ADLS metadata should be excluded from extraction. Multiple values can be configured by ',' separated.

Example: abfs://testcontainer@teststorageaccount.dfs.core.windows.net/testdir2/

atlas.adls.extraction.max.blob.per.call Number of blob storage to be fetched in one call to Azure ADLS by bulk extraction. 1000
atlas.adls.extraction.timeout.per.call.in.sec The timeout (seconds), used for each ADLS SDK call wherever it is required. 30
atlas.adls.extraction.resume.from.progress.file Resume from the last run in case of failure feature.

Set to false by default.

Set it to true if resuming an extract.

atlas.adls.extraction.progress.file Progress file used for extraction in case the user wants to resume. adls_extractor_progress_file.props
atlas.adls.extraction.max.reconnect.count

Specify the maximum number of retries to:

  • IDBroker in case of credentials expiry.

  • Retryable exception for Azure ADLS.

atlas.adls.extraction.fs.system File System used in Azure ADLS. Default set to: abfs
atlas.adls.extraction.incremental.queueNames

Azure list of Account:QueueName which is configured as part of Configuring ADLS Gen2 Storage Queue to get the blob/directory create, delete events.

Example: teststorageaccount:testqueue
atlas.adls.extraction.incremental.messagesPerRequest The number of messages Incremental Extractor tries to fetch from ADLS Queue in a single call. Default is 10. It ranges from 1 to 32.
atlas.adls.extraction.incremental.requestWaitTime The wait time in seconds in a single call to ADLS Queue to fetch atlas.adls.extraction.incremental.messagesPerRequest messages. 20
atlas.adls.extraction.incremental.max.retry Maximum retry count in case of Idle while reading Queue Messages in Incremental Extraction. 20
atlas.adls.extraction.incremental.delete.needed.for.rename Does an entity need deletion if it has been renamed to something which should not be created at Atlas due to allow and deny list. false
atlas.notification.hook.asynchronous This setting should be set to "true" only when extracting a large number of adls metadata (directory, blob) where there is a possibility of a lag when publishing messages to ATLAS_HOOK Kafka topic.

Defaults to asynchronous sending of events: true

To set synchronous sending of events: false

(Synchronous)