Extraction Configuration
Some of the configurations that must be set-up before you perform the metadata extraction.
The configuration file is located at: /opt/cloudera/parcels/CDH/lib/atlas/extractor/adls.conf
Configuration Parameter | Purpoose | Default Value |
---|---|---|
atlas.adls.extraction.account.name |
ADLS Gen2 storage account which was created as part of Extraction Prerequisites. | Mandatory. |
atlas.adls.extraction.account.key |
ADLS account key if IDBroker is not configured. |
To be specified if Knox IDBroker is not configured at CDP. |
atlas.adls.extraction.access.token |
Access token for token based authentication. | If Knox IDBroker is not configured at CDP, token based authentication is required. It must be configured. |
atlas.adls.extraction.allowlist.paths=abfs://<containername>@<accountname>.dfs.core.windows.net/<path> |
Comma separated ABFS paths or patterns from which ADLS metadata (directory, blob) needs to be extracted. Multiple values can be configured by ',' separated. Example: abfs://testcontainer@teststorageaccount.dfs.core.windows.net/testdir1/ |
|
atlas.adls.extraction.denylist.paths=abfs://<containername>@<accountname>.dfs.core.windows.net/<path> |
Comma separated ABFS paths or patterns from which ADLS metadata
should be excluded from extraction. Multiple values can be configured by
',' separated. Example: abfs://testcontainer@teststorageaccount.dfs.core.windows.net/testdir2/ |
|
atlas.adls.extraction.max.blob.per.call |
Number of blob storage to be fetched in one call to Azure ADLS by bulk extraction. | 1000 |
atlas.adls.extraction.timeout.per.call.in.sec |
The timeout (seconds), used for each ADLS SDK call wherever it is required. | 30 |
atlas.adls.extraction.resume.from.progress.file |
Resume from the last run in case of failure feature. |
Set to false by default. Set it to true if resuming an extract. |
atlas.adls.extraction.progress.file |
Progress file used for extraction in case the user wants to resume. | adls_extractor_progress_file.props |
atlas.adls.extraction.max.reconnect.count |
Specify the maximum number of retries to:
|
|
atlas.adls.extraction.fs.system |
File System used in Azure ADLS. | Default set to: abfs |
atlas.adls.extraction.incremental.queueNames |
Azure list of Account:QueueName which is configured as part of Configuring ADLS Gen2 Storage Queue to get the blob/directory create, delete events. Example:teststorageaccount:testqueue |
|
atlas.adls.extraction.incremental.messagesPerRequest |
The number of messages Incremental Extractor tries to fetch from ADLS Queue in a single call. | Default is 10. It ranges from 1 to 32. |
atlas.adls.extraction.incremental.requestWaitTime |
The wait time in seconds in a single call to ADLS Queue to fetch atlas.adls.extraction.incremental.messagesPerRequest messages. | 20 |
atlas.adls.extraction.incremental.max.retry |
Maximum retry count in case of Idle while reading Queue Messages in Incremental Extraction. | 20 |
atlas.adls.extraction.incremental.delete.needed.for.rename |
Does an entity need deletion if it has been renamed to something which should not be created at Atlas due to allow and deny list. | false |
atlas.notification.hook.asynchronous |
This setting should be set to "true" only when extracting a large number of adls metadata (directory, blob) where there is a possibility of a lag when publishing messages to ATLAS_HOOK Kafka topic. |
Defaults to asynchronous sending of events: true To set synchronous sending of events: false (Synchronous) |