S3 Data Extraction for Navigator
You can extract S3 data for Navigator using different methods, depending on your requirements. This topic discusses the different ways that you can extract S3 data for Cloudera Navigator, and describes extraction settings you can set using Advanced Configuration Snippets in Cloudera Manager. It also shows the IAM policy documents that you need to create and attach for the Navigator user in AWS.
Creating IAM Policy Documents
To enable S3 data extraction for Cloudera Navigator, you must create a policy document in AWS and attach that policy document to the AWS user associated with the Cloudera Navigator instance. Each of the extraction methods described include a policy document that you create to enable extraction.
- Access Policy Language Overview describes the basic elements used in policies.
- Creating a New Policy describes how you can create an access policy document. Use the instructions in Edit a policy using the policy editor to create a policy by copying the policy text defined in each extraction type section and pasting it in the policy editor.
- For addtional information and walkthroughs, see the various topics in Managing Access Permissions to Your Amazon S3 Resources.
Bulk and Incremental Extraction
By default, Navigator uses combined bulk and incremental extraction. The first extraction is a bulk extraction; all subsequent extractions are incremental.
- SQS queue in each region in which you have buckets
- S3 event notification for each bucket
- Highest performance of all extraction types.
- Least expensive in terms of API cost.
- Easiest to use; no additional Navigator or Cloudera Manager setup is required.
- You cannot use this method for any buckets that have existing S3 event notification.
- You must obtain additional permissions on AWS.
- Navigator changes your AWS environment. SQS queues are created in all regions that have buckets, and event notification is updated on all buckets.
Policy Document - Bulk and Incremental Extraction
{ "Version": "2012-10-17", "Statement": [ { "Sid": "Stmt1481678612000", "Effect": "Allow", "Action": [ "sqs:CreateQueue", "sqs:DeleteMessage", "sqs:DeleteMessageBatch", "sqs:GetQueueAttributes", "sqs:GetQueueUrl", "sqs:ReceiveMessage", "sqs:SetQueueAttributes" ], "Resource": "*" }, { "Sid": "Stmt1481678744000", "Effect": "Allow", "Action": [ "s3:GetBucketLocation", "s3:ListAllMyBuckets", "s3:ListBucket", "s3:GetObject", "s3:GetObjectAcl", "s3:GetBucketNotification", "s3:PutBucketNotification" ], "Resource": [ "arn:aws:s3:::*" ] } ] }
Bulk Extraction Only
- No additional AWS resources are required.
-
Minimum permissions are required to read from S3.
Disadvantages:
- Slow; it can take 10 times as long as extraction takes with Bulk and Incremental Extraction.
- Expensive in terms of API cost.
Setup
nav.s3.extractor.incremental.enable=false
Policy Document - Bulk Extraction
{ "Version": "2012-10-17", "Statement": [ { "Sid": "Stmt1481676614000", "Effect": "Allow", "Action": [ "s3:GetBucketLocation", "s3:ListAllMyBuckets", "s3:ListBucket", "s3:GetObject", "s3:GetObjectAcl" ], "Resource": [ "arn:aws:s3:::*" ] } ] }
Event Notification for an External Queue
If you have existing S3 event notification configured for any S3 buckets, you must use configure extraction to use that external queue. This requires you to set up the queues and configure event notification. "Bring your own queue" extraction is recommended for production environments.
- SQS queue for each region in which you have buckets.
- S3 event notification for each bucket to send change events to the Navigator queue.
- Full control over your AWS environment.
-
Performance level is high.
- Requires significant manual setup and configuration.
Setup
- Stop Cloudera Navigator.
- Open the Amazon SQS console and create a queue with the following settings:
- Default Visibility Timeout: 10 minutes
- Message Retention Period: 14 days
- Delivery Delay: 0 seconds
- Receive Message Wait Time: 0 seconds
- Select the queue you created, click the Permissions tab, click Add a Permission, and configure the following in the
Add a Permision to... dialog box:
- Effect: Allow
- Principal: Everybody
- Actions: SendMessage
In the Conditions (optional) area, set the following values:
- Qualifier: None
- Condition: ArnLike
- Key: aws:SourceArn
- Value: arn:aws:s3::*:*
When finished, click Add Condition, and then click Add Condition.
- Set up a queue in each region in which you have buckets.
- Configure event notification for every bucket:
- Name: nav-send-metadata-on-change
- Events: ObjectCreated(All) and ObjectRemoved(All)
- Send to: SQS queue
- SQS queue: The name of your queue
- Configure SNS fanout if you have existing S3 event notification. For more information about SNS fanout, see Common SNS Scenarios.
- In Cloudera Manager, add the following to the Navigator Metadata Server Advanced Configuration Snippet (Safety Valve) for
cloudera-navigator.properties in Cloudera Manager:
nav.s3.extractor.incremental.enable=true nav.s3.extractor.incremental.auto_setup.enable=false nav.s3.extractor.incremental.queues=queue_json
queue_json must have the following JSON format, with no spaces:
[{"region":"us-west-1"\\,"queueUrl":"https://sqs.us-west-1.amazonaws.com/account_num/queue_name"}\\,{queue_2}\\,...{queue_n}]
- Restart Navigator.
Policy Document - External Queue
{ "Version": "2012-10-17", "Statement": [ { "Sid": "Stmt1481678612000", "Effect": "Allow", "Action": [ "sqs:DeleteMessage", "sqs:DeleteMessageBatch", "sqs:GetQueueAttributes", "sqs:ReceiveMessage" ], "Resource": "*" }, { "Sid": "Stmt1481678744000", "Effect": "Allow", "Action": [ "s3:GetBucketLocation", "s3:ListAllMyBuckets", "s3:ListBucket", "s3:GetObject", "s3:GetObjectAcl", "s3:GetBucketNotification", "s3:PutBucketNotification" ], "Resource": [ "arn:aws:s3:::*" ] } ] }
Navigator S3 Extraction Options
Option | Description | Default Value |
---|---|---|
nav.s3.extractor.max_threads | Number of extractions workers to run in parallel. | 3 |
nav.s3.extractor.enable | Enable or disable S3 extraction. | true (if a key is provided to Navigator through Cloudera Manager) |
nav.s3.extractor.incremental.enable | Enable or disable incremental extraction. If set to false, bulk extraction is run.
You can enable bulk extraction by setting this value to to false and restarting Navigator. |
true |
nav.s3.extractor.incremental.batch_size | Number of messages to keep in memory at a time. | 1000 |
nav.s3.extractor.incremental.auto_setup.enable | Autoconfigure queues and configure S3 event notification. Set to false to use “bring your own queue”. | true |
nav.s3.extractor.incremental.queues | List of queues to use in external queue use case. | N/A |
nav.aws.api.limit | Maximum number of API calls that Navigator can make per month. | 5,000,000,000 |
nav.sqs.max_receive_count | Number of retries for SQS messages that might be inconsistent due to eventual consistency. | 10 |
nav.s3.implicit.batch_size | Number of Solr documents to keep in memory when updating the state of implicit folders. | 1000 |
nav.s3.home_region | The closest AWS region. Using the closest region can reduce API request time. | us-west-1 |
Categories: Data Management | Governance | Metadata | Navigator | S3 | All Categories