Configuring and Managing S3Guard
Minimum Required Role: User Administrator (also provided by Full Administrator)
Data written to Amazon S3 buckets is subject to the "eventual consistency" guarantee provided by Amazon Web Services (AWS), which means that data written to S3 may not be immediately available for queries and listing operations. This can cause failures in multi-step ETL workflows, where data from a previous step is not available to the next step. The S3Guard feature guarantees a consistent view of data stored in Amazon S3 by storing additional metadata in a table residing in an Amazon DynamoDB instance. Depending on the workload, this additional metadata store may also improve performance for Hive, Spark, and Impala jobs.
All processes that modify the S3 bucket that S3Guard is enabled for must use S3Guard. Since S3Guard works by logging metadata changes to an external database, modifying the bucket outside of S3Guard will cause the S3 data and the S3Guard database to go out of sync. This can cause issues such as S3A/S3Guard thinking that files are or are not present despite the bucket having different data.
To enable S3Guard, you set up an Amazon DynamoDB database from Amazon Web Services. Amazon charges an hourly rate for this service based on the capacity you provision. See Amazon DynamoDB Pricing.
When the data stored in S3 eventually becomes consistent (usually within 24 hours or less), the S3Guard metadata is no longer required and you can periodically prune the S3Guard Metadata stored in the DynamoDB to clear older entries. Pruning can also reduce costs associated with the DynamoDB.
- Credentials for the Amazon S3 bucket.
- An instance of Amazon DynamoDB database provisioned from Amazon Web Services.
- The configured region for the DynamoDB database.
- A CDH cluster managed by Cloudera Manager.
Configuring S3Guard for Cluster Access to S3
- Specify the AWS credentials for the Amazon S3 instance where you want to enable S3Guard. You can:
- Add a new AWS credential.
After adding the credential, the Edit S3Guard dialog box displays.
- Use an existing AWS credential:
-
- Go to .
- Locate the credential you want to use and click
The Edit S3Guard dialog box displays.
.
-
- Add a new AWS credential.
- Select Enable S3Guard.
- Edit the following S3Guard configuration properties:
S3Guard Configuration Properties Property Description Automatically Create S3Guard Metadata Table (fs.s3a.s3guard.ddb.table.create)
API Name:
s3guard_table_auto_createWhen Yes is selected, the DynamoDB table that stores the S3Guard metadata is automatically created if it does not exist.
When No is selected and the table does not exist, running the Prune command, queries, or other jobs on S3 will fail.
S3Guard Metadata Table Name (fs.s3a.s3guard.ddb.table)
API Name: s3guard_table_name
The name of the DynamoDB table that stores the S3Guard metadata.
By default, the table is named s3guard-metadata.
S3Guard Metadata Region Name (fs.s3a.s3guard.ddb.region)
API Name: s3guard_region
The DynamoDB region to connect to for access to the S3Guard metadata. Set this property to a valid region. See DynamoDB regions.
Expand the Advanced section to configure the following properties: S3Guard Metadata Pruning Age (fs.s3a.s3guard.cli.prune.age)
API Name: s3guard_cache_prune_age_ms
Maximum age for S3Guard metadata. Whenever the Prune command runs, entries in the S3Guard metadata cache older than this age will be deleted.
You can enter this value in milliseconds, seconds, minutes, hours, or days.
S3Guard Metadata Table Read Capacity (fs.s3a.s3guard.ddb.table.capacity.read)
API Name: s3guard_table_capacity_read
Provisioned throughput requirements, in capacity units, for read operations from the DynamoDB table used for the S3Guard metadata. This value is only used when creating a new DynamoDB table. After the table is created, you can monitor the throughput and adjust the read capacity using the DynamoDB AWS Management Console. See Provisioned Throughput.
S3Guard Metadata Table Write Capacity (fs.s3a.s3guard.ddb.table.capacity.write)
API Name: s3guard_table_capacity_write
Provisioned throughput requirements, in capacity units, for write operations to the DynamoDB table used for the S3Guard metadata. This value is only used when creating a new DynamoDB table. After the table is created, you can monitor the throughput and adjust the write capacity as needed using the DynamoDB AWS Management Console. See Provisioned Throughput.
- Click Save.
The Connect to Amazon Web Services dialog box displays.
- To enable cluster access to S3 using the S3 Connector Service, click the Enable for Cluster
Name link in the Cluster Access to S3 section.
Follow the prompts to add the S3 Connector Service. See Adding the S3 Connector Service for details.
Editing the S3Guard Configuration
- Click
- Locate the credential associated with the S3Guard configuration and click
The Edit S3Guard dialog box displays.
.
- Edit the S3Guard configuration. (To disable S3Guard for this credential, uncheck Enable S3Guard.)
- Click Save.
Pruning the S3Guard Metadata
Amazon charges for the amount of data stored in the DynamoDB and the bandwidth used for reads and writes to the database. To optimize costs and improve performance, you can remove stale metadata from the DynamoDB table by running the Prune command. Generally, data written to S3 becomes consistent after 24 hours or less, meaning that you only need to maintain metadata in DynamoDB for about one day. You can monitor the usage of DynamoDB using AWS tools to determine how often and when to prune the table.
Running the Prune command removes all metadata that is older than the age you specify with the S3Guard Metadata Pruning Age property in the S3Guard configuration. You can run this command from the Cloudera Manager Admin Console, or you can create a script to run the Prune command automatically using the Cloudera Manager API. Cloudera recommends that you run that script using a Linux cron job or other scheduling mechanism to regularly prune the metadata.
Running the Prune Command Using Cloudera Manager Admin Console
Minimum Required Role: Cluster Administrator (also provided by Full Administrator)
- Go to .
- Locate the credential associated with the S3 data and click .
Running the Prune Command Using the Cloudera Manager API
Cloudera recommends that you automate running the Prune command by creating a script that uses the Cloudera Manager API to run the command. You can run the command using a REST command, a Python script, or Java class. Configure the script using the Linux cron command or another scheduling mechanism to run on a regular schedule.
REST
See the Rest API Documentation.
You can run the Prune command by issuing the following REST request:
curl -X POST -u username:password 'Cloudera_Manager_server_URL:port_number/api/vAPI_version_number/externalAccounts/account/Credential_Name/commands/S3GuardPrune'
curl -X POST -u admin:admin 'http://clusterhost-1.gce.mycompany.com:7180/api/v16/externalAccounts/account/johnsmith/commands/S3GuardPrune' { "id" : 322, "name" : "S3GuardPrune", "startTime" : "2017-03-20T23:35:55.453Z", "active" : true, "children" : { "items" : [ { "id" : 323, "name" : "HostS3GuardPrune", "startTime" : "2017-03-20T23:35:55.777Z", "active" : true, "hostRef" : { "hostId" : "ff988a15-3749-4178-b167-a60b15f91653" }
Python
You can also use a Python script to run the Prune command. See aws.py for the code and usage instructions.
Java
See the Javadoc.