Configuring and Managing S3Guard

Minimum Required Role: User Administrator (also provided by Full Administrator)

Data written to Amazon S3 buckets is subject to the "eventual consistency" guarantee provided by Amazon Web Services (AWS), which means that data written to S3 may not be immediately available for queries and listing operations. This can cause failures in multi-step ETL workflows, where data from a previous step is not available to the next step. The S3Guard feature guarantees a consistent view of data stored in Amazon S3 by storing additional metadata in a table residing in an Amazon DynamoDB instance. Depending on the workload, this additional metadata store may also improve performance for Hive, Spark, and Impala jobs.

All processes that modify the S3 bucket that S3Guard is enabled for must use S3Guard. Since S3Guard works by logging metadata changes to an external database, modifying the bucket outside of S3Guard will cause the S3 data and the S3Guard database to go out of sync. This can cause issues such as S3A/S3Guard thinking that files are or are not present despite the bucket having different data.

To enable S3Guard, you set up an Amazon DynamoDB database from Amazon Web Services. Amazon charges an hourly rate for this service based on the capacity you provision. See Amazon DynamoDB Pricing.

When the data stored in S3 eventually becomes consistent (usually within 24 hours or less), the S3Guard metadata is no longer required and you can periodically prune the S3Guard Metadata stored in the DynamoDB to clear older entries. Pruning can also reduce costs associated with the DynamoDB.

To configure S3Guard in your cluster, you must provide the following:

Credentials for the Amazon S3 bucket.
An instance of Amazon DynamoDB database provisioned from Amazon Web Services.
The configured region for the DynamoDB database.
A CDH 5.11 or higher cluster managed by Cloudera Manager 5.11 or higher.

Continue reading:

Configuring S3Guard for Cluster Access to S3
Editing the S3Guard Configuration
Pruning the S3Guard Metadata
- Running the Prune Command Using Cloudera Manager Admin Console
- Running the Prune Command Using the Cloudera Manager API

Configuring S3Guard for Cluster Access to S3

Specify the AWS credentials for the Amazon S3 instance where you want to enable S3Guard. You can:
- Add a new AWS credential.
  After adding the credential, the Edit S3Guard dialog box displays.
- Use an existing AWS credential:
  - 1. Go to Administration > AWS Credentials.
    2. Locate the credential you want to use and click Actions > Edit S3Guard.
      The Edit S3Guard dialog box displays.
Select Enable S3Guard.

Edit the following S3Guard configuration properties:

S3Guard Configuration Properties
Property	Description
Automatically Create S3Guard Metadata Table (`fs.s3a.s3guard.ddb.table.create`) API Name: `s3guard_table_auto_create`	When Yes is selected, the DynamoDB table that stores the S3Guard metadata is automatically created if it does not exist. When No is selected and the table does not exist, running the Prune command, queries, or other jobs on S3 will fail.
S3Guard Metadata Table Name (`fs.s3a.s3guard.ddb.table`) API Name: `s3guard_table_name`	The name of the DynamoDB table that stores the S3Guard metadata. By default, the table is named `s3guard-metadata`.
S3Guard Metadata Region Name (`fs.s3a.s3guard.ddb.region`) API Name: `s3guard_region`	The DynamoDB region to connect to for access to the S3Guard metadata. Set this property to a valid region. See DynamoDB regions.
Expand the Advanced section to configure the following properties:
S3Guard Metadata Pruning Age (`fs.s3a.s3guard.cli.prune.age`) API Name: `s3guard_cache_prune_age_ms`	Maximum age for S3Guard metadata. Whenever the Prune command runs, entries in the S3Guard metadata cache older than this age will be deleted. You can enter this value in milliseconds, seconds, minutes, hours, or days.
S3Guard Metadata Table Read Capacity (`fs.s3a.s3guard.ddb.table.capacity.read`) API Name: `s3guard_table_capacity_read`	Provisioned throughput requirements, in capacity units, for read operations from the DynamoDB table used for the S3Guard metadata. This value is only used when creating a new DynamoDB table. After the table is created, you can monitor the throughput and adjust the read capacity using the DynamoDB AWS Management Console. See Provisioned Throughput.
S3Guard Metadata Table Write Capacity (`fs.s3a.s3guard.ddb.table.capacity.write`) API Name: `s3guard_table_capacity_write`	Provisioned throughput requirements, in capacity units, for write operations to the DynamoDB table used for the S3Guard metadata. This value is only used when creating a new DynamoDB table. After the table is created, you can monitor the throughput and adjust the write capacity as needed using the DynamoDB AWS Management Console. See Provisioned Throughput.

Click Save.
The Connect to Amazon Web Services dialog box displays.
To enable cluster access to S3 using the S3 Connector Service, click the Enable for Cluster Name link in the Cluster Access to S3 section.
Follow the prompts to add the S3 Connector Service. See Adding the S3 Connector Service for details.

Note: S3Guard is not supported for Cloud Backup and Restore and Cloudera Navigator Access to S3.

Editing the S3Guard Configuration

To edit or disable the S3Guard configuration:

Click Administration > AWS Credentials.
Locate the credential associated with the S3Guard configuration and click Actions > Edit S3Guard.
The Edit S3Guard dialog box displays.
Edit the S3Guard configuration. (To disable S3Guard for this credential, uncheck Enable S3Guard.)
Click Save.

Pruning the S3Guard Metadata

Amazon charges for the amount of data stored in the DynamoDB and the bandwidth used for reads and writes to the database. To optimize costs and improve performance, you can remove stale metadata from the DynamoDB table by running the Prune command. Generally, data written to S3 becomes consistent after 24 hours or less, meaning that you only need to maintain metadata in DynamoDB for about one day. You can monitor the usage of DynamoDB using AWS tools to determine how often and when to prune the table.

Running the Prune command removes all metadata that is older than the age you specify with the S3Guard Metadata Pruning Age property in the S3Guard configuration. You can run this command from the Cloudera Manager Admin Console, or you can create a script to run the Prune command automatically using the Cloudera Manager API. Cloudera recommends that you run that script using a Linux cron job or other scheduling mechanism to regularly prune the metadata.

Running the Prune Command Using Cloudera Manager Admin Console

Minimum Required Role: Cluster Administrator (also provided by Full Administrator)

To prune the S3Guard metadata in the DynamoDB table using the Cloudera Manager Admin Console:

Go to Administration > AWS Credentials.
Locate the credential associated with the S3 data and click Actions > Run S3 Guard Prune Command.

Running the Prune Command Using the Cloudera Manager API

Cloudera recommends that you automate running the Prune command by creating a script that uses the Cloudera Manager API to run the command. You can run the command using a REST command, a Python script, or Java class. Configure the script using the Linux cron command or another scheduling mechanism to run on a regular schedule.

REST

See the Rest API Documentation.

You can run the Prune command by issuing the following REST request:

curl -X POST -u username:password
 'Cloudera_Manager_server_URL:port_number/api/vAPI_version_number/externalAccounts/account/Credential_Name/commands/S3GuardPrune'

For example, the following request runs the S3Guard prune command on the data associated with the johnsmith credential. The response from Cloudera Manager is also displayed (within the curly brackets):

curl -X POST -u admin:admin 'http://clusterhost-1.gce.mycompany.com:7180/api/v16/externalAccounts/account/johnsmith/commands/S3GuardPrune'
{
  "id" : 322,
  "name" : "S3GuardPrune",
  "startTime" : "2017-03-20T23:35:55.453Z",
  "active" : true,
  "children" : {
    "items" : [ {
      "id" : 323,
      "name" : "HostS3GuardPrune",
      "startTime" : "2017-03-20T23:35:55.777Z",
      "active" : true,
      "hostRef" : {
        "hostId" : "ff988a15-3749-4178-b167-a60b15f91653"
      }

Python

You can also use a Python script to run the Prune command. See aws.py for the code and usage instructions.

Java

See the Javadoc.

Using Fast Upload with Amazon S3

How to Configure a MapReduce Job to Access S3 with an HDFS Credstore