Configuring Access to S3 on CDP Private Cloud Base

For Apache Hadoop applications to be able to interact with Amazon S3, they must know the AWS access key and the secret key. This can be achieved in three different ways: through configuration properties , environment variables, or instance metadata. While the first two options can be used when accessing S3 from a cluster running in your own data center. IAM roles, which use instance metadata should be used to control access to AWS resources if your cluster is running on EC2.

Table 1. Authentication Options for Different Deployment Scenarios
Deployment Scenario Authentication Options
Cluster runs on EC2 Use IAM roles to control access to your AWS resources. If you configure role-based access, instance metadata will automatically be used to authenticate.
Cluster runs in your own data center Use the configuration properties to authenticate. You can set the configuration properties globally or per-bucket.

Temporary security credentials, also known as session credentials, can be issued. These consist of a secret key with a limited lifespan, along with a session token, another secret which must be known and used alongside the access key. The secret key is never passed to AWS services directly. Instead it is used to sign the URL and headers of the HTTP request.

By default, the S3A filesystem client follows the following authentication chain:

  1. The fs.s3a.access.key and fs.s3a.secret.key are looked for in the Hadoop configuration properties.

  2. The AWS environment variables are then looked for.

  3. An attempt is made to query the Amazon EC2 Instance Metadata Service to retrieve credentials published to EC2 VMs.