Configuring Access to S3
For Apache Hadoop applications to be able to interact with Amazon S3, they must know the AWS access key and the secret key. This can be achieved in three different ways: through configuration properties, environment variables, or instance metadata. While the first two options can be used when accessing S3 from a cluster running in your own data center. IAM roles, which use instance metadata should be used to control access to AWS resources if your cluster is running on EC2.
Table 3.1. Authentication Options for Different Deployment Scenarios
Deployment Scenario | Authentication Options |
---|---|
Cluster runs on EC2 | Use IAM roles to control access to your AWS resources. If you configure role-based access, instance metadata will automatically be used to authenticate. |
Cluster runs in your own data center | Use configuration properties to authenticate. You can set the configuration properties globally or per-bucket. |
Temporary security credentials, also known as "session credentials", can be issued. These consist of a secret key with a limited lifespan, along with a session token, another secret which must be known and used alongside the access key. The secret key is never passed to AWS services directly. Instead it is used to sign the URL and headers of the HTTP request.
By default, the S3A filesystem client follows the following authentication chain:
If login details were provided in the filesystem URI, a warning is printed and then the username and password are extracted for the AWS key and secret respectively. However, authenticating via embedding credentials in the URL is dangerous and deprecated. Instead, you may authenticate using per-bucket authentication credentials.
The
fs.s3a.access.key
andfs.s3a.secret.key
are looked for in the Hadoop configuration properties.The AWS environment variables are then looked for.
An attempt is made to query the Amazon EC2 Instance Metadata Service to retrieve credentials published to EC2 VMs.