Specifying Impala Credentials to Access Data in S3 with Cloudera Manager

Cloudera recommends that you use Cloudera Manager to specify Impala credentials to access data in Amazon S3. If you are not using Cloudera Manager, see Specifying Impala Credentials to Access Data in S3 from the command line.

To configure access to data stored in S3 for Impala with Cloudera Manager, use one of the following authentication types:

  • IAM Role-based Authentication

    Amazon Identity & Access Management (IAM). You must set up IAM role-based authentication in Amazon. See Amazon documentation. This authentication method is best suited for environments where there is a single user, or where all cluster users can have the same privileges to data in S3. See How to Configure AWS Credentials for information about using IAM role-based authentication with Cloudera Manager.

  • Access Key Authentication

    For environments where you have multiple users or multi-tenancy, use an AWS access key and an AWS secret key that you obtain from Amazon. See Amazon documentation. For this scenario, you must enable the Sentry service and Kerberos to use the S3 Connector service. Cloudera Manager stores your AWS credentials securely and does not store them in world-readable locations. If you can use the Sentry service and Kerberos, see the following sections to add your AWS credentials to Cloudera Manager and to manage them:

Specifying Impala Credentials on Clusters Not Secured by Sentry or Kerberos

If you cannot use the Sentry service or Kerberos in your environment, specify Impala credentials in the Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml. For example:

<property>
   <name>fs.s3a.access.key</name>
   <value>your_access_key</value>
</property>
<property>
   <name>fs.s3a.secret.key</name>
   <value>your_secret_key</value>
</property>

Specifying your credentials in this safety valve does not require Kerberos or the Sentry service, but it is not as secure. After specifying the credentials, restart both the Impala and Hive services. Restarting Hive is required because operations such as Impala queries and CREATE TABLE statements go through the Hive metastore.