Using Amazon S3 Object Storage

For clusters running on AWS, Amazon S3 (Simple Storage Service) provides an efficient and cost-effective cloud storage option. For information on the uses of Amazon S3 in a CDH cluster, and how to configure Amazon S3 using Cloudera Manager, see How to Configure AWS Credentials and Configuring the Amazon S3 Connector in the Cloudera Enterprise documentation. For links to more topics focused on Amazon S3 from the core Cloudera Enterprise documentation library, see Get Started with Amazon S3.

Configuring Amazon S3 with Cloudera Director

Cluster access to Amazon S3 storage can be configured through Cloudera Director by launching your cluster with a configuration file and the bootstrap-remote CLI command. Cloudera Director will make the necessary API calls and pass your AWS access key information or IAM role information to Cloudera Manager so that S3 access is set up according to your configuration settings. Sample content for the sections of the configuration file needed to configure Amazon S3 access is in the aws.reference.conf configuration file, but is commented-out by default. To provide your cluster instances with access to Amazon S3, configure the following sections of the configuration file:
  1. First, create an external account with AWS access in the External Accounts section of your configuration file. There are two choices for authentication, as described in the configuration file comments, AWS access key authentication or IAM role authentication.
    • To use AWS access key authentication, uncomment the appropriate section shown below and provide an AWS access key and an AWS secret key.
    • To use IAM role authentication, uncomment the appropriate section show below and choose or create an IAM policy that includes Amazon S3 access (such as the AWS-managed policy AmazonS3FullAccess) and attach this policy to the IAM role that you assign to your cluster instances. IAM roles for instances that will use S3Guard should also include a policy that gives access to DynamoDB (such as the AWS-managed policy AmazonDynamoDBFullAccess). Specify the IAM role for the instance with the iamProfileName property in the common-instance-properties section of the configuration file.
      #
      # External accounts
      #
    
      # # Any external accounts that should be set up within Cloudera Manager. These will allow some cluster
      # # services to utilize cloud functionality, such as object stores.
      #
      # # Note: CM/CDH 5.10 is required for this feature. At the moment, only AWS external accounts are supported.
      # externalAccounts {
      #
      #     # External account that uses AWS Access Key Authentication. This type of authentication
      #     # will also require the AWS_S3 service.
      #     AWSAccount1 {
      #         type: AWS_ACCESS_KEY_AUTH
      #         configs {
      #             aws_access_key: REPLACE-ME
      #             aws_secret_key: REPLACE-ME
      #
      #             #
      #             # S3 Guard (added in CM/CDH 5.11) can be enabled to guarantee a consistent view of data stored
      #             # in Amazon S3 by storing additional metadata in a table residing in an Amazon DynamoDB instances.
      #             # See https://docs.cloudera.com/documentation/enterprise/latest/topics/cm_s3guard.html for more
      #             # details and additional S3 Guard configuration properties.
      #             #
      #
      #             # s3guard_enable: false
      #             # s3guard_region: REPLACE-ME
      #             # s3guard_table_name: s3guard-metadata
      #             # s3guard_table_auto_create: false
      #         }
      #     }
      #
      #     # External account that uses IAM Role Authentication.
      #     AWSAccount2 {
      #         type: AWS_IAM_ROLES_AUTH
      #     }
    Optionally, to use S3Guard with IAM role authentication, copy the S3Guard configurations from the access key authentication configs block to the IAM role authentication section and configure them.
    # s3guard_enable: false
      # s3guard_region: REPLACE-ME
      # s3guard_table_name: s3guard-metadata
      # s3guard_table_auto_create: false
    For descriptions of the S3Guard configuration properties, see the table in Configuring S3Guard in the Enterprise documentation. Use the API names given in this table when adding properties to the configs block of the Cloudera Director configuration file. For more information about the differences between AWS access key authentication and IAM role-based authentication, and the characteristics and use cases for each of them, see the sections on each in How to Configure AWS Credentials in the Enterprise documentation.
  2. Next, if you are using access key authentication, add (or uncomment) the Cloudera S3 Connector service, AWS_S3, in the list of cluster services in the Cluster description section of the configuration file. You should also add the AWS_S3 service with IAM role-based authentication if you are enabling S3Guard. Use of IAM role authentication doesn't require adding the AWS_S3 service if S3Guard is not enabled.
     services: [
                    HDFS,
                    YARN,
                    ZOOKEEPER,
                    HBASE,
                    HIVE,
                    HUE,
                    OOZIE,
                    SPARK_ON_YARN,
                    KAFKA,
                    SOLR,
                    FLUME,
                    IMPALA,
                    SQOOP,
                    ACCUMULO16,
                    KS_INDEXER,
                    # SENTRY,    # Sentry requires Kerberos to be enabled
                    SPARK2_ON_YARN,
                    KUDU,
                    # AWS_S3     # Requires Sentry and Kerberos (on default configurations)
                  ]
  3. Finally, point the AWS_S3 service to the external account you created in step #1 above in the custom service configurations section :
     #
        # Optional custom service configurations
        # Configuration keys containing special characters (e.g., '.', ':') must be enclosed in double quotes.
        #
        # Configuration properties for CDH roles and services are documented at
        # https://docs.cloudera.com/documentation/enterprise/properties/5-11-x/topics/cm_props_cdh5110.html
        #
    
        #
        # configs {
        #     AWS_S3 {
        #       cloud_account: AWSAccount1
        #     }
        #
        #     HDFS {
        #       dfs_block_size: 134217728
        #     }
        #
        #     MAPREDUCE {
        #       mapred_system_dir: /user/home
        #       mr_user_to_impersonate: mapred1
        #     }
        #
        #     KAFKA {
        #       "num.partitions": 3
        #     }
        # }