Adding Cloudera Data Warehouse cluster access to S3 buckets in the same AWS account

In certain scenarios, you might need to interact with data that resides outside of the data lake S3 buckets. You can add a bucket to S3, and then, using Hue define external tables based on the data, such as a CSV file, you put into the bucket.

In this task, you see how to add an S3 bucket to create an external table based on CSV data using Hue. You need to add read/write access to the external S3 bucket before creating the table. From the command line of your cluster, you can run HDFS CLI commands on the S3 bucket. You can also use the S3 bucket for uploading a UDF jar for registration, and then include UDFs in queries from your cluster.
  • The S3 bucket you add to hold the data outside your Data Lake must be in the same AWS account as your Cloudera Data Warehouse (CDW) service cluster.

  • Required role: DWAdmin
In this task, first you get the cluster ID from the Environments tile in the CDW service UI.
Next, you edit the AWS instance profile.
  1. In the AWS Console, navigate to AWS Management Console > CloudFormation and locate the corresponding stack using the cluster ID you obtained from the CDW Environments tile, and click on its name. The CloudFormation stack name is based on the form: <cluster-ID>-dwx-stack. For example, in the above image, the cluster ID is env-6cwwgg so the CloudFormation stack name for this cluster is env-6cwwgg-dwx-stack.
  2. In the CloudFormation stack details page, click the Resources tab, locate the NodeInstanceRole in the Logical ID column, and then click the adjacent hyperlink in the Physical ID column:

    This launches the Identity and Access Management (IAM) console.

  3. In the IAM console, locate the s3-read-write-own-buckets policy. Click Show…More if you do not see it.
  4. Expand s3-read-write-own-buckets, and then click Edit:
  5. In the Edit s3-read-write-own-buckets editor page, click the JSON tab and append information about the bucket you added in the "Resource" section of the JSON file. For example, if you wanted to add access to the more-sales-data bucket, you append it at the end of the "Resource" section, as shown in the last two resource names :
    "Resource":[
                "arn:aws:s3:::cdw-sales-hj9s-dwx-managed",
                "arn:aws:s3:::cdw-sales-hj9s-dwx-managed/*",
                "arn:aws:s3:::cdw-sales-hj9s-dwx-external",
                "arn:aws:s3:::cdw-sales-hj9s-dwx-external/*",
               "arn:aws:s3:::more-sales-data",
                "arn:aws:s3:::more-sales-data/*"
         ],
  6. Click Review policy in the lower right corner of the page, and then click Save changes. You can access the new bucket from your CDW service cluster now. For example, you can create external Hive tables that point to the newly added bucket.
    You can now access a bucket outside your data kake from Hue in your CDW service cluster. For example, you can create external Hive tables that point to the newly added bucket, and join those external tables with tables already in your Data Lake. Furthermore, you can govern CDW user access to this external S3 bucket via Ranger Hadoop SQL Policies.