Accessing Amazon S3 data using Cloudera Data Engineering

Some additional configuration is required to access Amazon S3 buckets from a Cloudera Data Engineering job.

To access data stored in Amazon S3 from Spark applications, use Hadoop file APIs (SparkContext.hadoopFile, JavaHadoopRDD.saveAsHadoopFile, SparkContext.newAPIHadoopRDD, and JavaHadoopRDD.saveAsNewAPIHadoopFile) for reading and writing RDDs, providing URLs of the form s3a://bucket_name/path/to/file.

Make sure that your user or group is mapped to an AWS IAM role with access to the buckets you need. For more information, see Onboarding Cloudera users and groups for cloud storage. In particular, make sure that the IAM role you are mapped to has the getBucketLocation permission on the buckets you need access to.

You must also configure the spark.kerberos.access.hadoopFileSystems (for Spark 3 jobs) or spark.yarn.access.hadoopFileSystems (for Spark 2 jobs) parameter to include the buckets you need to access. You can do this using the Cloudera Data Engineering job configuration, or at runtime as a command line parameter.

For example:

Job configuration

In the Cloudera Management Console, click the Data Engineering tile and click Overview.
In the Cloudera Data Engineering Services column, select the environment containing the virtual cluster with the job you want to configure.
In the Virtual Clusters column on the right, click the View Jobs icon on the virtual cluster with the job you want to configure.
In the left hand menu, click Jobs.
Click on the job you want to configure.
Go to the Configuration tab.
Click Edit.
Click the Add Configuration icon.
Enter spark.kerberos.access.hadoopFileSystems (for Spark 3 jobs) or spark.yarn.access.hadoopFileSystems (for Spark 2 jobs) in the config_key field.
In the config_value field, add a comma-separated list of buckets that the job requires access to, with no spaces after the commas. For example:
```
s3a://bucket1,s3a://bucket2
```

cde spark submit

If you are submitting a job using the CDE CLI, use the --conf command flag. For example:

Spark 3:

cde spark submit --conf "spark.kerberos.access.hadoopFileSystems=s3a://bucket1,s3a://bucket2" ...

Spark 2:

cde spark submit --conf "spark.yarn.access.hadoopFileSystems=s3a://bucket1,s3a://bucket2" ...

Accessing Amazon S3 data using Cloudera Data Engineering

We want your opinion

How can we improve this page?