Accessing Amazon S3 data using Cloudera Data Engineering

Some additional configuration is required to access Amazon S3 buckets from a Cloudera Data Engineering (CDE) job.

To access data stored in Amazon S3 from Spark applications, use Hadoop file APIs (SparkContext.hadoopFile, JavaHadoopRDD.saveAsHadoopFile, SparkContext.newAPIHadoopRDD, and JavaHadoopRDD.saveAsNewAPIHadoopFile) for reading and writing RDDs, providing URLs of the form s3a://bucket_name/path/to/file.

Make sure that your user or group is mapped to an AWS IAM role with access to the buckets you need. For more information, see Onboarding CDP users and groups for cloud storage. In particular, make sure that the IAM role you are mapped to has the getBucketLocation permission on the buckets you need access to.

You must also configure the spark.kerberos.access.hadoopFileSystems (for Spark 3 jobs) or spark.yarn.access.hadoopFileSystems (for Spark 2 jobs) parameter to include the buckets you need to access. You can do this using the CDE job configuration, or at runtime as a command line parameter.

For example:

Job configuration
  1. In the Cloudera Data Platform (CDP) management console, click the Data Engineering tile and click Overview.
  2. In the CDE Services column, select the environment containing the virtual cluster with the job you want to configure.
  3. In the Virtual Clusters column on the right, click the View Jobs icon on the virtual cluster with the job you want to configure.
  4. In the left hand menu, click Jobs.
  5. Click on the job you want to configure.
  6. Go to the Configuration tab.
  7. Click Edit.
  8. Click the Add Configuration icon.
  9. Enter spark.kerberos.access.hadoopFileSystems (for Spark 3 jobs) or spark.yarn.access.hadoopFileSystems (for Spark 2 jobs) in the config_key field.
  10. In the config_value field, add a comma-separated list of buckets that the job requires access to, with no spaces after the commas. For example:
    s3a://bucket1,s3a://bucket2
cde spark submit
If you are submitting a job using the CDE CLI, use the --conf command flag. For example:
Spark 3:
cde spark submit --conf "spark.kerberos.access.hadoopFileSystems=s3a://bucket1,s3a://bucket2" ...
Spark 2:
cde spark submit --conf "spark.yarn.access.hadoopFileSystems=s3a://bucket1,s3a://bucket2" ...