Accessing Amazon S3 data using Cloudera Data Engineering

Some additional configuration is required to access Amazon S3 buckets from a Cloudera Data Engineering (CDE) job.

To access data stored in Amazon S3 from Spark applications, use Hadoop file APIs (SparkContext.hadoopFile, JavaHadoopRDD.saveAsHadoopFile, SparkContext.newAPIHadoopRDD, and JavaHadoopRDD.saveAsNewAPIHadoopFile) for reading and writing RDDs, providing URLs of the form s3a://bucket_name/path/to/file.

Make sure that your environment is configured to allow access to the buckets you need. You must also configure the spark.yarn.access.hadoopFileSystems parameter to include the buckets you need to access. You can do this using the CDE job configuration, or at runtime as a command line parameter.

For example:

Job configuration
  1. Navigate to the Cloudera Data Engineering Overview page by clicking the Data Engineering tile in the Cloudera Data Platform (CDP) management console.
  2. In the Environments column, select the environment containing the virtual cluster with the job you want to configure.
  3. In the Virtual Clusters column on the right, click the View Jobs icon on the virtual cluster with the job you want to configure.
  4. In the left hand menu, click Jobs.
  5. Click on the job you want to configure.
  6. Go to the Configuration tab.
  7. Click Edit.
  8. Click the Add Configuration icon.
  9. Enter spark.yarn.access.hadoopFileSystems in the config_key field.
  10. In the config_value field, add a comma-separated list of buckets that the job requires access to, with no spaces after the commas. For example:
cde spark submit
If you are submitting a job using the CDE CLI, use the --conf command flag. For example:
cde spark submit --conf "spark.yarn.access.hadoopFileSystems=s3a://bucket1,s3a://bucket2" ...