Accessing Amazon S3 data using Cloudera Data Engineering
Some additional configuration is required to access Amazon S3 buckets from a Cloudera Data Engineering (CDE) job.
To access data stored in Amazon S3 from Spark applications, use Hadoop
file APIs (
JavaHadoopRDD.saveAsNewAPIHadoopFile) for reading and
writing RDDs, providing URLs of the form
Make sure that your user or group is mapped to an AWS IAM role with
access to the buckets you need. For more information, see Onboarding CDP users and groups for cloud
storage. In particular, make sure that the IAM role you are
mapped to has the
getBucketLocation permission on the
buckets you need access to.
You must also configure the
spark.kerberos.access.hadoopFileSystems (for Spark 3
spark.yarn.access.hadoopFileSystems (for Spark 2
jobs) parameter to include the buckets you need to access. You can do this
using the CDE job configuration, or at runtime as a command line
- Job configuration
- Navigate to the Cloudera Data Engineering Overview page by clicking the Data Engineering tile in the Cloudera Data Platform (CDP) management console.
- In the Environments column, select the environment containing the virtual cluster with the job you want to configure.
- In the Virtual Clusters column on the right, click the View Jobs icon on the virtual cluster with the job you want to configure.
- In the left hand menu, click Jobs.
- Click on the job you want to configure.
- Go to the Configuration tab.
- Click Edit.
- Click the Add Configuration icon.
spark.kerberos.access.hadoopFileSystems(for Spark 3 jobs) or
spark.yarn.access.hadoopFileSystems(for Spark 2 jobs) in the config_key field.
- In the config_value field, add a
comma-separated list of buckets that the job requires access to,
with no spaces after the commas. For
cde spark submit
- If you are submitting a job using the CDE CLI, use the
--confcommand flag. For example:
cde spark submit --conf "spark.kerberos.access.hadoopFileSystems=s3a://bucket1,s3a://bucket2" ...
cde spark submit --conf "spark.yarn.access.hadoopFileSystems=s3a://bucket1,s3a://bucket2" ...