Accessing data stored in Amazon S3 through Spark

To access data stored in Amazon S3 from Spark applications: use Hadoop file APIs (SparkContext.hadoopFile, JavaHadoopRDD.saveAsHadoopFile, SparkContext.newAPIHadoopRDD, and JavaHadoopRDD.saveAsNewAPIHadoopFile) for reading and writing RDDs, providing URLs of the form s3a://[***BUCKET_NAME***]/path/to/file. You can read and write Spark SQL DataFrames using the Data Source API.

Make sure that your environment is configured to allow access to the buckets you need. You must also configure

for Spark 2, the spark.yarn.access.hadoopFileSystems parameter
for Spark 3, the spark.kerberos.access.hadoopFileSystems parameter

to include the buckets you need to access. You can do this using the Spark client configuration, or at runtime as a command line parameter.

For example:

Spark 2🔗

Client configuration (/etc/spark/conf/spark-defaults.conf)

spark.yarn.access.hadoopFileSystems=s3a://bucket1,s3a://bucket2

spark-shell

spark-shell --conf "spark.yarn.access.hadoopFileSystems=s3a://bucket1,s3a://bucket2" ...

spark-submit

spark-submit --conf "spark.yarn.access.hadoopFileSystems=s3a://bucket1,s3a://bucket2" ...

Spark 3🔗

Client configuration (/etc/spark3/conf/spark-defaults.conf)

spark.kerberos.access.hadoopFileSystems=s3a://bucket1,s3a://bucket2

spark3-shell

spark3-shell --conf "spark.kerberos.access.hadoopFileSystems=s3a://bucket1,s3a://bucket2" ...

spark3-submit

spark3-submit --conf "spark.kerberos.access.hadoopFileSystems=s3a://bucket1,s3a://bucket2" ...

Accessing data stored in Amazon S3 through Spark

Spark 2🔗

Spark 3🔗

We want your opinion

How can we improve this page?