Accessing data stored in Amazon S3 through Spark

To access data stored in Amazon S3 from Spark applications: use Hadoop file APIs (SparkContext.hadoopFile, JavaHadoopRDD.saveAsHadoopFile, SparkContext.newAPIHadoopRDD, and JavaHadoopRDD.saveAsNewAPIHadoopFile) for reading and writing RDDs, providing URLs of the form s3a://[***BUCKET_NAME***]/path/to/file. You can read and write Spark SQL DataFrames using the Data Source API.

Make sure that your environment is configured to allow access to the buckets you need. You must also configure
  • for Spark 2, the spark.yarn.access.hadoopFileSystems parameter
  • for Spark 3, the spark.kerberos.access.hadoopFileSystems parameter
to include the buckets you need to access. You can do this using the Spark client configuration, or at runtime as a command line parameter.

For example:

Spark 2

Client configuration (/etc/spark/conf/spark-defaults.conf)
spark.yarn.access.hadoopFileSystems=s3a://bucket1,s3a://bucket2
spark-shell
spark-shell --conf "spark.yarn.access.hadoopFileSystems=s3a://bucket1,s3a://bucket2" ...
spark-submit
spark-submit --conf "spark.yarn.access.hadoopFileSystems=s3a://bucket1,s3a://bucket2" ...

Spark 3

Client configuration (/etc/spark3/conf/spark-defaults.conf)
spark.kerberos.access.hadoopFileSystems=s3a://bucket1,s3a://bucket2
spark3-shell
spark3-shell --conf "spark.kerberos.access.hadoopFileSystems=s3a://bucket1,s3a://bucket2" ...
spark3-submit
spark3-submit --conf "spark.kerberos.access.hadoopFileSystems=s3a://bucket1,s3a://bucket2" ...