Accessing data stored in Amazon S3 through Spark
To access data stored in Amazon S3 from Spark applications: use Hadoop
file APIs (SparkContext.hadoopFile
,
JavaHadoopRDD.saveAsHadoopFile
,
SparkContext.newAPIHadoopRDD
, and
JavaHadoopRDD.saveAsNewAPIHadoopFile
) for reading and
writing RDDs, providing URLs of the form
s3a://[***BUCKET_NAME***]/path/to/file
.
You can read and write Spark SQL DataFrames using the Data Source API.
Make sure that your environment is configured to allow
access to the buckets you need. You must also configure
- for Spark 2, the
spark.yarn.access.hadoopFileSystems
parameter - for Spark 3, the
spark.kerberos.access.hadoopFileSystems
parameter
For example:
Spark 2
- Client configuration
(
/etc/spark/conf/spark-defaults.conf
) -
spark.yarn.access.hadoopFileSystems=s3a://bucket1,s3a://bucket2
- spark-shell
-
spark-shell --conf "spark.yarn.access.hadoopFileSystems=s3a://bucket1,s3a://bucket2" ...
- spark-submit
-
spark-submit --conf "spark.yarn.access.hadoopFileSystems=s3a://bucket1,s3a://bucket2" ...
Spark 3
- Client configuration
(
/etc/spark3/conf/spark-defaults.conf
) -
spark.kerberos.access.hadoopFileSystems=s3a://bucket1,s3a://bucket2
- spark3-shell
-
spark3-shell --conf "spark.kerberos.access.hadoopFileSystems=s3a://bucket1,s3a://bucket2" ...
- spark3-submit
-
spark3-submit --conf "spark.kerberos.access.hadoopFileSystems=s3a://bucket1,s3a://bucket2" ...