Accessing data stored in Amazon S3 through Spark
To access data stored in Amazon S3 from Spark applications, use Hadoop
file APIs (SparkContext.hadoopFile
,
JavaHadoopRDD.saveAsHadoopFile
,
SparkContext.newAPIHadoopRDD
, and
JavaHadoopRDD.saveAsNewAPIHadoopFile
) for reading and
writing RDDs, providing URLs of the form
s3a://bucket_name/path/to/file
.
You can read and write Spark SQL DataFrames using the Data Source API.
Make sure that your environment is configured to allow
access to the buckets you need. You must also configure the
spark.yarn.access.hadoopFileSystems
parameter to
include the buckets you need to access. You can do this using the Spark
client configuration, or at runtime as a command line parameter.
For example:
- Client configuration
(
/etc/spark/conf/spark-defaults.conf
) -
spark.yarn.access.hadoopFileSystems=s3a://bucket1,s3a://bucket2
- spark-shell
-
spark-shell --conf "spark.yarn.access.hadoopFileSystems=s3a://bucket1,s3a://bucket2" ...
- spark-submit
-
spark-submit --conf "spark.yarn.access.hadoopFileSystems=s3a://bucket1,s3a://bucket2" ...