Managing Dependencies for Spark 2 Jobs
As with any Spark job, you can add external packages to the
executor on startup. To add external dependencies to Spark jobs, specify the
libraries you want added by using the appropriate configuration parameter in
a spark-defaults.conf
file.
The following table lists the most commonly used configuration parameters for adding dependencies and how they can be used:
Property | Description |
---|---|
spark.files
|
Comma-separated list of files to be placed in the working directory of each Spark executor. |
spark.submit.pyFiles
|
Comma-separated list of |
spark.jars
|
Comma-separated list of local jars to include on the Spark driver and Spark executor classpaths. |
spark.jars.packages
|
Comma-separated list of Maven coordinates of jars to include
on the Spark driver and Spark executor classpaths. When
configured, Spark will search the local Maven repo, and then
Maven central and any additional remote repositories
configured by |
spark.jars.ivy
|
Comma-separated list of additional remote repositories to
search for the coordinates given with
|
Example spark-defaults.conf
spark-defaults.conf
file that uses
some of the Spark configuration parameters discussed in the previous
section to add external packages on startup.
spark.jars.packages org.scalaj:scalaj-http_2.11:2.3.0
spark.jars my_sample.jar
spark.files data/test_data_1.csv,data/test_data_2.csv
-
spark.jars.packages
-
The
scalaj
package will be downloaded from Maven central and included on the Spark driver and executor classpaths. -
spark.jars
-
The pre-existing jar,
my_sample.jar
, residing in the root of this project will be included on the Spark driver and executor classpaths. -
spark.files
-
The two sample data sets,
test_data_1.csv
andtest_data_2.csv
, from the/data
directory of this project will be distributed to the working directory of each Spark executor.
For more advanced configuration options, visit the Apache 2 reference documentation.