Managing Dependencies for Spark 2 Jobs

As with any Spark job, you can add external packages to the executor on startup.

To add external dependencies to Spark jobs, specify the libraries you want added by using the appropriate configuration parameter in a spark-defaults.conf file. The following table lists the most commonly used configuration parameters for adding dependencies and how they can be used:

Property Description
spark.files

Comma-separated list of files to be placed in the working directory of each Spark executor.

spark.submit.pyFiles

Comma-separated list of .zip, .egg, or .py files to place on PYTHONPATH for Python applications.

spark.jars

Comma-separated list of local jars to include on the Spark driver and Spark executor classpaths.

spark.jars.packages

Comma-separated list of Maven coordinates of jars to include on the Spark driver and Spark executor classpaths. When configured, Spark will search the local Maven repo, and then Maven central and any additional remote repositories configured by spark.jars.ivy. The format for the coordinates are groupId:artifactId:version.

spark.jars.ivy

Comma-separated list of additional remote repositories to search for the coordinates given with spark.jars.packages.

Example spark-defaults.conf

Here is a sample spark-defaults.conf file that uses some of the Spark configuration parameters discussed in the previous section to add external packages on startup.

spark.jars.packages org.scalaj:scalaj-http_2.11:2.3.0
spark.jars my_sample.jar
spark.files data/test_data_1.csv,data/test_data_2.csv
spark.jars

The pre-existing jar, my_sample.jar, residing in the root of this project will be included on the Spark driver and executor classpaths.

spark.files

The two sample data sets, test_data_1.csv and test_data_2.csv, from the /data directory of this project will be distributed to the working directory of each Spark executor.

For more advanced configuration options, visit the Apache Spark 2 reference documentation.