Configuring CDS 2.x Powered by Apache Spark 2

This topic describes how to set Spark 2 environment variables, manage package dependencies for Spark 2 jobs, and how to configure logging.

Spark Configuration Files
Managing Memory Available for Spark Drivers
Managing Dependencies for Spark 2 Jobs
Spark Logging Configuration

Spark Configuration Files

Cloudera Data Science Workbench supports configuring Spark 2 properties on a per project basis with the spark-defaults.conf file. If there is a file called spark-defaults.conf in your project root, this will be automatically be added to the global Spark defaults. To specify an alternate file location, set the environmental variable, SPARK_CONFIG, to the path of the file relative to your project. If you’re accustomed to submitting a Spark job with key-values pairs following a --conf flag, these can also be set in a spark-defaults.conf file instead. For a list of valid key-value pairs, refer the Spark configuration reference documentation.

Administrators can set environment variable paths in the /etc/spark2/conf/spark-env.sh file.

You can also use Cloudera Manager to configure spark-defaults.conf and spark-env.sh globally for all Spark applications as follows.

Configuring Global Properties Using Cloudera Manager

Configure client configuration properties for all Spark applications in spark-defaults.conf as follows:

Go to the Cloudera Manager Admin Console.
Navigate to the Spark service.
Click the Configuration tab.
Search for the Spark Client Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-defaults.conf property.
Specify properties described in Application Properties. If more than one role group applies to this configuration, edit the value for the appropriate role group.
Click Save Changes to commit the changes.
Deploy the client configuration.
Restart Cloudera Data Science Workbench.

For more information on using a spark-defaults.conf file for Spark jobs, visit the Apache Spark 2 reference documentation.

Configuring Spark Environment Variables Using Cloudera Manager

Configure service-wide environment variables for all Spark applications in spark-env.sh as follows:

Go to the Cloudera Manager Admin Console.
Navigate to the Spark 2 service.
Click the Configuration tab.
Search for the Spark Service Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-env.sh property and add the paths for the environment variables you want to configure.
Click Save Changes to commit the changes.
Restart the service.
Deploy the client configuration.
Restart Cloudera Data Science Workbench.

Managing Memory Available for Spark Drivers

By default, the amount of memory allocated to Spark driver processes is set to a 0.8 fraction of the total memory allocated for the engine container. If you want to allocate more or less memory to the Spark driver process, you can override this default by setting the spark.driver.memory property in spark-defaults.conf (as described above).

Managing Dependencies for Spark 2 Jobs

As with any Spark job, you can add external packages to the executor on startup. To add external dependencies to Spark jobs, specify the libraries you want added by using the appropriate configuration parameter in a spark-defaults.conf file. The following table lists the most commonly used configuration parameters for adding dependencies and how they can be used:

Property	Description
`spark.files`	Comma-separated list of files to be placed in the working directory of each Spark executor.
`spark.submit.pyFiles`	Comma-separated list of `.zip, .egg,` or `.py` files to place on `PYTHONPATH` for Python applications.
`spark.jars`	Comma-separated list of local jars to include on the Spark driver and Spark executor classpaths.
`spark.jars.packages`	Comma-separated list of Maven coordinates of jars to include on the Spark driver and Spark executor classpaths. When configured, Spark will search the local Maven repo, and then Maven central and any additional remote repositories configured by `spark.jars.ivy`. The format for the coordinates are `groupId:artifactId:version`.
`spark.jars.ivy`	Comma-separated list of additional remote repositories to search for the coordinates given with `spark.jars.packages`.

Example spark-defaults.conf

Here is a sample spark-defaults.conf file that uses some of the Spark configuration parameters discussed in the previous section to add external packages on startup.

spark.jars.packages org.scalaj:scalaj-http_2.11:2.3.0
spark.jars my_sample.jar
spark.files data/test_data_1.csv,data/test_data_2.csv

spark.jars.packages: The scalaj package will be downloaded from Maven central and included on the Spark driver and executor classpaths.
spark.jars: The pre-existing jar, my_sample.jar, residing in the root of this project will be included on the Spark driver and executor classpaths.
spark.files: The two sample data sets, test_data_1.csv and test_data_2.csv, from the /data directory of this project will be distributed to the working directory of each Spark executor.

For more advanced configuration options, visit the Apache Spark 2 reference documentation.

Spark Logging Configuration

Cloudera Data Science Workbench allows you to update Spark’s internal logging configuration on a per-project basis. Spark 2 uses Apache Log4j, which can be configured through a properties file. By default, a log4j.properties file found in the root of your project will be appended to the existing Spark logging properties for every session and job. To specify a custom location, set the environmental variable LOG4J_CONFIG to the file location relative to your project.

The Log4j documentation has more details on logging options.

Increasing the log level or pushing logs to an alternate location for troublesome jobs can be very helpful for debugging. For example, this is a log4j.properties file in the root of a project that sets the logging level to INFO for Spark jobs.

shell.log.level=INFO

PySpark logging levels should be set as follows:

log4j.logger.org.apache.spark.api.python.PythonGatewayServer=<LOG_LEVEL>

And Scala logging levels should be set as:

log4j.logger.org.apache.spark.repl.Main=<LOG_LEVEL>

CDS 2.x Powered by Apache Spark

Setting Up an HTTP Proxy for Spark 2