Running Spark with Yarn on the Cloudera base cluster

The primary supported way to run Spark workloads on Cloudera AI uses Spark on Kubernetes. This is different from Cloudera Data Science Workbench, which uses Spark on Yarn to run Spark workloads.

For users who are migrating projects from Cloudera Data Science Workbench to Cloudera AI, or who have existing Yarn workloads, Cloudera AI on premises offers a way to run those Spark on Yarn workloads on the Cloudera base cluster. This is sometimes called Spark pushdown. This allows the Spark workloads to run without the need of modifíing them to run on Kubernetes.

The Cloudera AI Administrator must enable this mode for a Cloudera AI Workbench, and each Cloudera AI workload must enable this mode to run Spark workloads in the attached Cloudera base cluster.

When this mode is enabled, each newly launched Cloudera AI workload has port forwarding rules set up in Kubernetes. Additionally, Spark configurations are set in the Cloudera AI session to allow Spark applications launched in the Cloudera AI session to run in client mode with Executors in Yarn in the attached base cluster.

Prerequisites🔗

In Cloudera AI, Spark on Yarn pushdown workloads are only supported with ML Runtime.

General requirements:

Spark pushdown functionality only works with DEX 1.19.1 Spark Addons.
Yarn Service must be configured and run in your Cloudera base cluster.
Spark On Yarn Service must be configured and run in your Cloudera base cluster.
The Cloudera base cluster must have access to the Spark drivers that run on Data Service Hosts running Cloudera AI workloads. These are launched on a set of randomized ports in the range: 30000-32768.

PySpark requirements:

Python must be installed on all Cloudera base cluster YARN Node Manager nodes which must match the Python version of the selected ML Runtime (that is, 3.7 or 3.8).
The Python binary available on Yarn Node Manager nodes must be specified in the PYSPARK_PYTHON environment variable.
- As an example for 3.7, you can specify the environment variable for the Cloudera AI project with Spark pushdown enabled as follows:
```
"PYSPARK_PYTHON": "/usr/local/bin/python3.7"
```
- PYSPARK_PYTHON - It defines the location of Python in executors running in Yarn Nodes.
  note
  In Cloudera AI PYSPARK_PYTHON is set to /usr/local/bin/python3 by default. Change it to the appropriate location in Yarn Nodes.
- PYSPARK_DRIVER_PYTHON - It defines the location of Python in the driver running in a Cloudera AI session.
  note
  For ML Runtimes PYSPARK_DRIVER_PYTHON is set to /usr/local/bin/python3.

Enabling Spark on the base cluster🔗

Spark can be enabled on the base cluster both site-wide and on project-specific level.

Select Site Administration > Settings.
Select Allow users to enable Spark Pushdown Configuration for Projects.
Select Project Settings > Settings > Enable Spark Pushdown.

This is a project-specific setting to enable Spark pushdown for all newly launched workloads in the project. Each project that intends to use the Cloudera base cluster Yarn for Spark workloads must enable this setting.

Spark application dependencies🔗

Due to the unique running mode of Spark on Yarn in Cloudera AI, dependencies must be handled differently from running the jobs on the base cluster.

To determine which dependencies are required on the cluster, consider that Spark code applications run in Spark executor processes distributed throughout the cluster. If the Python code you are running uses any third-party libraries, Spark executors require access to those libraries when they run on remote executors.

Refer to the following Spark configurations to determine how dependencies can be made available to executors.

Jars:

spark.yarn.jars
- By default, this is not set in a Cloudera AI Project Spark Pushdown project to ensure that all Spark jars loaded from the Cloudera AI Spark Runtime Addon are made available to yarn executors.
- This configuration must not be overridden within your Cloudera AI projects. Consider using spark.yarn.dist.jars to indicate external references to jars.
- This results in some minor transfer time of Spark jars when starting Spark applications.
spark.yarn.dist.jars
- This is not configured by Cloudera AI.

Python:

spark.submit.pyFiles
- By default, this is set to /opt/spark/python/lib/*.zip to ensure that the pyspark and py4j .zip files included in Cloudera AI Spark Runtime Addons are available to executors.
- The configuration can be overridden, but you must keep the original /opt/spark/python/lib/*.zip in the new custom list.

Extra files:

spark.yarn.dist.archives - This is not configured by Cloudera AI.
spark.yarn.dist.files - This is not configured by Cloudera AI.

User-specified Spark application configurations🔗

spark-defaults.conf

Multiple Spark configuration sources are appended to a single file for Spark pushdown in Cloudera AI on premises. This occurs in the following order (lower has higher precedence as the contents of /etc/spark/conf/spark-defaults.conf are loaded from top-down):

base cluster Spark spark-defaults.conf Defaults and Safety valves
Cloudera AI system-specific configurations injection
Cloudera AI Project spark-defaults.conf

Check the content of /etc/spark/conf/spark-defaults.conf inside the Cloudera AI Session for the final configuration used by the spark driver.

Cloudera AI-injected Spark application configurations🔗

There are Spark configurations which are applied by Cloudera AI in order to enable or simplify Spark on Basecluster Yarn workloads.

Spark environment variables🔗

Multiple environment variable sources are considered when setting up the Cloudera AI session which runs the interactive spark driver.

For spark-env.sh:

base cluster Spark spark-env.sh Defaults and Safety valves
Cloudera AI system-specific Spark environments' overriding

For Cloudera AI session environment:

Content of constructed spark-env.sh
workbench environment variables
Project environment variables
User environment variables