Running Spark with Yarn on the Cloudera base cluster

The primary supported way to run Spark workloads on Cloudera Machine Learning uses Spark on Kubernetes. This is different from Cloudera Data Science Workbench, which uses Spark on Yarn to run Spark workloads.

For users who are migrating projects from Cloudera Data Science Workbench to Cloudera Machine Learning, or who have existing Yarn workloads, Cloudera Machine Learning Private Cloud offers a way to run those Spark on Yarn workloads on the Cloudera base cluster. This is sometimes called Spark pushdown. This allows the Spark workloads to run without the need of modifíing them to run on Kubernetes.

The Cloudera Machine Learning Administrator must enable this mode for a Cloudera Machine Learning Workspace, and each Cloudera Machine Learning workload must enable this mode to run Spark workloads in the attached Cloudera base cluster.

When this mode is enabled, each newly launched Cloudera Machine Learning workload has port forwarding rules set up in Kubernetes. Additionally, Spark configurations are set in the Cloudera Machine Learning session to allow Spark applications launched in the Cloudera Machine Learning session to run in client mode with Executors in Yarn in the attached base cluster.

Prerequisites

In Cloudera Machine Learning, Spark on Yarn pushdown workloads are only supported with ML Runtime.

General requirements:

  • Spark pushdown functionality only works with DEX 1.19.1 Spark Addons.
  • Yarn Service must be configured and run in your Cloudera base cluster.
  • Spark On Yarn Service must be configured and run in your Cloudera base cluster.
  • The Cloudera base cluster must have access to the Spark drivers that run on Data Service Hosts running Cloudera Machine Learning workloads. These are launched on a set of randomized ports in the range: 30000-32768.

PySpark requirements:

  • Python must be installed on all Cloudera base cluster YARN Node Manager nodes which must match the Python version of the selected ML Runtime (that is, 3.7 or 3.8).
  • The Python binary available on Yarn Node Manager nodes must be specified in the PYSPARK_PYTHON environment variable.
    • As an example for 3.7, you can specify the environment variable for the Cloudera Machine Learning project with Spark pushdown enabled as follows:
      "PYSPARK_PYTHON": "/usr/local/bin/python3.7"
    • PYSPARK_PYTHON - It defines the location of Python in executors running in Yarn Nodes.
    • PYSPARK_DRIVER_PYTHON - It defines the location of Python in the driver running in a Cloudera Machine Learning session.

Enabling Spark on the base cluster

Spark can be enabled on the base cluster both site-wide and on project-specific level.

  1. Select Site Administration > Settings.
  2. Select Allow users to enable Spark Pushdown Configuration for Projects.
  3. Select Project Settings > Settings > Enable Spark Pushdown.

    This is a project-specific setting to enable Spark pushdown for all newly launched workloads in the project. Each project that intends to use the Cloudera base cluster Yarn for Spark workloads must enable this setting.

Spark application dependencies

Due to the unique running mode of Spark on Yarn in Cloudera Machine Learning, dependencies must be handled differently from running the jobs on the base cluster.

To determine which dependencies are required on the cluster, consider that Spark code applications run in Spark executor processes distributed throughout the cluster. If the Python code you are running uses any third-party libraries, Spark executors require access to those libraries when they run on remote executors.

Refer to the following Spark configurations to determine how dependencies can be made available to executors.

Jars:

  • spark.yarn.jars
    • By default, this is not set in a Cloudera Machine Learning Project Spark Pushdown project to ensure that all Spark jars loaded from the Cloudera Machine Learning Spark Runtime Addon are made available to yarn executors.
    • This configuration must not be overridden within your Cloudera Machine Learning projects. Consider using spark.yarn.dist.jars to indicate external references to jars.
    • This results in some minor transfer time of Spark jars when starting Spark applications.
  • spark.yarn.dist.jars
    • This is not configured by Cloudera Machine Learning.
Python:
  • spark.submit.pyFiles
    • By default, this is set to /opt/spark/python/lib/*.zip to ensure that the pyspark and py4j .zip files included in Cloudera Machine Learning Spark Runtime Addons are available to executors.
    • The configuration can be overridden, but you must keep the original /opt/spark/python/lib/*.zip in the new custom list.
Extra files:
  • spark.yarn.dist.archives - This is not configured by Cloudera Machine Learning.
  • spark.yarn.dist.files - This is not configured by Cloudera Machine Learning.

User-specified Spark application configurations

spark-defaults.conf

Multiple Spark configuration sources are appended to a single file for Spark pushdown in Cloudera Machine Learning Private Cloud. This occurs in the following order (lower has higher precedence as the contents of /etc/spark/conf/spark-defaults.conf are loaded from top-down):

  1. base cluster Spark spark-defaults.conf Defaults and Safety valves
  2. Cloudera Machine Learning system-specific configurations injection
  3. Cloudera Machine Learning Project spark-defaults.conf

Check the content of /etc/spark/conf/spark-defaults.conf inside the Cloudera Machine Learning Session for the final configuration used by the spark driver.

Cloudera Machine Learning-injected Spark application configurations

There are Spark configurations which are applied by Cloudera Machine Learning in order to enable or simplify Spark on Basecluster Yarn workloads.

Spark environment variables

Multiple environment variable sources are considered when setting up the Cloudera Machine Learning session which runs the interactive spark driver.

For spark-env.sh:

  • base cluster Spark spark-env.sh Defaults and Safety valves
  • Cloudera Machine Learning system-specific Spark environments' overriding

For Cloudera Machine Learning session environment:

  • Content of constructed spark-env.sh
  • Workspace environment variables
  • Project environment variables
  • User environment variables