Running Spark Python applications

Learn how to configure and maintain your environment to access Spark with Python.

Accessing Spark with Java and Scala offers many advantages: platform independence by running inside the JVM, self-contained packaging of code and its dependencies into JAR files, and higher performance because Spark itself runs in the JVM. You lose these advantages when using the Spark Python API.

Managing dependencies and making them available for Python jobs on a cluster can be difficult. To determine which dependencies are required on the cluster, you must understand that Spark code applications run in Spark executor processes distributed throughout the cluster. If the Python transformations you define use any third-party libraries, such as NumPy or nltk, Spark executors require access to those libraries when they run on remote executors.

Configuring the Python executable for Spark

Spark uses the version of Python found in the default path on the cluster nodes (typically /usr/bin/python). To use a specific version of Python, such as Python 3.11, you must configure the environment variables and Spark properties in Cloudera Manager.

For Spark 2:

Minimum Required Role: Configurator (also provided by Cluster Administrator, Full Administrator)

  1. In the Cloudera Manager Admin Console, go to the Spark 2 service.
  2. Click the Configuration tab.
  3. Search for Spark 2 Service Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-env.sh and add the following:
    export PYSPARK_PYTHON=[***PATH_TO_PYTHON***]
    export PYSPARK_DRIVER_PYTHON=[***PATH_TO_PYTHON***]
  4. Search for Spark 2 Client Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-defaults.conf and add the following:
    spark.yarn.appMasterEnv.PYSPARK_PYTHON=[***PATH_TO_PYTHON***]
    spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=[***PATH_TO_PYTHON***]
    spark.pyspark.python=[***PATH_TO_PYTHON***]
    spark.pyspark.driver.python=[***PATH_TO_PYTHON***]
  5. Enter a Reason for change, and then click Save Changes to commit the changes.
  6. Restart the affected services and redeploy client configurations.

For Spark 3:

Minimum Required Role: Configurator (also provided by Cluster Administrator, Full Administrator)

  1. In the Cloudera Manager Admin Console, go to the Spark 3 service.
  2. Click the Configuration tab.
  3. Search for Spark 3 Service Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-env.sh and add the following:
    export PYSPARK_PYTHON=/usr/bin/python3.11
    export PYSPARK_DRIVER_PYTHON=/usr/bin/python3.11
  4. Search for Spark 3 Client Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-defaults.conf and add the following:
    spark.yarn.appMasterEnv.PYSPARK_PYTHON=/usr/bin/python3.11
    spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=/usr/bin/python3.11
    spark.pyspark.python=/usr/bin/python3.11
    spark.pyspark.driver.python=/usr/bin/python3.11
  5. Enter a Reason for change, and then click Save Changes to commit the changes.
  6. Restart the affected services and redeploy client configurations.

Self-Contained Dependencies

In a common situation, a custom Python package contains functionality you want to apply to each element of an RDD. You can use a map() function call to make sure that each Spark executor imports the required package, before calling any of the functions inside that package. The following shows a simple example:

def import_my_special_package(x):
  import my.special.package
  return x

int_rdd = sc.parallelize([1, 2, 3, 4])
int_rdd.map(lambda x: import_my_special_package(x))
int_rdd.collect()

You create a simple RDD of four elements and call it int_rdd. Then you apply the function import_my_special_package to every element of the int_rdd. This function imports my.special.package and then returns the original argument passed to it. Calling this function as part of a map() operation ensures that each Spark executor imports my.special.package when needed.

If you only need a single file inside my.special.package, you can direct Spark to make this available to all executors by using the --py-files option in your spark-submit command and specifying the local path to the file. You can also specify this programmatically by using the sc.addPyFiles() function. If you use functionality from a package that spans multiple files, you can make an egg for the package, because the --py-files flag also accepts a path to an egg file.

If you have a self-contained dependency, you can make the required Python dependency available to your executors in two ways:

  • If you depend on only a single file, you can use the --py-files command-line option, or programmatically add the file to the SparkContext with sc.addPyFiles(path) and specify the local path to that Python file.
  • If you have a dependency on a self-contained module (a module with no other dependencies), you can create an egg or zip file of that module and use either the --py-files command-line option or programmatically add the module to theSparkContext with sc.addPyFiles(path) and specify the local path to that egg or zip file.

Complex Dependencies

Some operations rely on complex packages that also have many dependencies. Although such a package is too complex to distribute as a *.py file, you can create an egg for it and all of its dependencies, and send the egg file to executors using the --py-files option.

Limitations of Distributing Egg Files on Heterogeneous Clusters

If you are running a heterogeneous cluster, with machines of different CPU architectures, sending egg files is impractical because packages that contain built-in code must be compiled for a single specific CPU architecture. Therefore, distributing an egg for complex, compiled packages like NumPy, SciPy, and pandas often fails. Instead of distributing egg files, install the required Python packages on each host of the cluster and specify the path to the Python binaries for the worker hosts to use.

Installing and Maintaining Python Environments

Installing and maintaining Python environments can be complex but allows you to use the full Python package ecosystem. Ideally, a sysadmin installs the Anaconda distribution or sets up a virtual environment on every host of your cluster with your required dependencies.

If you are using Cloudera Manager, you can deploy the Anaconda distribution as a parcel as follows:

Minimum Required Role: Cluster Administrator (also provided by Full Administrator)

  1. Add the following URL https://repo.anaconda.com/pkgs/misc/parcels/ to the Remote Parcel Repository URLs as described in "Parcel Configuration Settings."
  2. Download, distribute, and activate the parcel as described in "Managing Parcels."

Anaconda is installed in parcel directory/Anaconda, where parcel directory is /opt/cloudera/parcels by default, but can be changed in parcel configuration settings. The Anaconda parcel is supported by Continuum Analytics.

If you are not using Cloudera Manager, you can set up a virtual environment on your cluster by running commands on each host using Cluster SSH, Parallel SSH, or Fabric. Assuming each host has Python and pip installed, use the following commands to set up the standard data stack (NumPy, SciPy, scikit-learn, and pandas) in a virtual environment on a RHEL 6-compatible system:

# Install python-devel:
yum install python-devel

# Install non-Python dependencies required by SciPy that are not installed by default:
yum install atlas atlas-devel lapack-devel blas-devel

# install virtualenv:
pip install virtualenv

# create a new virtualenv:
virtualenv mynewenv

# activate the virtualenv:
source mynewenv/bin/activate

# install packages in mynewenv:
pip install numpy
pip install scipy
pip install scikit-learn
pip install pandas