Setting Up a PySpark Project
The default Cloudera Data Science Workbench engine currently includes Python 2.7.18 and Python 3.6.10.
PySpark Environmental Variables
To use PySpark with lambda functions that run within the CDH cluster, the Spark executors must have access to a matching version of Python. For many common operating systems, the default system Python will not match the minor release of Python included in Data Science Workbench.
To ensure that the Python versions match, Python can either be installed on every CDH host or made available per job run using Spark’s ability to distribute dependencies. Given the size of a typical isolated Python environment and the desire to avoid repeated uploads from gateway hosts, Cloudera recommends installing Python 2.7 and 3.6 on the cluster if you are using PySpark with lambda functions.
You can install Python 2.7 and 3.6 on the cluster using any method and set the corresponding PYSPARK_PYTHON environment variable in your project. Cloudera Data Science Workbench 1.3 (and higher) include a separate environment variable for Python 3 sessions called PYSPARK3_PYTHON. Python 2 sessions continue to use the default PYSPARK_PYTHON variable. This will allow you to run Python 2 and Python 3 sessions in parallel without either variable being overridden by the other.
Creating and Running a PySpark Project
To get started quickly, use the PySpark template project to create a new project. For instructions, see Creating a Project with Legacy Engine Variants or Creating a Project with ML Runtimes Variants
To run a PySpark project, navigate to the project's overview page, open the workbench console and launch a Python session.
Testing a PySpark Project in Spark Local Mode
Spark's local mode is often useful for testing and debugging purposes. Use the following sample code snippet to start a PySpark session in local mode.
from pyspark.sql import SparkSession
spark = SparkSession\
.builder \
.appName("LocalSparkSession") \
.master("local") \
.getOrCreate()
For more details, refer the Spark documentation: Running Spark Applications.
Spark on ML Runtimes
- Python must be installed on the CDH cluster YARN Node Manager nodes which should match the Python version of the selected ML Runtime (i.e., 3.6 or 3.7)
- This Python version must be specified by its path for Spark using the pyspark_python environment variable
- As an example for 3.7, one could specify the environment variable like this for the
CDSW project:
- "PYSPARK_PYTHON": "/usr/local/bin/python3.7"