Using Spark 2 from Python

Cloudera Data Science Workbench supports using Spark 2 from Python via PySpark.

Setting Up a PySpark Project
The default Cloudera Data Science Workbench engine currently includes Python 2.7.18 and Python 3.6.10.
Spark on ML Runtimes
Spark is supported for ML Runtimes with Python 3.6 and Python 3.7 kernels given that the following workaround is applied on the cluster:
Example: Montecarlo Estimation
Within the template PySpark project, pi.py is a classic example that calculates Pi using the Montecarlo Estimation.
Example: Locating and Adding JARs to Spark 2 Configuration
This example shows how to discover the location of JAR files installed with Spark 2, and add them to the Spark 2 configuration.
Example: Distributing Dependencies on a PySpark Cluster
Although Python is a popular choice for data scientists, it is not straightforward to make a Python library available on a distributed PySpark cluster. To determine which dependencies are required on the cluster, you must understand that Spark code applications run in Spark executor processes distributed throughout the cluster. If the Python code you are running uses any third-party libraries, Spark executors require access to those libraries when they run on remote executors.