Python Supported Versions

Cloudera Data Science Workbench supports the following Python versions.

The default Cloudera Data Science Workbench engine includes Python 2.7.11 and Python 3.6.10. CDSW supports what comes bundled with the base image. To use PySpark within the HDP cluster, the Spark executors must have access to a matching version of Python. For many common operating systems, the default system Python will not match the minor release of Python included in Cloudera Data Science Workbench.

To ensure that the Python versions match, Python can either be installed on every HDP host or made available per job run using Spark’s ability to distribute dependencies. Given the size of a typical isolated Python environment and the desire to avoid repeated uploads from gateway hosts, Cloudera recommends installing Python 2.7 and 3.6 on the cluster if you are using PySpark with lambda functions.

You can install Python 2.7 and 3.6 on the cluster using any method and set the corresponding PYSPARK_PYTHON environment variable in your project. Cloudera Data Science Workbench includes a separate environment variable for Python 3 sessions called PYSPARK3_PYTHON. Python 2 sessions continue to use the default PYSPARK_PYTHON variable. This will allow you to run Python 2 and Python 3 sessions in parallel without either variable being overridden by the other.

Anaconda

Continuum Analytics and Cloudera have partnered to create an Anaconda parcel for CDH to enable simple distribution, installation, and management of popular Python packages and their dependencies. Note that this parcel is not directly supported by Cloudera.