Python Supported Versions
Cloudera Data Science Workbench supports the following Python versions.
The default Cloudera Data Science Workbench engine includes Python 2.7.11 and Python 3.6.10. CDSW supports what comes bundled with the base image. To use PySpark within the HDP cluster, the Spark executors must have access to a matching version of Python. For many common operating systems, the default system Python will not match the minor release of Python included in Cloudera Data Science Workbench.
To ensure that the Python versions match, Python can either be installed on every HDP host or made available per job run using Spark’s ability to distribute dependencies. Given the size of a typical isolated Python environment and the desire to avoid repeated uploads from gateway hosts, Cloudera recommends installing Python 2.7 and 3.6 on the cluster if you are using PySpark with lambda functions.
You can install Python
2.7 and 3.6 on the cluster using any method and set the corresponding
PYSPARK_PYTHON
environment variable in your project. Cloudera Data
Science Workbench includes a separate environment variable for Python 3 sessions called
PYSPARK3_PYTHON
. Python 2 sessions continue to use the default
PYSPARK_PYTHON
variable. This will allow you to run Python 2 and
Python 3 sessions in parallel without either variable being overridden by the other.
Anaconda
-
Install the Anaconda package on all cluster hosts. For installation instructions, refer to the Anaconda installation documentation.
-
Set the
ANACONDA_DIR
property in the Cloudera Data Science Workbench configuration file:cdsw.conf
. This can be done when you first configurecdsw.conf
during the installation or later. -
Restart Cloudera Data Science Workbench to have this change go into effect.