Spark on ML Runtimes

Spark is supported for ML Runtimes with Python 3.6 and Python 3.7 kernels given that the following workaround is applied on the cluster:

  • Python must be installed on the CDH master node which should match the Python version of the selected ML Runtime (i.e. 3.6 or 3.7)
  • This Python version must be specified by its path for Spark using the pyspark_python environment variable
  • As an example for 3.7, one could specify the environment variable like this for the CDSW project:
    • "PYSPARK_PYTHON": "/usr/local/bin/python3.7"