Spark on ML Runtimes

Only certain Spark versions are supported on ML Runtimes.

Spark is supported for ML Runtimes with Python 3.7 and above. Python 3.6 is no longer supported because it has reached end-of-life. Different ML Runtimes support different versions of Python. Review the list of Pre-Installed Packages in ML Runtimes to determine which Runtime supports which specific Python kernel.

To use Python, ensure that:
  • The Python installed on the CDH cluster Node Manager nodes matches the Python version of the selected ML Runtime (for example, 3.7 or above)
  • The /usr/bin/python symlink on all Node Managers should point to the path where Python is installed.
If multiple versions of Python 3 need to be used, then:
  1. Install each version of Python into the same location on all Node Manager nodes.
  2. Set the /usr/bin/python on all nodes to the Python version that will be used by default.
  3. If a non-default version of Python needs to be used, in the Project Settings > Advanced > Environment Variables section, set PYSPARK_PYTHON to the path to the non-default Python version. For example:
    • PYSPARK_PYTHON=/user/bin/python3.9