ML Runtimes Known Issues and Limitations
You might run into some known issues while using ML Runtimes.
TensorFlow 2.16.1 fails to work with GPUs
Due to the known TensorFlow issue https://github.com/tensorflow/tensorflow/issues/63362, it does not work with GPUs.
Workaround: Set the LD_LIBRARY_PATH
environment variable to
/home/cdsw/.local/lib/python3.X/site-packages/nvidia/cudnn/lib/:$LD_LIBRARY_PATH
.
Here, python3.X is the Python version you use, for example
python3.9.
Packages may not load in a session
When installing R or Python packages in a Session, the kernel might not be able to load the package in the same session, if a previous version of the package or its newly installed dependencies have been loaded in the same Session.
Workaround: Start a new session, import and use the newly installed package there.
Earlier releases of Python Runtimes in CML fail to import the
setuptools
Python library and can fail installing some Python
packages
Python workloads using ML Runtimes from Runtime releases up to the 2023.05 release fail to import the setuptools Python library when version 60.0.0 or higher of this library is present on the Runtime image or is installed into a CDSW project. This can lead to failure in installing some Python packages into a CDSW project.
Workaround: Use ML Runtimes from the 2023.08 or later releases. When using earlier
Runtime versions, set the environmental variable
SETUPTOOLS_USE_DISTUTILS=stdlib
either on a project level under
Project Settings -> Advanced or on a workspace level under
Site Administration -> Runtime -> Environment variables.
DSE-2804 PBJ CUDA Runtimes have wrong metadata version
PBJ NVidia GPU Runtimes Released in version 2023.05.1 had wrong runtime metadata version. Future releases from July, 2023 of CML Public Cloud might not be compatible with these Runtimes.
Workaround: Follow the instructions in TSB-678.
ML Runtimes installation might pull wrong metadata
- Open Runtime Catalog.
- Check if newer runtime variants have been downloaded, e.g. ‘JupyterLab - Conda - Tech Preview’.
- Open the version list for this Runtime.
- Check if any of the versions’ details page shows ‘container.repository.cloudera.com...’ under ‘Runtime Image’ section.
- If the versions' Details page shows 'container.repository.cloudera.com..., complete the Workaround steps.
Workaround: This workaround applies for CML Public Cloud 2022.06.1 (CDSW-2.0.32) or CDSW 1.10.1 or higher versions.
Administrator should perform the following steps:
- Go to the Runtime Catalog page.
- On the right-hand side, open the version list of each Runtime variant.
- Open the Details page for the 2023.05.1-b4 versions.
- Use the three-dot menu and use the Set to Disabled function for
those versions which have
container.repository.cloudera.com...
as Runtime Image. - Recheck within 24 hours that the runtimes with
docker.repository.cloudera.com
have been added.
Regular user should perform the following steps:
- If the project does not have ML Runtimes 2023.05 after the previous actions has been taken by the Administrator, no action is required.
- If the project does have ML Runtime 2023.05, after the Administrator disables the wrong one, it should display as ‘Disabled’.
- When the runtimes with the right metadata has been added automatically to the installation, use the Add Latest button to add the proper runtime(s).
- Remove the disabled runtimes from the project.
DSE-9818 JupyterLab Conda Tech Preview Runtime
- Sessions
- When starting a Notebook or a Console for a specific environment, the installed
packages will be available and the interpreter used to evaluate the contents of the
Notebook or Console will be the one installed in the environment. However, the Conda
environment is not "activated" in these sessions, therefore commands like
!which python
will return with the base Python 3.10 interpreter on the Runtime. The recommended ways to modify a Conda environments or install packages are the following:- conda commands must be used with the
-n
or--name
argument to specify the environment, for exampleconda -n myenv install pandas
- When installing packages with pip, use the
%pip
magic to install packages in the active kernel’s environment, for example%pip install pandas
- conda commands must be used with the
- Applications and Jobs
- To start an Application or Job, first create a launcher Python script containing the
following line:
!source activate <conda_env_name> && python <job / application script.py>
- Models
- Models are currently not supported for the Conda Runtime.
- Spark
- Spark is not supported in JupyterLab Notebooks and Consoles.
Adding a new ML Runtimes when using a custom root certificate might generate error messages
When trying to add new ML Runtimes, a number of error messages might appear in various places when using a custom root certificate. For example, you might see: "Could not fetch the image metadata" or "certificate signed by unknown authority". This is caused by the runtime-puller pods not having access to the custom root certificate that is in use.
Workaround:
- Create a directory at any location on the master node:
For example:
mkdir -p /certs/
- Copy the full server certificate chain into this folder. It is usually easier to
create a single file with all of your certificates (server, intermediate(s), root):
# copy all certificates into a single file: cat server-cert.pem intermediate.pem root.pem > /certs/cert-chain.crt
- (Optional) If you are using a custom docker registry that has its own certificate, you
need to copy this certificate chain into this same
file:
cat docker-registry-cert.pem >> /certs/cert-chain.crt
- Copy the global CA certificates into this new file:
# cat /etc/ssl/certs/ca-bundle.crt >> /certs/cert-chain.crt
- Edit your deployment of runtime manager and add the new mount.
Do not delete any existing objects.
#kubectl edit deployment runtime-manager
- Under VolumeMounts, add the following lines.
Note that the text is white-space sensitive - use spaces and not tabs.
- mountPath: /etc/ssl/certs/ca-certificates.crt name: mycert subPath: cert-chain.crt #this should match the new file name created in step 4
Under Volumes add the following text in the same edit:
- hostPath: path: /certs/ #this needs to match the folder created in step 1 type: "" name: mycert
- Save your changes:
wq!
Once saved, you will receive the message "deployment.apps/runtime-manager edited" and the pod will be restarted with your new changes.
- To persist these changes across cluster restarts, use the following Knowledge Base article to create a kubernetes patch file for the runtime-manager deployment: https://community.cloudera.com/t5/Customer/Patching-CDSW-Kubernetes-deployments/ta-p/90241
Cloudera Bug: DSE-20530
Spark Runtime Add-on required for Spark 2 integration with Scala Runtimes
Scala Runtimes on CML require Spark Runtime Addon to enable Spark2 integration. Spark3 is not supported with the Scala Runtime.
DSE-17981 - Disable Scala runtimes in models, experiments and applications runtime selection
Scala Runtimes should not appear as an option for Models, Experiments, and Applications in the user interface. Currently Scala Runtimes only support Session and Jobs.