Known Issues and Limitations

There are some known issues you might run into while using Cloudera Machine Learning.

CML workspace installation fails

CML workspace installation with Azure NetApp Files on NFS v4.1 fails. The workaround is to use NFS v3.

No module named cmlapi

Problem: When trying to import cmlapi inside a session, this error occurs:

ImportError: No module named cmlapi

This error indicates the cmlapi module was not copied to an internal pod.

Solution: To resolve this problem, perform the following steps.
  1. Identify the pod name:
    kubectl get pods -n mlx
  2. Exec into the ds-vfs-<podname> pod:
    kubectl exec it ds-vfs<podname> -n mlx --bash
  3. Remove the following directory:
    rm -rf /projects/addons/cmladdon-python<version>
  4. Delete the api pod and wait for it to start again.
    kubectl delete pod api-<podname> -n mlx

Spark executors fail due to insufficient disk space

Generally, the administrator should estimate the shuffle data set size before provisioning the workspace, and then specify the root volume size of the compute node that is appropriate given that estimate. For more specific guidelines, see the following resources.

Hive SSO not supported (DWX-8031)

Hive SSO is not supported for ML Discovery & Exploration.

Spark 2.4.7 incompatible runtime (DSE-19073)

When starting a session where you want to use Spark 2.4.7, you must use the correct runtime version. The Spark 2.4.7 version with CDE 1.13 does not work. Use the version with CDE 1.11 instead.

The username field length maximum is 50 bytes (DSE-18016)

In CML, the username field is limited to 50 bytes in length, which is less than the corresponding field length in Microsoft Active Directory. This can cause errors when onboarding a user who has a long name.

Data access on RAZ-enabled data lake fails (DSE-18290)

If you attempt to access storage in a data lake of version 7.2.11 or higher, and RAZ is enabled:
  • Legacy engines will fail.
  • ML runtimes with CDP version 7.2.10 and below will fail.
  • ML runtimes with CDP version 7.2.11 and above will work.

Failed to install workspace (CDPSDX-3207)

Installation may fail with the error: Installation Failed. Installation timed out. This is an intermittent error.

Solution: Try installing the workspace again.

Grafana (DSE-18499)

Users may not see any pre-populated Grafana dashboards (Cluster, Containers, Node, Models) inside the ML workspace.

Data lineage is not reported to Atlas (DSE-16706)

Registering training data lineage using a linking file is not working.

Runtime Addon fails to load (DSE-16200)

A Spark runtime add-on may fail when upgrading a workspace.

Solution: To resolve this problem, try to reload the add-on. In Site Administration > Runtime/Engine, in the option menu next to the failed add-on, select Reload.

CML workspace provisioning times out

When provisioning a CML workspace, the process may time out with an error similar to Warning FailedMount or Failed to sync secret cache:timed out waiting for the condition. This can happen on AWS or Azure.

Solution: Delete the workspace and retry provisioning.

CML endpoint connectivity from DataHub and Cloudera Data Engineering (DSE-14882)

When CDP services connect to CML services, if the ML workspace is provisioned on a public subnet, traffic is routed out of the VPC first, and then routed back in. On Private Cloud CML, traffic is not routed externally.

Chrome browser warning when accessing ML workspace (DSE-14652)

Some browsers (Chrome 86 and higher) may display the following message when a user attempts to access a workspace that was configured without TLS. 
The information you're
          about to submit is not secure.

Workaround: Accept and bypass the browser warning.

Explanation: Chrome 86 and higher displays warnings when forms submit or redirect to http://, which is the case when connecting to a workspace configured without TLS using SSO. The workspace is still functional in all respects if you accept and bypass the browser warning. It is not possible to enable TLS on a workspace that was created without TLS.

Ranger and RAZ enabled environments (OPSAPS-59476)

When using Ranger and RAZ enabled environments in public cloud CML, run the following commands on the session terminal or inline in the user code before doing any other operations:
sed -i "s/http:/https:/g"
          /etc/hadoop/conf/core-site.xml sed -i "s/http:/https:/g"

Orphan EBS block volumes after deletion of ML workspace (DSE-14606)

If a CML workspace on AWS is deleted using the February 3, 2021 release (1.15.0-b72) of the CML Control Plane, orphan EBS block volumes may be left behind. The state of any orphan volumes appears as Available in the EC2 console (you can see the list of volumes by navigating to Service > EC2 > Volumes). The volumes have names similar to kubernetes-dynamic-pvc-<unique ID>, and are tagged":"mlx". These orphan EBS volumes must be deleted to prevent cloud resource leaks.

Upgrade not supported on NFS v4.x (DSE-14519, DSE-14077)

Upgrading ML workspaces on Azure configured with external NFS services using the NFS v4.x protocol is currently not supported.

Applications may not start after Kubernetes upgrade (DSE-14355)

In some cases, applications that were running before a Kubernetes upgrade may fail to start after Kubernetes is upgraded. Users with access to such applications must restart them manually.

Transparent proxy supported only on AWS (DSE-13937)

Cloudera Machine Learning, when used on AWS public cloud, supports transparent proxies. Transparent proxy enables CML to proxy web requests without requiring any particular browser setup. In normal operation, CML requires the ability to reach several external domains. For more information, see: Outbound network access destinations.

Cannot restrict application access (DSE-13928, DSE-6651)

Authorization used by Applications might not be up to date. For example, if a user is removed from a project in CDSW or CML (no more read access to the project and its applications), this user might continue to have access to the application, if they accessed the application before their access was revoked.

Workaround: When updating permissions of a project that has applications, restart applications to ensure that applications use up-to-date authorization.

Jupyter Notebook sessions do not time out (DSE-13741)

Jupyter Notebook sessions in legacy engine:8 through engine:13 do not exit after IDLE_MAXIMUM_MINUTES of inactivity. They will run until SESSION_MAXIMUM_MINUTES (which is seven days by default).


You can change the configuration of your cluster to apply the fix for this issue. Change the editor command for Jupyter Notebook in every engine that uses it to the following:

        /usr/local/bin/jupyter notebook --no-browser --ip= --port=${CDSW_APP_PORT}
        --NotebookApp.token= --NotebookApp.allow_remote_access=True --NotebookApp.quit_button=False
        --log-level=ERROR --NotebookApp.shutdown_no_activity_timeout=300
        --MappingKernelManager.cull_interval=60 --TerminalManager.cull_interval=60
This does the following:
  • Kills each running notebook after IDLE_MAXIMUM_MINUTES of inactivity
  • Kills the CDSW/CML session in which Jupyter is running after 5 minutes with no notebooks

Play button missing in CML sessions with ML Runtimes (DSE-13629)

For ML Runtimes sessions, the Play button might not display.


You can still run the Session code by selecting Run–>Run All or Run --> Run Lines when the Play button is not shown on the UI.

Scheduled Job is not running after switching over to Runtimes, Application can't be restarted (DSE-13573)

ML Runtimes is a new feature in the current release. Although you can now change your existing projects from Engine to ML Runtimes, we are currently not recommend migrating existing projects.

Applications and Jobs created with Engines might be impacted once their project is changed to use ML Runtimes based on the following:
  • You will be forced to change to ML Runtimes if you try to update related Editor/Kernel settings of Jobs, Models, Experiments, or Applications
  • Applications cannot be restarted from the UI in a migrated project unless ML Runtime settings are updated for that application.

NFS performance issues on AWS EFS (DSE-12404)

CML uses NFS as the filesystem for storing application and user data. NFS performance may be much slower than expected in situations where a data scientist writes a very large number (typically in the thousands) of small files. Example tasks include: using git clone to clone a very large source repository (such as TensorFlow), or using pip to install a Python package that includes JavaScript code (such as plotly). Reduced performance is particularly common with CML on AWS (which uses EFS), but it may be seen in other environments.

Disable file upload and download (DSE-12065)

You cannot disable file upload and download when using the Jupyter Notebook.

Remove Workspace operation fails (DSE-8834)

Remove Workspace operation fails if workspace creation is still in progress.

CML does not support modifying CPU/GPU scaling limits on provisioned ML workspaces (DSE-8407)

When provisioning a workspace, CML currently supports a maximum of 30 nodes of each type: CPUs and GPUs. Currently, CML does not provide a way to increase this limit for existing workspaces.

  1. Log in to the CDP web interface at using your corporate credentials or any other credentials that you received from your CDP administrator.
  2. Click ML Workspaces.
  3. Select the workspace whose limits you want to modify and go to its Details page.
  4. Copy the Liftie Cluster ID of the workspace. It should be of the format, liftie-abcdefgh.
  5. Login to the AWS EC2 console, and click Auto Scaling Groups.
  6. Paste the Liftie Cluster ID in the search filter box and press enter.
  7. Click on the auto-scaling group that has a name like: liftie-abcdefgh-ml-pqrstuv-xyz-cpu-workers-0-NodeGroup. Especially note the 'cpu-workers' in the middle of the string.
  8. On the Details page of this auto-scaling group, click Edit.
  9. Set Max capacity to the desired value and click Save.

Note that CML does not support lowering the maximum instances of an auto scaling group due to certain limitations in AWS.

SSO does not work if the first user to access an ML workspace is not a Site Admin

Problem: If a user assigned the MLUser role is the first user, the web application will display an error.

Workaround: Any user assigned the MLAdmin role must always be the first user to access an ML workspace.

API does not enforce a maximum number of nodes for ML workspaces

Problem: When the API is used to provision new ML workspaces, it does not enforce an upper limit on the autoscale range.

Downscaling ML workspace nodes does not work as expected (MLX-637, MLX-638)

Problem: Downscaling nodes does not work as seamlessly as expected due to a lack of Bin Packing on the Spark default scheduler, and because dynamic allocation is not currently enabled. As a result, currently infrastructure pods, Spark driver/executor pods, and session pods are tagged as non-evictable using the "false" annotation.