Known Issues and Limitations

Known Issues

This topic lists some of the known issues you might run into while using Cloudera Machine Learning.

DSE-8834: Remove Workspace operation fails

Remove Workspace operation fails if workspace creation is still in progress.

DSE-8407: CML does not support modifying CPU/GPU scaling limits on provisioned ML workspaces

When provisioning a workspace, CML currently supports a maximum of 30 nodes of each type: CPUs and GPUs. Currently, CML does not provide a way to increase this limit for existing workspaces.

Workaround:
  1. Log in to the CDP web interface at https://console.us-west-1.cdp.cloudera.com using your corporate credentials or any other credentials that you received from your CDP administrator.
  2. Click ML Workspaces.
  3. Select the workspace whose limits you want to modify and go to its Details page.
  4. Copy the Liftie Cluster ID of the workspace. It should be of the format, liftie-abcdefgh.
  5. Login to the AWS EC2 console, and click Auto Scaling Groups.
  6. Paste the Liftie Cluster ID in the search filter box and press enter.
  7. Click on the auto-scaling group that has a name like: liftie-abcdefgh-ml-pqrstuv-xyz-cpu-workers-0-NodeGroup. Especially note the 'cpu-workers' in the middle of the string.
  8. On the Details page of this auto-scaling group, click Edit.
  9. Set Max capacity to the desired value and click Save.

Note that CML does not support lowering the maximum instances of an auto scaling group due to certain limitations in AWS.

SSO does not work if the first user to access an ML workspace is not a Site Admin

Problem: If a user assigned the MLUser role is the first user, the web application will display an error.

Workaround: Any user assigned the MLAdmin role must always be the first user to access an ML workspace.

API does not enforce an maximum number of nodes for ML workspaces

Problem: When the API is used to provision new ML workspaces, it does not enforce an upper limit on the autoscale range.

MLX-637, MLX-638: Downscaling ML workspace nodes does not work as expected

Problem: Downscaling nodes does not work as seamlessly as expected due to a lack of Bin Packing on the Spark default scheduler, and because dynamic allocation is not currently enabled. As a result, currently infrastructure pods, Spark driver/executor pods, and session pods are tagged as non-evictable using the cluster-autoscaler.kubernetes.io/safe-to-evict: "false" annotation.

MLX-698: ML Workspaces do not load on Firefox

Problem: CML uses LetsEncrypt for TLS certificates which enable OCSP stapling by default. However, by default, Firefox refuses to load HTTPS sites which use OCSP-enabled certificates, which includes ML workspaces.

Workaround: It is possible to override this behavior in Firefox by setting the security.ssl.enable_ocsp_must_staple to false. Or, you can use Google Chrome which does not face this issue.

Limitations (CML on AWS)

This section lists some resource limits that CML and AWS impose on workloads running in ML workspaces.

  • Certificate creation (for TLS) uses LetsEncrypt which is limited to 2000 certs/week. As such a single tenant in CDP can create a maximum of 2000 ML workspaces per week.

  • CML imposes a limit (50) on the number of pods a user can create at any point within a specific workspace. This limit is not configurable.

  • CML allows you to provision a maximum of 30 compute nodes per ML workspace. This does not include any additional infrastructure nodes CML might need to provision to run the service.

  • Amazon EKS imposes a limit on the number of pods you can run simultaneously on a node. This limit varies depending on your instance type. For details, see ENI Max Pods.