Known Issues and Limitations

Known Issues

This topic lists some of the known issues you might run into while using Cloudera Machine Learning.

DSE-8285: Problems with the quota feature

If you hit the memory quota, the error message says CPU Limit reached instead of Memory Limit reached.

If you specify floating point values for CPU / Memory / GPU units in engine profiles, they will not work and will erroneously cause Out of Quota error messages.

DSE-8070: We should not allow minimum nodes to be set to 0 for CPU node group during CML Workspace creation

Problem: When you are creating a workspace, setting both the minimum and maximum autoscale range to zero for the CPU or the GPU will cause your workspace creation to fail.

Workaround: Choose different values for the minimum and maximum autoscale range for the CPU and the GPU. For example, choose zero for the minimum and five for the maximum autoscale range.

DSE-8136: Issue with GPU set up in CML

Problem: If a CML workspace is configured with the default GPU autoscaling range of 0 - 1 instances, no GPU node will be configured in the cluster. This means that the CML admin user will not be able to edit the Maximum GPUs per Session/Job setting to enable creation of sessions/jobs that require GPUs.

Workaround: To work around this issue, you must manually add a GPU node to the cluster using the following steps:
  1. Navigate to the ML workspaces list page on the DCP control plane UI.
  2. Click the desired workspace to display the workspace Details page.
  3. Copy the Cluster Name, which looks something like liftie-t5zpbfr9.
  4. Log in to your AWS/EC2 console as an Admin.
  5. Click the Autoscaling Groups link on the lower left navigation pane.
  6. Paste the Cluster Name from step 3 into the Filter field and hit enter.
  7. Out of the three autoscaling groups, click on the group that contains gpu in the name.
  8. Select the Details tab and click Edit.
  9. Enter 1 in the Desired Capacity field and click Save.
  10. Click the Instances tab.

    Wait and refresh until a new instance displays and is “Healthy”.

  11. From the CDP control plane, open the workspace UI and log in as Admin user.
  12. Click Admin in the lower left navigation pane.
  13. Click Engines.
  14. Locate Maximum GPUs per Session/Jobs and select the desired number.

    The setting is automatically saved.

    Users should now be able to launch workbenches that request GPUs.

DSE-7521: cdswctl login does not work with SAML SSO enabled

Problem: The cdswctl login command creates a $HOME/.cdsw/config.yaml file, which contains the metadata required for communicating with a CML workspace from a user's local machine. However, this does not work with workspaces configured to use SAML SSO (which is the recommended method of authentication).

Fix: Users are now required to manually create the $HOME/.cdsw/config.yaml file with the required metadata. The steps have been updated to reflect this here: Initialize an SSH Endpoint.

MLX-923: ML Workspace installation may fail intermittently while provisioning TLS certificates

Problem: This failure might occur if the workspace installer is unable to get the TLS certificate and key in time.

Workaround: Force delete the workspaces (that failed to be provisioned) and try again.

SSO does not work if the first user to access an ML workspace is not a Site Admin

Problem: If a user assigned the MLUser role is the first user, the web application will display an error.

Workaround: Any user assigned the MLAdmin role must always be the first user to access an ML workspace.

CDPCP-534: SSO does not work if not logged in to the CDP web interface

Problem: You cannot log in to an ML workspace with SSO credentials if you are not already logged in to the CDP web interface.

Workaround: To access your ML workspaces, first log into CDP. Then navigate to the list of ML Workspaces and click on the workspace you want to access.

API does not enforce an maximum number of nodes for ML workspaces

Problem: When the API is used to provision new ML workspaces, it does not enforce an upper limit on the autoscale range.

MLX-637, MLX-638: Downscaling ML workspace nodes does not work as expected

Problem: Downscaling nodes does not work as seamlessly as expected due to a lack of Bin Packing on the Spark default scheduler, and because dynamic allocation is not currently enabled. As a result, currently infrastructure pods, Spark driver/executor pods, and session pods are tagged as non-evictable using the cluster-autoscaler.kubernetes.io/safe-to-evict: "false" annotation.

MLX-698: ML Workspaces do not load on Firefox

Problem: CML uses LetsEncrypt for TLS certificates which enable OCSP stapling by default. However, by default, Firefox refuses to load HTTPS sites which use OCSP-enabled certificates, which includes ML workspaces.

Workaround: It is possible to override this behavior in Firefox by setting the security.ssl.enable_ocsp_must_staple to false. Or, you can use Google Chrome which does not face this issue.

Limitations (CML on AWS)

This section lists some resource limits that CML and AWS impose on workloads running in ML workspaces.

  • Certificate creation (for TLS) uses LetsEncrypt which is limited to 2000 certs/week. As such a single tenant in CDP can create a maximum of 2000 ML workspaces per week.

  • CML imposes a limit (50) on the number of pods a user can create at any point within a specific workspace. This limit is not configurable.

  • CML allows you to provision a maximum of 30 compute nodes per ML workspace. This does not include any additional infrastructure nodes CML might need to provision to run the service.

  • Amazon EKS imposes a limit on the number of pods you can run simultaneously on a node. This limit varies depending on your instance type. For details, see ENI Max Pods.