Known Issues

You might run into some known issues while using Cloudera Machine Learning on Private Cloud.

Using dollar character in environment variables in Cloudera Machine Learning

Environment variables with the dollar ($) character are not parsed correctly by Cloudera Machine Learning (CML). For example, if you set PASSWORD="pass$123" in the project environment variables, and then try to read it using the echo command, the following output will be displayed: pass23

Workaround: Use one of the following commands to print the $ sign:
echo 24 | xxd -r -p
or
echo JAo= | base64 -d
Insert the value of the environment variable by wrapping it in the command substitution using $() or ``. For example, if you want to set the environment variable to ABC$123, specify:
ABC$(echo 24 | xxd -r -p)123
or
ABC`echo 24 | xxd -r -p`123
DSE-36718: Disable auto synchronization feature for users and teams

The automated team and user synchronization feature is disabled. Newly installed or upgraded workspaces do not have the automatic synchronization option in the Cloudera Machine Learning UI.

Workaround: none

DSE-36759: AMPs and Feature Announcement sections do not work in NTP setups

Clouder Machine Learning (CML) Private Cloud setups with Non Transparent Proxy do not function properly, that affects Accelerators for ML Projects (AMPs) and Feature Announcements. The home page freezes, the feature announcement displays error message, and the AMPs do not load.

Workaround:

To avoid the home page freeze copy the following environment variables from the web deployment, and add them to the environment section of the API deployments:
  • HTTP_PROXY
  • HTTPS_PROXY
  • NO_PROXY
  • http_proxy
  • https_proxy
  • no_proxy
DSE-32943: Enabling Service Accounts
Teams in the CML workspace can only run workloads within team projects with the Run as option for service accounts if they have previously manually added service accounts as a collaborator to the team.
DSE-35013: First CML workspace creation fails

On RHEL 8.8, during the first CML workspace installation on GPU with ECS external registry, pods might get stuck in the init or CrashLoop state.

First-time workspace installation is expected to fail. Consider this as a test workspace, and apply the following manual workaround for creating subsequent workspaces:
  1. Restart or delete the pods which are in init or CrashLoop state in the test workspace.
  2. Once all pods are in the running state, create new workspaces as needed.
  3. Delete the test workspace from the CML UI if no longer needed.
OPSX-4603: Buildkit in ECS in CML PrivateCloud

Issue: BuildKit was introduced in ECS for building images of models and experiments. BuildKit is a replacement for Docker, which was previously used to build images of CML's models and experiments in ECS. Buildkit is only for OS RHEL8.x and CentOS 8.x.

Buildkit in CML Private Cloud 1.5.2 is a Technical Preview feature. Hence, having Docker installed on the nodes/hosts is still mandatory for models and experiments to work smoothly. Upcoming release will be completely eliminating the dependency of Docker on the nodes.

Workaround: None.

DSE-32285: Migration: Migrated models are failing due to image pull errors

Issue: After CDSW to CML migration (on-premises) via full-fledged migration tool, migrated models on CML Private Cloud Workspace fails on initial deployment. This is because the initial model deployment tries to pull images from on-premises's registry.

Workaround: Redeploy the migrated model. As this involves the build and deploy process, the image will be built, pushed to the CML Private Cloud Workspace's configured registry, and then the same image will be consumed for further usage.

DSE-28768: Spark Pushdown is not working with Scala 2.11 runtime

Issue: Scala and R are not supported for Spark Pushdown.

Workaround: None.

DSE-32304: On CML Private Cloud ECS terminal and ssh connections can terminate

Issue: In Private Cloud ECS, CML Terminal and SSH connections can terminate after an uncertain amount of time, usually after 4-10 minutes. This issue affects the usage of local IDEs to work with CML, as well as any customer application using a websocket connection.

Workaround: None.

DSE- 35251: Web pod crashes if a project forking takes more than 60 minutes
The web pod crashes if a project forking takes more than 60 minutes. This is because the timeout is set to 60 minutes using the grpc_git_clone_timeout_minutes property. The following error is displayed after the web pod crash:
2024-04-23 22:52:36.384   1737    ERROR      AppServer.VFS.grpc                    crossCopy grpc error    data = [{"error":"1"},{"code":4,"details":"2","metadata":"3"},"Deadline exceeded",{}]
          ["Error: 4 DEADLINE_EXCEEDED: Deadline exceeded\n    at callErrorFromStatus (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/call.js:31:19)\n    at Object.onReceiveStatus (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/client.js:192:76)\n    at Object.onReceiveStatus (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:360:141)\n    at Object.onReceiveStatus (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:323:181)\n    at /home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/resolving-call.js:94:78\n    at process.processTicksAndRejections (node:internal/process/task_queues:77:11)\nfor call at\n    at ServiceClientImpl.makeUnaryRequest (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/client.js:160:34)\n    at ServiceClientImpl.crossCopy (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/make-client.js:105:19)\n    at /home/cdswint/services/web/server-dist/grpc/vfs-client.js:235:19\n    at new Promise (<anonymous>)\n    at Object.crossCopy (/home/cdswint/services/web/server-dist/grpc/vfs-client.js:234:12)\n    at Object.crossCopy (/home/cdswint/services/web/server-dist/models/vfs.js:280:38)\n    at projectForkAsyncWrapper (/home/cdswint/services/web/server-dist/models/projects/projects-create.js:229:19)"]
          node:internal/process/promises:288
          triggerUncaughtException(err, true /* fromPromise */);
          ^Error: 4 DEADLINE_EXCEEDED: Deadline exceeded
          at callErrorFromStatus (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/call.js:31:19)
          at Object.onReceiveStatus (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/client.js:192:76)
          at Object.onReceiveStatus (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:360:141)
          at Object.onReceiveStatus (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:323:181)
          at /home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/resolving-call.js:94:78
          at process.processTicksAndRejections (node:internal/process/task_queues:77:11)
          for call at
          at ServiceClientImpl.makeUnaryRequest (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/client.js:160:34)
          at ServiceClientImpl.crossCopy (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/make-client.js:105:19)
          at /home/cdswint/services/web/server-dist/grpc/vfs-client.js:235:19
          at new Promise (<anonymous>)
          at Object.crossCopy (/home/cdswint/services/web/server-dist/grpc/vfs-client.js:234:12)
          at Object.crossCopy (/home/cdswint/services/web/server-dist/models/vfs.js:280:38)
          at projectForkAsyncWrapper (/home/cdswint/services/web/server-dist/models/projects/projects-create.js:229:19) {
          code: 4,
          details: 'Deadline exceeded',
          metadata: Metadata { internalRepr: Map(0) {}, options: {} }
          }  
Workaround: Increase the timeout limit, for example, to 120 minutes, using the grpc_git_clone_timeout_minutes property.
UPDATE site_config SET grpc_git_clone_timeout_minutes = <new value>;