Known Issues
You might run into some known issues while using Cloudera Machine Learning on Private Cloud.
- Using dollar character in environment variables in Cloudera Machine Learning
Environment variables with the dollar ($) character are not parsed correctly by Cloudera Machine Learning (CML). For example, if you set
PASSWORD="pass$123"
in the project environment variables, and then try to read it using the echo command, the following output will be displayed:pass23
Workaround: Use one of the following commands to print the $ sign:echo 24 | xxd -r -p or echo JAo= | base64 -d
Insert the value of the environment variable by wrapping it in the command substitution using $() or ``. For example, if you want to set the environment variable toABC$123
, specify:ABC$(echo 24 | xxd -r -p)123 or ABC`echo 24 | xxd -r -p`123
- DSE-37827: Jupyter's RTC extension throws an error and notebooks become unusable
-
In certain cases, Jupyter’s RTC (Real Time Collaboration) extension may cause errors claiming either that other sessions are active, or that other processes have accessed the notebook files. After these errors, the notebook becomes unusable due to the error messages and the CML session needs to be restarted.
Workaround:
You must disable the Jupyter RTC extension by performing the following tasks:- Create a Session.
- Open the terminal.
- Enter nano /home/cdsw/.jupyter/labconfig/page_config.json.
- Add the following lines to the file:
{ "disabledExtensions": { "@jupyter/collaboration-extension": true }, "lockedExtensions": { "@jupyter/collaboration-extension": true } }
- Save and close the file.
- DSE-36718: Disable auto synchronization feature for users and teams
-
The automated team and user synchronization feature is disabled. Newly installed or upgraded workspaces do not have the automatic synchronization option in the Cloudera Machine Learning UI.
Workaround: none
- DSE-36759: AMPs and Feature Announcement sections do not work in NTP setups
-
Clouder Machine Learning (CML) Private Cloud setups with Non Transparent Proxy do not function properly, that affects Accelerators for ML Projects (AMPs) and Feature Announcements. The home page freezes, the feature announcement displays error message, and the AMPs do not load.
Workaround:
To avoid the home page freeze copy the following environment variables from the web deployment, and add them to the environment section of the API deployments:- HTTP_PROXY
- HTTPS_PROXY
- NO_PROXY
- http_proxy
- https_proxy
- no_proxy
- DSE-32943: Enabling Service Accounts
- Teams in the CML workspace can only run workloads within team projects with the Run as option for service accounts if they have previously manually added service accounts as a collaborator to the team.
- DSE-35013: First CML workspace creation fails
-
On RHEL 8.8, during the first CML workspace installation on GPU with ECS external registry, pods might get stuck in the init or CrashLoop state.
First-time workspace installation is expected to fail. Consider this as a test workspace, and apply the following manual workaround for creating subsequent workspaces:- Restart or delete the pods which are in init or CrashLoop state in the test workspace.
- Once all pods are in the running state, create new workspaces as needed.
- Delete the test workspace from the CML UI if no longer needed.
- OPSX-4603: Buildkit in ECS in CML PrivateCloud
-
Issue: BuildKit was introduced in ECS for building images of models and experiments. BuildKit is a replacement for Docker, which was previously used to build images of CML's models and experiments in ECS. Buildkit is only for OS RHEL8.x and CentOS 8.x.
Buildkit in CML Private Cloud 1.5.2 is a Technical Preview feature. Hence, having Docker installed on the nodes/hosts is still mandatory for models and experiments to work smoothly. Upcoming release will be completely eliminating the dependency of Docker on the nodes.
Workaround: None.
- DSE-32285: Migration: Migrated models are failing due to image pull errors
-
Issue: After CDSW to CML migration (on-premises) via full-fledged migration tool, migrated models on CML Private Cloud Workspace fails on initial deployment. This is because the initial model deployment tries to pull images from on-premises's registry.
Workaround: Redeploy the migrated model. As this involves the build and deploy process, the image will be built, pushed to the CML Private Cloud Workspace's configured registry, and then the same image will be consumed for further usage.
- DSE-28768: Spark Pushdown is not working with Scala 2.11 runtime
-
Issue: Scala and R are not supported for Spark Pushdown.
Workaround: None.
- DSE-32304: On CML Private Cloud ECS terminal and ssh connections can terminate
-
Issue: In Private Cloud ECS, CML Terminal and SSH connections can terminate after an uncertain amount of time, usually after 4-10 minutes. This issue affects the usage of local IDEs to work with CML, as well as any customer application using a websocket connection.
Workaround: None.
- DSE- 35251: Web pod crashes if a project forking takes more than 60 minutes
-
The web pod crashes if a project forking takes more than 60 minutes. This is because the timeout is set to 60 minutes using the grpc_git_clone_timeout_minutes property. The following error is displayed after the web pod crash:
2024-04-23 22:52:36.384 1737 ERROR AppServer.VFS.grpc crossCopy grpc error data = [{"error":"1"},{"code":4,"details":"2","metadata":"3"},"Deadline exceeded",{}] ["Error: 4 DEADLINE_EXCEEDED: Deadline exceeded\n at callErrorFromStatus (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/call.js:31:19)\n at Object.onReceiveStatus (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/client.js:192:76)\n at Object.onReceiveStatus (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:360:141)\n at Object.onReceiveStatus (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:323:181)\n at /home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/resolving-call.js:94:78\n at process.processTicksAndRejections (node:internal/process/task_queues:77:11)\nfor call at\n at ServiceClientImpl.makeUnaryRequest (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/client.js:160:34)\n at ServiceClientImpl.crossCopy (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/make-client.js:105:19)\n at /home/cdswint/services/web/server-dist/grpc/vfs-client.js:235:19\n at new Promise (<anonymous>)\n at Object.crossCopy (/home/cdswint/services/web/server-dist/grpc/vfs-client.js:234:12)\n at Object.crossCopy (/home/cdswint/services/web/server-dist/models/vfs.js:280:38)\n at projectForkAsyncWrapper (/home/cdswint/services/web/server-dist/models/projects/projects-create.js:229:19)"] node:internal/process/promises:288 triggerUncaughtException(err, true /* fromPromise */); ^Error: 4 DEADLINE_EXCEEDED: Deadline exceeded at callErrorFromStatus (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/call.js:31:19) at Object.onReceiveStatus (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/client.js:192:76) at Object.onReceiveStatus (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:360:141) at Object.onReceiveStatus (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:323:181) at /home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/resolving-call.js:94:78 at process.processTicksAndRejections (node:internal/process/task_queues:77:11) for call at at ServiceClientImpl.makeUnaryRequest (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/client.js:160:34) at ServiceClientImpl.crossCopy (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/make-client.js:105:19) at /home/cdswint/services/web/server-dist/grpc/vfs-client.js:235:19 at new Promise (<anonymous>) at Object.crossCopy (/home/cdswint/services/web/server-dist/grpc/vfs-client.js:234:12) at Object.crossCopy (/home/cdswint/services/web/server-dist/models/vfs.js:280:38) at projectForkAsyncWrapper (/home/cdswint/services/web/server-dist/models/projects/projects-create.js:229:19) { code: 4, details: 'Deadline exceeded', metadata: Metadata { internalRepr: Map(0) {}, options: {} } }
Workaround: Increase the timeout limit, for example, to 120 minutes, using the grpc_git_clone_timeout_minutes property.UPDATE site_config SET grpc_git_clone_timeout_minutes = <new value>;