Known Issues and Limitations
There are some known issues you might run into while using Cloudera AI.
Unable to deploy ONNX optimization profiles for embedding and ranking NIMs on GPUs with optimization profiles (DSE-40509)
Deploying ONNX profiles for embedding and ranking NIMs on GPUs where compatible GPU profiles exist will lead to deployment failure. Before deploying an ONNX optimization profile for embedding or ranking NIMs from the Model Hub, ensure that the NIM does not have a supported profile for the target GPU.
Cloudera AI automatic JWT authorization to Cloudera Data Warehouse is failing due to a wrong KNOX URL (DSE-39855)
Due to an issue, there is a mismatch of the Data Lake name between the actual Data Lake name in the environment and the one parsed by the CML 2.0.43-b208 version or later.
- Obtain the correct Data Lake version by running the following command using CDP
CLI:
cdp datalake describe-datalake
- Override the KNOX URL in the environment variable by performing the following:
- Run the following command to save the deployment status to a
file:
kubectl get deployment ds-cdh-client -o json -n mlx > /tmp/rs.json
- Edit the /tmp/rs.json file and add the below object for
ds-cdh-client environment under the
spec.template.spec.containers.env
section.
{ "name": "FIXED_KNOX_URL", "value": "https://[***ENVIRONMENT-VARIABLE***]/value" }
- Apply the
configuration.
kubectl apply -f /tmp/rs.json
- Run the following command to save the deployment status to a
file:
The modify-ml-serving-app command fails on Cloudera AI Inference service Azure cluster in the us-west-2 workload region (DSE-39826)
When you run modify-ml-serving-app API on Cloudera AI Inference service
Azure cluster, the status of the application is displayed as
modify:failed
.
Workaround: You must first delete the instance group you want to modify using the delete-instance-group-ml-serving-app API. Then, recreate the instance group, modify the configuration based on your requirements, and add the instance group using the add-instance-groups-ml-serving-app API.
Automatic synchronization of teams and users feature is disabled in the UI (DSE-36718)
Due to some issues with the automatic synchronization of teams and users, this feature is disabled in the UI in the September 26, 2024 release for version 2.0.46-b200. New installation or upgraded workbench does not have the auto synchronization option in the UI.
Workaround: You can manually synchronize teams and synchronize users.
Web pod crashes if a project forking takes more than 60 minutes (DSE-35251)
2024-04-23 22:52:36.384 1737 ERROR AppServer.VFS.grpc crossCopy grpc error data = [{"error":"1"},{"code":4,"details":"2","metadata":"3"},"Deadline exceeded",{}]
["Error: 4 DEADLINE_EXCEEDED: Deadline exceeded\n at callErrorFromStatus (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/call.js:31:19)\n at Object.onReceiveStatus (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/client.js:192:76)\n at Object.onReceiveStatus (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:360:141)\n at Object.onReceiveStatus (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:323:181)\n at /home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/resolving-call.js:94:78\n at process.processTicksAndRejections (node:internal/process/task_queues:77:11)\nfor call at\n at ServiceClientImpl.makeUnaryRequest (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/client.js:160:34)\n at ServiceClientImpl.crossCopy (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/make-client.js:105:19)\n at /home/cdswint/services/web/server-dist/grpc/vfs-client.js:235:19\n at new Promise (<anonymous>)\n at Object.crossCopy (/home/cdswint/services/web/server-dist/grpc/vfs-client.js:234:12)\n at Object.crossCopy (/home/cdswint/services/web/server-dist/models/vfs.js:280:38)\n at projectForkAsyncWrapper (/home/cdswint/services/web/server-dist/models/projects/projects-create.js:229:19)"]
node:internal/process/promises:288
triggerUncaughtException(err, true /* fromPromise */);
^Error: 4 DEADLINE_EXCEEDED: Deadline exceeded
at callErrorFromStatus (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/call.js:31:19)
at Object.onReceiveStatus (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/client.js:192:76)
at Object.onReceiveStatus (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:360:141)
at Object.onReceiveStatus (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:323:181)
at /home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/resolving-call.js:94:78
at process.processTicksAndRejections (node:internal/process/task_queues:77:11)
for call at
at ServiceClientImpl.makeUnaryRequest (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/client.js:160:34)
at ServiceClientImpl.crossCopy (/home/cdswint/services/web/node_modules/@grpc/grpc-js/build/src/make-client.js:105:19)
at /home/cdswint/services/web/server-dist/grpc/vfs-client.js:235:19
at new Promise (<anonymous>)
at Object.crossCopy (/home/cdswint/services/web/server-dist/grpc/vfs-client.js:234:12)
at Object.crossCopy (/home/cdswint/services/web/server-dist/models/vfs.js:280:38)
at projectForkAsyncWrapper (/home/cdswint/services/web/server-dist/models/projects/projects-create.js:229:19) {
code: 4,
details: 'Deadline exceeded',
metadata: Metadata { internalRepr: Map(0) {}, options: {} }
}
UPDATE site_config SET grpc_git_clone_timeout_minutes = <new value>;
Enabling Service Accounts (DSE-32943)
Teams in the Cloudera AI Workbench can only run workloads within team projects with the Run as option for service accounts if they have previously manually added service accounts as a collaborator to the team.
Working with files larger than 1 MB in Jupyter causes error (OPSAPS-61524)
While working on files or saving files of size larger than 1 MB, Jupyter Notebook may display an error message such as 413 Request Entity Too Large.
Workaround:
Clean up the notebook cell results often to keep the notebook below 1 MB. Use the
kubectl
CLI to add the following annotation to the ingress corresponding
to the session.
annotations:
nginx.ingress.kubernetes.io/proxy-body-size: "0"
- Get the session ID (the alphanumeric suffix in the URL) from the web UI.
- Get the corresponding namespace:
kubectl get pods -A | grep <session ID>
- List the ingress in the namespace
kubectl get ingress -n <user-namespace> | grep <session ID>
- In the metadata, add the annotation.
kubectl edit ingress <ingress corresponding to the session> -n <user-namespace>
Terminal does not stop after time-out (DSE-12064)
After a web session times out, the terminal should stop running, but it remaings functional.
Cloudera AI Workbench upgrades disabled with NTP
Upgrades are disabled for Cloudera AI Workbench configured with non-transparent proxy (NTP). This issue is anticipated to be fixed in a subsequent hotfix release.
Using dollar character in environment variables in Cloudera AI
Environment variables with the dollar ($) character are not parsed correctly by Cloudera AI. For
example, if you set PASSWORD="pass$123"
in
the project environment variables, and then try to read it using the echo command, the
following output will be displayed: pass23
echo 24 | xxd -r -p
or
echo JAo= | base64 -d
$()
or ``
. For example, if you want to set the
environment variable to ABC$123
,
specify:ABC$(echo 24 | xxd -r -p)123
or
ABC`echo 24 | xxd -r -p`123
Models: Some API endpoints not fully supported
In the create_run api
, run_name
is not supported.
Also, search_experiments
only supports pagination.
When a team added as a collaborator, it does not appear in the UI. (DSE-31570)
Run Job as displays even if the job is enabled on a service account. (DSE-31573)
If the job is enabled on a service account, the Run Job as option should not display. Even if me is selected at this point, the job still runs in the service account.
AMP archive upload fails if Project does not contain metadata YAML file
- Download the AMP zip file from GitHub
- Unzip it to a temp directory
- From the command line navigate to the root directory of the zip
- Run this command to create the new zip file:
zip -r amp.zip
.
Make sure you see the .project-metadata/yaml in the root of the zip file.
Cloning from Git using SSH is not supported via HTTP proxy
Workaround: Cloudera AI Projects support HTTPS for cloning git projects. It is suggested to use this as the workaround.
Model deployments requiring outbound access via proxy do not honor HTTP_PROXY, HTTPS_PROXY environment variables
Workaround: Add the HTTP_PROXY, HTTPS_PROXY, http_proxy and https_proxy environment variables to the cdsw-build.sh file of the Project Repository.
Application does not restart after upgrade or migration
An application may fail to automatically restart after a workbench upgrade or migration. In this case, manually restart the application.
Do not use backtick characters in environment variable names
Avoid using backtick characters ( `
) in environment variable names, as
this will cause sessions to fail with exit code 2.
AI Registry is not supported on R models
AI Registry is not supported on R models.
The mlflow.log_model registered model files might not be available on NFS Server (DSE-27709)
When using mlflow.log_model, registered model files might not be available on the NFS server due to NFS server settings or network connections. This could cause the model to remain in the registering status.
- Re-register the model. It will register as an additional version, but it should correct the problem.
- Add the ARTIFACT_SYNC_PERIOD environment variable to hdfscli-server Kubernetes deployment and set it to an integer value. This will set the model registry retry operation to twice the number of seconds specified by the artifact sync period integer value. If the ARTIFACT_SYNC_PERIOD is set to 30 seconds then model registry will retry for 60 seconds. The default value is 10 and model registry retries for 20 seconds. For example: -name: ARTIFACT_SYNC_PERIOD value: “30”.
Applications appear in failed state after upgrade (DSE-23330)
After upgrading Cloudera AI from version 1.29.0 on AWS, some applications may be in a Failed state. The workaround is to restart the application.
Cannot use hashtag character in JDBC connection string
The special character #
(hashtag) cannot be used in a password that is
then used in a JDBC connection string. Avoid using this special character, or use
'%23'
instead.
Cloudera AI Workbench installation fails
Cloudera AI Workbench installation with Azure NetApp Files on NFS v4.1 fails. The workaround is to use NFS v3.
Spark executors fail due to insufficient disk space
Generally, the administrator should estimate the shuffle data set size before provisioning the workbench, and then specify the root volume size of the compute node that is appropriate given that estimate. For more specific guidelines, see the following resources.
Runtime Addon fails to load (DSE-16200)
A Spark runtime add-on may fail when upgrading a workbench.
Solution: To resolve this problem, try to reload the add-on. In Reload.
, in the option menu next to the failed add-on, selectCloudera AI Workbench provisioning times out
When provisioning a Cloudera AI Workbench, the process may time out with an error similar to
Warning FailedMount
or Failed to sync secret cache:timed out
waiting for the condition.
This can happen on AWS or Azure.
Solution: Delete the workbench and retry provisioning.
Cloudera AI endpoint connectivity from Cloudera Data Hub and Cloudera Data Engineering (DSE-14882)
When Cloudera services connect to Cloudera AI services, if the Cloudera AI Workbench is provisioned on a public subnet, traffic is routed out of the VPC first, and then routed back in. On Cloudera Private Cloud Cloudera AI, traffic is not routed externally.
NFS performance issues on AWS EFS (DSE-12404)
Cloudera AI uses NFS as the filesystem for storing application and user data. NFS performance may be much slower than expected in situations where a data scientist writes a very large number (typically in the thousands) of small files. Example tasks include: using git clone to clone a very large source repository (such as TensorFlow), or using pip to install a Python package that includes JavaScript code (such as plotly). Reduced performance is particularly common with Cloudera AI on AWS (which uses EFS), but it may be seen in other environments.
Disable file upload and download (DSE-12065)
You cannot disable file upload and download when using the Jupyter Notebook.
Remove Workbench operation fails (DSE-8834)
Remove Workbench operation fails if workbench creation is still in progress.
API does not enforce a maximum number of nodes for Cloudera AI Workbench
Problem: When the API is used to provision new Cloudera AI Workbench, it does not enforce an upper limit on the autoscale range.
Downscaling Cloudera AI Workbench nodes does not work as expected (MLX-637, MLX-638)
Problem: Downscaling nodes does not work as seamlessly as expected due to a lack of
Bin Packing on the Spark default scheduler, and because dynamic allocation is not currently
enabled. As a result, currently infrastructure pods, Spark driver/executor pods, and session
pods are tagged as non-evictable using the
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
annotation.
Limitations
- Cloudera AI Inference service and AI Registry are not supported on Azure East Asia and Qatar Central regions
Cloudera AI Inference service and AI Registry are not supported on Microsoft Azure East Asia and Qatar Central regions due to lack of support for Workload Identity by Microsoft Azure.
Technical Service Bulletins
- TSB 2024-761: Orphan EBS Volumes in Cloudera AI Workbench
Cloudera AI provisions Elastic Block Store (EBS) volumes during provisioning of a workbench. Due to missing labels on Cloudera AI Workbench, delete operations on previously restored Cloudera AI Workbench didn’t clean up a subset of the provisioned block volumes.
- Knowledge article
- For the latest update on this issue see the corresponding Knowledge article: TSB 2024-761: Orphan EBS Volumes in Cloudera AI Workbench
- TSB 2023-628: Sensitive user data getting collected in Cloudera AI Workbench or CDSW workbench diagnostic bundles
When using Cloudera Data Science Workbench (CDSW), Cloudera recommends users to store sensitive information, such as passwords or access keys, in environment variables rather than in the code. See Engine Environment Variables in the official Cloudera documentation for details. Cloudera recently learned that all session environment variables in the affected releases of CDSW and Cloudera AI are logged in web pod logs, which may be included in support diagnostic bundles sent to Cloudera as part of support tickets.
- Knowledge article
- For the latest update on this issue see the corresponding Knowledge article: TSB-2023-628: Sensitive user data getting collected in Cloudera AI Workbench or in CDSW workbench diagnostic bundles
- TSB 2022-588: Kubeconfig and new version of aws-iam-authenticator
Regenerate Kubeconfig and in conjunction use a newer version of aws-iam-authenticator on AWS. Kubeconfig in Cloudera Cloudera Public Cloud Data Services needs to be regenerated because the Kubeconfig generated before June 15, 2022 uses an old APIVersion (client.authentication.k8s.io/v1alpha1) which is no longer supported. This causes compatibility issues with aws-iam-authenticator starting from v0.5.7. To be able to use the new aws-iam-authenticator, the Kubeconfig needs to be regenerated.
- Knowledge article
- For the latest update on this issue see the corresponding Knowledge article: TSB-2022-588: Kubeconfig and new version of aws-iam-authenticator