Known Issues and Limitations
There are some known issues you might run into while using Cloudera Machine Learning.
Do not use backtick characters in environment variable names
Avoid using backtick characters ( `
) in environment variable names, as
this will cause sessions to fail with exit code 2.
Model Registry is not supported on R models
Model Registry is not supported on R models.
Model Registry model name cannot exceed 19 characters
The Model Registry model name cannot exceed 19 characters.
The mlflow.log_model registered model files might not be available on NFS Server
When using mlflow.log_model, registered model files might not be available on the NFS server due to NFS server settings or network connections. This could cause the model to remain in the registering status.
- Re-register the model. It will register as an additional version, but it should correct the problem.
- Add the ARTIFACT_SYNC_PERIOD environment variable to hdfscli-server Kubernetes deployment and set it to an integer value. This will set the model registry retry operation to twice the number of seconds specified by the artifact sync period integer value. If the ARTIFACT_SYNC_PERIOD is set to 30 seconds then model registry will retry for 60 seconds. The default value is 10 and model registry retries for 20 seconds. For example: -name: ARTIFACT_SYNC_PERIOD value: “30”.
Applications appear in failed state after upgrade (DSE-23330)
After upgrading CML from version 1.29.0 on AWS, some applications may be in a Failed state. The workaround is to restart the application.
Cannot use hashtag character in JDBC connection string
The special character #
(hashtag) cannot be used in a password that is
then used in a JDBC connection string. Avoid using this special character, or use '%23'
instead.
CML workspace installation fails
CML workspace installation with Azure NetApp Files on NFS v4.1 fails. The workaround is to use NFS v3.
Spark executors fail due to insufficient disk space
Generally, the administrator should estimate the shuffle data set size before provisioning the workspace, and then specify the root volume size of the compute node that is appropriate given that estimate. For more specific guidelines, see the following resources.
Runtime Addon fails to load (DSE-16200)
A Spark runtime add-on may fail when upgrading a workspace.
Solution: To resolve this problem, try to reload the add-on. In Reload.
, in the option menu next to the failed add-on, selectCML workspace provisioning times out
When provisioning a CML workspace, the process may time out with an error similar to
Warning FailedMount
or Failed to sync secret cache:timed out
waiting for the condition.
This can happen on AWS or Azure.
Solution: Delete the workspace and retry provisioning.
CML endpoint connectivity from DataHub and Cloudera Data Engineering (DSE-14882)
When CDP services connect to CML services, if the ML workspace is provisioned on a public subnet, traffic is routed out of the VPC first, and then routed back in. On Private Cloud CML, traffic is not routed externally.
Jupyter Notebook sessions do not time out (DSE-13741)
Jupyter Notebook sessions in legacy engine:8 through
engine:13 do not exit after IDLE_MAXIMUM_MINUTES
of
inactivity. They will run until SESSION_MAXIMUM_MINUTES
(which is seven
days by default).
Workaround
You can change the configuration of your cluster to apply the fix for this issue. Change the editor command for Jupyter Notebook in every engine that uses it to the following:
NOTEBOOK_TIMEOUT_SECONDS=$(python3 -c “print(${IDLE_MAXIMUM_MINUTES}*60)“)
/usr/local/bin/jupyter notebook --no-browser --ip=127.0.0.1 --port=${CDSW_APP_PORT}
--NotebookApp.token= --NotebookApp.allow_remote_access=True --NotebookApp.quit_button=False
--log-level=ERROR --NotebookApp.shutdown_no_activity_timeout=300
--MappingKernelManager.cull_idle_timeout=${NOTEBOOK_TIMEOUT_SECONDS}
--TerminalManager.cull_inactive_timeout=${NOTEBOOK_TIMEOUT_SECONDS}
--MappingKernelManager.cull_interval=60 --TerminalManager.cull_interval=60
--MappingKernelManager.cull_connected=True
- Kills each running notebook after IDLE_MAXIMUM_MINUTES of inactivity
- Kills the CDSW/CML session in which Jupyter is running after 5 minutes with no notebooks
NFS performance issues on AWS EFS (DSE-12404)
CML uses NFS as the filesystem for storing application and user data. NFS performance may be much slower than expected in situations where a data scientist writes a very large number (typically in the thousands) of small files. Example tasks include: using git clone to clone a very large source repository (such as TensorFlow), or using pip to install a Python package that includes JavaScript code (such as plotly). Reduced performance is particularly common with CML on AWS (which uses EFS), but it may be seen in other environments.
Disable file upload and download (DSE-12065)
You cannot disable file upload and download when using the Jupyter Notebook.
Remove Workspace operation fails (DSE-8834)
Remove Workspace operation fails if workspace creation is still in progress.
API does not enforce a maximum number of nodes for ML workspaces
Problem: When the API is used to provision new ML workspaces, it does not enforce an upper limit on the autoscale range.
Downscaling ML workspace nodes does not work as expected (MLX-637, MLX-638)
Problem: Downscaling nodes does not work as seamlessly as expected due to a lack of
Bin Packing on the Spark default scheduler, and because dynamic allocation is not currently
enabled. As a result, currently infrastructure pods, Spark driver/executor pods, and session
pods are tagged as non-evictable using the
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
annotation.