Known issues

There are some known issues you might run into while using Cloudera AI Inference service.

The following compute instance types are not supported by Cloudera AI Inference service:
- Azure: NVadsA10_v5 series.
- AWS: p4d.24xlarge
Unclean deletion of Cloudera AI Inference service version 1.2.0 and older. If you delete Cloudera AI Inference service version 1.2.0 or older, some Kubernetes resources are left in the cluster and causes a subsequent creation of Cloudera AI Inference service on the same cluster to fail. It is recommended that you delete the Compute cluster and recreate it to deploy Cloudera AI Inference service on it.
Graceful deletion of Cloudera AI Inference service version older than 1.3.0-b111 fails. A new feature introduced in version 1.3.0-b111 has caused a regression where graceful deletion of an existing Cloudera AI Inference service version 1.2.0 fails.
Workaround: Use the CDP CLI version 0.9.131 or higher to forcefully delete Cloudera AI Inference service.
```
cdp ml delete-ml-serving-app --app-crn [***APP_CRN***] --force
```
Cloudera recommends that after a forceful deletion of Cloudera AI Inference service, you delete the underlying compute cluster as well to ensure that all resources are cleaned up properly.
Updating the description after a model has been added to a model endpoint will lead to a UI mismatch in the model builder for models listed by the model builder and the models deployed.
When you create a model endpoint from the Create Endpoint page, even though the instance type selection is not mandatory, the endpoint creation fails if the instance type is not selected.
DSE-39626: If no worker node can be found within 60 minutes to schedule a model endpoint that is either newly created or is scaling up from 0, the system will give up trying to create and schedule the replica. A common reason for this behavior is insufficient cloud quota, or capacity constraints on the cloud service provider’s side. You could either ask for increased quota, or try to use an instance type that is more readily available.
- To bring up the endpoint after a revision is failed, the endpoint configuration needs to be updated. This can be achieved currently in the form of an autoscale range change, or resource requirements change.
When updating an Model Endpoint with a specific GPU requirement, the instance type must be explicitly set again even if there is no change.
When updating an endpoint with a specific GPU requirement, the instance type must be explicitly set again even if there is no change.
Embedding models function in two modes: query or passage. This has to be specified when interacting with the models. There are two ways to do this:
- suffix the model id in the payload by either -query or -passage or
- specify the input_type parameter in the request payload.
  
  For more information, see NVIDIA documentation.
Embedding models only accept strings as input. Token stream input is currently not supported.
Llama 3.2 Vision models are not supported on AWS on A10G and L40S GPUs.
Llama 3.1 70B Instruct model L40S profile needs 8 GPUs to deploy successfully, while Nvidia documentation lists this model profile as needing only 4 L40S GPUs.
Mistral 7B models for NIM version 1.1.2 require the max_tokens parameter in the request payload. This API regression is known to affect the Test Model UI functionality for this specific NIM version.
NIM endpoints will reply with a 307 temporary redirect if the URL ends with a trailing /. Make sure not to have a trailing slash character at the end of NIM endpoint URLs.
Model Runtimes have been changed in a non-backward compatible way between Cloudera AI Inference service version 1.2 and 1.3. Therefore, NIM model endpoints deployed in version 1.2 need to be redeployed by downloading their profiles again through Model Hub and creating a new endpoint from the most recent version of the model in the Cloudera AI Registry.
You cannot upgrade from Cloudera AI Inference service version 1.3.0-b111 to higher. You must first delete the service and recreate it to deploy version 1.3.0-b113 or higher.
Hugging Face model deployment fails in Cloudera AI Inference service 1.3.0-b114.
Specificing subnets for load balancer from the UI when creating Cloudera AI Inference service does not work. The specified subnets are accepted by the UI, but these settings are not actually applied to the load balancer service created in the cluster. Workaround: Use CDP CLI to specify subnets for the load balancer.