Cloudera AI Inference service Concepts

Platform

Cloudera AI Inference service is a Kubernetes-native model inference platform built using the Cloud Native Computing Foundation (CNCF)-hosted KServe model orchestration system. The Cloudera AI Inference service is built with enterprise-grade scalability, security and performance incorporated via integrations with NVIDIA NIM, NVIDIA Triton, and Cloudera. It is designed to serve trained models for production-level use cases.

Runtime

Runtimes are the basic building blocks that are responsible for loading trained model artifacts into memory, and providing APIs that client applications can invoke to run inference requests. The Runtimes also provide metrics with which to monitor the performance of the models. The supported Runtimes in this release include various NVIDIA NIM versions and Hugging Face transformer for text-generation and embedding tasks, and NVIDIA Triton for deep-learning models using the ONNX backend.

Autoscaling

Cloudera AI Inference service provides Cluster Node autoscaling and Model Endpoint autoscaling.

Cluster Node Autoscaling

If aCloudera AI Inference service runs on an autoscaling Kubernetes cluster, it can be configured with multiple auto-scaling worker node groups. There are two default node groups that are meant to run certain core services, and an arbitrary number of worker node groups where user workloads are scheduled. You can add or delete worker node groups from the cluster. The autoscaling range of an existing node group in the cluster can be changed.

Model Endpoint Autoscaling

In addition to worker node autoscaling, Cloudera AI Inference service provides autoscaling at the model endpoint level by increasing or decreasing the number of replicas based on predefined, customizable scaling criteria. Scaling to or from zero replicas is also supported. The supported scaling criteria (or metrics) are Requests Per Second (RPS) and concurrency per replica of the model endpoint. For instance, you can configure your model endpoint to autoscale up if the number of concurrent RPS exceeds 100, so that request latencies are maintained at an acceptable level.

Model endpoint replicas are terminated when they have not received any request for a certain amount of time. For large language models, this idle timeout is set to 1 hour, whereas for other models it is set to 10 minutes.

Autoscaling Latency

It is important to be aware of the time it takes to autoscale-up a model replica, which can adversely affect user experience. The following mathematical expression can be used to calculate the approximate scale-up latency:

T = T_n + T_c + T_d + T_m

where:

T_n: The time needed for a newly scaled up worker node to be ready. This may be zero if the new model replica is scheduled on a node that is already in the cluster. It can also be very large to infinite, depending on the availability of the instance type requested. In general, larger instance types take longer to reach the ready status.
T_c: The time needed to pull container images of the model replica pod. This is negligible if the node already has the images.
T_d: The time taken to download the model artifacts from the Cloudera AI Registry storage to the instance volume. This can vary widely depending on the size of the model, ranging from a few seconds to even hours for the largest models. For large language models, such as Llama 3.1 70b or bigger, this term is the dominant one.
T_m: This is the time required to load the model objects from the instance volume to one or more GPU RAMs for models that require GPU, or to the system memory for CPU-only models.

So based on the above equation, you can see that, for example, to scale up a large language model deployment from 0 to 1 replica can take a few minutes to an hour or more. You must keep this in mind when planning your model deployments and the expected SLAs.

Canary Rollout and Rollback

Cloudera AI Inference service lets you roll out a new version of a model without bringing down the currently running version, using the canary rollout feature. When a new model endpoint is created using version A of a model, 100% of the traffic is routed by the system to the deployed model version. You can roll out another version, say B, under the same model endpoint URL and give it a percentage of the traffic, say 10%. If the new version rolls out successfully, the system will send 90% of the traffic to version A, and 10% to version B. If version B does not successfully come up for some reason, the system will continue sending 100% of the traffic to the good version, which is A.

Note that each version of the model that is rolled out consumes the same amount of system resources allocated to each model endpoint replica. Therefore, if each replica is configured to use 2 CPUs and 5Gi of memory, the resource footprint will double when versions A and B are running. If version B turns out to be good, you can send 100% traffic to it and have version A be scaled to 0 replicas to reduce the resource consumption back to that of a single version. If, after sending 100% traffic to version B, you realize that version A is preferred, there is a way to rollback to the previous version. You can set the traffic for version B to 0% and the traffic is automatically sent back to version A. Setting the traffic to 0% rolls back to the last version that was ready with 100% traffic.

Example:

Suppose you begin with version A receiving 100% of the traffic. A new model, version B, is introduced and allocated 25% of the traffic. Upon evaluation, version B underperforms compared to version A, based on the collected metrics. While version B is still live, you proceed to deploy version C with 100% traffic. However, version C also underperforms relative to version A. Instead of redeploying version A manually, you can set version C's traffic to 0%, and all traffic reverts to version A. Note that version B, being an intermediary version that never reached 100% traffic, is not eligible for rollback.