Cloudera AI Inference service Concepts

Platform

Cloudera AI Inference service is a Kubernetes-native model inference platform built using the Cloud Native Computing Foundation (CNCF)-hosted KServe model orchestration system. The Cloudera AI Inference service is built with enterprise-grade scalability, security and performance incorporated via integrations with NVIDIA NIM, NVIDIA Triton, and Cloudera. It is designed to serve trained models for production-level use cases.

Runtime

Runtimes are the basic building blocks that are responsible for loading trained model artifacts into memory, and providing APIs that client applications can invoke to run inference requests. The Runtimes also provide metrics with which to monitor the performance of the models. The supported Runtimes in this release include various NVIDIA NIM versions and Hugging Face transformer for text-generation and embedding tasks, and NVIDIA Triton for deep-learning models using the ONNX backend.

Autoscaling

Cloudera AI Inference service provides Model Endpoint autoscaling.

Model Endpoint Autoscaling

Cloudera AI Inference service provides autoscaling at the model endpoint level by increasing or decreasing the number of replicas based on predefined, customizable scaling criteria. Scaling to or from zero replicas is also supported. The supported scaling criteria (or metrics) are Requests Per Second (RPS) and concurrency per replica of the model endpoint. For instance, you can configure your model endpoint to autoscale up if the number of concurrent RPS exceeds 100, so that request latencies are maintained at an acceptable level.

Model endpoint replicas are terminated when they have not received any request for a certain amount of time. For large language models, this idle timeout is set to 1 hour, whereas for other models it is set to 10 minutes.

Autoscaling Latency

It is important to be aware of the time it takes to autoscale-up a model replica, which can adversely affect user experience. The following mathematical expression can be used to calculate the approximate scale-up latency:

T = T_n + T_c + T_d + T_m

where:

T_n: The time needed for a newly scaled up worker node to be ready. This may be zero if the new model replica is scheduled on a node that is already in the cluster. It can also be very large to infinite, depending on the availability of the instance type requested. In general, larger instance types take longer to reach the ready status.
T_c: The time needed to pull container images of the model replica pod. This is negligible if the node already has the images.
T_d: The time taken to download the model artifacts from the Cloudera AI Registry storage to the instance volume. This can vary widely depending on the size of the model, ranging from a few seconds to even hours for the largest models. For large language models, such as Llama 3.1 70b or bigger, this term is the dominant one.
T_m: This is the time required to load the model objects from the instance volume to one or more GPU RAMs for models that require GPU, or to the system memory for CPU-only models.

So based on the above equation, you can see that, for example, to scale up a large language model deployment from 0 to 1 replica can take a few minutes to an hour or more. You must keep this in mind when planning your model deployments and the expected SLAs.