Tuning auto-scaling sensitivity using the API
To customize the autos-caling sensitivity and requisites, set the target and metric fields in the autoscaling.autoscalingconfig parameter.
The metric field is the data you are watching to make auto-scaling
decisions, which you can set to either concurrency or to RPS
(requests per second):
- Concurrency is the number of requests that each replica of the model shall aim to handle at once.
- RPS is the calculated requests per second handled over the polling period.
- Target is the target value of the metric that you aim to maintain. If the Target value is exceeded when calculating the metric value, new replicas will be spun up or down to maintain the Target value as the upper bound.
Set these values as follows:
# cat ./examples/mlflow/model-spec-cml-registry.json
{
"namespace": "serving-default",
"name": "mlflow-wine-test-from-registry-onnx",
"source": {
"registry_source": {
"version": 1,
"model_id": "yf0o-hrxq-l0xj-8tk9"
}
},
"autoscaling": {
"min_replicas": "0",
"max_replicas": "4",
"autoscalingconfig": {
"metric": "concurrency",
"target": "25"
}
}
}
The above example scales between zero and four replicas aiming to maintain at most 25 concurrent requests per replica, scaling the number of replicas deployed to maintain this target.