Supported vLLM command line arguments for Cloudera AI Inference service 1.5.5 SP1 using vLLM 0.8.4
Cloudera AI on premises 1.5.5 SP1 uses vLLM 0.8.4. The command line arguments available in vLLM 0.8.4 are detailed in vLLM arguments.
The following command line arguments are supported with vLLM 0.8.4:
- --block-size
- --calculate-kv-scales
- --cpu-offload-gb
- --disable-cascade-attn
- --disable-chunked-mm-input
- --disable-sliding-window
- --dtype
- --enable-auto-tool-choice
- --enable-chunked-prefill
- --enable-prefix-caching
- --enforce-eager
- --gpu-memory-utilization
- --kv-cache-dtype
- --load-format
- --logprobs-mode
- --long-prefill-token-threshold
- --max-logprobs
- --max-long-partial-prefills
- --max-model-len
- --max-num-batched-tokens
- --max-num-partial-prefills
- --max-num-seqs
- --max-seq-len-to-capture
- --multi-step-stream-outputs
- --no-enable-prefix-caching
- --num-lookahead-slots
- --num-scheduler-steps
- --pipeline-parallel-size, -pp
- --prefix-caching-hash-algo
- --preemption-mode
- --quantization
- --rope-scaling
- --rope-theta
- --scheduling-policy
- --seed
- --tensor-parallel-size, -tp
- --tool-call-parser
- --trust-remote-code
You can find details on the above listed command line arguments here: vLLM arguments.
