Using Hugging Face models

Deploying Hugging Face models might require additional configuration details.

Table 1. Using Hugging Face models
Model name vLLM parameters GPU Test payload
google/gemma-4-31B-it

--tensor-parallel-size 2

--quantization fp8

--max-model-len 8192

--max-num-batched-tokens 4096

2 x NVIDIA L40S
{
  "model": "google/gemma-4-31B-it",
  "messages": [
    {
      "role": "user",
      "content": "What is Generative AI?"
    }
  ],
  "stream": true
}