Deploying Agent Studio-supported LLMs using Cloudera AI Inference service

Configure and validate Model Hub Large Language Models (LLMs) to ensure compatibility with Agent Studio and Agentic workflow within Cloudera AI Inference service.

The Cloudera AI Inference service supports two specific models:

  • nemotron-3-super-120b-a12b (Agentic workflow tagged)
  • llama-3.3-nemotron-super-49b (Agentic workflow tagged)

The models use helper script plugins to support tool invocation (function calling) and advanced reasoning. As standard model cards use local relative paths that are not resolved in containerized environments, you must configure the required runtime environment overrides by using the NIM_PASSTHROUGH_ARGS parameter.

When deploying either of these two models using Cloudera AI, you must configure an additional environment variable during the final deployment step.

  1. In the Cloudera console, click the Cloudera AI tile.
    The Cloudera AI Workbenches page is displayed.
  2. Click on Model Endpoints in the left navigation pane.
    The Endpoint Details page is displayed.
  3. Open Endpoint Details > Model Builder > Resource Profile > Advanced Options pages and enter the required details.
    Figure 1. Configuring Advanced Options and Environment Variables for Model Endpoint Creation
  4. Click on + Add.

    When deploying either of the Model Hub Large Language Models (LLMs) using Cloudera AI Inference service, users must configure an additional environment variable.

  5. Select the NIM_PASSTHROUGH_ARGS environment variable from the Environment variables drop-down list.
  6. Enter the corresponding Value based on your chosen model and precision profile:
    • Nemotron Super 120B (nvidia/nemotron-3-super-120b-a12b) -

      The plugin directory is precision-specific. Select the argument configuration based on the precision profile you are launching:

      • NVFP4 precision profile:

        --reasoning-parser-plugin 
        /mnt/serving/ngc/hub/models--nim--nvidia--nemotron-3-super-120b-a12b/snapshots/rl-030326-nvfp4/super_v3_reasoning_parser.py 
        --reasoning-parser super_v3 
        --enable-auto-tool-choice 
        --tool-call-parser qwen3_coder
      • FP8 precision profile:

        --reasoning-parser-plugin 
        /mnt/serving/ngc/hub/models--nim--nvidia--nemotron-3-super-120b-a12b/snapshots/rl-030326-fp8/super_v3_reasoning_parser.py 
        --reasoning-parser super_v3 
        --enable-auto-tool-choice 
        --tool-call-parser qwen3_coder
      • BF16 precision profile:

        --reasoning-parser-plugin 
        /mnt/serving/ngc/hub/models--nim--nvidia--nemotron-3-super-120b-a12b/snapshots/rl-030326-bf16/super_v3_reasoning_parser.py 
        --reasoning-parser super_v3 
        --enable-auto-tool-choice
        --tool-call-parser qwen3_coder
    • Nemotron Super 49B (llama-3.3-nemotron-super-49b)

      Due to runtime compatibility differences, use the JSON fallback configuration below for all precision profiles (NVFP4, FP8, BF16):

      --enable-auto-tool-choice --tool-call-parser llama3_json