Hosting an LLM as a Cloudera AI Workbench model

Cloudera AI Workbench models give you the flexibility to host, expose, and monitor a variety of AI and machine learning models and functions.

Consider the below example on how to use an open-source model and the HuggingFace Transformers library to expose and perform Large Language Model (LLM) Inference.

Prerequisites for hosting an LLM model

You must review which LLM model you want to deploy and which GPU model is available within your Cloudera AI Workbench before hosting an LLM solution.

If the LLM you want to deploy is too large to fit within the GPU memory of the available GPU model, consider a smaller LLM, changing the GPU model available for this Cloudera AI Workbench, or utilize quantization techniques in your launch_model deployment script.

Requirements for hosting an LLM model

Consider the following file requirements:

requirements.txt

This file lists all the PIP dependencies required for loading and performing inference on the LLM.
transformers==4.44.0
torch==2.3.1
accelerate==0.33.0

cdsw-build.sh

This file contains the prerequisite steps that shall be executed for the model build. In this case it is used primarily for installing dependencies.
pip install --no-cache-dir -r requirements.txt

launch_model.py

This file contains the python script to be loaded during the deployment of the Cloudera AI Workbench Model. This file will also contain the function to be executed whenever the Model Endpoint is called, this is specified in the Cloudera AI Workbench Model configuration.
import cml.models_v1 as models
    
import torch
from transformers import pipeline

# The LLM from HugginFace of your choice
# Note that parameters for the pipeline instantiation could differ depending on the LLM chosen
model_id = "meta-llama/Llama-3.2-3B-Instruct"
model_inference = pipeline(
    "text-generation", 
    model=model_id, 
    torch_dtype=torch.bfloat16, 
    device_map="auto",
    return_full_text=False
)

@models.cml_model
def api_wrapper(args):
  # Pick up args from Cloudera AI Workbench api, this could be modified to the needs of the clients calling the Cloudera AI Workbench api.
  prompt = args["prompt"]

  # Pick up some additional args for inference, see transformers documentation for addition inference parameters that are possible to pass in a pipeline
  try:
    max_length = int(args["max_length"])
  except (ValueError, KeyError):
    max_length = 512

  try:
    temperature = int(args["temperature"])
  except (ValueError, KeyError):
    temperature = 0.7
  
  # Note: Huggingface pipeline inference includes the promp text in the generated output
  output = model_inference(prompt, max_length=max_length, temperature=temperature)
  return output[0]["generated_text"]