Hosting an LLM as a Cloudera AI Workbench model
Cloudera AI Workbench models give you the flexibility to host, expose, and monitor a variety of AI and machine learning models and functions.
Consider the below example on how to use an open-source model and the HuggingFace Transformers library to expose and perform Large Language Model (LLM) Inference.
Prerequisites for hosting an LLM model
You must review which LLM model you want to deploy and which GPU model is available within your Cloudera AI Workbench before hosting an LLM solution.
If the LLM you want to deploy is too large to fit within the GPU memory of the
available GPU model, consider a smaller LLM, changing the GPU model available for
this Cloudera AI Workbench, or utilize quantization techniques in
your launch_model
deployment script.
Requirements for hosting an LLM model
Consider the following file requirements:
requirements.txt
transformers==4.44.0
torch==2.3.1
accelerate==0.33.0
cdsw-build.sh
pip install --no-cache-dir -r requirements.txt
launch_model.py
import cml.models_v1 as models
import torch
from transformers import pipeline
# The LLM from HugginFace of your choice
# Note that parameters for the pipeline instantiation could differ depending on the LLM chosen
model_id = "meta-llama/Llama-3.2-3B-Instruct"
model_inference = pipeline(
"text-generation",
model=model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
return_full_text=False
)
@models.cml_model
def api_wrapper(args):
# Pick up args from Cloudera AI Workbench api, this could be modified to the needs of the clients calling the Cloudera AI Workbench api.
prompt = args["prompt"]
# Pick up some additional args for inference, see transformers documentation for addition inference parameters that are possible to pass in a pipeline
try:
max_length = int(args["max_length"])
except (ValueError, KeyError):
max_length = 512
try:
temperature = int(args["temperature"])
except (ValueError, KeyError):
temperature = 0.7
# Note: Huggingface pipeline inference includes the promp text in the generated output
output = model_inference(prompt, max_length=max_length, temperature=temperature)
return output[0]["generated_text"]