Evaluating LLM with MLFlow

Cloudera AI’s experiment tracking features allow you to use MLflow APIs for LLMs evaluation. MLflow provides an API mlflow.evaluate() to help evaluate your LLMs. LLMs can generate text in various fields, such as answering questions, translation, and text summarization.

MLflow’s LLM evaluation functionality consists of three main components:

A model to evaluate: it can be an MLflow pyfunc model, a URI pointing to one registered MLflow model, or any Python callable that represents your model. For example: a HuggingFace text summarization pipeline.
Metrics: Two types of LLM evaluation metrics, that is, Heuristic-based metrics and LLM-as-a-Judge metrics is available in MLflow. More information about these metrics is discussed in detail later in this topic.
Evaluation data: the data your model is evaluated at, can be a pandas Dataframe, a python list, a numpy array or an mlflow.data.dataset.Dataset() instance.

LLM Evaluation Metrics🔗

There are two types of LLM evaluation metrics in MLflow:

Heuristic-based metrics: These metrics calculate a score for each data record (row in terms of Pandas/Spark dataframe), based on certain functions, such as Rouge (rougeL()), Flesch Kincaid (flesch_kincaid_grade_level()) or Bilingual Evaluation Understudy (BLEU) (bleu()). These metrics are similar to traditional continuous value metrics. For the list of built-in heuristic metrics and how to define a custom metric with your own function definition, see the Heuristic-based Metrics section.
For more information on using heuristic-based metrics and an example of how to use mlflow to evaluate an LLM using heuristic-based metrics, see Using Heuristic-based metrics.
LLM-as-a-Judge metrics: LLM-as-a-Judge is a new type of metric that uses LLMs to score the quality of model outputs. It overcomes the limitations of heuristic-based metrics, which often miss nuances like context and semantic accuracy. LLM-as-a-Judge metrics provide a more human-like evaluation for complex language tasks while being more scalable and cost-effective than human evaluation. MLflow provides various built-in LLM-as-a-Judge metrics and supports creating custom metrics with your own prompt, grading criteria, and reference examples. See the LLM-as-a-Judge Metrics section for more details.
For more information on using LLM-as-a-Judge metrics and an example of how to use mlflow to evaluate an LLM using LLM-as-a-Judge metrics, see Using LLM-as-a-Judge Metrics.

Evaluating LLM with MLFlow

LLM Evaluation Metrics🔗

We want your opinion

How can we improve this page?