Using LLM-as-a-Judge metrics
LLM-as-a-Judge is a new type of metric that uses LLMs to score the quality of model outputs, providing a more human-like evaluation for complex language tasks while being more scalable and cost-effective than human evaluation.
MLflow supports several built-in LLM-as-a-judge metrics, as well as allowing you to create your own LLM-as-a-judge metrics with custom configurations and prompts.
Built-in LLM-as-a-Judge metrics
To use built-in LLM-as-a-Judge metrics in MLflow, pass the list of metrics
definitions to the extra_metrics argument in the
mlflow.evaluate()
function.
The following example uses the built-in answer correctness metric for evaluation, in
addition to the latency metric
(heuristic):
import mlflow import os os.environ["OPENAI_API_KEY"] = "<your-openai-api-key>" answer_correctness = mlflow.metrics.genai.answer_correctness(model="openai:/gpt-4o") # Test the metric definition answer_correctness( inputs="What is MLflow?", predictions="MLflow is an innovative full self-driving airship.", targets="MLflow is an open-source platform for managing the end-to-end ML lifecycle.", )
The output would be something similar to the
following:
MetricValue(scores=[1], justifications=['The output is completely incorrect as it describes MLflow as a "full self-driving airship," which is entirely different from the provided target that states MLflow is an open-source platform for managing the end-to-end ML lifecycle. There is no semantic similarity or factual correctness in the output compared to the target.'], aggregate_results={'mean': 1.0, 'variance': 0.0, 'p90': 1.0})
Here is the list of built-in LLM-as-a-Judge metrics. Click on the link to see the
full documentation for each metric:
- answer_similarity(): Evaluate how similar a model’s generated output or predictions is compared to a set of reference (ground truth) data.
- answer_correctness(): Evaluate how factually correct a model’s generated output is based on the information within the ground truth data.
- answer_relevance(): Evaluate how relevant the model generated output is to the input (context is ignored).
- relevance(): Evaluate how relevant the model generated output is with respect to both the input and the context.
- faithfulness(): Evaluate how faithful the model generated output is based on the context provided.