ExperimentsPDF version

Using LLM-as-a-Judge metrics

LLM-as-a-Judge is a new type of metric that uses LLMs to score the quality of model outputs, providing a more human-like evaluation for complex language tasks while being more scalable and cost-effective than human evaluation.

MLflow supports several built-in LLM-as-a-judge metrics, as well as allowing you to create your own LLM-as-a-judge metrics with custom configurations and prompts.

Built-in LLM-as-a-Judge metrics

To use built-in LLM-as-a-Judge metrics in MLflow, pass the list of metrics definitions to the extra_metrics argument in the mlflow.evaluate() function.

The following example uses the built-in answer correctness metric for evaluation, in addition to the latency metric (heuristic):
import mlflow
import os

os.environ["OPENAI_API_KEY"] = "<your-openai-api-key>"

answer_correctness = mlflow.metrics.genai.answer_correctness(model="openai:/gpt-4o")

# Test the metric definition
answer_correctness(
    inputs="What is MLflow?",
    predictions="MLflow is an innovative full self-driving airship.",
    targets="MLflow is an open-source platform for managing the end-to-end ML lifecycle.",
)
The output would be something similar to the following:
MetricValue(scores=[1],
justifications=['The output is completely incorrect as it describes MLflow as a "full self-driving airship," which is entirely different from the provided target that states MLflow is an open-source platform for managing the end-to-end ML lifecycle. There is no semantic similarity or factual correctness in the output compared to the target.'],
aggregate_results={'mean': 1.0, 'variance': 0.0, 'p90': 1.0})
Here is the list of built-in LLM-as-a-Judge metrics. Click on the link to see the full documentation for each metric:
  • answer_similarity(): Evaluate how similar a model’s generated output or predictions is compared to a set of reference (ground truth) data.
  • answer_correctness(): Evaluate how factually correct a model’s generated output is based on the information within the ground truth data.
  • answer_relevance(): Evaluate how relevant the model generated output is to the input (context is ignored).
  • relevance(): Evaluate how relevant the model generated output is with respect to both the input and the context.
  • faithfulness(): Evaluate how faithful the model generated output is based on the context provided.
For more information about MLFlow Evaluation, see MLFlow documentation.

We want your opinion

How can we improve this page?

What kind of feedback do you have?