Using LLM-as-a-Judge metrics

LLM-as-a-Judge is a new type of metric that uses LLMs to score the quality of model outputs, providing a more human-like evaluation for complex language tasks while being more scalable and cost-effective than human evaluation.

MLflow supports several built-in LLM-as-a-judge metrics, as well as allowing you to create your own LLM-as-a-judge metrics with custom configurations and prompts.

Built-in LLM-as-a-Judge metrics

To use built-in LLM-as-a-Judge metrics in MLflow, pass the list of metrics definitions to the extra_metrics argument in the mlflow.evaluate() function.

The following example uses the built-in answer correctness metric for evaluation, in addition to the latency metric (heuristic):
import mlflow
import os

os.environ["OPENAI_API_KEY"] = "<your-openai-api-key>"

answer_correctness = mlflow.metrics.genai.answer_correctness(model="openai:/gpt-4o")

# Test the metric definition
answer_correctness(
    inputs="What is MLflow?",
    predictions="MLflow is an innovative full self-driving airship.",
    targets="MLflow is an open-source platform for managing the end-to-end ML lifecycle.",
)
The output would be something similar to the following:
MetricValue(scores=[1],
justifications=['The output is completely incorrect as it describes MLflow as a "full self-driving airship," which is entirely different from the provided target that states MLflow is an open-source platform for managing the end-to-end ML lifecycle. There is no semantic similarity or factual correctness in the output compared to the target.'],
aggregate_results={'mean': 1.0, 'variance': 0.0, 'p90': 1.0})
Here is the list of built-in LLM-as-a-Judge metrics. Click on the link to see the full documentation for each metric:
  • answer_similarity(): Evaluate how similar a model’s generated output or predictions is compared to a set of reference (ground truth) data.
  • answer_correctness(): Evaluate how factually correct a model’s generated output is based on the information within the ground truth data.
  • answer_relevance(): Evaluate how relevant the model generated output is to the input (context is ignored).
  • relevance(): Evaluate how relevant the model generated output is with respect to both the input and the context.
  • faithfulness(): Evaluate how faithful the model generated output is based on the context provided.
For more information about MLFlow Evaluation, see MLFlow documentation.