ExperimentsPDF version

Using Heuristic-based metrics

The Heuristic-based metrics evaluate text or data using various heuristic metrics, such as, Rouge, Flesch-Kincaid, and BLEU. Below is a simple example of how MLflow LLM evaluation works.

You need to install the following dependencies before running the sample code:

If you are prompted for other missing dependencies, you can install them. For information about the download location, see MLFlow documentation.

The following example builds a simple question-answering model by wrapping openai/gpt-4 with the custom prompt:
export OPENAI_API_KEY='your-api-key-here'

import mlflow
import openai
import os
Import json
import pandas as pd
from getpass import getpass

eval_data = pd.DataFrame(
    {
        "inputs": [
            "What is MLflow?",
            "What is Spark?",
        ],
        "ground_truth": [
            "MLflow is an open-source platform for managing the end-to-end machine learning (ML) "
            "lifecycle. It was developed by Databricks, a company that specializes in big data and "
            "machine learning solutions. MLflow is designed to address the challenges that data "
            "scientists and machine learning engineers face when developing, training, and deploying "
            "machine learning models.",
            "Apache Spark is an open-source, distributed computing system designed for big data "
            "processing and analytics. It was developed in response to limitations of the Hadoop "
            "MapReduce computing model, offering improvements in speed and ease of use. Spark "
            "provides libraries for various tasks such as data ingestion, processing, and analysis "
            "through its components like Spark SQL for structured data, Spark Streaming for "
            "real-time data processing, and MLlib for machine learning tasks",
        ],
    }
)

with mlflow.start_run() as run:
    system_prompt = "Answer the following question in two sentences"
    # Wrap "gpt-4" as an MLflow model.
    logged_model_info = mlflow.openai.log_model(
        model="gpt-4",
        task=openai.chat.completions,
        artifact_path="model",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "{question}"},
        ],
    )

    # Use predefined question-answering metrics to evaluate our model.
    results = mlflow.evaluate(
        logged_model_info.model_uri,
        eval_data,
        targets="ground_truth",
        model_type="question-answering",
    )
    metrics_json = json.dumps(results.metrics, indent=4)
    print(f"See aggregated evaluation results below: \n{metrics_json}")

    # Evaluation result for each data record is available in `results.tables`.
    eval_table = results.tables["eval_results_table"]
    print(f"See evaluation table below: \n{eval_table}")
The output would be something similar to the following:
See aggregated evaluation results below: 
{
    "flesch_kincaid_grade_level/v1/mean": 14.3,
    "flesch_kincaid_grade_level/v1/variance": 0.010000000000000106,
    "flesch_kincaid_grade_level/v1/p90": 14.38,
    "ari_grade_level/v1/mean": 17.95,
    "ari_grade_level/v1/variance": 0.5625,
    "ari_grade_level/v1/p90": 18.55,
    "exact_match/v1": 0.0
}


Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 502.97it/s] 

See evaluation table below: 
            inputs                                       ground_truth  \
0  What is MLflow?  MLflow is an open-source platform for managing...   
1   What is Spark?  Apache Spark is an open-source, distributed co...   

                                             outputs  token_count  \
0  MLflow is an open-source platform developed by...           42   
1  Spark is an open-source distributed general-pu...           31   

   flesch_kincaid_grade_level/v1/score  ari_grade_level/v1/score  
0                                 12.1                      17.2  
1                                 13.7                      18.5  

We want your opinion

How can we improve this page?

What kind of feedback do you have?