Using Heuristic-based metrics
The Heuristic-based metrics evaluate text or data using various heuristic metrics, such as, Rouge, Flesch-Kincaid, and BLEU. Below is a simple example of how MLflow LLM evaluation works.
If you are prompted for other missing dependencies, you can install them. For information about the download location, see MLFlow documentation.
The following example builds a simple question-answering model by wrapping
openai/gpt-4 with the custom
prompt:
export OPENAI_API_KEY='your-api-key-here' import mlflow import openai import os Import json import pandas as pd from getpass import getpass eval_data = pd.DataFrame( { "inputs": [ "What is MLflow?", "What is Spark?", ], "ground_truth": [ "MLflow is an open-source platform for managing the end-to-end machine learning (ML) " "lifecycle. It was developed by Databricks, a company that specializes in big data and " "machine learning solutions. MLflow is designed to address the challenges that data " "scientists and machine learning engineers face when developing, training, and deploying " "machine learning models.", "Apache Spark is an open-source, distributed computing system designed for big data " "processing and analytics. It was developed in response to limitations of the Hadoop " "MapReduce computing model, offering improvements in speed and ease of use. Spark " "provides libraries for various tasks such as data ingestion, processing, and analysis " "through its components like Spark SQL for structured data, Spark Streaming for " "real-time data processing, and MLlib for machine learning tasks", ], } ) with mlflow.start_run() as run: system_prompt = "Answer the following question in two sentences" # Wrap "gpt-4" as an MLflow model. logged_model_info = mlflow.openai.log_model( model="gpt-4", task=openai.chat.completions, artifact_path="model", messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": "{question}"}, ], ) # Use predefined question-answering metrics to evaluate our model. results = mlflow.evaluate( logged_model_info.model_uri, eval_data, targets="ground_truth", model_type="question-answering", ) metrics_json = json.dumps(results.metrics, indent=4) print(f"See aggregated evaluation results below: \n{metrics_json}") # Evaluation result for each data record is available in `results.tables`. eval_table = results.tables["eval_results_table"] print(f"See evaluation table below: \n{eval_table}")
The output would be something similar to the
following:
See aggregated evaluation results below: { "flesch_kincaid_grade_level/v1/mean": 14.3, "flesch_kincaid_grade_level/v1/variance": 0.010000000000000106, "flesch_kincaid_grade_level/v1/p90": 14.38, "ari_grade_level/v1/mean": 17.95, "ari_grade_level/v1/variance": 0.5625, "ari_grade_level/v1/p90": 18.55, "exact_match/v1": 0.0 } Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 502.97it/s] See evaluation table below: inputs ground_truth \ 0 What is MLflow? MLflow is an open-source platform for managing... 1 What is Spark? Apache Spark is an open-source, distributed co... outputs token_count \ 0 MLflow is an open-source platform developed by... 42 1 Spark is an open-source distributed general-pu... 31 flesch_kincaid_grade_level/v1/score ari_grade_level/v1/score 0 12.1 17.2 1 13.7 18.5