Deep-Dive Analysis Tools

The dashboard is designed for granular troubleshooting beyond the surface-level scores.

Understanding the Score

Most metrics provide a score from 0.00 to 1.00, where 1.00 represents perfect performance or success.

Click the caret (>) next to any metric to reveal the Explanation block. This provides a natural language description of why the score was given.

Explanation Block: Expanding a metric reveals a natural language description (e.g., if a "Faithfulness" check fails, the judge explains exactly why the output mismatched the context).
Standardized Status: Evaluation results now use consistent PASS/FAIL labels to ensure clarity across different evaluators.

Expanding a metric also reveals raw metadata in JSON format. This includes:

Trace ID/Task ID: Unique identifiers for the specific execution step.
Input Prompts: The exact text sent to the LLM that resulted in the scored output.
Detailed Explanations: The judge's full reasoning for the success or failure status.