Deep-dive analysis tools
The Evaluation dashboard provides granular troubleshooting through surface-level scores, natural language explanations, and raw metadata for in-depth performance analysis.
Most metrics provide a score from 0.00 to 1.00, in which 1.00 represents perfect performance or success.
Clicking the caret (>) next to any metric reveals the following details behind the score:
- Explanation block – Natural language description of why a judge assigned a specific score. For example,, if a Faithfulness check fails, the judge explains exactly why the output mismatched the context.
- Standardized status – Consistent PASS/FAIL labels to ensure clarity across different evaluators.
- Raw metadata in JSON format, including the following data:
- Trace ID or task ID – Unique identifier for the specific execution step.
- Input prompts – The exact text sent to the LLM that resulted in the scored output.
- Detailed explanations – The reasoning of the judge for the success or failure status.
