Deep-dive analysis tools

The Evaluation dashboard provides granular troubleshooting through surface-level scores, natural language explanations, and raw metadata for in-depth performance analysis.

Most metrics provide a score from 0.00 to 1.00, in which 1.00 represents perfect performance or success.

Clicking the caret (>) next to any metric reveals the following details behind the score:

  • Explanation block – Natural language description of why a judge assigned a specific score. For example,, if a Faithfulness check fails, the judge explains exactly why the output mismatched the context.
  • Standardized status – Consistent PASS/FAIL labels to ensure clarity across different evaluators.
  • Raw metadata in JSON format, including the following data:
    • Trace ID or task ID – Unique identifier for the specific execution step.
    • Input prompts – The exact text sent to the LLM that resulted in the scored output.
    • Detailed explanations – The reasoning of the judge for the success or failure status.