Metrics reference glossary
The metrics reference tables define the automatic metrics and LLM judge evaluators used to measure performance, cost, and qualitative accuracy in agentic workflows.
| Metric category | Metric name | Detailed definition |
|---|---|---|
| Automatic | Latency | Total time in seconds from workflow start to finish. Useful for identifying performance bottlenecks. |
| Automatic | Token Usage | Cost analysis of total prompt compared to completion tokens used across all execution spans. |
| Automatic | Error Rate | The percentage of spans that returned an error status rather than
Success. |
| Automatic | Loop Detection | Identification of repeated or looping behavior with improved signal quality to catch malfunctioning agents during testing. |
| Automatic | Task Completion | Binary status (Success or Fail) based strictly on the execution status code of the final task. |
| Automatic | Tool Call Count | Sum total of all tool or sub-agent calls made during the workflow trace. |
| Metric category | Metric name | Detailed definition |
|---|---|---|
| LLM Judge | Faithfulness | Evaluation if whether the answer is grounded factually in the provided context to detect hallucinations. |
| LLM Judge | Reasoning Quality | Assessment of the agent's logical chain of thought and specialist routing efficiency. |
| LLM Judge | Toxicity | Safety check for harmful, offensive, hateful, or inappropriate language in the agent response. |
| LLM Judge | Manager Delegation | Measurement of the Manager agent's ability to select the correct specialist agent based on the query. |
| LLM Judge | Tool Calling Accuracy | Measurement of tool selection accuracy and the validity of the generated parameters. |
