Metrics reference glossary

The metrics reference tables define the automatic metrics and LLM judge evaluators used to measure performance, cost, and qualitative accuracy in agentic workflows.

Table 1. Automatic metrics
Metric category Metric name Detailed definition
Automatic Latency Total time in seconds from workflow start to finish. Useful for identifying performance bottlenecks.
Automatic Token Usage Cost analysis of total prompt compared to completion tokens used across all execution spans.
Automatic Error Rate The percentage of spans that returned an error status rather than Success.
Automatic Loop Detection Identification of repeated or looping behavior with improved signal quality to catch malfunctioning agents during testing.
Automatic Task Completion Binary status (Success or Fail) based strictly on the execution status code of the final task.
Automatic Tool Call Count Sum total of all tool or sub-agent calls made during the workflow trace.
Table 2. Qualitative analysis
Metric category Metric name Detailed definition
LLM Judge Faithfulness Evaluation if whether the answer is grounded factually in the provided context to detect hallucinations.
LLM Judge Reasoning Quality Assessment of the agent's logical chain of thought and specialist routing efficiency.
LLM Judge Toxicity Safety check for harmful, offensive, hateful, or inappropriate language in the agent response.
LLM Judge Manager Delegation Measurement of the Manager agent's ability to select the correct specialist agent based on the query.
LLM Judge Tool Calling Accuracy Measurement of tool selection accuracy and the validity of the generated parameters.