Metrics reference glossary

The metrics reference tables define the automatic metrics and LLM judge evaluators used to measure performance, cost, and qualitative accuracy in agentic workflows.

Table 1. Automatic metrics
Metric category	Metric name	Detailed definition
Automatic	Latency	Total time in seconds from workflow start to finish. Useful for identifying performance bottlenecks.
Automatic	Token Usage	Cost analysis of total prompt compared to completion tokens used across all execution spans.
Automatic	Error Rate	The percentage of spans that returned an error status rather than `Success`.
Automatic	Loop Detection	Identification of repeated or looping behavior with improved signal quality to catch malfunctioning agents during testing.
Automatic	Task Completion	Binary status (Success or Fail) based strictly on the execution status code of the final task.
Automatic	Tool Call Count	Sum total of all tool or sub-agent calls made during the workflow trace.

Table 2. Qualitative analysis
Metric category	Metric name	Detailed definition
LLM Judge	Faithfulness	Evaluation if whether the answer is grounded factually in the provided context to detect hallucinations.
LLM Judge	Reasoning Quality	Assessment of the agent's logical chain of thought and specialist routing efficiency.
LLM Judge	Toxicity	Safety check for harmful, offensive, hateful, or inappropriate language in the agent response.
LLM Judge	Manager Delegation	Measurement of the Manager agent's ability to select the correct specialist agent based on the query.
LLM Judge	Tool Calling Accuracy	Measurement of tool selection accuracy and the validity of the generated parameters.