Metrics Reference Glossary

Table 1. Automatic Metrics
Metric Category	Metric Name	Detailed Definition
Automatic	Latency	Total time in seconds from workflow start to finish. Useful for identifying performance bottlenecks.
Automatic	Token Usage	Cost analysis measuring total prompt vs. completion tokens used across all execution spans.
Automatic	Error Rate	The percentage of spans that returned an error status rather than "Success".
Automatic	Loop Detection	Flags repetitive LLM calls that suggest an agent is "stuck" or stuck in a recursive loop.
Automatic	Task Completion	Binary check (Success/Fail) based strictly on the execution status code of the final task.
Automatic	Tool Call Count	Sum total of all tool or sub-agent calls made during the workflow trace.

Table 2. Qualitative Analysis
Metric Category	Metric Name	Detailed Definition
LLM Judge	Faithfulness	Evaluates if the answer is grounded factually in the context provided. Often used to detect hallucinations.
LLM Judge	Reasoning Quality	Assesses if the agent followed a logical chain of thought and appropriate specialist routing.
LLM Judge	Toxicity	Safety check for harmful, offensive, hateful, or inappropriate language in the agent's response.
LLM Judge	Manager Delegation	Measures if the Manager agent chose the correct specialist agent based on the user's specific query.
LLM Judge	Tool Calling Accuracy	Measures if the correct tool was selected and if the generated parameters were valid for that tool.