Metrics Reference Glossary

Table 1. Automatic Metrics
Metric Category Metric Name Detailed Definition
Automatic Latency Total time in seconds from workflow start to finish. Useful for identifying performance bottlenecks.
Automatic Token Usage Cost analysis measuring total prompt vs. completion tokens used across all execution spans.
Automatic Error Rate The percentage of spans that returned an error status rather than "Success".
Automatic Loop Detection Flags repetitive LLM calls that suggest an agent is "stuck" or stuck in a recursive loop.
Automatic Task Completion Binary check (Success/Fail) based strictly on the execution status code of the final task.
Automatic Tool Call Count Sum total of all tool or sub-agent calls made during the workflow trace.
Table 2. Qualitative Analysis
Metric Category Metric Name Detailed Definition
LLM Judge Faithfulness Evaluates if the answer is grounded factually in the context provided. Often used to detect hallucinations.
LLM Judge Reasoning Quality Assesses if the agent followed a logical chain of thought and appropriate specialist routing.
LLM Judge Toxicity Safety check for harmful, offensive, hateful, or inappropriate language in the agent's response.
LLM Judge Manager Delegation Measures if the Manager agent chose the correct specialist agent based on the user's specific query.
LLM Judge Tool Calling Accuracy Measures if the correct tool was selected and if the generated parameters were valid for that tool.