Guardian Agents Benchmark

Introduction:

The rise of agentic AI platforms marks a fundamental shift in how we evaluate model performance. These systems do more than generate text. They plan, select and use tools, manage multi-step workflows, and autonomously execute tasks on behalf of the user. Once an AI system begins interacting with external tools and maintaining evolving state across steps, traditional LLM benchmarks no longer capture what matters. Agents require an evaluation paradigm that reflects decision quality, action sequencing, correctness of tool usage, and the reliability of the final response. This shift requires evaluation methods beyond traditional text-generation metrics, and it has motivated several new efforts to benchmark agents more directly.

As the industry shifts toward evaluating full agent systems rather than single-turn prompts, several benchmarks have emerged to measure agent behavior. However, even these newer efforts do not evaluate agents running inside real agentic platforms. Broadly, these benchmarks fall into two categories:

Tool-use prediction benchmarks: Benchmarks such as BFCL evaluate whether a model can select the correct tool and produce arguments that match a predefined schema. These evaluations occur outside of any agentic framework, so they do not capture evolving runtime state, dependencies on tool outputs, or consistency across multi-step execution. They measure isolated prediction ability, not real agent performance.
Simulated agent behavior benchmarks: Frameworks such as TauBench and SafetyBench evaluate multi-step agent interactions, but they run inside their own simulated Python environments rather than on real agentic platforms like LangChain, Vectara, or LlamaIndex. They capture multi-turn behavior but not real platform execution, so they cannot measure the robustness of the agentic platform itself.

As a result, the field lacks a benchmark that evaluates agents in their actual operating environment, where platform behavior, tool APIs, LLM reasoning patterns, and error propagation all interact in complex ways.

To address this gap, we designed a benchmark that evaluates agents inside real agentic platforms rather than relying on simulations or isolated tool-prediction tasks. The objective of this benchmark is to surface real-world agent failures and use those findings to guide the design of Guardian Agents, a pre-execution safety layer we are developing. The benchmark is platform-agnostic and can run across a range of agentic frameworks, allowing failures to be observed under realistic runtime conditions. Instead of testing models in synthetic sandboxes, the benchmark measures how the entire agent stack behaves when executing real workflows end to end—capturing planning quality, tool selection, argument construction, sequencing, and the correctness of the final response.

In this blog, we present a platform-agnostic benchmark that evaluates agent robustness across diverse domains, a scenario-generation pipeline for creating realistic test cases, and initial guardrail strategies that help agents operate more reliably.

Approach: Scenario generation and evaluation

To evaluate agents in a structured and repeatable way, we developed an LLM-driven pipeline that generates test scenarios across multiple domains. Each scenario is designed to reflect a realistic workflow an agent would encounter inside a platform, along with the tool responses and constraints required to complete the task.

A Scenario Engine That Builds, Breaks, and Quality-Checks Itself

Figure 1: Scenario generation pipeline with optional adversarial modifications.

The scenario engine is designed to transform an agent's configuration - its system prompt, tools, and domain description - into comprehensive test scenarios. It does this by generating diverse user intents and creating both complete and incomplete prompts. A complete prompt provides all the information the agent needs to perform the task, while an incomplete prompt reflects real user behavior where key details are missing, implied, or only partially stated.

The engine then simulates realistic tool responses, constructs ground truth execution traces, and defines clear response evaluation criteria. These steps produce what we refer to as a happy path scenario. A happy path scenario describes the ideal workflow where the task can be completed without unexpected obstacles. It serves as the expected plan an agent should follow when everything goes right.

The engine can also generate adversarial variations from any happy path scenario by altering one or more simulated tool responses. These changes introduce realistic challenges such as errors, extremely large payloads, ambiguous outputs, content that resembles prompt injection, or partial information. After any modification, the engine updates the ground truth trace and evaluation criteria so the scenario remains consistent and safe to test.

At each stage of generation, the pipeline performs both rule based and LLM based validation to ensure structural correctness, internal consistency, and realism. If a step fails validation, it is regenerated with targeted feedback until it meets the required standard. This validation and retry loop ensures that all scenarios, whether normal or adversarial, reach the quality needed for reliable agent evaluation.

Evaluating Agent Responses and Actions:

To measure agent reliability, we evaluate both what the agent says and what the agent does. The first dimension, Response Correctness, checks whether the agent’s final answer accurately reflects the information returned by tools. The judge model compares the agent’s response against tool outputs and the scenario’s evaluation criteria, ensuring the answer is complete, grounded, free of hallucinations, and consistent with the task requirements.

The second dimension, Action Trace Correctness, evaluates the agent’s behavior throughout the workflow: the tools it selected, the parameters it provided, and the order in which those tools were executed. Using a graph based scoring rubric, the judge verifies that the agent followed the expected sequence, respected dependencies, matched arguments appropriately, and avoided unnecessary or unsafe tool calls.

An agent is considered correct only when it passes both dimensions. This dual perspective allows us to separate models that simply provide fluent answers from those that reliably execute workflows, follow constraints, and use tools safely.

To further understand why agents fail, we classify every incorrect run into a failure taxonomy. The main categories include:

Incorrect Tool Selection: The agent inappropriately chooses a tool for the task, such as using check_free_slots when check_conflicts is the required function.
Invalid or missing parameters: The agent uses incorrect values or fails to include required fields. A common example is formatting a date as mm/dd/yyyy when the tool expects mm-dd-yyyy.
Missing Required Tool Call: The agent fails to invoke a necessary tool call to complete the workflow. For example, it might call create_event without first calling check_conflicts.
Repeated Tool Calls: The agent repeatedly calls an inappropriate tool, such as find_free_slots, multiple times, expecting a different outcome rather than modifying its plan.
Incorrect Tool Sequencing: The agent fails to execute tools in the correct order, resulting in actions that are either invalid or premature. A common example is attempting to cancel_order before first using check_eligibility to confirm the order is eligible for cancellation.

Benchmark Evaluation and Findings

To understand how agents behave in realistic workflows, we evaluated them across multiple domains, each with its own tools, constraints, and expected execution patterns. Every scenario runs the full agent loop inside an actual agentic platform, allowing us to observe not only the final response but the entire chain of decisions and tool interactions.

Domains Covered:

The benchmark spans six operational domains that reflect common enterprise agent workflows:

Email: Managing inbox operations such as searching, filtering, reading, organizing messages, and marking or moving emails.
Calendar: Scheduling meetings, checking availability, resolving conflicts, handling attendees, and updating events.
Financial Analysis: Computing valuation metrics, comparing company fundamentals, interpreting ratios, and summarizing investment insights.
Customer Service: Handling order tracking, returns and cancellations, account issues, and general support workflows.
Internal Knowledge Retrieval: Searching HR policies, IT procedures, compliance guidelines, training documents, and company handbooks across multiple knowledge bases.
Business Intelligence: Executing text to SQL queries, analyzing structured datasets, identifying trends, and summarizing analytical results.

These domains allow us to evaluate agents in a wide range of realistic, multi step tasks that require tool use, reasoning, and correct sequencing.

Evaluation Setting:

We tested agents across different platform and model configurations. Each configuration executes every scenario end to end using the tool simulation defined in the scenario file. We evaluated multiple combinations of platforms and models:

Platforms: LLamaIndex and LangChain
Models: GPT-5 and Claude Sonet 4.5
Metrics:
- R - Response Correct
- A - Action Trace Correct
- O - Overall Correct, which requires both R and A to be correct

The table below summarizes performance across platforms and models:

Table 1: Domain wise evaluation results across platforms and models.

R represents Response Correct, A represents Action Trace Correct, and O represents Overall Correct (both R and A must be correct). Across all 907 scenarios, Overall Correct ranges from roughly 5–59%, even when Response Correct is often above 50%.

Across all domains, we observe a consistent pattern: agents often generate reasonably good final responses (R), but struggle significantly with correct tool usage and step-by-step execution (A). As a result, Overall Correct scores (O) remain low across platforms. These results highlight the difficulty agents face in following optimized tool plans and the importance of evaluating both response and actions.

Failure Taxonomy Breakdown:

Beyond accuracy numbers, we analyze every failed run using the taxonomy described earlier.

This gives us visibility into which failure types occur most frequently and why.

Figure 2: Distribution of failure types across model and framework combinations.

Across configurations, most failures stem from missing required tool calls and incorrect tool selection, which together make up the majority of errors. Agents also get stuck in repeated tool calls when they fail to adjust their plan, and invalid parameters appear regularly in schema heavy domains. Incorrect tool order is relatively uncommon. Overall, these patterns show that agents struggle more with choosing and completing the right tools than with sequencing alone.

Improving Reliability with Platform Level Guardrails

The evaluation results highlight a clear pattern: agents often produce fluent answers but struggle with choosing the right tools, constructing correct arguments, and following the required sequence of actions. To address this, we plan to introduce an early stage validation layer that we refer to as Guardian Agents.

Guardian Agents operate before tool execution. Instead of immediately running a predicted action plan, the system first collects the agent’s proposed tool calls and evaluates them across three checks:

Unnecessary Tools: The guardian identifies tools that should not be part of the plan. This prevents the agent from taking irrelevant or unsafe actions.
Missing Required Tools: The guardian checks whether the plan includes all tools needed to satisfy the user request. This addresses one of the most common failure types seen in our taxonomy.
Argument Validation: The guardian inspects the predicted arguments for correctness, presence, and structure. It ensures the agent provides the right values in the right places before the tool is ever executed.

Each check returns a simple pass or fail decision along with natural language feedback explaining the issue. Feedback from all checks is aggregated and sent back to the agent, allowing it to revise the plan. The agent is then run again with the corrected plan, with a cap on the number of retries to avoid infinite loops.

For the business, this means fewer expensive mistakes - like incorrect calendar updates, faulty financial analysis, or misrouted customer requests - because bad tool calls are intercepted before they execute. It also reduces the risk of unsafe or non compliant behavior by blocking actions that violate required checks or skip necessary approvals. And by guiding agents back onto a valid path instead of allowing silent failures, Guardian Agents increase the completion rate of complex, multi step workflows. Because these guardrails are domain agnostic, they can be applied across a wide range of use cases without rebuilding the safety layer for each one. This marks a meaningful step toward making agentic AI not only more capable, but reliably safe for real world deployment.

What's next?

Our next step is to integrate Guardian Agents directly into the Vectara platform as a pre-execution safety layer. This will be a first-class platform feature rather than a separate agent developers must orchestrate themselves. When enabled, the platform will automatically validate an agent’s proposed tool calls before execution, return feedback to the agent when issues are detected, and only proceed with safe, complete, and well-formed actions.

We will evaluate the impact of this feature by comparing baseline runs with guarded runs, focusing on improvements in tool selection, argument correctness, and overall workflow completion. This will help us identify which failure types benefit the most from early validation and quantify how platform-level guardrails strengthen agent reliability in real workloads.

Learn more about vectara agents here: https://www.vectara.com/blog/announcing-vectara-agents-enterprise-ai-that-works

Guardian Agents Benchmark

Introduction:

Approach: Scenario generation and evaluation

A Scenario Engine That Builds, Breaks, and Quality-Checks Itself

Evaluating Agent Responses and Actions:

Benchmark Evaluation and Findings

Domains Covered:

Evaluation Setting:

Failure Taxonomy Breakdown:

Improving Reliability with Platform Level Guardrails

What's next?

Connect with
our Community!

Discord.

Github.

X / Twitter.

LinkedIn.

Discuss.

E-mail.

A Scenario Engine That Builds, Breaks, and Quality-Checks Itself

Evaluating Agent Responses and Actions:

Connect with our Community!

Discord.

Github.

X / Twitter.

LinkedIn.

Discuss.

E-mail.

Connect with
our Community!