[Preview ]Agent Evaluation: Overview

Important Note: This feature is available in Preview for select customers. During the Preview phase, AI Agent Evaluation is accessed inside the AI Agent Platform. At General Availability (GA), the experience will move to CXA Operations Center (formerly AI Trainer).

 

Agent Evaluation is a feature that lets you test an AI Agent Orchestration against a set of predefined scenarios and measure how the AI Agent behaves. It is designed to help AI Agent Ops and Admins answer three questions before deploying or promoting a version of an AI Agent to production:

  • Did the Agent achieve the goals the user was asking for?

  • Did it call the right tools, in the right order, with the right arguments?

  • Did it stay within its allowed scope and follow its instructions?

This article provides a high-level walkthrough of Agent Evaluation. For step-by-step instructions on building datasets and running evaluations, see the linked articles at the bottom.

Core Concepts

Agent Evaluation is built around three concepts: Dataset, Metric, and Evaluation. Please read the following for details on each.

 

Dataset

A Dataset is a collection of test cases that represent the scenarios you want to validate against your AI Agent Orchestration. Each test case captures the inputs to send to the Agent (or to the simulated user), the expected behaviour, and the references used to score the outcome.

There are two dataset types:

  • Simulated: The user side of the conversation is played by an LLM driven by a Persona, a Goal, and behavioural Instructions. Useful for multi-turn flows: Bookings, eligibility checks, troubleshooting resolutions.

  • Scripted: The user side is a literal, predefined sequence of messages that is sent to the Agent exactly as written. Useful for deterministic regression tests: FAQ responses, adversarial probes, single-turn Q&A.

Metric

A Metric is a check applied to each test case when the evaluation runs. Every metric produces a per-case score between 0 and 1, plus a reasoning explanation. The product supports the following metrics:

  • Goal Accuracy: Did the AI Agent achieve the goal?

  • Answer Accuracy: Does the AI Agent's reply match the expected answer?

  • Tool Call Accuracy: Did the AI Agent call the right tools, in the right order, with the right arguments?

  • Guardrails: Did the AI Agent identify and block out-of-scope requests or prompt injection attempts?

  • Application Output Accuracy: Does the end-of-automation output match the expected routing label?

  • Instruction Adherence: Did the AI Agent follow the behavioural rules in its prompt?

Evaluation

An Evaluation is a single run. It ties together:

  • The AI Agent Orchestration and the version to test.

  • The Dataset to run against.

  • The subset of Metrics to compute.

When the evaluation finishes, results are displayed in summary cards (one per metric) and a per-test-case table.

 

The Agent Evaluation Workflow

  1. Create a Dataset. Define the scenarios, select a dataset type (Simulated or Scripted), and add test cases with the appropriate reference fields for the metrics you want to validate.

  2. Create an Evaluation. Select the Orchestration, the version, the dataset, and the metrics to compute. Run the evaluation.

  3. Read the results. Review the per-metric KPI %, drill into individual test cases, and investigate reasoning strings when a metric scores below expectation.

  4. Iterate. Adjust AI Agent instructions, prompts, skills, or Orchestration structure. Re-run the same evaluation against the new version to confirm the regression is resolved.

 

Dataset × Metric Compatibility

Not every metric works with every dataset type. The main rules: 

Metric Simulated Scripted Notes
Goal Accuracy

Works with single- and multi-turn

Needs reference data for evaluation

Tool Call Accuracy

Works with single- and multi-turn

Needs reference data for evaluation

Answer Accuracy

Scripted only, single-turn only

Needs reference data for evaluation

Application Output Accuracy

Works with single- and multi-turn

Needs reference data for evaluation

Instruction Adherence

Works with single- and multi-turn

Free Reference

Guardrails

Works with single- and multi-turn

Free Reference

When you select Compatible Metrics while creating a dataset, the UI only shows the combinations that are valid for that dataset type. Metrics that need reference inside the dataset.

 

Reading the Results

Once the evaluation completes, click View on the evaluation row to open the results page. The results page shows:

  • Summary cards: One per metric, with the aggregate KPI % and a pass/fail badge.
  • Inputs table: Every test case with its per-metric verdict.
  • Click View on any row to drill into the transcript, chain-of-thought, skill details, and per-metric reasoning.

All Articles ""
Please sign in to submit a request.