[Preview ]Agent Evaluation: Overview – Knowledge Base

Important Note: This feature is available in Preview for select customers. During the Preview phase, AI Agent Evaluation is accessed inside the AI Agent Platform. At General Availability (GA), the experience will move to CXA Operations Center (formerly AI Trainer).

Core Concepts
Dataset
Metric
Evaluation
The Agent Evaluation Workflow
Dataset × Metric Compatibility
Reading the Results

Agent Evaluation is a feature that lets you test an AI Agent Orchestration against a set of predefined scenarios and measure how the AI Agent behaves. It is designed to help AI Agent Ops and Admins answer three questions before deploying or promoting a version of an AI Agent to production:

Did the Agent achieve the goals the user was asking for?
Did it call the right tools, in the right order, with the right arguments?
Did it stay within its allowed scope and follow its instructions?

This article provides a high-level walkthrough of Agent Evaluation. For step-by-step instructions on building datasets and running evaluations, see the linked articles at the bottom.

Core Concepts

Agent Evaluation is built around three concepts: Dataset, Metric, and Evaluation. Please read the following for details on each.

Dataset

A Dataset is a collection of test cases that represent the scenarios you want to validate against your AI Agent Orchestration. Each test case captures the inputs to send to the Agent (or to the simulated user), the expected behaviour, and the references used to score the outcome. A dataset can contain up to 500 scenarios.

There are two dataset types:

Simulated: The user side of the conversation is played by an LLM driven by a Persona, a Goal, and behavioural Instructions. Useful for multi-turn flows: Bookings, eligibility checks, troubleshooting resolutions. Expected behaviour for a Simulated test case is captured as one or more Reference Goals (up to 10 per test case) — each describing a distinct outcome the AI Agent should achieve. All Reference Goals for a test case are evaluated together as the ground truth for scoring.
Scripted: The user side is a literal, predefined sequence of messages that is sent to the Agent exactly as written. Useful for deterministic regression tests: FAQ responses, adversarial probes, single-turn Q&A.

Metric

A Metric is a check applied to each test case when the evaluation runs. Every metric produces a per-case score between 0 and 1, plus a reasoning explanation. The product supports the following metrics:

Goal Accuracy: Did the AI Agent achieve the goal?
Answer Accuracy: Does the AI Agent's reply match the expected answer?
Tool Call Accuracy: Did the AI Agent call the right tools, in the right order, with the right arguments?
Guardrails: Did the AI Agent identify and block out-of-scope requests or prompt injection attempts?
Application Output Accuracy: Does the end-of-automation output match the expected routing label?
Instruction Adherence: Did the AI Agent follow the behavioural rules in its prompt?

Evaluation

An Evaluation is a single run. It ties together:

The AI Agent Orchestration and the version to test.
The Dataset to run against.
The subset of Metrics to compute.
The Number of Runs — how many times to repeat the full dataset within this evaluation session.

Because AI Agent behaviour is non-deterministic, running the same dataset more than once helps confirm results are consistent before you trust a pass/fail verdict. Set Number of Runs greater than 1 to have the evaluation repeat automatically; results are aggregated per test case across all runs.

When the evaluation finishes, results are displayed in summary cards (one per metric) and a per-test-case table.

The Agent Evaluation Workflow

Create a Dataset. Define the scenarios (up to 500 per dataset), select a dataset type (Simulated or Scripted), and add test cases with the appropriate reference fields — including one or more Reference Goals for Simulated test cases — for the metrics you want to validate.
Create an Evaluation. Select the Orchestration, the version, the dataset, and the metrics to compute. Run the evaluation.
Read the results. Review the per-metric KPI %, drill into individual test cases, and investigate reasoning strings when a metric scores below expectation.
Iterate. Adjust AI Agent instructions, prompts, skills, or Orchestration structure. Re-run the same evaluation against the new version to confirm the regression is resolved.

Dataset × Metric Compatibility

Not every metric works with every dataset type. The main rules:

Metric	Simulated	Scripted	Notes
Goal Accuracy	✅	✅	Works with single- and multi-turn. Needs reference data for evaluation. Can set multiple goals per scenario.
Tool Call Accuracy	✅	✅	Works with single- and multi-turn. Needs reference data for evaluation
Answer Accuracy	—	✅	Scripted only, single-turn only. Needs reference data for evaluation
Application Output Accuracy	✅	✅	Works with single- and multi-turn. Needs reference data for evaluation
Instruction Adherence	✅	✅	Works with single- and multi-turn. Free Reference
Guardrails	✅	✅	Works with single- and multi-turn. Free Reference

When you select Compatible Metrics while creating a dataset, the UI only shows the combinations that are valid for that dataset type. Metrics that need reference inside the dataset.

Reading the Results

Once the evaluation completes, click View on the evaluation row to open the results page. The results page shows:

Summary cards: One per metric, with the aggregate KPI % and a pass/fail badge.
Inputs table: Every test case with its per-metric verdict.
Click View on any row to drill into the transcript, chain-of-thought, skill details, and per-metric reasoning.

How can we help?

[Preview ]Agent Evaluation: Overview

Published May 07, 2026 14:45 • Last Updated July 21, 2026 16:41

Core Concepts

Dataset

Metric

Evaluation

The Agent Evaluation Workflow

Dataset × Metric Compatibility

Reading the Results

Core Concepts

Dataset

Metric

Evaluation

The Agent Evaluation Workflow

Dataset × Metric Compatibility

Reading the Results

Related articles