Agent Evaluation

Agent evaluation tests how well an agent's orchestration works and whether it meets your expected capabilities and performance. You import a test dataset, run the agent against it in batch, and analyze the outputs to get objective quality metrics — so you can keep tuning the workflow as you debug.

Features

Batch data testing: Import scenario-based datasets to simulate user dialogues, run them in batch, and collect outputs for a full assessment of response quality.
AI model-assisted analysis: Let an AI model evaluate the results automatically, returning quality judgments and performance scores to speed up analysis.
Multi-dimensional comparison: Compare and annotate results online, benchmark across versions, and trace knowledge-base retrieval for detailed diagnostics.

Batch testing

Go to the My Agent list, select an agent, then click … > Batch Testing in the Operation column.

Alternatively, click Manage in the Operation column to open the agent details page, then click Batch Testing in the top right corner.

Agent details page with Batch Testing button

note

Only agents with an officially released version can be batch tested.

AI evaluation configuration

The platform can use an AI evaluation model to analyze test results automatically. When you create an evaluation task with the AI Evaluation debug type, the system invokes the model to analyze the agent's outputs and generate a report.

Click AI Evaluation Configuration in the top right corner. This configuration applies only to the current agent. Every AI evaluation task you create for this agent reuses it.

AI Evaluation Configuration entry
Configure both the Evaluation Model and the Evaluation Prompt.

Evaluation model and prompt fields

Enter the prompt manually, or click Prompt Template in the lower-left corner to preview a template and click Use to apply it. Click Switch to English or Switch to Chinese to change the template language. Chinese and English are supported.

Prompt template selection

When done, click OK.
Each save creates a historical version of the configuration. In the History section on the right, click Details to view a version, or click Restore this version to roll back to it.

Configuration history list

Historical version details

Create a task

On the batch testing task list, click Create Task in the top right corner. You can evaluate any released version of the agent.

Field	Description
Data Region	The data center where the agent is deployed. Task data is also saved here.
Agent	The name of the agent.
Select Version	A historical version of the current agent.
Test Task Name	The name of the evaluation task.
Debug Type	Agent Execution runs the test data and outputs results. AI Evaluation runs the agent, then has the AI evaluation model analyze the outputs and return results.
Import Data	Import test data from a spreadsheet, one file at a time. Click Download Test Set Template and format your data to the template to avoid parsing failures.

When done, click Save and Execute Immediately to run the task.

Evaluation result

After the task finishes, click Details to view results online, or click Download to download the result file.

If the agent is linked to a knowledge base, click View Retrieval in the results to see the retrieval details.

Field	Description
Input	The test case data you uploaded.
Expected Output	The response you expect for the input.
Actual Output	The result the agent produced.
Evaluation Result	A manual annotation. Mark each result pass or fail and add comments.
Evaluation Description	For AI evaluation, the model's opinion. For agent execution, empty until you annotate it.
Additional Information	Add remarks as needed.
Knowledge Retrieval	If the agent is linked to a knowledge base, the retrieval details for this input. Retrieval details cannot be exported.

Result comparison

You can compare any two tasks of an agent online. On the Batch Testing task list, click Result Comparison, select two historical tasks, then click Result Comparison to see the details.

You can also download the result files and compare them in detail locally.

Billing

Agent evaluation is free. The tokens consumed by running an evaluation task are billed at standard rates.

Go to the Resource Consume page for detailed token consumption. You can also select an agent, open the Batch Testing list, and check the Token Consumption column for each task.

Features​

Batch testing​

AI evaluation configuration​

Create a task​

Evaluation result​

Result comparison​

Billing​

See also​