Docs›Evals

Evals

Name: Oya
Author: Oya

Write test cases for your OAgent, run them for real against its connected tools, and watch each one pass or fail with a score and reason.

Overview

Evals are how you measure whether an OAgent actually does the right thing. You author a set of test cases (each is a message the user might send plus a rubric for what a correct outcome looks like), then run them. Every case runs the OAgent end-to-end against its real, connected tools, and an LLM judge grades the reply against your rubric. You get a per-case pass or fail, a score, the reason, and the exact skills that fired.

The Evals tab lives on the chat/OAgent page at /chat/{agentId}. You can also reach it from Run evals in the builder. Under the hood the tab calls the routes under /api/agents/{id}/evals/*.

The Evals tab showing performance and accuracy, run evals, auto-improve, and eval cases sections — The Evals tab: latest accuracy and real-run health at the top, then Run evals, the Auto-improve toggle, your eval cases, and recent runs.

Warning

Eval cases run for real. When a case fires a skill, the OAgent may actually send email, post a message, or create a record. Keep every case owner-safe: target only yourself (e.g. “summarize today's signups and email ME the list”), never a real customer or an external address.

Eval Cases

An eval case is one test scenario. Each case has a message (what the user types), a rubric (one line describing what a correct, useful outcome looks like), an optional list of expected skills (the tool names it should call), and a set that tags it as either happy path or edge.

Happy path: a realistic request the owner would make that the OAgent must handle well. List the skill(s) it should call, and a rubric for the correct outcome.
Edge / guardrail: an out-of-scope request, one missing key info, or something the OAgent should decline or ask a clarifying question about. Expected skills is usually empty; the rubric describes the correct restraint.

Every case can be toggled enabled or disabled (only enabled cases run), edited, or deleted. Cases carry a source: ones you write are marked customer, and ones seeded from the mission are marked generated (shown with a “suggested” badge).

Authoring a case

Click Add in the Eval cases section to open the inline form. Enter the message, a rubric, and pick Happy path or Edge / guardrail, then Save. A concrete example:

Message: "Summarize today's signups and email ME the list."
Rubric: "Pulls today's signups and sends a clear summary email to the owner."
Set: Happy path, expected skill: the email-sending skill.

Auto-Generate Cases

Click Suggest to seed a starter suite from the OAgent's mission and spec. It produces both a happy set and an edge set, and every generated scenario is constrained to owner-safe targets so running it has a small blast radius. Suggested cases are deduplicated against your existing messages, so clicking Suggest again will not pile up duplicates.

Tip

Suggest is the fastest way to get a baseline. Start there, then edit the wording and rubrics, disable ones that do not fit, and add your own cases for the scenarios you care most about.

Running Evals

Click Run evals to run every enabled case. The run streams live: you see a status line as each case starts, then a row per case with a check or cross, the set tag, the message, the judge's reason, and the skills that fired. If you have no saved cases, the run falls back to a generated starter suite so the button still works. Use Stop to abort a run in progress.

Each run is persisted with a total, a passed count, a score, and a pass rate. Grading is strict: a case hard-fails if the run errors or a skill fails, or if a happy-path case with expected skills called none. Otherwise the LLM judge scores the reply against your rubric. Recent runs are listed at the bottom of the tab with their pass rate and timestamp, plus an auto-fix badge when auto-improve made changes during the run.

Auto-Improve

The Auto-improve toggle changes what happens when evals fail. It is stored on the OAgent at agents.config.auto_improve.

Off: eval runs report failures without changing the OAgent. One pass through the cases, then done.
On: a failing run triggers an automatic fix-and-retry loop: skills that errored are regenerated, behavior rules are revised for non-tool failures, and the suite re-runs to confirm the fix (up to a few rounds).

Tip

Auto-improve only rewrites skills that belong to this OAgent alone. Shared or catalog skills are never mutated in place; those failures are handled by revising the behavior rules instead.

Builder test runs are excluded: only production runs are counted.
A run counts as a failure if its status is failed, or if it exited silently with a non-zero exit code.
Success rate is (total minus failures) over total; average duration is measured from start to completion.

API Endpoints

Everything the tab does maps to routes under /api/agents/{agent_id}/evals:

text

GET    /api/agents/{agent_id}/evals/summary          # latest score, trend, case count, production stats
GET    /api/agents/{agent_id}/evals/cases            # list eval cases
POST   /api/agents/{agent_id}/evals/cases            # create a case
PATCH  /api/agents/{agent_id}/evals/cases/{case_id}  # edit / enable / disable a case
DELETE /api/agents/{agent_id}/evals/cases/{case_id}  # delete a case
POST   /api/agents/{agent_id}/evals/cases/generate   # auto-generate suggested cases from the mission
POST   /api/agents/{agent_id}/evals/run              # run enabled cases (streamed, SSE)
GET    /api/agents/{agent_id}/evals/runs             # list past runs
PUT    /api/agents/{agent_id}/evals/settings         # toggle auto-improve

Note

The run endpoint streams Server-Sent Events. Frames arrive as the run progresses (status, per-case scenario results, any auto-fix records, and a final summary), terminated by a [DONE] marker.