Eval Dataset Schema and Grading Contract¶
This page continues two nearby topics:
- Chapter 13. Offline Evals, Online Evals, and Regression Gates
- Trace Schema and Event Catalog
- Evidence Spine: From Request to Rollout Judgment
And connects them to the runnable package:
If the trace schema page answers “how do we describe what happened inside a run?”, this page answers “how do we describe what we expect from the system as an eval artifact?”
Why an explicit eval dataset schema matters¶
Many teams say they “have evals”, but in practice that often means:
- a spreadsheet with a few manual examples;
- a set of unrelated prompt cases;
- JSON without a stable structure;
- a mix of ground truth, expectations, and reviewer comments in one field.
That is a problem for three reasons:
- comparisons between versions become blurry;
- regression gates are hard to automate;
- trace grading and dataset grading live in separate worlds.
That is why it helps to treat an eval dataset as a contract.
Minimal eval artifact shape¶
For agent systems, it is very useful for one dataset item to contain at least:
scenario_idlabelsuser_inputsexpected_outcomesrisk_class
A minimal example looks like this:
{
"scenario_id": "support_ticket",
"labels": ["write_path", "approval_required", "ticketing"],
"user_inputs": [
"Please create a ticket for this onboarding issue."
],
"expected_outcomes": {
"latest_status": "success",
"approval_wait_runs": 1,
"required_output_substrings": [
"waiting for human approval"
]
},
"risk_class": "high"
}
That is already much more useful than “here is an example prompt”.
Why labels are not enough without expected outcomes¶
Labels help you group scenarios:
- retrieval
- approval
- memory
- safety
- multi-turn
But labels alone do not tell you what successful behavior actually means.
That is why an eval dataset should usually separate:
labelsas the scenario class;expected_outcomesas the desired result;grading_rulesas the check logic;verifier_outputsas the structured grading result, including verifier identity and contract version.
What a grading contract is¶
A grading contract exists to remove ambiguity between “an example” and “a pass criterion”.
In practice, that means a scenario should explicitly define:
- which fields are evaluated;
- which check types apply;
- what counts as pass/fail;
- what may be treated as a warning versus a blocking failure.
A good grading contract answers:
“Would a different reviewer or pipeline reach the same conclusion on the same scenario tomorrow?”
Useful grading rule types¶
For reference-grade agent evals, it helps to distinguish at least these rules:
status_equalscontains_substringmax_tool_callsapproval_requiredpolicy_violation_absentmemory_write_absentprocess_score_presentoutcome_score_presentfailure_attribution_validfailed_run_traceablesandbox_profile_review
failed_run_traceable becomes important once release review expects failed-run drills. It checks that a degraded path did not merely fail, but failed in a way the team can still inspect through status, a concrete failure reason such as failure_reason, trace linkage, and governed release identity.
sandbox_profile_review matters for sandbox-backed paths: it checks that workspace materialization, shell/filesystem permissions, network/secrets posture, and snapshot/resume policy were explicitly represented as reviewable evidence instead of remaining implicit runtime settings.
That means the grading contract should not focus only on the final answer text, but also on system behavior.
How this connects to traces¶
A useful practical model looks like this:
- the trace schema describes actual run behavior;
- the eval dataset schema describes expected behavior;
- the grading contract maps one to the other.
This is the point where observability stops being only a way to look backward and becomes part of release decisions.
What the reference runtime already supports¶
In agent_runtime_ref, this command:
already produces a small structured artifact with:
- multiple session scenarios;
labels;expected_outcomes;- a failed-run drill scenario that preserves failed status and
failure_reasonin session export and eval expectations.
The bundled export contract is intentionally concrete. Session eval config validation also keeps malformed eval specs separate from failed eval results with Session eval specs must be a mapping, Session eval spec must be a mapping, Session eval spec key must be a string, Session eval spec key must not be empty, and Session eval spec keys must be unique.
The export contract is intentionally concrete: the default dataset_name is agent-runtime-ref-eval-seed; the top-level summary includes session_count, session_ids, run_count, failed_runs, traceable_failed_runs, trace_ids, failed_trace_ids, idempotency_keys, approval_ids, approval_capability_names, pending_approval_ids, pending_approval_capability_names, approval_status_counts, and latest_failure_reason; approval-backed scenarios also carry approval_status_counts in expected_outcomes; and the built-in scenarios include failed_run_timeout with a duplicate_ticket_eval_passed label, max_ticket_side_effects: 1, and blocking duplicate_ticket_guard grading rule, profile_memory with memory_read, profile_lookup, and grounded_answer labels, mixed_session with multi_run, approval_then_memory, and session_evals labels plus required_run_count as an expected outcome, and support_ticket with a sandbox_profile_review label, sandbox_profile_reviewed expected outcome, and blocking sandbox_profile_review grading rule.
Eval gate for the duplicate-ticket thread
For the running support-triage case, a dedicated eval should reproduce a timeout after create_ticket, require preserved trace_id and idempotency_key, expect exactly one ticket side effect or a side_effect_unknown stop, and block rollout if a new prompt/model/adapter version blindly retries and creates a second ticket.
It is not yet a full industrial eval framework, but it is already a reasonable seed for:
- regression grading;
- scenario comparison;
- rollout review;
- manual expansion of the eval set.
What a production dataset schema should add¶
Once the system becomes more serious, it is useful to extend the dataset schema with:
dataset_versionscenario_ownersource_trace_idsgrader_typeblockingnotes_for_reviewverifier_outputsfailure_attributionverifier_idverifier_contract_versionverifier_evidence_refssandbox_profile_contractworkspace_manifest_refsnapshot_policy
That is when the eval artifact starts behaving like part of release discipline, not just temporary JSON.
Example grading contract¶
Here is a workable skeleton for a failed-run drill scenario:
scenario_id: failed_run_timeout
labels:
- failed_run
- tool_timeout
- failure_drill
grading_rules:
- type: status_equals
expected: failed
blocking: true
- type: contains_substring
expected: tool_timeout
blocking: true
- type: failed_run_traceable
expected: true
blocking: true
- type: sandbox_profile_review
expected:
sandbox_profile_contract: sandbox-profile-v1
workspace_entries_reviewed: true
permissions_profile: restricted-shell-network-denied
network_secrets_posture: network:denied,secrets:none
snapshot_policy: required_on_completion
blocking: true
verifier_outputs:
verifier_id: fara-process-review
verifier_contract_version: verifier-v2
process_score: 0.92
outcome_score: 0.35
failure_attribution: uncontrollable_environment
verifier_evidence_refs:
- trace:trace_123
- screenshot:step_7
The point is that the contract evaluates not only the final text, but also the correct operational shape of behavior, including whether the concrete failed condition remains visible enough for later review.
This becomes especially important for long-horizon agents, where a binary pass/fail verdict often hides the difference between correct behavior with a blocked outcome and unsafe behavior that happened to end in nominal success.
Why multi-run sessions matter¶
For agent systems, an eval item often needs to describe not one request, but a short related sequence.
For example:
- the user asks to create a ticket;
- then asks what the agent remembers about preferences;
- then asks for the next step.
If the dataset cannot describe such a sequence, you can test single-turn behavior fairly well, but session behavior much less well.
That is why session exports and eval dataset exports should be designed together.
What not to do¶
Several mistakes are very common:
- mixing scenario metadata and grading logic in one text field;
- keeping only happy-path cases;
- not declaring expected outcomes explicitly;
- grading only the final answer and ignoring policy or tool behavior;
- not versioning the dataset;
- not linking dataset items to trace evidence or incident history;
- collapsing verifier output into a single weak verdict with no process/outcome split or failure attribution;
- requiring
sandbox_profile_reviewin rollout, but having no grading rule that checks workspace, permissions, and snapshot/resume evidence.
That makes eval culture fragile.
What to Do Right Away¶
Start with this short list and mark every "no" explicitly:
- Does every scenario have a stable
scenario_id? - Are labels separate from expected outcomes?
- Do you have grading rules, not just reviewer prose?
- Can you evaluate behavior, not only text?
- Can the verifier output separate
process_score,outcome_score, andfailure_attribution? - Can you tell which verifier identity and contract version produced that grading output?
- Is there a dedicated rule for sandbox-backed paths that checks sandbox profile contract, workspace entries, permissions, and snapshot/resume evidence?
- Do you support multi-run sessions?
- Do you have dataset versioning and ownership?
If several answers are “no,” you probably have examples, but not yet a proper eval dataset schema.