Trace Schema and Event Catalog¶
This page exists to bridge one practical gap: moving from a high-level observability discussion to an event structure that can actually be exported, inspected, and reused in eval workflows.
It connects two parts of the book:
- Chapter 11. Traces, Spans, and Structured Events
- Chapter 13. Offline Evals, Online Evals, and Regression Gates
- Evidence Spine: From Request to Rollout Judgment
And the runnable package:
Why an explicit trace schema matters¶
Without an explicit trace schema, teams usually end up in one of two bad states:
- events exist, but they are just ad hoc JSON blobs;
- events help with debugging, but they are weak inputs for grading, audit, or incident review.
That is why it is useful to separate:
trace envelopeevent catalogpayload contractsverifier contract identityverifier evidence linkage
Even in a small runtime.
Minimal trace envelope¶
agent_runtime_ref currently uses an intentionally compact envelope:
{
"event_type": "run_start",
"trace_id": "trace-demo-001",
"payload": {
"agent_id": "support-triage-ref",
"tenant_id": "tenant-acme",
"principal_id": "user-42",
"session_id": "session-demo-001",
"user_input": "Please create a ticket for this onboarding issue."
}
}
The minimum useful field set is:
event_typetrace_idpayload
In production, this usually needs to grow to include:
session_idagent_idtenant_idprincipal_idevent_tsspan_idparent_span_id
In the reference runtime, some of those fields still live inside payload to keep the structure small and easy to inspect. At the same time, serialized events now carry schema_version and redacted_fields, and the export path supports redaction for selected fields. The event loader validates this shape explicitly: Telemetry path must be a string or path-like object, Telemetry event line is not valid JSON: {line_number}, Telemetry event must be a mapping, Telemetry event is missing required field: {required_field}, Telemetry event field must be a string: {field}, Telemetry event field must not be empty: {field}, Telemetry schema version is not supported: {schema_version}, Telemetry event payload must be a mapping, Telemetry event payload key must be a string, Telemetry event payload key must not be empty, Telemetry event payload keys must be unique, Telemetry event payload value must be a string: {payload_key}, Telemetry event redacted_fields must be a tuple, Telemetry event redacted_fields must be a list, Telemetry event redacted_fields entries must be strings, Telemetry redact field must not be empty, and Telemetry redact field is not present in events: {missing}.
How trace and session relate¶
For agent systems, one trace is usually not enough. You almost always need a longer context:
- one
trace_iddescribes one run; - one
session_idlinks multiple runs together; - a session-level summary can already support evals, rollout review, and postmortems.
That is why the package includes:
inspect-traceinspect-sessionsession-eval-summaryexport-sessionexport-eval-dataset
dump-events, export-events, and inspect-trace also keep the command response usable as a triage surface rather than just a raw JSONL dump: they expose session_id, tenant_id, principal_id, agent_id, authorization_mode, delegated_principal_id, delegated_scope, status, result, output_preview, failure_reason, approval_ids, approval_capability_names, pending_approval_ids, pending_approval_capability_names, approval_status_counts, and non-empty idempotency_keys before an operator has to manually scan every approval_requested or tool_execution payload. replay-run then reports source_idempotency_keys and replay_idempotency_keys separately, making it explicit that replay is a new run with its own duplicate-write key rather than a silent reuse of the original write.
Trace replay validates this evidence before it can seed a new run: Trace ID request must be a string, Trace ID not found in event file: {requested_trace_id}, Trace file does not contain any trace IDs, Trace file contains multiple trace IDs; pass --trace-id explicitly, Trace file does not contain a run_start event, Trace file contains multiple run_start events, Trace run_start event is missing replay fields: {missing_keys}, Trace run_start event has redacted replay fields: {redacted_keys}, Trace run_start replay field must be a string: {field}, and Trace run_start replay field must not be empty: {field}.
Reference runtime event catalog¶
Below is the current minimal event catalog.
| Event type | When it appears | Why it matters |
|---|---|---|
run_start | at the beginning of a run | captures input and actor identity |
policy_precheck | immediately after run admission | records the policy precheck action, reason, and policy ID |
retrieval | when memory context is fetched | records the source and number of retrieved records |
context_layers_built | after context assembly | shows which context layers actually entered the run; internally RunContext keeps retrieved_context and retrieved_records before any tool_request is handled |
tool_policy_decision | before tool execution | records the policy gate and allow/deny/approval reason |
tool_execution | after a capability call or approval handoff | records capability status and tool-principal context |
approval_requested | on a high-risk write path | shows that execution moved into human review |
sandbox_profile_reviewed | when a sandbox-backed path is reviewed | records workspace, permissions, and snapshot/resume evidence review |
memory_write_decision | before background memory persistence | records whether a candidate memory write was allowed or denied |
memory_persisted | after a background write | records provenance and revision of a memory record |
background_compaction | after background memory maintenance | records tenant-level compaction results |
background_update_scheduled | after background work is queued or completed | records background update status for the run |
run_failed | when a tool failure becomes the run outcome | preserves explicit failed-run traceability |
run_complete | at the end of a run | closes the run-level outcome |
span | around individual calls | provides simple latency and status telemetry |
This is not meant to be a universal perfect catalog. It is a compact operational vocabulary that is already enough to support:
A stronger production vocabulary should also make room for verifier-aware evidence, so traces can later explain not only what the runtime did, but also what evidence a verifier used to judge process quality, outcome quality, or failure attribution.
- trace inspection;
- regression seeds;
- session summaries;
- incident review.
Why payload contracts matter¶
The problem is not that events are plain. The problem is that without contracts, payloads quickly turn into garbage.
For each event type, decide up front:
- which fields are required;
- which fields are stable;
- which fields can be added without breaking downstream tooling;
- which fields matter for grading;
- which fields matter for audit.
For example, tool_policy_decision should usually include at least:
capability_namedecisionreasonrisk_tiertool_principal
Trace for the duplicate-ticket thread
In the support-triage case, tool_policy_decision, approval_requested, tool_execution, and the final outcome should be tied by one trace_id, session_id, approval_id, tool_principal, and idempotency_key. If create_ticket times out and the side-effect status is unknown, the trace should show side_effect_unknown instead of masking the run as successful or repeating the write without reconciliation.
For a sandbox-backed run, reserve fields that link the trace to the execution boundary:
sandbox_session_idsandbox_manifest_versionsandbox_permissions_profilesnapshot_idworkspace_manifest_ref
If rollout or eval requires sandbox_profile_review, the trace should also be able to point to review evidence, not only state fields:
sandbox_profile_contractworkspace_entries_reviewedpermissions_profilenetwork_secrets_posturesnapshot_policyreviewed_byreview_evidence_refs
If the system relies on verifier-aware evals, it is also useful to define an event or linked payload contract for verifier evidence, for example:
verifier_idverifier_contract_versionprocess_scoreoutcome_scorefailure_attributionevidence_refs
And memory_persisted should usually include:
memory_classkindprovenancerevision
The current reference payloads also use operational metadata fields such as runtime_principal, authorization_mode, delegated_principal_id, delegated_scope, policy_id, static_items, session_items, retrieved_items, tool_items, approval_id, reviewer, capability_session_id, capability_session_status, tool_status, output_preview, memory_id, revision_mode, compacted_records, persisted_records, tool_results, span_name, and duration_ms. Tool request/result model validation is part of the same trace boundary: malformed tool calls fail with Tool request capability name must be a string, Tool request capability name must not be empty, Tool request arguments must be a mapping, Tool request argument key must be a string, Tool request argument key must not be empty, Tool request argument keys must be unique, Tool request argument value must be a string: {argument_key}, and malformed tool results fail with Tool result status must be a string, Tool result status must not be empty, Tool result payload must be a mapping, Tool result payload key must be a string, Tool result payload key must not be empty, Tool result payload keys must be unique, and Tool result payload value must be a string: {payload_key}.
What the package already supports¶
You can inspect this directly:
.venv/bin/python -m agent_runtime_ref dump-events
.venv/bin/python -m agent_runtime_ref export-events --output artifacts/trace-demo.jsonl
.venv/bin/python -m agent_runtime_ref export-events --output artifacts/trace-demo.jsonl --redact-field user_input
.venv/bin/python -m agent_runtime_ref inspect-trace --input artifacts/trace-demo.jsonl
.venv/bin/python -m agent_runtime_ref export-session --output artifacts/session-demo-001.json
.venv/bin/python -m agent_runtime_ref export-eval-dataset --output artifacts/eval-dataset.json
That matters because the same trace vocabulary now lives in three places at once:
- in the runtime;
- in the book;
- in eval-ready artifacts.
What a production schema should add¶
The reference runtime is intentionally small, so a more mature system should quickly add:
- a timestamp on every event;
- explicit
span_idandparent_span_id; - a separate stable
run_id; - version fields for the schema;
- a split between
display payloadandmachine payload; - redaction rules for sensitive fields;
- an explicit way to link traces to verifier evidence, screenshots, or grading artifacts;
- a stable way to record which verifier contract version produced the grading output;
- sandbox state fields for runs that materialize a workspace, use shell/filesystem capabilities, or continue from a snapshot;
- an event or linked payload for
sandbox_profile_reviewed, so rollout/eval evidence for workspace, permissions, and snapshot/resume policy is traceable.
That is what turns an event stream from debug output into a real platform artifact.
What to Do Right Away¶
Start with this short list and mark every "no" explicitly:
- Do you have a stable event catalog?
- Do you clearly separate
trace_idandsession_id? - Is it clear which fields are required for each event type?
- Can you reconstruct the policy decision and tool path from a trace?
- Can you build an eval dataset from a session export?
- Can you link a trace to verifier evidence used in grading or rollout review?
- If rollout requires
sandbox_profile_review, is there trace evidence for workspace entries, permissions, and snapshot/resume policy? - Can you tell which verifier contract version produced that grading output?
- Do you have a plan for redaction and schema versioning?
If several answers are “no,” you probably have logging, but not yet a real trace schema.