Skip to content

Trace Schema and Event Catalog

This page exists to bridge one practical gap: moving from a high-level observability discussion to an event structure that can actually be exported, inspected, and reused in eval workflows.

It connects two parts of the book:

And the runnable package:

Why an explicit trace schema matters

Without an explicit trace schema, teams usually end up in one of two bad states:

  • events exist, but they are just ad hoc JSON blobs;
  • events help with debugging, but they are weak inputs for grading, audit, or incident review.

That is why it is useful to separate:

  • trace envelope
  • event catalog
  • payload contracts
  • verifier contract identity
  • verifier evidence linkage

Even in a small runtime.

Minimal trace envelope

agent_runtime_ref currently uses an intentionally compact envelope:

{
  "event_type": "run_start",
  "trace_id": "trace-demo-001",
  "payload": {
    "agent_id": "support-triage-ref",
    "tenant_id": "tenant-acme",
    "principal_id": "user-42",
    "session_id": "session-demo-001",
    "user_input": "Please create a ticket for this onboarding issue."
  }
}

The minimum useful field set is:

  • event_type
  • trace_id
  • payload

In production, this usually needs to grow to include:

  • session_id
  • agent_id
  • tenant_id
  • principal_id
  • event_ts
  • span_id
  • parent_span_id

In the reference runtime, some of those fields still live inside payload to keep the structure small and easy to inspect. At the same time, serialized events now carry schema_version and redacted_fields, and the export path supports redaction for selected fields. The event loader validates this shape explicitly: Telemetry path must be a string or path-like object, Telemetry event line is not valid JSON: {line_number}, Telemetry event must be a mapping, Telemetry event is missing required field: {required_field}, Telemetry event field must be a string: {field}, Telemetry event field must not be empty: {field}, Telemetry schema version is not supported: {schema_version}, Telemetry event payload must be a mapping, Telemetry event payload key must be a string, Telemetry event payload key must not be empty, Telemetry event payload keys must be unique, Telemetry event payload value must be a string: {payload_key}, Telemetry event redacted_fields must be a tuple, Telemetry event redacted_fields must be a list, Telemetry event redacted_fields entries must be strings, Telemetry redact field must not be empty, and Telemetry redact field is not present in events: {missing}.

How trace and session relate

For agent systems, one trace is usually not enough. You almost always need a longer context:

  • one trace_id describes one run;
  • one session_id links multiple runs together;
  • a session-level summary can already support evals, rollout review, and postmortems.

That is why the package includes:

  • inspect-trace
  • inspect-session
  • session-eval-summary
  • export-session
  • export-eval-dataset

dump-events, export-events, and inspect-trace also keep the command response usable as a triage surface rather than just a raw JSONL dump: they expose session_id, tenant_id, principal_id, agent_id, authorization_mode, delegated_principal_id, delegated_scope, status, result, output_preview, failure_reason, approval_ids, approval_capability_names, pending_approval_ids, pending_approval_capability_names, approval_status_counts, and non-empty idempotency_keys before an operator has to manually scan every approval_requested or tool_execution payload. replay-run then reports source_idempotency_keys and replay_idempotency_keys separately, making it explicit that replay is a new run with its own duplicate-write key rather than a silent reuse of the original write.

Trace replay validates this evidence before it can seed a new run: Trace ID request must be a string, Trace ID not found in event file: {requested_trace_id}, Trace file does not contain any trace IDs, Trace file contains multiple trace IDs; pass --trace-id explicitly, Trace file does not contain a run_start event, Trace file contains multiple run_start events, Trace run_start event is missing replay fields: {missing_keys}, Trace run_start event has redacted replay fields: {redacted_keys}, Trace run_start replay field must be a string: {field}, and Trace run_start replay field must not be empty: {field}.

Reference runtime event catalog

Below is the current minimal event catalog.

Event type When it appears Why it matters
run_start at the beginning of a run captures input and actor identity
policy_precheck immediately after run admission records the policy precheck action, reason, and policy ID
retrieval when memory context is fetched records the source and number of retrieved records
context_layers_built after context assembly shows which context layers actually entered the run; internally RunContext keeps retrieved_context and retrieved_records before any tool_request is handled
tool_policy_decision before tool execution records the policy gate and allow/deny/approval reason
tool_execution after a capability call or approval handoff records capability status and tool-principal context
approval_requested on a high-risk write path shows that execution moved into human review
sandbox_profile_reviewed when a sandbox-backed path is reviewed records workspace, permissions, and snapshot/resume evidence review
memory_write_decision before background memory persistence records whether a candidate memory write was allowed or denied
memory_persisted after a background write records provenance and revision of a memory record
background_compaction after background memory maintenance records tenant-level compaction results
background_update_scheduled after background work is queued or completed records background update status for the run
run_failed when a tool failure becomes the run outcome preserves explicit failed-run traceability
run_complete at the end of a run closes the run-level outcome
span around individual calls provides simple latency and status telemetry

This is not meant to be a universal perfect catalog. It is a compact operational vocabulary that is already enough to support:

A stronger production vocabulary should also make room for verifier-aware evidence, so traces can later explain not only what the runtime did, but also what evidence a verifier used to judge process quality, outcome quality, or failure attribution.

  • trace inspection;
  • regression seeds;
  • session summaries;
  • incident review.

Why payload contracts matter

The problem is not that events are plain. The problem is that without contracts, payloads quickly turn into garbage.

For each event type, decide up front:

  • which fields are required;
  • which fields are stable;
  • which fields can be added without breaking downstream tooling;
  • which fields matter for grading;
  • which fields matter for audit.

For example, tool_policy_decision should usually include at least:

  • capability_name
  • decision
  • reason
  • risk_tier
  • tool_principal

Trace for the duplicate-ticket thread

In the support-triage case, tool_policy_decision, approval_requested, tool_execution, and the final outcome should be tied by one trace_id, session_id, approval_id, tool_principal, and idempotency_key. If create_ticket times out and the side-effect status is unknown, the trace should show side_effect_unknown instead of masking the run as successful or repeating the write without reconciliation.

For a sandbox-backed run, reserve fields that link the trace to the execution boundary:

  • sandbox_session_id
  • sandbox_manifest_version
  • sandbox_permissions_profile
  • snapshot_id
  • workspace_manifest_ref

If rollout or eval requires sandbox_profile_review, the trace should also be able to point to review evidence, not only state fields:

  • sandbox_profile_contract
  • workspace_entries_reviewed
  • permissions_profile
  • network_secrets_posture
  • snapshot_policy
  • reviewed_by
  • review_evidence_refs

If the system relies on verifier-aware evals, it is also useful to define an event or linked payload contract for verifier evidence, for example:

  • verifier_id
  • verifier_contract_version
  • process_score
  • outcome_score
  • failure_attribution
  • evidence_refs

And memory_persisted should usually include:

  • memory_class
  • kind
  • provenance
  • revision

The current reference payloads also use operational metadata fields such as runtime_principal, authorization_mode, delegated_principal_id, delegated_scope, policy_id, static_items, session_items, retrieved_items, tool_items, approval_id, reviewer, capability_session_id, capability_session_status, tool_status, output_preview, memory_id, revision_mode, compacted_records, persisted_records, tool_results, span_name, and duration_ms. Tool request/result model validation is part of the same trace boundary: malformed tool calls fail with Tool request capability name must be a string, Tool request capability name must not be empty, Tool request arguments must be a mapping, Tool request argument key must be a string, Tool request argument key must not be empty, Tool request argument keys must be unique, Tool request argument value must be a string: {argument_key}, and malformed tool results fail with Tool result status must be a string, Tool result status must not be empty, Tool result payload must be a mapping, Tool result payload key must be a string, Tool result payload key must not be empty, Tool result payload keys must be unique, and Tool result payload value must be a string: {payload_key}.

What the package already supports

You can inspect this directly:

.venv/bin/python -m agent_runtime_ref dump-events
.venv/bin/python -m agent_runtime_ref export-events --output artifacts/trace-demo.jsonl
.venv/bin/python -m agent_runtime_ref export-events --output artifacts/trace-demo.jsonl --redact-field user_input
.venv/bin/python -m agent_runtime_ref inspect-trace --input artifacts/trace-demo.jsonl
.venv/bin/python -m agent_runtime_ref export-session --output artifacts/session-demo-001.json
.venv/bin/python -m agent_runtime_ref export-eval-dataset --output artifacts/eval-dataset.json

That matters because the same trace vocabulary now lives in three places at once:

  • in the runtime;
  • in the book;
  • in eval-ready artifacts.

What a production schema should add

The reference runtime is intentionally small, so a more mature system should quickly add:

  • a timestamp on every event;
  • explicit span_id and parent_span_id;
  • a separate stable run_id;
  • version fields for the schema;
  • a split between display payload and machine payload;
  • redaction rules for sensitive fields;
  • an explicit way to link traces to verifier evidence, screenshots, or grading artifacts;
  • a stable way to record which verifier contract version produced the grading output;
  • sandbox state fields for runs that materialize a workspace, use shell/filesystem capabilities, or continue from a snapshot;
  • an event or linked payload for sandbox_profile_reviewed, so rollout/eval evidence for workspace, permissions, and snapshot/resume policy is traceable.

That is what turns an event stream from debug output into a real platform artifact.

What to Do Right Away

Start with this short list and mark every "no" explicitly:

  • Do you have a stable event catalog?
  • Do you clearly separate trace_id and session_id?
  • Is it clear which fields are required for each event type?
  • Can you reconstruct the policy decision and tool path from a trace?
  • Can you build an eval dataset from a session export?
  • Can you link a trace to verifier evidence used in grading or rollout review?
  • If rollout requires sandbox_profile_review, is there trace evidence for workspace entries, permissions, and snapshot/resume policy?
  • Can you tell which verifier contract version produced that grading output?
  • Do you have a plan for redaction and schema versioning?

If several answers are “no,” you probably have logging, but not yet a real trace schema.

What to Do Next