Skip to content

Chapter 9. Sandbox Execution and MCP as an Integration Contract

1. Why an Execution Layer Without a Sandbox Quickly Becomes Too Trusting

Once an agent has access to tools, the next danger is almost always the same: system boundaries start to blur.

The agent can now:

  • read data;
  • run operations;
  • call external services;
  • receive responses from unpredictable environments.

If all of that executes "as is", without isolation and contracts, the platform quickly accumulates problems:

  • a tool returns untrusted payloads in unexpected formats;
  • an integration hangs or exceeds a resource budget;
  • a side effect happens outside the expected policy path;
  • one badly designed adapter drags the whole runtime down.

That is why the execution layer is not just a router. It is also a sandbox boundary.

2. A Sandbox Is Not Necessarily a Container, It Is First a Set of Limits

When people say "sandbox", many immediately think of Docker, a VM, or a separate process. Those are possible implementations, but architecturally the important thing is different: a sandbox defines the limits of what a capability is allowed to do.

A good sandbox usually limits:

  • network access;
  • file system access;
  • access to secrets;
  • CPU and memory budgets;
  • allowed syscalls or execution mode;
  • operation lifetime.

In other words, the sandbox answers: "What happens if a tool or adapter behaves worse than we expected?"

This is not only security. It is also blast-radius control.

3. You Cannot Treat an External Integration Like a Simple Function

A common mistake looks like this: an external service is wrapped in a function, and the agent sees it as just another call.

But a real integration is almost always:

  • less stable than local code;
  • less cleanly typed;
  • dependent on permissions and environment;
  • capable of returning partial or unsafe results;
  • subject to its own latency and rate limits.

That is why it is more useful to treat integrations as capability endpoints with a contract, not as convenient helper methods.

4. MCP Is Useful Precisely as a Contract Layer

MCP is not useful because it is fashionable. It is useful because it gives you a clear contract boundary between the agent and an external capability.

In a good design, MCP gives you several benefits:

  • a standardized way to describe tools and resources;
  • a separate server boundary;
  • a clearer lifecycle for connected capabilities;
  • the ability to keep adapters outside the core runtime;
  • a natural point for policy checks, logging, and isolation.

That becomes especially valuable once you have not one runtime and one integration, but a set of capabilities you want to connect systematically rather than chaotically.

MCP is useful as a contract layer between the runtime and external capabilities

flowchart LR
    A["Agent runtime"] --> B["Execution layer"]
    B --> C["Policy and validation"]
    C --> D["MCP client"]
    D --> E["MCP server"]
    E --> F["Typed adapter"]
    F --> G["External API / system"]
    G --> F
    F --> E
    E --> D
    D --> B

5. Why Move Adapters Out of the Core Runtime

That gives you several immediate benefits:

  • failures in one integration affect the central runtime less;
  • network, secrets, and filesystem can be constrained per capability;
  • it is easier to swap or upgrade one adapter without rewriting orchestration;
  • contracts become clearer;
  • capabilities are easier to test independently of the agent logic.

That matters especially when some tools are read-only, some write into external systems, and some execute code or shell commands.

6. Not Every Capability Needs the Same Isolation Level

It is useful to split integrations into at least three classes:

  • low-risk read capabilities;
  • medium-risk business actions;
  • high-risk execution capabilities.

Examples:

  • read_kb or search_docs can run with softer controls;
  • create_ticket or update_crm_record need stricter policy and audit;
  • run_shell, exec_sql, or deploy_job need the strongest sandbox and approval.

If every tool gets the same soft execution profile, the platform becomes either unsafe or incident-prone.

7. A Capability Contract Must Include More Than Input/Output

Many teams do a decent job describing input schema, but the operational contract is missing. In practice, that part is often more important.

It helps to define explicitly:

  • read or write nature;
  • network policy;
  • secret scope;
  • allowed environments;
  • timeout budget;
  • retry policy;
  • approval requirement;
  • logging and redaction rules.
capabilities:
  search_docs:
    transport: mcp
    mode: read
    network: internal_only
    secrets: none
    timeout_seconds: 8
    approval: none
  create_ticket:
    transport: mcp
    mode: write
    network: internal_only
    secrets: service_account_helpdesk
    timeout_seconds: 15
    approval: manager_for_high_priority
  run_shell:
    transport: sandboxed_exec
    mode: high_risk
    network: denied
    filesystem: workspace_only
    secrets: none
    timeout_seconds: 10
    approval: always

This is no longer just a function description. It is a behavioral contract for a capability.

8. Sandbox Execution Should Return Execution Facts, Not Only Output

If the sandbox returns only stdout or a payload, you lose half the value of the isolation layer.

For investigations and control, it is useful to return:

  • exit status;
  • timeout flag;
  • resource usage summary;
  • side effect uncertainty;
  • redacted logs;
  • policy decision id.

Then the execution layer can explain not just "the command failed", but something mature like: "the operation was terminated by timeout after 8 seconds, network was denied, side effect is not confirmed".

9. A Simple Capability Dispatch Example

This small skeleton shows the core idea: transport and execution profile are chosen from the capability contract, not invented by the model on the fly.

from dataclasses import dataclass


@dataclass
class CapabilitySpec:
    name: str
    transport: str
    mode: str
    timeout_seconds: int


def dispatch_capability(spec: CapabilitySpec, args: dict) -> dict:
    if spec.transport == "mcp":
        return {"status": "success", "transport": "mcp", "capability": spec.name}
    if spec.transport == "sandboxed_exec" and spec.mode == "high_risk":
        return {"status": "approval_required", "capability": spec.name}
    return {"status": "validation_failure", "reason": "unsupported capability profile"}

It is intentionally simple, but it locks in the right idea: the way execution happens is determined by the platform, not improvised by the model every time.

10. What Usually Breaks in the Sandbox and Capability Layer

The same problems repeat over and over:

  • a capability gets more network access than it needs;
  • secrets are visible to too many adapters;
  • tool results drag raw external payloads into prompts;
  • timeouts exist, but side effect uncertainty is not modeled;
  • an MCP server was added, but policy and audit never reached it;
  • a sandbox exists on paper but does not restrict anything important.

That is why sandboxing cannot be a checkbox feature. It has to be part of execution design.

11. Practical Checklist

If you want to quickly review the capability layer, ask:

  • Are adapters separated from the core runtime?
  • Is there a per-capability execution profile?
  • Are network, filesystem, and secrets constrained?
  • Is transport explicit: direct, MCP, sandboxed exec?
  • Does the system distinguish trustworthy from only partially trusted results?
  • Do you store execution facts beyond business payload?
  • Can you explain why a capability was allowed in this specific run?

If those answers are vague, the capability layer is still a pile of useful integrations, not a managed platform.

The next natural topic in this part is idempotency, retries, rate limits, and rollback boundaries. After sandboxing and capability contracts, that is what turns the execution model into a production-grade layer.