Forem: lelandfy

Stop Writing Docker Wrappers for Your AI Agent's Code Execution

lelandfy — Mon, 16 Mar 2026 21:37:05 +0000

Every AI agent that executes code needs a sandbox. And teams building one often end up writing the same thing: a Python wrapper around subprocess.run(["docker", "run", ...]) with a growing list of security flags they keep forgetting to set.

The Problem

Here's what a typical "sandbox" looks like in most agent codebases:

import subprocess
import json

result = subprocess.run(
    ["docker", "run", "--rm", "--network=none",
     "--memory=512m", "--cpus=1",
     "--read-only", "--security-opt=no-new-privileges",
     "--pids-limit=64",
     "python:3.12-slim",
     "python3", "-c", "print('hello')"],
    capture_output=True, text=True, timeout=300
)
print(result.stdout)

This works. Until it doesn't:

Someone forgets --network=none and your agent starts making HTTP requests.
The timeout handling is a mess when Docker itself hangs
Parsing stdout/stderr gets fragile fast
Cleanup on crash? Good luck
Want to swap Docker for Firecracker? Rewrite everything

What We Built

Roche is a sandbox orchestrator that replaces all of that with:

from roche_sandbox import Roche

with Roche().create(image="python:3.12-slim") as sandbox:
    result = sandbox.exec(["python3", "-c", "print('hello')"])
    print(result.stdout)

That's it. The sandbox is created with secure defaults, the command runs, and the sandbox is destroyed when the context manager exits. Even if your code throws an exception.

What "Secure Defaults" Actually Means

When you call Roche().create() with no arguments, you get:

Setting	Default	Why
Network	Disabled	LLM-generated code should not make HTTP calls
Filesystem	Read-only	No persistent writes, no dropping payloads
Timeout	300 seconds	No infinite loops eating your CPU
PID limit	64	No fork bombs
Privileges	no-new-privileges	No privilege escalation

Every one of these can be overridden when you need to:

sandbox = roche.create(
    image="python:3.12-slim",
    network=True,       # enable network
    writable=True,      # writable filesystem
    timeout_secs=600,   # longer timeout
    memory="1g",        # memory limit
    cpus=2.0,           # CPU limit
)

But you have to opt in. Dangerous capabilities are never on by default.

Async Support

If you're building an async agent (most are), there's AsyncRoche:

from roche_sandbox import AsyncRoche

async def run_code(code: str) -> str:
    roche = AsyncRoche()
    async with (await roche.create()) as sandbox:
        result = await sandbox.exec(["python3", "-c", code])
        return result.stdout

Using It With Agent Frameworks

Roche doesn't care what framework you use. Here's a quick example with OpenAI Agents:

from agents import Agent, Runner, function_tool
from roche_sandbox import Roche

roche = Roche()

@function_tool
def execute_python(code: str) -> str:
    """Execute Python code in a secure sandbox."""
    with roche.create() as sandbox:
        result = sandbox.exec(["python3", "-c", code])
        if result.exit_code != 0:
            return f"Error:\n{result.stderr}"
        return result.stdout

agent = Agent(
    name="Coder",
    instructions="You can run Python code using execute_python.",
    tools=[execute_python],
)

Same pattern works with LangChain, CrewAI, Anthropic tool use, AutoGen, etc. The sandbox logic stays the same regardless of the framework.

Swapping Providers

The whole point of Roche is that provider choice is a config change, not a rewrite:

# Docker (default)
roche = Roche(provider="docker")

# Firecracker microVMs (stronger isolation)
roche = Roche(provider="firecracker")

# WebAssembly (lightweight, fast)
roche = Roche(provider="wasm")

Your create / exec / destroy calls don't change. The security defaults adjust per provider but stay safe.

Architecture (For the Curious)

The core is a Rust library (roche-core) with a SandboxProvider trait:

Your Code (Python/TS/Go)
    |
    v
SDK (roche-sandbox on PyPI)
    |
    v
CLI subprocess or gRPC daemon (roched)
    |
    v
roche-core (Rust)
    |
    v
Docker / Firecracker / WASM

The SDKs communicate with the Rust core either by shelling out to the roche CLI (zero setup) or through a gRPC daemon (roched) that adds sandbox pooling for faster acquisition.

You don't need to install Rust. pip install roche-sandbox is enough if you have Docker on your machine.

Getting Started

pip install roche-sandbox

from roche_sandbox import Roche

with Roche().create() as sandbox:
    out = sandbox.exec(["python3", "-c", "import sys; print(sys.version)"])
    print(out.stdout)

Requirements: Python 3.10+ and Docker.

Links

The whole thing is Apache-2.0. Contributions welcome.

Do LLM Agents Need an OS?

lelandfy — Wed, 11 Mar 2026 15:29:39 +0000

LLM agents are getting smarter every month. But the way we run them hasn't changed: a prompt, a while loop, and a prayer.

while not done:
    action = llm.generate(messages, tools=tools)
    result = execute_tool(action.tool_name, action.arguments)  # no isolation
    messages.append(result)

This architecture has three gaps:

No interception point. If the agent decides to call delete_database(), the damage is done before you see it in the logs. There is no gate between the LLM's decision and the side effect.

No budget. The agent can make unlimited API calls, send unlimited emails, or consume unlimited compute. The only limit is when the context window fills up or the while loop hits max iterations.

No crash recovery. If the process dies, all state is lost. The agent starts from scratch — re-executing every tool call, re-spending every API dollar.

We solved analogous problems decades ago. In the 1960s, computers ran one program at a time with full hardware access. Then operating systems arrived — process isolation, resource quotas, preemptive scheduling — and everything changed.

What if we applied the same ideas to LLM agents?

The OS Analogy

OS Concept	Agent Equivalent
System calls	All tool calls go through a validated gateway
Process checkpoints	Agent state is a replay log, not a serialized coroutine
Resource quotas	Finite, depletable budgets per resource type
Hardware interrupts	Destructive ops auto-suspend for human review

The key insight: the agent function is "user space," and the kernel controls all side effects.

Why a Microkernel, Not a Framework

If agents need OS-like controls, the obvious move is to bake them into the agent framework itself. LangChain adds guardrails. CrewAI adds role-based access. Every framework reinvents its own safety layer.

This is the monolithic kernel approach — and it has the same problems it had in the 1980s:

Tight coupling. Safety logic is entangled with orchestration logic. You can't use LangChain's budget system with AutoGen's agents. Each framework is a walled garden.

All or nothing. Want checkpoint/replay? You need to adopt the entire framework. Want HITL approval? Same deal. There's no way to add just the control layer.

Growing attack surface. Every new framework feature is another place where an unchecked tool call can slip through. The more the framework does, the harder it is to audit.

The microkernel approach inverts this. The kernel does exactly four things: validate tool calls, enforce budgets, gate destructive operations, and manage checkpoints. Everything else — orchestration, prompting, LLM selection, agent logic — stays in user space. The kernel is small enough to audit, framework-agnostic, and composable.

Monolithic:   [Agent logic + tools + safety + budgets + HITL + replay]
              <- one framework, tightly coupled

Microkernel:  [Agent logic + tools]  <-  user space (any framework)
              ------------------------------------------------
              [validation | budgets | HITL | replay]  <-  kernel

This is why Mini-Castor exists as a standalone kernel, not a plugin for an existing framework. Any framework can integrate with it. Any agent can run on top of it.

To test this idea, I wrote Mini-Castor — the entire microkernel in one Python file. No dependencies. No frameworks. Just asyncio, dataclasses, and contextvars. ~500 lines.

The Syscall Proxy

Every agent action goes through a single gateway:

result = await proxy.syscall("search_emails", {"query": "older than 30 days"})

The proxy implements three paths:

Replay path — serving cached responses from a previous run (instant)
Fast path — budget check -> execute -> log (normal execution)
Slow path — destructive tool -> suspend for human review

The agent never calls tools directly. It doesn't know which path it's on.

The Core Trick: Non-Serialized Coroutines

This is the most important design choice.

Python coroutines can't be serialized. You can't pickle an async def that's halfway through — it holds C-level stack frames, event loop references, and closure state.

Mini-Castor's solution: don't serialize the coroutine at all.

Record every syscall and its response in a log
To "suspend": raise an exception that unwinds the entire call stack
To "resume": re-run the function from the top, serve cached responses
The agent fast-forwards through cached syscalls, then continues live

Resume after suspension:
  syscall #0: search_emails  -> cached (instant)
  syscall #1: analyze        -> cached (instant)
  syscall #2: delete_emails  -> LIVE (human approved)
  syscall #3: send_summary   -> LIVE

The agent doesn't know it was ever "killed." From its perspective, syscall() just returned a value. This also gives you crash recovery for free — save the checkpoint to disk, resume from any point.

Capability Budgets

Every tool declares its cost:

@tool("io", cost=2, destructive=True)
async def delete_emails(criteria: str) -> str: ...

The kernel deducts before execution and refunds on failure. When the budget hits zero, the agent is stopped. No runaway API bills.

A subtle detail: deduction happens before execution. If we deducted after and the tool raised an exception, the cost would stick but the result would never be logged. On replay, the syscall would re-execute and deduct again — a permanent budget leak.

Why Modify Doesn't Mutate the Request

Destructive tools auto-suspend. The human gets three choices: approve, reject, or modify.

"Modify" is the subtlest design decision. When a human says "only delete files older than 90 days," we do NOT edit the pending request. That would break replay:

On replay, the agent emits:  delete(scope="all")       <- original
But the log would contain:   delete(scope="90+ days")  <- mutated
                              MISMATCH -> ReplayDivergenceError

Instead, we log the original request with the human's feedback as the response. On replay, the agent sees {"status": "MODIFIED", "feedback": "only 90+ days"} and the LLM re-plans with revised arguments. The revised call becomes a new syscall entry. Full audit trail. Replay integrity preserved. The human writes natural language, not JSON.

The ContextVar Bridge

The proxy is powerful, but it couples agent code to the kernel. Every agent must accept a proxy parameter. We can do better:

# Agent code — zero kernel imports, zero kernel coupling
async def my_agent():
    result = await call_tool("search", query="hello")
    remaining = budget("api")
    return result

A ContextVar (Python's async-aware thread-local) holds the current proxy. The kernel sets it before running the agent; free functions read it implicitly. This creates a clean separation:

Operator: sets up kernel, registers tools, manages budgets
Agent developer: writes pure logic using call_tool() / budget()

Like how libc hides raw syscall numbers behind printf() and malloc(). The kernel auto-detects the agent's signature — agent(proxy) gets the proxy explicitly, agent() uses ContextVar implicitly. Both work.

Try It

git clone https://github.com/substratum-labs/mini-castor.git
cd mini-castor
python demo.py

The demo runs a research assistant that searches, analyzes, and tries to delete emails. The kernel suspends at the destructive call. You choose: approve, reject, or modify. Watch the replay.

No API keys. No dependencies. 30 seconds.

What This Leaves Out

Mini-Castor is a teaching tool, not a production system. It deliberately skips:

Schema validation — real tools need Pydantic validation with LLM-readable errors
Sub-agent spawning — parent agents delegating budget to children
Context window management — evicting old messages when the window fills
Persistence — saving checkpoints to SQLite for real crash recovery
Streaming + preemption — canceling an LLM mid-generation and capturing partial output

These are hard problems. We're working on a production kernel that tackles them.

The Question

The agent ecosystem is building increasingly complex orchestration on top of the "prompt + while loop." But maybe what we need isn't another framework on top — it's a layer underneath. A kernel that any framework can integrate with. A syscall boundary that makes every tool call auditable, budgeted, and interruptible.

Do LLM agents need an OS? Read the code and decide for yourself.

Mini-Castor is open source under Apache 2.0. The entire kernel is one file, ~500 lines, heavily commented. Designed to be read top to bottom in one sitting.