<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: lelandfy</title>
    <description>The latest articles on Forem by lelandfy (@leland_fy).</description>
    <link>https://forem.com/leland_fy</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3818658%2Fa706e8a0-66cb-43d9-9195-2e25efba6d2c.jpg</url>
      <title>Forem: lelandfy</title>
      <link>https://forem.com/leland_fy</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/leland_fy"/>
    <language>en</language>
    <item>
      <title>Stop Writing Docker Wrappers for Your AI Agent's Code Execution</title>
      <dc:creator>lelandfy</dc:creator>
      <pubDate>Mon, 16 Mar 2026 21:37:05 +0000</pubDate>
      <link>https://forem.com/leland_fy/stop-writing-docker-wrappers-for-your-ai-agents-code-execution-1c5b</link>
      <guid>https://forem.com/leland_fy/stop-writing-docker-wrappers-for-your-ai-agents-code-execution-1c5b</guid>
      <description>&lt;p&gt;Every AI agent that executes code needs a sandbox. And teams building one often end up writing the same thing: a Python wrapper around &lt;code&gt;subprocess.run(["docker", "run", ...])&lt;/code&gt; with a growing list of security flags they keep forgetting to set.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Here's what a typical "sandbox" looks like in most agent codebases:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;docker&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;run&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--rm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--network=none&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
     &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--memory=512m&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--cpus=1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
     &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--read-only&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--security-opt=no-new-privileges&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
     &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--pids-limit=64&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
     &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python:3.12-slim&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
     &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-c&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;print(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;hello&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;capture_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;300&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works. Until it doesn't:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Someone forgets --network=none and your agent starts making HTTP requests.&lt;/li&gt;
&lt;li&gt;The timeout handling is a mess when Docker itself hangs&lt;/li&gt;
&lt;li&gt;Parsing stdout/stderr gets fragile fast&lt;/li&gt;
&lt;li&gt;Cleanup on crash? Good luck&lt;/li&gt;
&lt;li&gt;Want to swap Docker for Firecracker? Rewrite everything&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What We Built
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/substratum-labs/roche" rel="noopener noreferrer"&gt;Roche&lt;/a&gt; is a sandbox orchestrator that replaces all of that with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;roche_sandbox&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Roche&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;Roche&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python:3.12-slim&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;sandbox&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sandbox&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exec&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-c&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;print(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;hello&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. The sandbox is created with secure defaults, the command runs, and the sandbox is destroyed when the context manager exits. Even if your code throws an exception.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "Secure Defaults" Actually Means
&lt;/h2&gt;

&lt;p&gt;When you call &lt;code&gt;Roche().create()&lt;/code&gt; with no arguments, you get:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setting&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Network&lt;/td&gt;
&lt;td&gt;Disabled&lt;/td&gt;
&lt;td&gt;LLM-generated code should not make HTTP calls&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Filesystem&lt;/td&gt;
&lt;td&gt;Read-only&lt;/td&gt;
&lt;td&gt;No persistent writes, no dropping payloads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Timeout&lt;/td&gt;
&lt;td&gt;300 seconds&lt;/td&gt;
&lt;td&gt;No infinite loops eating your CPU&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PID limit&lt;/td&gt;
&lt;td&gt;64&lt;/td&gt;
&lt;td&gt;No fork bombs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Privileges&lt;/td&gt;
&lt;td&gt;no-new-privileges&lt;/td&gt;
&lt;td&gt;No privilege escalation&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Every one of these can be overridden when you need to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;sandbox&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;roche&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python:3.12-slim&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;network&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;       &lt;span class="c1"&gt;# enable network
&lt;/span&gt;    &lt;span class="n"&gt;writable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="c1"&gt;# writable filesystem
&lt;/span&gt;    &lt;span class="n"&gt;timeout_secs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;600&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# longer timeout
&lt;/span&gt;    &lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1g&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="c1"&gt;# memory limit
&lt;/span&gt;    &lt;span class="n"&gt;cpus&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;2.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;# CPU limit
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But you have to opt in. Dangerous capabilities are never on by default.&lt;/p&gt;

&lt;h2&gt;
  
  
  Async Support
&lt;/h2&gt;

&lt;p&gt;If you're building an async agent (most are), there's &lt;code&gt;AsyncRoche&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;roche_sandbox&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AsyncRoche&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_code&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;roche&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AsyncRoche&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nf"&gt;with &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;roche&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;sandbox&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;sandbox&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exec&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-c&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Using It With Agent Frameworks
&lt;/h2&gt;

&lt;p&gt;Roche doesn't care what framework you use. Here's a quick example with OpenAI Agents:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agents&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Runner&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;function_tool&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;roche_sandbox&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Roche&lt;/span&gt;

&lt;span class="n"&gt;roche&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Roche&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="nd"&gt;@function_tool&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;execute_python&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Execute Python code in a secure sandbox.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;roche&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;sandbox&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sandbox&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exec&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-c&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exit_code&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stderr&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;

&lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Coder&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;instructions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You can run Python code using execute_python.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;execute_python&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same pattern works with LangChain, CrewAI, Anthropic tool use, AutoGen, etc. The sandbox logic stays the same regardless of the framework.&lt;/p&gt;

&lt;h2&gt;
  
  
  Swapping Providers
&lt;/h2&gt;

&lt;p&gt;The whole point of Roche is that provider choice is a config change, not a rewrite:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Docker (default)
&lt;/span&gt;&lt;span class="n"&gt;roche&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Roche&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;docker&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Firecracker microVMs (stronger isolation)
&lt;/span&gt;&lt;span class="n"&gt;roche&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Roche&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;firecracker&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# WebAssembly (lightweight, fast)
&lt;/span&gt;&lt;span class="n"&gt;roche&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Roche&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wasm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your &lt;code&gt;create / exec / destroy&lt;/code&gt; calls don't change. The security defaults adjust per provider but stay safe.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture (For the Curious)
&lt;/h2&gt;

&lt;p&gt;The core is a Rust library (&lt;code&gt;roche-core&lt;/code&gt;) with a &lt;code&gt;SandboxProvider&lt;/code&gt; trait:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Your Code (Python/TS/Go)
    |
    v
SDK (roche-sandbox on PyPI)
    |
    v
CLI subprocess or gRPC daemon (roched)
    |
    v
roche-core (Rust)
    |
    v
Docker / Firecracker / WASM
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The SDKs communicate with the Rust core either by shelling out to the &lt;code&gt;roche&lt;/code&gt; CLI (zero setup) or through a gRPC daemon (&lt;code&gt;roched&lt;/code&gt;) that adds sandbox pooling for faster acquisition.&lt;/p&gt;

&lt;p&gt;You don't need to install Rust. &lt;code&gt;pip install roche-sandbox&lt;/code&gt; is enough if you have Docker on your machine.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;roche-sandbox
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;roche_sandbox&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Roche&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;Roche&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;sandbox&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sandbox&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exec&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-c&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;import sys; print(sys.version)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Requirements: Python 3.10+ and Docker.&lt;/p&gt;

&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/substratum-labs/roche" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://substratum-labs.github.io/roche-docs/" rel="noopener noreferrer"&gt;Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pypi.org/project/roche-sandbox/" rel="noopener noreferrer"&gt;PyPI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.npmjs.com/package/roche-sandbox" rel="noopener noreferrer"&gt;npm&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The whole thing is Apache-2.0. Contributions welcome.&lt;/p&gt;

</description>
      <category>python</category>
      <category>rust</category>
      <category>docker</category>
      <category>ai</category>
    </item>
    <item>
      <title>Do LLM Agents Need an OS?</title>
      <dc:creator>lelandfy</dc:creator>
      <pubDate>Wed, 11 Mar 2026 15:29:39 +0000</pubDate>
      <link>https://forem.com/leland_fy/do-llm-agents-need-an-os-15i4</link>
      <guid>https://forem.com/leland_fy/do-llm-agents-need-an-os-15i4</guid>
      <description>&lt;p&gt;LLM agents are getting smarter every month. But the way we run them hasn't changed: a prompt, a while loop, and a prayer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;done&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;action&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;execute_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tool_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arguments&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# no isolation
&lt;/span&gt;    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This architecture has three gaps:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No interception point.&lt;/strong&gt; If the agent decides to call &lt;code&gt;delete_database()&lt;/code&gt;, the damage is done before you see it in the logs. There is no gate between the LLM's decision and the side effect.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No budget.&lt;/strong&gt; The agent can make unlimited API calls, send unlimited emails, or consume unlimited compute. The only limit is when the context window fills up or the while loop hits max iterations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No crash recovery.&lt;/strong&gt; If the process dies, all state is lost. The agent starts from scratch — re-executing every tool call, re-spending every API dollar.&lt;/p&gt;

&lt;p&gt;We solved analogous problems decades ago. In the 1960s, computers ran one program at a time with full hardware access. Then operating systems arrived — process isolation, resource quotas, preemptive scheduling — and everything changed.&lt;/p&gt;

&lt;p&gt;What if we applied the same ideas to LLM agents?&lt;/p&gt;

&lt;h2&gt;
  
  
  The OS Analogy
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;OS Concept&lt;/th&gt;
&lt;th&gt;Agent Equivalent&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;System calls&lt;/td&gt;
&lt;td&gt;All tool calls go through a validated gateway&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Process checkpoints&lt;/td&gt;
&lt;td&gt;Agent state is a replay log, not a serialized coroutine&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Resource quotas&lt;/td&gt;
&lt;td&gt;Finite, depletable budgets per resource type&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hardware interrupts&lt;/td&gt;
&lt;td&gt;Destructive ops auto-suspend for human review&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The key insight: &lt;strong&gt;the agent function is "user space," and the kernel controls all side effects.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why a Microkernel, Not a Framework
&lt;/h2&gt;

&lt;p&gt;If agents need OS-like controls, the obvious move is to bake them into the agent framework itself. LangChain adds guardrails. CrewAI adds role-based access. Every framework reinvents its own safety layer.&lt;/p&gt;

&lt;p&gt;This is the monolithic kernel approach — and it has the same problems it had in the 1980s:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tight coupling.&lt;/strong&gt; Safety logic is entangled with orchestration logic. You can't use LangChain's budget system with AutoGen's agents. Each framework is a walled garden.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;All or nothing.&lt;/strong&gt; Want checkpoint/replay? You need to adopt the entire framework. Want HITL approval? Same deal. There's no way to add just the control layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Growing attack surface.&lt;/strong&gt; Every new framework feature is another place where an unchecked tool call can slip through. The more the framework does, the harder it is to audit.&lt;/p&gt;

&lt;p&gt;The microkernel approach inverts this. The kernel does exactly four things: validate tool calls, enforce budgets, gate destructive operations, and manage checkpoints. Everything else — orchestration, prompting, LLM selection, agent logic — stays in user space. The kernel is small enough to audit, framework-agnostic, and composable.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Monolithic:   [Agent logic + tools + safety + budgets + HITL + replay]
              &amp;lt;- one framework, tightly coupled

Microkernel:  [Agent logic + tools]  &amp;lt;-  user space (any framework)
              ------------------------------------------------
              [validation | budgets | HITL | replay]  &amp;lt;-  kernel
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is why Mini-Castor exists as a standalone kernel, not a plugin for an existing framework. Any framework can integrate with it. Any agent can run on top of it.&lt;/p&gt;

&lt;p&gt;To test this idea, I wrote &lt;a href="https://github.com/substratum-labs/mini-castor" rel="noopener noreferrer"&gt;Mini-Castor&lt;/a&gt; — the entire microkernel in one Python file. No dependencies. No frameworks. Just &lt;code&gt;asyncio&lt;/code&gt;, &lt;code&gt;dataclasses&lt;/code&gt;, and &lt;code&gt;contextvars&lt;/code&gt;. ~500 lines.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Syscall Proxy
&lt;/h2&gt;

&lt;p&gt;Every agent action goes through a single gateway:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;syscall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search_emails&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;older than 30 days&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The proxy implements three paths:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Replay path&lt;/strong&gt; — serving cached responses from a previous run (instant)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fast path&lt;/strong&gt; — budget check -&amp;gt; execute -&amp;gt; log (normal execution)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Slow path&lt;/strong&gt; — destructive tool -&amp;gt; suspend for human review&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The agent never calls tools directly. It doesn't know which path it's on.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Core Trick: Non-Serialized Coroutines
&lt;/h2&gt;

&lt;p&gt;This is the most important design choice.&lt;/p&gt;

&lt;p&gt;Python coroutines can't be serialized. You can't pickle an &lt;code&gt;async def&lt;/code&gt; that's halfway through — it holds C-level stack frames, event loop references, and closure state.&lt;/p&gt;

&lt;p&gt;Mini-Castor's solution: &lt;strong&gt;don't serialize the coroutine at all.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Record every syscall and its response in a log&lt;/li&gt;
&lt;li&gt;To "suspend": raise an exception that unwinds the entire call stack&lt;/li&gt;
&lt;li&gt;To "resume": re-run the function from the top, serve cached responses&lt;/li&gt;
&lt;li&gt;The agent fast-forwards through cached syscalls, then continues live
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Resume after suspension:
  syscall #0: search_emails  -&amp;gt; cached (instant)
  syscall #1: analyze        -&amp;gt; cached (instant)
  syscall #2: delete_emails  -&amp;gt; LIVE (human approved)
  syscall #3: send_summary   -&amp;gt; LIVE
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent doesn't know it was ever "killed." From its perspective, &lt;code&gt;syscall()&lt;/code&gt; just returned a value. This also gives you crash recovery for free — save the checkpoint to disk, resume from any point.&lt;/p&gt;

&lt;h2&gt;
  
  
  Capability Budgets
&lt;/h2&gt;

&lt;p&gt;Every tool declares its cost:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;io&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;destructive&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;delete_emails&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;criteria&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The kernel deducts before execution and refunds on failure. When the budget hits zero, the agent is stopped. No runaway API bills.&lt;/p&gt;

&lt;p&gt;A subtle detail: deduction happens &lt;em&gt;before&lt;/em&gt; execution. If we deducted after and the tool raised an exception, the cost would stick but the result would never be logged. On replay, the syscall would re-execute and deduct again — a permanent budget leak.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Modify Doesn't Mutate the Request
&lt;/h2&gt;

&lt;p&gt;Destructive tools auto-suspend. The human gets three choices: approve, reject, or modify.&lt;/p&gt;

&lt;p&gt;"Modify" is the subtlest design decision. When a human says "only delete files older than 90 days," we do NOT edit the pending request. That would break replay:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;On replay, the agent emits:  delete(scope="all")       &amp;lt;- original
But the log would contain:   delete(scope="90+ days")  &amp;lt;- mutated
                              MISMATCH -&amp;gt; ReplayDivergenceError
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Instead, we log the original request with the human's feedback as the response. On replay, the agent sees &lt;code&gt;{"status": "MODIFIED", "feedback": "only 90+ days"}&lt;/code&gt; and the LLM re-plans with revised arguments. The revised call becomes a new syscall entry. Full audit trail. Replay integrity preserved. The human writes natural language, not JSON.&lt;/p&gt;

&lt;h2&gt;
  
  
  The ContextVar Bridge
&lt;/h2&gt;

&lt;p&gt;The proxy is powerful, but it couples agent code to the kernel. Every agent must accept a &lt;code&gt;proxy&lt;/code&gt; parameter. We can do better:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Agent code — zero kernel imports, zero kernel coupling
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;my_agent&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;call_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hello&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;remaining&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;budget&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;api&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A &lt;code&gt;ContextVar&lt;/code&gt; (Python's async-aware thread-local) holds the current proxy. The kernel sets it before running the agent; free functions read it implicitly. This creates a clean separation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Operator&lt;/strong&gt;: sets up kernel, registers tools, manages budgets&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent developer&lt;/strong&gt;: writes pure logic using &lt;code&gt;call_tool()&lt;/code&gt; / &lt;code&gt;budget()&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Like how libc hides raw syscall numbers behind &lt;code&gt;printf()&lt;/code&gt; and &lt;code&gt;malloc()&lt;/code&gt;. The kernel auto-detects the agent's signature — &lt;code&gt;agent(proxy)&lt;/code&gt; gets the proxy explicitly, &lt;code&gt;agent()&lt;/code&gt; uses ContextVar implicitly. Both work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/substratum-labs/mini-castor.git
&lt;span class="nb"&gt;cd &lt;/span&gt;mini-castor
python demo.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The demo runs a research assistant that searches, analyzes, and tries to delete emails. The kernel suspends at the destructive call. You choose: approve, reject, or modify. Watch the replay.&lt;/p&gt;

&lt;p&gt;No API keys. No dependencies. 30 seconds.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Leaves Out
&lt;/h2&gt;

&lt;p&gt;Mini-Castor is a teaching tool, not a production system. It deliberately skips:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Schema validation&lt;/strong&gt; — real tools need Pydantic validation with LLM-readable errors&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sub-agent spawning&lt;/strong&gt; — parent agents delegating budget to children&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context window management&lt;/strong&gt; — evicting old messages when the window fills&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Persistence&lt;/strong&gt; — saving checkpoints to SQLite for real crash recovery&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streaming + preemption&lt;/strong&gt; — canceling an LLM mid-generation and capturing partial output&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are hard problems. We're working on a production kernel that tackles them.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Question
&lt;/h2&gt;

&lt;p&gt;The agent ecosystem is building increasingly complex orchestration on top of the "prompt + while loop." But maybe what we need isn't another framework on top — it's a layer underneath. A kernel that any framework can integrate with. A syscall boundary that makes every tool call auditable, budgeted, and interruptible.&lt;/p&gt;

&lt;p&gt;Do LLM agents need an OS? &lt;a href="https://github.com/substratum-labs/mini-castor/blob/main/mini_castor.py" rel="noopener noreferrer"&gt;Read the code&lt;/a&gt; and decide for yourself.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Mini-Castor is open source under Apache 2.0. The entire kernel is &lt;a href="https://github.com/substratum-labs/mini-castor/blob/main/mini_castor.py" rel="noopener noreferrer"&gt;one file&lt;/a&gt;, ~500 lines, heavily commented. Designed to be read top to bottom in one sitting.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>agents</category>
      <category>python</category>
    </item>
  </channel>
</rss>
