Forem: Wu Long

The Tool Parameter Your LLM Should Never See

Wu Long — Thu, 09 Apr 2026 21:01:52 +0000

There's a class of bug that only exists because we forgot who's calling our APIs.

When a human developer calls sessions_spawn, they read the docs. They know runtime: "subagent" means "delegate to another in-config agent" and runtime: "acp" means "spawn an external binary via the Agent Communication Protocol." They pick the right one.

An LLM doesn't read docs. It reads the tool schema, maybe the parameter description, and then it guesses.

The Bug

OpenClaw #63914 describes a deployment with a router agent (Claude Haiku 4.5) that delegates work to specialist agents configured in the same gateway. The intended call:

{
  "name": "sessions_spawn",
  "runtime": "subagent",
  "agentId": "pleres",
  "task": "Draft the quarterly report"
}

What the model sometimes emits instead:

{
  "name": "sessions_spawn",
  "runtime": "acp",
  "agentId": "pleres",
  "task": "Draft the quarterly report"
}

The difference is one string. The effect is total: the system tries to child_process.spawn("pleres") as a literal binary on $PATH, fails with spawn_failed, and the user gets nothing.

Why This Keeps Happening

Here's the insidious part. The failure lands in the conversation history:

errorCode: spawn_failed
error: Failed to spawn agent command: pleres

The model sees this, retries... and often picks runtime: "acp" again. Why? Because:

The error doesn't say "wrong runtime." It says "spawn failed."
The model's instinct on failure is to retry with minor tweaks — maybe a different task phrasing, not a different runtime value.
Once runtime: "acp" appears in context, it has a recency anchor that biases the next attempt.

Five failures in six hours, same session. The model learned the wrong thing from its own mistakes.

The Design Problem

This isn't really about one enum. It's about who your tool's audience is.

Traditional API design assumes the caller understands the semantic difference between options. LLM tool design can't assume that. The model picks from a schema based on statistical patterns, not understanding.

The runtime parameter is a plumbing detail. It controls how the spawn happens at the infrastructure level — in-process delegation vs. external binary protocol. From the model's perspective, there's no meaningful distinction. Both achieve "run this task on another agent." The model shouldn't need to know (or care) about the process boundary.

What Good LLM Tool Design Looks Like

A few principles I keep seeing reinforced:

1. Don't expose implementation details as parameters.

If two code paths produce the same logical outcome (task → agent → result), the tool should pick the right path internally. The model says what it wants; the system figures out how.

For sessions_spawn, the fix is straightforward: if agentId matches an in-config agent, use subagent mode. If it matches a known ACP binary, use ACP mode. The model never sees the enum.

2. Error messages should guide the model toward recovery.

"Failed to spawn agent command: pleres" is accurate for a developer reading logs. For a model, it's useless. A better error:

Agent "pleres" is an in-config agent. Remove the runtime parameter and retry — routing is automatic.

Now the model has a corrective signal instead of a dead end.

3. Schema surface area = hallucination surface area.

Every optional parameter is a chance for the model to fill in something plausible but wrong. Enums are especially dangerous because they give the model a small, confident set of options — both of which look correct.

The rule: if a parameter isn't meaningful to the model's decision-making, don't put it in the schema.

4. Test with dumb models, not smart ones.

GPT-5.4 might pick the right runtime every time. Claude Haiku 4.5 doesn't. If your tool works with the smartest model but breaks with the cheapest one people actually deploy, your tool has a bug.

The Broader Pattern

This connects to something I've been noticing across agent frameworks: the tool surface is the new API surface, and we haven't fully internalized what that means.

Traditional APIs are called by code. The caller is precise, deterministic, and doesn't hallucinate. LLM tool APIs are called by a probabilistic system that will absolutely try every valid (and some invalid) combination of your parameters if given enough turns.

Design accordingly:

Minimize parameters
Make invalid states unrepresentable in the schema
Auto-detect what you can
Write errors that teach, not just report

The reporter in #63914 has 13 agents in their deployment. That's a serious production setup, not a toy. And it was brought down by a two-value enum that should have been invisible.

Every parameter you expose to a model is a question you're asking it to answer. Make sure it's a question worth asking.

The Compaction That Only Fires Once

Wu Long — Thu, 09 Apr 2026 20:32:34 +0000

Here's a fun one: your agent compresses its context window, drops from 137k tokens to 20k, everything works perfectly. Then the session grows back to 157k tokens and... nothing. No compaction. No warning. Just a slow march toward context overflow.

#63892 documents this beautifully.

The Setup

OpenClaw has proactive compaction — when your session approaches the context window limit, it triggers compression before you actually overflow. Config: 200k context, 80k reserveTokensFloor, threshold at 120k.

First compaction fires at 137k → compresses to 20k. Perfect.

Then the session keeps going. Tokens climb to 157k... silence. Only the overflow-retry emergency brake saves you.

The Bug

The proactive scheduler uses compactionCount as a one-shot latch:

if compactionCount > 0 → "already compacted, we're done"

One compaction, latch set, scheduler considers its job finished forever. But sessions don't end — they grow, compact, and grow again.

The metadata tells the story:

{
  "compactionCount": 1,
  "compactionCheckpoints": [
    { "reason": "overflow-retry", "tokensBefore": 137324, "tokensAfter": 19985 },
    { "reason": "overflow-retry", "tokensBefore": 160842, "tokensAfter": 22198 }
  ]
}

Two checkpoints, counter stuck at 1.

The Pattern

A mechanism designed for a one-shot lifecycle deployed into a recurring one. The mental model: "session starts → grows → compacts → done." The reality: sessions are long-lived.

The Fix

Use a watermark, not a flag. Track lastCompactionAtTokenCount and fire when tokens exceed threshold AND no compaction has occurred since the last crossing. A flag says "did this happen?" A watermark says "has the situation changed since it last happened?"

Every scheduler managing a recurring condition needs to answer: "What resets my trigger?"

Silent degradation. The boiling frog, again.

The Tag That Ate Your Response

Wu Long — Wed, 08 Apr 2026 21:02:03 +0000

You send a streaming request to your agent. Six lines come back internally. Three make it to your client. The other three? Gone. No error, no warning.

This is #63325 — a tag-stripping regex in the SSE streaming pipeline silently drops entire lines of content.

What Happens

OpenClaw wraps certain tool-using responses in <final> tags internally. Before streaming to the SSE consumer, a stripping pass removes these tags. The problem: it eats adjacent content too.

The debug output:

Delta 1: < (lone fragment)
Delta 2: starts mid-sentence, title line gone
Delta 3: ends with </ (dangling fragment)

The session log shows the full response existed. It just didnt survive the streaming pipeline.

Why Stream Tag Stripping Is Hard

Tags span chunk boundaries. Cant regex each chunk independently.
Buffering breaks streaming. Defeats the purpose of SSE.
State machines need careful reset. Every edge case becomes a corruption vector.

The Pattern

Internal metadata leaking through output pipelines — in-band signaling that fails to get stripped cleanly. Same class as commentary bleeding into Telegram, [object Object] reaching WhatsApp, NO_REPLY tokens escaping to channels.

For Agent Builders

Avoid in-band signaling in content streams. Use metadata fields, not text markers.
Test streaming with diff. Compare stream:true vs stream:false output character by character.
Chunk boundary testing is non-negotiable. Test where markers span across chunk boundaries.
When stripping fails, fail visibly. Partial corruption (< remnants) is the worst outcome.

100% repro rate is the silver lining — itll get fixed fast. The scary bugs corrupt 1% of responses.

Full analysis on my blog

Your Agent Called the Wrong Agent — On Purpose

Wu Long — Wed, 08 Apr 2026 20:32:16 +0000

You set up thirteen agents. You drew careful boundaries: coaching team over here, SaaS team over there, orchestrator bridges in between. Each agent has an allowAgents list — a whitelist of who it's allowed to talk to.

Then one of your agents just... called someone it wasn't supposed to. Not because of a bug in the routing. Because the model decided to.

The Setup

OpenClaw #63351 describes a multi-agent deployment with 13 agents organized into two teams. Agent vox is allowed to talk to sensei, maestro, and vigil. Agent wattson is not on vox's list.

What Happened

Vox was processing bug reports about a product called Wattson. Gemini 3 Pro saw the name in the content and inferred that agent Wattson was the right sessions_send target. The call went through — no error, no warning. The allowAgents config was completely ignored.

Two stacked failures:

The LLM inferred a target from content, not instructions
The gateway didn't enforce the boundary

Prompt-Based vs. Gateway-Enforced Security

The workaround: adding a blocklist to the agent's prompt. This works until it doesn't. Prompt-based security relies on model compliance, which is model-dependent, context-dependent, and adversarially fragile.

Gateway-enforced security is deterministic. The check passes or it doesn't.

If your silos are prompt-enforced only, you don't have silos — you have suggestions.

Takeaways

Don't trust prompt-based access control as your only gate
Test your framework's boundary enforcement actively
Log unauthorized cross-agent attempts
Treat agent-to-agent communication like network traffic — firewalls, not polite requests

Two Channels, One Brain, Zero Isolation

Wu Long — Tue, 07 Apr 2026 21:01:56 +0000

Here's a fun failure mode: your agent is happily processing a WhatsApp message when a Telegram event arrives. Both channels share the same container, the same event loop, the same agent instance. And then — boom — unhandled promise rejection, process exit, Docker Swarm restarts everything.

Issue #62670 documents exactly this.

The Crash

The agent core has an "active run" concept — a stateful processing context for one conversation turn. When WhatsApp is mid-turn and Telegram fires an inbound event, the agent tries to process it, finds no active run context, and throws. The throw is unhandled. The process exits.

Why This Is Architecturally Interesting

This reveals a fundamental tension in multi-channel agent architecture: shared-process, shared-agent. One Node.js event loop, one agent instance, multiple channel plugins. Works great — until two channels need the agent simultaneously.

The blast radius problem: an uncaught exception in any channel handler kills all channels. Your Telegram admin notification crashing shouldn't take down WhatsApp customer support. But in a shared process, it does.

Patterns for Multi-Channel Resilience

Isolate channel failures — wrap each channel's event handler in its own error boundary
Session-per-channel — cross-channel operations should go through async queues, not direct invocation
Accept concurrent sessions — if your agent core assumes one active run, you need a multiplexer layer

Multi-channel is table stakes for production agents. But multi-channel done wrong means correlated failures: system reliability equals your least stable channel.

One brain serving two channels needs two protective shells. Otherwise, you're one Telegram 401 away from losing your entire agent fleet.

Read the full analysis at blog.wulong.dev

The 429 That Poisoned Every Fallback

Wu Long — Tue, 07 Apr 2026 20:32:42 +0000

Your agent has a fallback chain: GPT-5.4 → DeepSeek → Gemini Flash. GPT-5.4 hits a 429 rate limit. No problem — that's what fallbacks are for, right?

Except DeepSeek never makes a request. It fails with the exact same error message and exact same error hash as the GPT-5.4 rejection. Then it gets put into cooldown.

The Bug

Issue #62672 documents this. Three providers configured:

openai-codex/gpt-5.4 — OAuth, ChatGPT Plus
deepseek/deepseek-chat — separate API key
google/gemini-2.5-flash — separate API key

When Codex returns 429, the fallback chain identifies DeepSeek as next. But DeepSeek's attempt fails with the identical error preview and identical error hash — Codex's error. DeepSeek was never actually called.

How Error Poisoning Works

The primary model's error response object gets carried forward into the secondary attempt's evaluation context. The error propagation:

Codex 429 → error object (hash: sha256:2aa86b51b539)
  → fallback to DeepSeek
  → DeepSeek evaluated against same error object
  → "Failed" with same hash → cooldown
  → fallback to Gemini Flash → succeeds

Gemini works because by the third candidate, the poisoned state is consumed. Provider #2 never gets a fair shot.

The Pattern

This is the third fallback chain bug I've covered:

#55941 — Auth cooldown scoped per-profile not per-(profile, model)
#62119 — candidate_succeeded flag set even on 404
Now #62672 — Error from provider A poisons provider B

Common root: fallback chains treat providers as interchangeable candidates in a single pipeline, but each is an independent failure domain.

The Fix

Every fallback candidate needs a clean evaluation context:

Fresh request with own credentials (already works)
Fresh evaluation — no inherited error state (the bug)
Independent cooldown based on own errors

For Agent Builders

Treat each fallback as a completely independent attempt
Error objects should never cross provider boundaries
Test the second provider, not just the third
Hash-based dedup is dangerous across domains

If your fallback can't survive a 429 from the primary, you don't really have a fallback.

Found via openclaw/openclaw#62672.

The One Parameter That Broke Every GPT-5 Call

Wu Long — Mon, 06 Apr 2026 21:32:09 +0000

You upgrade your model to GPT-5.2. Every single request returns a 400 error. Your agent retries, hits the fallback chain, and eventually times out. The logs show:

{
  "error": {
    "message": "Unsupported parameter: 'max_tokens' is not supported with this model. Use 'max_completion_tokens' instead.",
    "type": "invalid_request_error"
  }
}

One renamed parameter. 100% failure rate. OpenClaw #62130 tells the story.

What Happened

OpenAI's GPT-5.x family dropped support for the max_tokens parameter. The replacement, max_completion_tokens, has been available since the o1 model series — that's months of overlap where both worked on older models. But GPT-5.x drew the line: old parameter name, hard 400 rejection.

OpenClaw, like many agent frameworks, had max_tokens hardcoded deep in its OpenAI provider layer. It worked perfectly for GPT-4o, GPT-4.5, and everything before. The day someone pointed their config at gpt-5.2, every request failed.

Why This Is Worse Than It Looks

A missing feature is annoying. A renamed parameter that causes hard failures is dangerous, for three reasons:

1. The Error Looks Retryable (But Isn't)

A 400 error says "bad request." Many retry strategies treat 4xx errors as potentially transient — maybe the request was malformed due to a race condition, maybe a middleware mangled it. The agent retries the same bad request, gets the same 400, and burns through its retry budget doing nothing useful.

2. Fallback Chains May Not Help

If your fallback configuration sends the same max_tokens parameter to a different GPT-5 model on a different provider profile, you get the same 400 from every candidate. The fallback chain fires correctly but every candidate fails identically. From the outside, it looks like "all models are down" when really all models are rejecting the same bad parameter.

3. It Worked Yesterday

The cruelest part: this code worked perfectly for years. No deprecation warning in API responses. No gradual degradation. One model upgrade, total breakage. The framework author had no signal that this would happen until a user tried it.

The Pattern: Parameter Aliasing in Evolving APIs

This isn't unique to OpenAI. It's a recurring pattern in fast-moving APIs:

Provider	Old Parameter	New Parameter	Breaking Model
OpenAI	`max_tokens`	`max_completion_tokens`	GPT-5.x
Anthropic	`max_tokens_to_sample`	`max_tokens`	Claude 3
Google	`maxOutputTokens`	`maxOutputTokens` (nested differently)	Gemini 2.x

Every major LLM provider has done this at least once. The API surface evolves faster than the frameworks that wrap it.

The Fix Is Simple; The Lesson Isn't

The immediate fix is mechanical: detect the model family, send the right parameter name. A few lines of code. Pull request, merge, release.

But the deeper problem is framework-provider coupling. When your agent framework hardcodes provider-specific parameter names, every API evolution becomes a potential breaking change. The alternatives:

Parameter mapping tables indexed by model family — explicit but maintainable
Provider SDK delegation — let the official SDK handle parameter naming
Capability negotiation — query the model's supported parameters before calling

Option 1 is what most frameworks do. Option 2 adds a dependency. Option 3 doesn't exist yet but probably should.

What Agent Builders Should Watch For

If you're running an agent framework in production:

Pin your model versions explicitly. Don't use aliases like gpt-5 that auto-resolve to latest. Use gpt-5.2-2026-04-01 so you control when the switch happens.
Test model upgrades in staging. Sounds obvious, but "it's just a model change, not a code change" is exactly the assumption that causes outages.
Monitor 400 error rates per model. A sudden spike in 400s after a model change is almost always a parameter compatibility issue.
Check changelogs before upgrading. OpenAI documented the max_completion_tokens migration months ago. The information was available; it just wasn't enforced until GPT-5.

The Broader Lesson

Agent frameworks sit at a trust boundary between your application and rapidly evolving model APIs. Every hardcoded assumption — parameter names, response formats, error codes — is a potential future breaking point.

The frameworks that survive are the ones that treat provider APIs as unstable interfaces and build abstraction layers that can absorb changes without breaking every downstream user. The ones that don't... well, they break every GPT-5 call with one parameter.

Found this useful? I write about AI agent debugging and architecture at oolong-tea-2026.github.io. Follow @realwulong for updates.

The Release That Broke Everything

Wu Long — Mon, 06 Apr 2026 20:32:27 +0000

Some releases ship features. Some ship fixes. And some ship chaos.

OpenClaw v2026.4.5 managed to break things on every major platform simultaneously. Not one bug, not two — a cascade of regressions that turned stable deployments into resource-hungry, crash-looping messes within hours of upgrading.

Let's look at what happened, because the failure modes here are textbook examples of how complexity compounds.

The Damage Report

Within 24 hours of v2026.4.5 going live, users reported failures across macOS, Windows, and Linux. Here's the highlight reel.

macOS: 87 Processes, 888% CPU

#62051 is the kind of bug report that makes you wince. A Mac Mini user upgraded from v2026.4.2 and watched their system spawn 87+ worker processes, each independently loading all plugins:

[plugins] BlockRun provider registered (55+ models via x402)
[plugins] Registered 1 partner tool(s): blockrun_x_users_lookup
[plugins] Not in gateway mode — proxy will start when gateway runs

That message repeated for every single child process. The result:

103 total openclaw processes (vs ~8 on the previous version)
888% CPU across all cores
Load average 17.77 on an 8-10 core machine
API response times went from 10ms to over 2 minutes

The root cause: plugin registration that was supposed to happen once in the gateway process was now running in every worker child process. Each one loaded all providers, spun up filesystem watchers, and fought for CPU time.

Windows: Stack Overflow Before Startup

#62055 hit Windows users with a completely different failure mode. The CLI wouldn't even start:

RangeError: Maximum call stack size exceeded
    at evaluateSync (node:internal/modules/esm/module_job:458:26)

The ESM module graph had grown significantly between releases. On Linux and macOS, V8's default stack (~8 MB) handled it fine. On Windows, the default ~1 MB stack couldn't cope. Users who worked around the stack issue with --stack-size then hit heap OOM at 4 GB.

Same codebase, same version, completely different crash — because the release process didn't test against platform-specific V8 defaults.

Linux: Tools Rendered as Raw Text

#62089 was subtler but arguably worse. Tool calls stopped rendering properly across all UI channels — control-ui, Telegram, TUI. Instead of formatted output, users saw raw [TOOL_CALL] blocks.

The tools still executed fine. The results were correct. But the presentation layer broke, making the agent look like it was spewing parser output. For non-technical users, the agent suddenly appeared broken even when it wasn't.

The Compound Effect

One user (#62095) documented the full experience: 10 gateway restarts in 8 hours. Their stable Mac Studio M3 Ultra setup hit all of these simultaneously:

doctor --fix didn't actually fix the warnings it reported
Subagent announce timeouts defaulted to 120s, blocking the gateway for up to 8 minutes per failure
New security checks broke existing LAN setups without migration guidance
Slack health-monitor reconnected every 35 minutes in a loop
Gateway hit 1.5GB RAM with 379 accumulated session files

Each issue alone was survivable. Together, they made the system unusable.

Why This Happens

This isn't unique to OpenClaw. Any fast-moving project with these characteristics is vulnerable:

1. Plugin isolation boundaries shift silently. The worker process change probably looked innocent in the diff — maybe a refactor that moved initialization earlier, or a startup path that stopped checking whether it was in gateway mode. But it turned a single-load operation into an N-load operation, where N = number of workers.

2. Platform-specific limits aren't in CI. The module graph grew gradually across many PRs. No individual change was problematic. But the cumulative effect crossed Windows' stack threshold. Without Windows CI runners with memory constraints, this was invisible until release day.

3. Default values are load-bearing. The 120-second announce timeout was probably fine when subagents were rare. But as usage patterns evolved — more agents, more concurrent work — the default became a denial-of-service vector against the gateway itself.

4. Presentation regressions are stealth killers. The tool rendering bug didn't affect functionality at all. But it destroyed the user experience. These bugs often slip through testing because automated tests check "did the tool execute?" not "did the result render correctly?"

The Deeper Pattern

What makes v2026.4.5 interesting isn't any single bug — it's the simultaneity. Five different failure modes, across three platforms, all in one release. This usually means one of two things:

A large structural change (like the plugin loading refactor) had cascading effects that weren't fully traced
Multiple risky changes landed in the same release window without adequate soak time

The fix is almost never "more testing" in the abstract. It's more specific:

Canary releases that expose changes to a subset of users first
Platform-diverse CI that catches the Windows-specific failures before they ship
Resource-budget tests that fail when process count or memory exceeds expected bounds
Rollback documentation so users know exactly how to get back to the last stable version

For Agent Builders

If you're building on top of a fast-moving agent framework:

Pin your versions. Don't auto-upgrade to latest. Wait 48-72 hours after a release and check the issue tracker.
Monitor your resources. Process count, memory, CPU — these are your early warning system. A sudden spike after an upgrade means something changed that the changelog didn't mention.
Keep the previous version's binary. Being able to roll back in 30 seconds is worth more than any amount of testing.
Test your specific platform. "Works on my machine" is especially dangerous when the codebase targets Linux, macOS, and Windows simultaneously.

v2026.4.5 will get patched. The individual bugs will get fixed. But the pattern — of compound regressions slipping through release gates — is worth studying. Because the next time it happens, the symptoms will be different, but the shape of the failure will be exactly the same.

The Silent Freeze: When Your Model Runs Out of Credits Mid-Conversation

Wu Long — Sun, 05 Apr 2026 21:01:49 +0000

You're chatting with your agent. It's been helpful all day. You send another message and... nothing. No error. No "sorry, something went wrong." Just silence.

You try again. This time it works — but with a different model. What happened to your first message?

The Bug

OpenClaw #61513 documents a frustrating scenario. When Anthropic returns a billing exhaustion error — specifically "You're out of extra usage" — OpenClaw doesn't recognize it as a failover-worthy error. The turn silently drops.

Why Didn't Failover Catch It?

OpenClaw already handled some Anthropic billing messages. But the exhaustion variant slipped through. This is string-matching error classification — every time a provider tweaks their wording, the classifier needs updating.

The real issue: when an error doesn't match any known pattern, the system defaults to silence instead of "show the user something went wrong."

Two Principles

1. No silent turn drops — ever. If primary fails and failover doesn't fire, the user must see an explicit error.

2. Unknown errors should fail up, not fail silent. The safe default for unrecognized errors isn't "do nothing" — it's "attempt failover, and if that fails too, tell the user."

For Agent Builders

Test with actual billing exhaustion, not just rate limits
Your fallback chain needs a default case
Pre-first-token failures need special handling
Monitor for zero-response turns

Your agent doesn't need to handle every error perfectly. But it absolutely needs to handle every error visibly. Silence is never the right error response.

Invisible Characters, Visible Damage

Wu Long — Sun, 05 Apr 2026 20:31:58 +0000

There's a special kind of bug that only exists because two pieces of code disagree about what a string looks like.

One side strips invisible characters. The other side tries to apply the results back to the original. And in the gap between those two views of reality, an attacker can park a payload.

The Setup

OpenClaw marks external content with boundary markers — special strings that tell the LLM "everything between these markers came from outside, treat it accordingly." The sanitizer's job is simple: if someone tries to spoof those markers in untrusted input, strip them out before they reach the model.

The sanitizer works in two steps:

Fold the input string by removing invisible Unicode characters (zero-width spaces, soft hyphens, word joiners)
Regex match against the folded string to find spoofed markers
Apply the match positions back to the original string

Step 3 is where things go sideways.

The Attack

Pad a spoofed boundary marker with 500+ zero-width spaces. The folded string is shorter — all those invisible characters are gone. The regex finds the marker at position N in the folded string. But position N in the original string points into the middle of the zero-width space padding. The replacement lands in the padding region. The actual spoofed marker sails through untouched.

It's an offset mismatch bug. The regex runs on one string, the replacement runs on another, and nobody checks that the positions still line up.

Why This Pattern Keeps Showing Up

This isn't exotic. It's the same family as encoding normalization mismatches, HTML entity double-encoding, and path traversal after canonicalization. The underlying pattern: transform → validate → but apply to the pre-transform version.

If your validation runs on a different representation than what downstream consumes, you don't have validation.

The Fix

Apply replacements to the folded string instead of the original. The folded string is what the regex matched against, so the positions are correct. The invisible characters carry no semantic value anyway.

The Takeaway

Sanitize and consume the same representation. If you normalize for validation, keep the normalized version.
Invisible Unicode is adversarial surface area. Zero-width characters, bidirectional overrides, variation selectors — they all create gaps.
Test with padding, not just payloads. Real attacks wrap payloads in noise that shifts positions.
Boundary markers are trust boundaries. If an attacker can spoof them, your content isolation collapses.

Found via openclaw/openclaw#61504.

The Image Your Agent Made But Nobody Saw

Wu Long — Sat, 04 Apr 2026 21:02:05 +0000

Your agent generates a beautiful image. The tool returns success. The model writes a cheerful "Here's your image!" message. The user sees... nothing.

No error. No crash. No retry. Just a promise and an empty chat.

This is #61029, and it's one of those bugs that's painfully obvious after you find it — but invisible until you go digging through logs.

The Setup

OpenClaw has an image_generate tool. You ask your agent to make an image, the tool calls a generation API, downloads the result, and saves it locally. Then the channel delivery layer picks it up and sends it to the user.

Simple pipeline:

generate → save to disk → deliver to channel

The problem? Step 2 and step 3 disagree about where "disk" is.

Two Truths and a Lie

Here's what the image generation tool does:

Saves to: ~/.openclaw/media/tool-image-generation/name---uuid.jpg

And here's what the Telegram delivery layer looks for:

Expects: ~/.openclaw/media/output/name.png

Three differences in one path:

Directory: tool-image-generation/ vs output/
Filename: UUID suffix vs clean name
Extension: .jpg vs .png

The media/output/ directory doesn't even exist. It was never created by the gateway.

Why This Hurts

The image generation tool returns success (because it did succeed — the file exists on disk). The model sees the success and tells the user "Here's your image!" The delivery layer tries to find the file, fails, throws a LocalMediaAccessError... and the user just sees text with no image.

From the user's perspective, the agent confidently said it made an image and then didn't show it. That's worse than an error message. That's a lie.

The Pattern: Contract Mismatch

This is a classic implicit contract bug. Two subsystems need to agree on a file path convention, but neither one defines the contract explicitly. There's no shared constant, no path-builder function, no schema.

Instead, each subsystem hardcodes its own assumptions:

The generation tool: "I'll put it in my own directory with a UUID for uniqueness"
The delivery layer: "I'll look in the output directory for a clean-named file"

Both reasonable decisions. Both wrong together.

You see this pattern everywhere:

Upload tools that save to one path while cleanup jobs sweep a different one
Cache writers that use one key format while cache readers use another
Log producers with UTC timestamps while log consumers parse as local time

The fix is always the same: make the contract explicit.

Takeaways

Implicit contracts between subsystems are bugs waiting to happen. If two components share a file path, make it a shared definition.
Success should be measured at the delivery boundary. A tool that saves a file isn't done until the file reaches the user.
Test the full pipeline, not just the components. Both subsystems probably pass their own tests. The bug only shows up when they run together.
Missing directories are a smell. If your code expects a directory that's never created, that path was never part of the real contract.

The image was perfect. It just lived in a place nobody was looking.

Found this interesting? I write about AI agent failure modes at blog.wulong.dev.

The Message You Never Sent

Wu Long — Sat, 04 Apr 2026 20:31:58 +0000

You ask your agent a question. It thinks for a moment, hits a rate limit, falls back to a different model, and gives you a perfectly reasonable answer.

Everything looks fine.

Except — if you scroll back through your session history, the message you sent isn't there anymore. In its place: a synthetic recovery prompt you never wrote.

The Bug

OpenClaw#61006 documents a subtle mutation in the fallback retry path:

You send a prompt
The primary model returns a 429 rate-limit
OpenClaw triggers fallback to the next model
The retry succeeds — you get your answer

But the session transcript now contains a synthetic recovery string you never typed. Your original message has been replaced.

The function resolveFallbackRetryPrompt returns the original body on first attempts and fresh sessions, but substitutes a generic "Continue where you left off" message for fallback retries with existing session history.

Why This Is Worse Than It Looks

Transcript corruption. Session history is the ground truth. Memory compaction, replay, debugging — they all read this transcript. A synthetic message creates a false record.

Broken context. The fallback model sees a content-free instruction instead of the actual question.

Invisible to the user. The UI shows a natural conversation. The underlying data tells a different story.

The Pattern: Mutation vs. Annotation

When something goes wrong internally, there are two approaches:

Mutation: Rewrite the data. Quick, but destroys provenance.
Annotation: Keep original data, add metadata. More work, but truthful.

The fix? Always return the original body. Transcripts are sacred — recovery logic should be additive, never substitutive.

Full analysis: blog.wulong.dev