Forem: Tijo Gaucher

[3 Reliability Patterns That Stopped My AI Agent From Crashing Every 6 Hours]

Tijo Gaucher — Mon, 18 May 2026 02:13:13 +0000

I'm running five AI agents in production. None of them are the ambient "does your whole job" kind everyone's demoing on X. They're boring: a research agent that scrapes pricing data overnight, a cold-email agent that drafts and queues replies, a coding agent that triages GitHub issues, a screenshot/QA agent for our marketing pages, and one that just runs scheduled reports.

For the first month, every one of them died on a six-to-twelve hour cadence. Sometimes a tool call would hang forever. Sometimes the model would return a token the parser choked on. Sometimes it was just OOM. The agent would freeze, the cron-style triggers would queue up behind it, and I'd find out the next morning when nothing had run.

Three patterns took me from "babysit the process" to "haven't looked at the dashboard in a week." Here they are.

1. Treat the agent like any other long-running process: supervise it

The first instinct people have is to put their agent script in a while True: loop and call it good. Don't. The loop dies with the process — and the process dies more than you think.

I put every agent under supervisord (you can use systemd if you prefer; the point is the same). The config is maybe ten lines:

[program:research-agent]
command=/usr/bin/python /opt/agents/research.py
autostart=true
autorestart=true
startretries=10
stderr_logfile=/var/log/agents/research.err.log
stdout_logfile=/var/log/agents/research.out.log

Three things this gets you. First, restart on crash — autorestart=true brings the process back even when the Python interpreter exits non-zero. Second, logs that survive the crash, because supervisord captures stderr to a file the agent itself never opened. Third, a single command to see state — supervisorctl status tells you which agents are alive without grepping ps.

The number that mattered after I did this: my "agent uptime" went from 71% to 99.4% in a week. Nothing about the agent code changed. The whole win was running it like a real service instead of a script.

2. Persist state outside the process

Restart-on-crash is only useful if the agent doesn't lose its place when it comes back. The default for most agent frameworks is to hold the conversation history, the to-do list, and any cached tool outputs in memory — all of which vanish the moment the process dies.

Two layers of persistence cover almost every case I've hit:

Checkpoint after every tool call. Before the agent loop calls the next tool, write the current state (messages, pending tasks, partial outputs) to disk or SQLite. After a restart, the first thing the agent does is read the checkpoint and resume from the last completed step. The overhead is a few milliseconds per turn — nothing compared to the inference latency you're already paying.

Use a real queue for incoming work. If your agent runs on a schedule or responds to webhooks, don't have the trigger call the agent directly. Push the trigger onto a queue (Redis, SQS, or a database table with a claimed_at column) and have the agent pull from it. When the agent dies mid-task, the queue still has the job, and the next run picks it up.

These two together mean a crash costs you one redo of the last tool call — not a whole night of missed work.

3. Bound everything with timeouts (and don't trust the SDK defaults)

The single biggest source of "agent is alive but doing nothing" was tool calls that hung. A scraping target stops responding. A subprocess.run with shell=True never returns. The model SDK's "default" timeout turns out to be three minutes, and you didn't notice.

Wrap every tool the agent can call in a timeout. In Python:

import signal
from contextlib import contextmanager

@contextmanager
def hard_timeout(seconds):
    def _handler(signum, frame):
        raise TimeoutError(f"hit {seconds}s timeout")
    old = signal.signal(signal.SIGALRM, _handler)
    signal.alarm(seconds)
    try:
        yield
    finally:
        signal.alarm(0)
        signal.signal(signal.SIGALRM, old)

Then any tool call gets with hard_timeout(30): .... Pick the budget based on what the tool actually does — 5s for a curl, 30s for a scrape, 120s for a model call. The point is that something fires.

Pair this with a circuit breaker for tools that fail repeatedly. If the same tool times out three times in a row, mark it broken for ten minutes and route the agent to a fallback (or just have it skip and log). The alternative — letting the agent retry indefinitely on a broken endpoint — burns tokens and blocks every other task in the queue.

What changed

After all three patterns: 99.4% uptime across the five agents, average time-to-recovery on a crash dropped from "next morning when I noticed" to under 30 seconds, and token spend on retries dropped by about 40%. The agents got less smart in some sense — they no longer try heroic recovery — and that's the trade. Boring agents that run forever beat clever agents that occasionally do magic.

When you stop doing this yourself

I built all this on a $20/mo VPS I administer myself. It works. But by month three I noticed I was spending more time on the supervision layer than on the agents themselves — tweaking restart policies, rotating logs, sizing the checkpoint table, debugging why one agent's port forward broke when another one OOMed.

If you're in that loop and you'd rather not be — building your agent and operating your agent are different jobs — managed hosting takes the operational layer off your plate. RapidClaw's Builder Sandbox ($99/mo) is the same MicroVM-with-sudo setup I'm describing here, with the supervisor, checkpointing, and timeouts already wired up. The Dev Agent tier adds snapshot/rollback so a bad deploy doesn't take the whole agent down.

Either way, the patterns are the patterns. Whether you operate them yourself or let someone else run the agent host, supervise the process, persist the state, bound the calls. Your agents will outlive you.

Tijo Gaucher is the founder of RapidClaw, managed hosting for AI agents.

[I Ran 5 AI Agents Unattended for 30 Days] What Actually Broke and What Held

Tijo Gaucher — Mon, 11 May 2026 02:11:54 +0000

When I tell operators their AI assistant will "just run 24/7," that's the promise. The reality is uglier — agents crash, sessions die, context windows fill up, model providers throttle, and your "automation" becomes a 3AM page.

Last month I gave myself a constraint: run 5 small agents unattended for 30 days as a solo founder. No babysitting, no manual restarts. One for inbox triage. One for monitoring a few competitor pricing pages. One for nightly browser-based status checks. One for code refactor batch jobs. One for content scraping.

Here's what actually broke, what held, and the reliability patterns I'd ship before letting a non-technical operator anywhere near an agent.

The four failure modes I hit (in order of frequency)

1. Context window bloat → silent degradation. This was the most insidious. The agent didn't crash — it just got progressively dumber. By day 4, the inbox agent was misclassifying obvious spam because the conversation history was bumping against the limit and the most recent emails were displacing the routing rules. No exception, no alert. Just bad work.

2. Model provider throttling. Day 11. Rate limits I didn't know existed kicked in mid-batch. The agent threw a 429, didn't have a retry path, and silently stopped processing the queue. I found out 6 hours later when the backlog showed up.

3. Auth token expiry. The scraping agent died on day 19 when a session cookie aged out. Standard problem, completely predictable, completely missed.

4. Memory leaks in long-running browser sessions. Headless Chrome doesn't love a 30-day uptime. Day 23, OOM. The monitoring agent took the whole VM with it.

The five reliability patterns that would have prevented all of it

These aren't novel — they're the same patterns you'd apply to any unattended workload. The new part is applying them to an LLM-driven workflow.

Pattern 1: Context rotation at fixed intervals. Don't let conversation history grow unboundedly. Snapshot the state you care about (decisions, rules, persistent memory), drop the rest, start a fresh context. For the inbox agent, every 200 messages = new context with a summary of routing rules pinned at the top. Simple, fixes the silent degradation problem permanently.

Pattern 2: Exponential backoff with provider failover. When your primary model throttles, fall back to a secondary. OpenRouter makes this trivial — you configure a fallback chain and forget about it. For most tasks, Claude → Haiku → GPT-4o-mini is plenty. The user never notices when the primary 429s.

Pattern 3: Health checks an operator can actually read. Not Prometheus, not Grafana. A status page that says "Inbox agent: last action 8 minutes ago" or "Pricing monitor: failed at 2:14am, retried 3 times, paged at 2:20am." The operator should be able to glance at it in the morning and know what to act on. If they have to interpret a graph, you've already lost them.

Pattern 4: Token refresh as a first-class concern. Auth tokens have expiries. Bake them into the agent's lifecycle: rotate proactively, never reactively. If your agent runs longer than your shortest token lifetime, you have a bug — even if it hasn't fired yet. Treat it like SSL renewal: scheduled, alerted, automated.

Pattern 5: Process-level rollback on resource thresholds. When memory or CPU breaches a threshold, snapshot the agent state, kill the process, restart from snapshot. This is boring infra work. It's also what makes the difference between "my agent ran for 30 days" and "my agent ran for 4 days, three times in a row."

What this looks like in a managed world

If you're an operator — not a developer — you don't want to set up any of this. You want your assistant to run, you want to know when it doesn't, and you want a vendor to fix the second case before you notice the first.

That's what managed agent hosting is supposed to solve. Not "we run a container for you" (that's just hosting). The actual job is the five patterns above, plus the 50 others I haven't written about yet, applied consistently so the operator never sees them.

I'm building toward this with RapidClaw — managed OpenClaw hosting tiered for SMEs. The Builder Sandbox tier ($99/mo, MicroVM with sudo + live port-forwarding) is where agents like the five above live. The Dev Agent tier ($200/mo) adds observability and snapshot/rollback specifically because patterns 3 and 5 above kept biting me during this test.

The boring stuff is the moat

The ambient-agent-does-your-job narrative is still mostly vibes. What's actually working in production today is the boring stuff — scheduled jobs that run reliably, browser automation that doesn't die overnight, coding agents that finish their refactor without losing the plot at hour 4.

That's not a sexy story. But it's the story that pays. The patterns above aren't novel; they're table stakes for any unattended workload. The reason most agent stacks don't have them is because most agent stacks are demos that got deployed.

If you're running agents yourself, the five patterns above are free advice — apply them, you'll have a better month than I did. If you'd rather not think about any of it, that's the pitch for managed agent hosting: the boring stuff, handled, so you can run 5 agents at once and not wake up at 3AM.

Either way: don't ship an agent into production without context rotation, failover, health checks, token refresh, and resource thresholds. The next 30-day uptime story you tell will be better for it.

What I found auditing my own homepage for AI Overview compatibility

Tijo Gaucher — Sun, 10 May 2026 09:21:41 +0000

I have been quietly losing top-of-funnel traffic to a thing I cannot click on. Not a competitor. Not a Reddit thread. The Google AI Overview box at the top of the SERP. The thing that pulls a paragraph out of the open web, paraphrases it, slaps a few citation chips next to it, and answers the question before the user ever scrolls down to my listing. For a year I have been telling myself I would do something about it. Last week I finally sat down and audited my own homepage to figure out why I was not getting cited.

The audit took most of a Saturday. Some of it was infuriating. Some of it was satisfying in the way that only finding and fixing a stupid mistake can be. I want to write down what I actually found, because most of the AEO and "answer engine optimization" writing I have read this year is too abstract to act on. It is full of words like "entity-level relevance" and "knowledge graph alignment" and not enough screenshots of the JSON your homepage is missing.

I run the content side of a small two-founder hosted-OpenClaw shop here in Bali. My brother Brandon runs the infra. Neither of us is a SEO consultant. I have a content background and I read a lot, but most of the GEO and AEO advice in 2026 is being written by people selling a course about it, and I do not trust the course people. So this is the working operator's version. Here is what was wrong with my homepage and what I changed.

The first thing I did was query my own product

This sounds dumb but I had not actually done it. I opened a clean browser window, signed out of everything, and asked Google "what is the cheapest managed openclaw hosting" the way a buyer would. Then I asked "is rapidclaw any good" and "rapidclaw vs the alternatives". I watched the AI Overview render each time and I read what it said.

The Overview talked about my competitors. It did not mention me. The citation chips next to the answer were two competitors and a Hacker News comment thread. My homepage was nowhere. The organic listing for rapidclaw.dev was sitting at position three on the page underneath the Overview, which used to be a fine place to be in 2022. In 2026 it is a place where almost nobody reads.

The interesting part was that the Overview answer was wrong about my competitors in a small way. It was confidently citing a price tier that one of them had retired six months ago. So this was not a quality bar problem. The Overview was happy to repeat outdated stuff. The bar was something simpler. My page just was not in the consideration set.

The second thing I did was read my own HTML

I right-clicked my homepage and asked for the page source. Then I pasted it into a structured-data validator. This is the kind of thing I should have been doing once a quarter for the last three years. I had not been doing it. The result was embarrassing.

My homepage had no organization-level schema. None. The og:image meta tag was pointing at a screenshot that we deprecated last spring and the URL was a 404. The h1 read something marketing-coded that did not contain the actual product category. The meta description was the placeholder Vercel had auto-generated when we redeployed in February and it said "rapidclaw.dev — built with Next.js" because we had never overwritten it.

It was rough. I am writing this down in detail because I think more founders are running pages like this than admit it. The marketing-site code path tends to be the one that drifts the most, because it is the one nobody is paying explicit attention to. Brandon owns infra. I own content. The marketing site is the seam between us and the seam is where rust grows.

The fixes I shipped, in the order I did them

I worked top-down through the page source.

The first thing I added was an Organization schema block in JSON-LD with the company name, the URL, the logo, and a sameAs array pointing at the X account, the YouTube channel, the GitHub org, and the Indie Hackers profile. None of that information was in machine-readable form anywhere on the homepage. I had been assuming the AI Overview would figure out we were a real entity from context. It does not figure that out. It expects you to declare it.

The second thing I added was a Product schema block describing the hosted OpenClaw service, with a price field on the cheapest tier, an aggregateRating placeholder I will fill in next month when we have enough first-party reviews, and a real description field that uses the words a buyer would search for. I rewrote the description three times before I was happy. The version I shipped reads less like marketing and more like the answer to a question.

The third thing I did was rewrite the h1. The old h1 was clever. The new h1 is descriptive. Cleverness is a luxury you can afford when the search interface is humans reading a list of ten blue links. It is not a luxury you can afford when the search interface is a model summarizing the open web in three sentences. I want to be in those three sentences. The model is going to grab whatever in my markup most clearly explains what category I am in. So my h1 now explains what category I am in. There is a separate clever line underneath it for the human who has scrolled. Both audiences are served. Neither is ignored.

The fourth thing I did was fix the meta description. I wrote a real one. It is forty-eight words and it tells you exactly what we sell, who we sell it to, and what the cheapest option costs. The old one had been the auto-generated placeholder for nearly three months. I want to crawl into a hole when I think about how much referrer traffic I cost myself with that one mistake.

The fifth thing I did was write an FAQ block with FAQPage schema. Five questions, real answers, no padding. The questions were the ones the support inbox actually receives every week. "Do I need an API key?" "Is there a free tier?" "What happens if my container goes to sleep?" "Can I bring my own model?" "How do I cancel?" The answers are short enough that an AI Overview can quote them verbatim if it wants to.

The sixth thing was the og:image. I generated a new social card, hosted it on a stable URL, and updated the meta tag. I also added Twitter card tags, which I had never done because I had been told nobody used Twitter cards anymore in 2024. They are still useful in 2026 for any model that ingests social-media-flavored metadata, which is a thing models do.

The benchmarks check is still my favorite part of this work

The reason I care about being cited in AI Overviews specifically is that the buyer who arrives via that path tends to be more qualified than the buyer who arrives via a long-tail organic click. They have already had their question half-answered. They are clicking through because they want the rest of the answer or because they want to verify the citation. They are not browsing. They are auditing.

If you are in the AI agent infrastructure category, the buyer is also auditing performance numbers. There is a whole class of buyer who will not click a hosted-agent product page until they have checked the same product against a public benchmark. That is the buyer I have been writing for over the past two months. The deep-dive I keep getting search traffic on is the AgentBench 2026 leaderboard rundown — it goes through the top results, which prompts gamed which categories, and what the numbers actually mean for an operator picking a hosting layer. That piece does not need to mention rapidclaw fifteen times to convert. It just needs to be the most useful version of itself. The conversion takes care of itself when the reader trusts that you are reading the same data they are.

I bring this up because it is the same lesson as the homepage audit. The AI Overview is going to cite the page that most clearly answers the question. It is not going to cite the page that most aggressively sells you a product. So the work, on both the homepage and on the long-form posts, is to be more useful and less promotional. The promotion is the link at the bottom that the qualified reader clicks because they have already decided.

What I have not figured out yet

A few open questions I am working on.

I do not know whether the JSON-LD changes are going to show up in the AI Overview within days, weeks, or months. I have read every plausible answer to this on the web and there is no consensus. I will know in a few weeks whether my homepage starts getting cited. If it does, I will write the follow-up. If it does not, I will write the failure post-mortem.

I do not know if aggregateRating placeholders without real reviews are worth shipping. The structured data validator does not love it. I am going to leave it out for now and add it once we have enough first-party testimonials to back the number up. I would rather be invisible than be visibly fake.

I do not know how much of this is going to matter in twelve months. The retrieval layer behind AI Overviews keeps shifting. Maybe by next year the model is good enough to extract organization metadata from natural language without needing the schema block. I do not think we are there yet. I am writing for the version of the model that exists right now. If the model gets smarter, my schema block does no harm. If it does not get smarter, I am suddenly readable.

What I would tell another founder reading this

Three quick ones.

First, query your own product the way a buyer would, signed out, in a clean browser. Do this once a month. Watch the AI Overview render. If it does not mention you, you have a problem upstream of any clever marketing copy. You have a problem in your HTML.

Second, the JSON-LD work is small and unglamorous and almost everybody is skipping it. Schema is the cheapest leverage you have. An hour of work fixes a year of invisibility. I am annoyed at myself for waiting.

Third, the content piece and the homepage piece are the same piece. The model that reads your homepage is the same model that reads your blog posts. The tone you take in your most useful blog post is the tone you should take on your homepage. If your homepage is louder than your blog posts, your homepage is wrong.

Brandon is going to push the schema changes through Vercel today. I will check the AI Overview again in two weeks. Either it will start mentioning us, in which case there is a follow-up to write, or it will not, in which case there is a different follow-up to write. Both posts get written. The outcome decides which one.

— Tijo

The CTA disclosure test: I added "$29/m after 5 messages" to every Free Trial button. Here's what I expected to happen.

Tijo Gaucher — Sun, 10 May 2026 04:19:46 +0000

Earlier this week I made a change I had been putting off for two months. I went through every "Start Free Trial" button on the site and rewrote it to read "Start Free Trial — $29/m after 5 messages". Same button, same color, same destination. Just a price tag glued onto the label.

Then I sat back and waited for conversion to crater.

That's what I expected. That is what most CRO people I trust have told me to expect over the years. Disclosing a price near the CTA is supposed to scare buyers off. The conventional move is to keep the trial language clean, get the email, and let the upgrade screen handle the money conversation. I have done it that way for a long time. I run the content side of our little two-founder shop here in Bali. My brother Brandon ships the infra. I write the words and run the experiments. So when I changed every CTA on the site, I was running an experiment on my own copy and I was pretty sure the result was going to make me feel stupid.

It did not make me feel stupid. It made me feel like I had been quietly lying to myself for a year.

Why I made the change at all

The reason I changed the buttons in the first place had nothing to do with conversion theory. It was a support ticket.

A user wrote in last Tuesday and said something like: "I clicked Start Free Trial on the homepage and now I am being charged $29 a month and I never saw a price anywhere." He was not angry. He was confused. He thought he was on a free tier. He had clicked through onboarding too quickly, like everybody does, and the trial expiry email landed in his Promotions tab. By the time the charge hit his card he had no memory of what the price even was. The product was working. The bill was correct. The experience was terrible.

Brandon and I talked about it on a call. Brandon's view was that this was a billing UX problem and he could fix it on the dashboard side. My view was that it was upstream of that. The user did not get blindsided in the dashboard. He got blindsided on the marketing site, where I had spent twelve months carefully not telling him the price near the button he was clicking.

There is a phrase I keep coming back to when I write copy for the SME audience we care about: a point-of-sale system never surprises you. That is the standard I want for our hosted OpenClaw thing. People should know what they are buying when they click. Even if the click is for a free trial. Especially if the click is for a free trial.

So I changed the button. Then I changed it everywhere. Hero, pricing page, footer, the comparison tables, the in-line "Try it free" links inside blog posts. Wherever a user could begin the trial flow, the price for what they were beginning was now sitting right next to the button.

What I expected vs. what happened

The hypothesis I went in with was the textbook one. Higher friction at CTA means lower click rate. Lower click rate means fewer trial signups. Fewer trial signups means fewer paid conversions even if intent quality is higher, because most funnels lose to volume.

The chart in my head looked like a small dip in clicks, an unclear story for trials, and a possibly-positive but possibly-flat story for paid. I told myself this was the kind of thing where you accept a 10-15% conversion hit because the integrity story is worth it. I was bracing for a number that was going to make me defensive on a call with Brandon.

What actually showed up over four days was that click-through on the changed buttons did not drop in any meaningful way. The wobble I saw was within the noise floor of a site that does not get a billion visits a day. Trial signups looked the same. The piece I did not predict was the support volume change. The "I did not realize this was paid" tickets did not just go down. They stopped. Zero in the four days after the change. The week before the change there had been three.

I want to be careful here. Four days is not a study. I do not have enough volume to call this statistically significant. What I am willing to claim is the directional story. The dip I was bracing for did not happen, and the failure mode I was actually paying for did happen to disappear.

Why I think the conventional wisdom missed this one

The CRO playbooks I learned from are almost all built on consumer e-commerce data. A surprise price at the cart on a t-shirt site really does kill conversion, because the buyer has no model for what a t-shirt should cost and the surprise reads as a trick. The price-near-CTA test runs against a buyer who is already loss-averse and already suspicious.

Our buyer is not that buyer. The person clicking "Start Free Trial" on rapidclaw.dev is usually a small business operator who has been quietly evaluating three or four options. He has been told by somebody on a Reddit thread that AI agents cost money and break overnight. He has read the comparison articles. He has a budget in his head before he gets to my page. When my CTA hides the price, I am not being polite. I am being suspicious. I am giving him a reason to think the real number is higher than the budget he had in mind, otherwise why would I be hiding it?

This is the lesson I keep relearning when I write for SMEs instead of for developers. SME operators are not afraid of $29 a month. Most of them are paying $29 a month for fifteen different things they barely use. What they are afraid of is the shape of a trick. The trick-shape is "free trial → unexpected charge". The honest-shape is "free trial, $29/m after". Once they see the honest-shape on the button, the rest of the page reads as honest too. And the honest page outperforms the polished one.

There is a thread connecting this to the content side of what I do. Every time I write a piece that tries to figure out the real number of something — what an AI agent actually costs you to run for a month, what the cheap tier really gets you, where the bills come from — that piece tends to outperform the piece that talks about features. The audience is starving for somebody to do the math out loud. If you are wrestling with this same problem on the infra side, my OpenClaw hosting cost breakdown goes through the self-host vs. managed numbers in the same flat way I am writing here. I keep coming back to it because the buyers who read that piece convert at a higher rate than buyers who read the hero copy. They have already done the spreadsheet, in their heads, with me.

What I did not change, and why

I did not change the upgrade screen. The upgrade screen still reads the same way it has read for a year. I did not change the email sequence. I did not change the dashboard banner. The only change was the marketing site CTAs.

The reason matters. If I had changed everything at once, I would not be able to say anything about the CTA disclosure specifically. I would have a bundle of changes and a bundle of outcomes and a story I could tell either direction. By keeping the change narrow, I have a thin slice of evidence that the disclosure itself was at worst a wash and possibly a small win. That is enough for me to leave it on. It is not enough for me to write a Twitter thread saying "always disclose price near your CTA, you cowards". I am not writing that thread. I am writing this post.

The other thing I did not do is run a real A/B test. I do not have the volume on this site for a clean A/B in four days. I changed the buttons globally, watched the metrics I cared about, and decided on the basis of the things that did not happen — clicks did not drop, signups did not drop, support tickets did drop. I am calling that good enough for a button copy decision that took fifteen minutes to make.

The piece I did not see coming

A thing I noticed, a week in, that I did not predict at all. Trial-to-paid conversion looks like it is creeping up. I am not ready to claim a number on it because the cohort is small and the sample period is short. But it is moving in the right direction and I think I know why. The trial users who came in after the button change are showing up in their first session knowing what they are paying for and when. That sounds obvious, right up until you watch how trial flows usually behave. Most trials get killed by the second-week confusion: the user is not sure if they are using the product correctly, is not sure when the bill comes, has not internalized what the bill even is. A user who saw "$29/m after 5 messages" on the way in does not have that confusion. He sat down at the dashboard with a price already loaded into his head. He is not surprised by anything. He treats the trial like a paid product on day one. So he uses it like a paid product on day one. So he keeps it.

That is the part I want to write a longer follow-up on. Not yet. I want another two weeks of data first.

What I would tell another founder reading this

Three things, none of them clever.

First, the CTA-near-price prohibition is a holdover from a different audience. If you are selling to small business operators with budget in their heads, the conventional move is the wrong move. Price near the button reads as honesty. Hiding the price reads as a trick. The buyer's instinct is not the e-commerce buyer's instinct.

Second, the metric you should look at is not the click rate on the button. It is the support ticket queue and the trial-to-paid rate. The button is a top-of-funnel thing. The win, if there is one, lands later in the funnel. If you only watch the click rate you will miss the entire story.

Third, do the change before you can prove the change. I sat on this for two months because I wanted a clean experimental design and I did not have the volume for one. The actual decision took fifteen minutes once I let go of the test-design fantasy. Sometimes you ship the obvious thing and the world tells you what happened.

I am leaving the buttons how they are. If the trial-to-paid number keeps moving in this direction, I will write that follow-up. If it drifts back to the old number, I will write that one too.

Either way I am not putting the price back behind the button. Brandon and I do not run this thing to be clever. We run it to be a place a small business can park its agent and forget about it. Forgetting starts at the button.

— Tijo

5 ways your AI agent runtime silently dies overnight (and the boring fix for each)

Tijo Gaucher — Mon, 04 May 2026 07:40:16 +0000

I ran the same agent for thirty straight days. It died five times. Four of them did not show up in any log I had set up ahead of time, which is the part that bothers me most.

By the end I had a checklist of things that take an agent down at 2am while you're asleep, and none of them are the dramatic failures that get blog posts. They are all dull.

Here is the list.

1. OOM during a long tool-call loop

The agent is happily looping through 200 tool calls in one task. Each call returns a response. The agent appends every response to its working context plus an internal trace it writes to disk. Around call 150, RAM usage starts going up faster than usage going down. By call 180, the kernel OOM-killer wakes up and ends the process.

In the log: nothing. The agent's stdout cuts off mid-sentence. The supervisor logs say "process exited 137" which is the OOM signal but very few people read it that way the first time.

The boring fix: cgroup memory limits with a soft warning at 80%, plus a tool-call counter that flushes the working trace to disk every 25 calls and resets the in-memory copy. Not exotic. Just remembering that long agent loops are basically a memory leak unless you actively flush.

2. File descriptor exhaustion

Day eleven. The agent had been making API calls all day. A new tool call started and immediately got OSError: too many open files. The agent caught the exception, tried to retry, got the same error, gave up, returned an error to the user.

The agent itself didn't crash. It just stopped being useful. The supervisor process saw "agent returned an error" and moved on. Nothing alerted.

What actually happened: the agent's HTTP client was reusing a session pool that didn't close idle sockets, and over 11 days it had accumulated about 950 open FDs against the per-process default of 1024. Every new HTTP call added to the pool. Eventually it ran out.

The boring fix: explicit session lifecycle with a timeout, a daily restart of the agent process, and ulimit -n raised to something sane (16384 on the runtimes I cared about). The daily restart is the cheap one. People resist it because it feels primitive, but every long-running daemon I have ever shipped survives on a daily restart somewhere in the stack.

3. Context window bloat

This one I had read about, but it still got me. The agent's working context grew to about 180,000 tokens by hour 60 of a multi-day task. Each new tool call cost more than the last because the model was paying to re-read the entire history. By hour 65 a single tool call was taking 90 seconds and burning through the per-minute rate limit, which the agent interpreted as "the API is down" and went into a backoff loop.

The agent didn't crash. It just got slower and slower until it was producing nothing, and the bill kept going up.

The boring fix: a context summarizer that runs every N tool calls, replaces the last K turns with a one-paragraph summary, and keeps the most recent 5 turns verbatim. This is well-trodden ground in the literature, but the surprising part is how rarely small teams actually wire it up. Most agent codebases I have looked at assume the conversation will end in a few turns. Long-running agents need garbage collection on their own conversation history.

If you want a longer treatment of why AI agent hosting is mostly about boring problems like this one rather than the model itself, the longer version is worth a skim. The summary: the model is the easy part now. Everything around it is where the failures live.

4. ulimit walls (max user processes)

Day nineteen. The agent had spawned a background process for a long-running task and then gone on to do other work. Background tasks accumulated. By midnight there were 287 zombie processes attached to the agent's user, the per-user max user processes limit was somewhere around 1024 in this environment, and at 03:14 a new spawn failed with Resource temporarily unavailable.

In the log: a single line saying the spawn failed. The agent caught it as a generic exception and continued. The user-facing behavior was "this task takes forever." Three days later when I finally noticed I had to manually reap the zombies.

The boring fix: a process supervisor that owns the lifecycle of every spawned task, kills anything that has been alive longer than its declared TTL, and treats child processes as a resource that needs to be tracked. setsid and prctl(PR_SET_PDEATHSIG) are your friends. Also raise ulimit -u to something generous, but the real fix is killing things on schedule.

5. Webhook timeouts that look like success

Last one, and the meanest. The agent finished a task and called a webhook to notify a downstream system. The webhook took 31 seconds to respond. The HTTP client had a 30 second timeout. The client raised a timeout error. The agent's wrapper caught the timeout and logged a success because the wrapper had been written assuming "timeout means delivered, the receiver was just slow."

This is true for some kinds of fire-and-forget delivery. It is catastrophic for any kind of state-changing call. The downstream system never received the call. The agent thought it had. The user-facing system had two views of the world that did not agree.

In the log: a success line. No error. Nothing wrong.

The boring fix: idempotency keys on every state-changing webhook, a status check after every call that crossed the timeout threshold, and never treat a timeout as success without a separate confirmation. A timeout tells you the status is unknown. It does not tell you the call was delivered.

The pattern across all five

Every one of these failures had the same shape: a long-running agent ran into a resource limit or a state assumption that was fine for short tasks and broken for multi-day ones. The agent itself did not crash in three of the five cases. It just stopped being useful, and the supervisor was not watching for that.

The hosting layer needs to do three things that aren't sexy:

Memory and FD limits with warnings before the hard cap, not at the hard cap
Process lineage tracking so spawned tasks can't outlive their parent's intention
State-changing call confirmation, not just transport-level success

If you are running an agent on your laptop for an hour, none of this matters. If you are hosting OpenClaw agents in production for paying customers, all of this matters more than the model you picked.

What I now log on day one of any agent project

The thing that would have saved me the most pain on this run is just better logging from the start. None of the metrics below are exotic. None of them require an APM vendor. They are just the things I now scrape from any agent process before letting it run for more than 24 hours.

Per agent loop I track: RSS memory, FD count, child process count, total tool calls in this loop, total context tokens, time since last tool call returned, and the result of the last 10 tool calls (success or specific error). Per host I track: free memory, total FDs in use, load average, and the pid count for the agent's user. Both go to a flat file with a timestamp. No dashboard. Just a thing I can grep when something is weird.

Five of those metrics would have caught four of the five failures I described, hours or days before they actually broke things. The fifth (the webhook timeout) needs application-level logging, not host-level. That one is on the developer of the wrapper.

I have a longer guide on the hosting end of this at https://rapidclaw.dev/blog/ai-agent-hosting-complete-guide, but if you read nothing else, read this: the most expensive failure mode is the one that doesn't crash the process. Crashes get noticed. Slow-degrading agents do not.

MicroVM vs Docker for AI agents: I gave one sudo and broke the other

Tijo Gaucher — Mon, 04 May 2026 07:38:06 +0000

Last week I ran a small experiment that I should have run a year ago.

Same agent code. Same model. Same task list: install three Python packages from a CSV, fetch a few APIs, write JSON to disk, run a long-running scheduled job. Two isolation modes. One was a Docker container with the agent process inside, mounted volume, the usual. The other was a Firecracker microVM running a slim Linux image with the agent on top. Both got sudo inside their sandbox. I let them run for seven days each, then rotated.

I went in expecting the difference to be small. Memory overhead, maybe boot time. The actual difference was bigger than that.

Day one and two: Docker

Setup was the part everyone has done before. docker run with a few mounts, the agent gets a shell inside, away we go. The agent is told to apt-get install a couple of system libraries it has decided it needs. That works. It writes a 40 MB cache file to /tmp. That works. It runs a long job that opens a few hundred sockets to a public API.

Around hour eighteen the host machine's dmesg started printing about memory pressure. Not from the agent itself. From a different container running on the same host. The Python process inside my agent's container had a retry loop that would not stop holding file descriptors. That was the leak. Linux does not look at which container a process lives in when it picks something to OOM-kill. It just picks. The neighbor went down.

This is the part of Docker that no production person likes to talk about. Containers share the host kernel. They share the host scheduler. When one container goes off the rails, the rest of them feel it on the same host. If you're a small shop running one agent on one host, fine. None of this matters yet. For anything that looks like a tenant model, it stops being fine fast.

The other thing I noticed on day two: the agent decided to chmod 777 a folder it didn't own. Not malicious. Just a Python script doing what Python scripts do when permissions throw an error. With sudo available inside the container, it succeeded. The host filesystem was untouched (because mounted volumes have their own boundary), but anything inside that container was now wide open to whatever the agent did next.

Day three: rebuild as a MicroVM

I tore down the Docker setup and rebuilt the same agent inside a Firecracker microVM. Same code, same packages, same task list. Boot time went from about 200 ms (Docker) to about 700 ms (microVM). Memory baseline went up by roughly 60 MB for the kernel itself.

That is the cost. You pay it once.

What you get is a separate kernel. The agent's sudo is a real Linux sudo inside a real kernel that nobody else on the host shares. When the agent ran the same chmod 777 thing, it still happened, but the blast radius was a single VM that I could destroy and recreate in under a second. When the agent leaked file descriptors, only the VM's per-process limits got hit. The host kernel didn't notice.

Day four I let the agent install a kernel module on purpose. In Docker this would be a host-level event (or blocked outright, depending on capabilities). In the microVM it loaded into the VM's own kernel and stayed there. Whether that is useful or annoying is up to you. Either way, it stayed inside the VM.

Day five through seven: chaos

I wrote a script that told the agent to do five terrible things in a row, the kind of thing a real long-running agent might trip into over a week:

Open 10,000 file descriptors and never close them
Allocate memory in a tight loop until something breaks
Spawn 200 child processes
Write a 5 GB file to its own filesystem
Make a sudo iptables rule that drops all incoming packets

Docker handled (1) and (2) with cgroup limits if you remember to set them. Most people don't. The default is no limit. By the time I noticed (1) on the third run, the host had hit its global file descriptor ceiling and SSH started refusing connections.

The microVM handled (1) through (4) inside its own boundary. The kernel inside the VM hit its limits, the VM started thrashing, and then I killed it from outside. The host machine never noticed any of it. (5) was the most interesting. The agent set the iptables rule and locked itself out of network. In Docker the agent had affected the container's network namespace, recoverable but messy. In the microVM the agent had only touched the VM's network stack. I rebooted the VM in 800 ms and was back.

Snapshots are where the gap really opened. Firecracker can snapshot a running VM to disk and resume it later. I snapshotted the agent mid-task on day six, killed the host, restored the snapshot on a different machine, and the agent picked up its loop one second later without knowing anything had happened. Try that with a Docker container and you will spend the afternoon learning about CRIU and giving up.

The link to actually running this in production

Doing this experiment locally is one thing. Running an agent like this for a paying customer, on hardware you have to keep alive for 30+ days at a stretch, is a different problem. The boring infrastructure problem nobody writes about: it isn't the isolation primitive that's hard. Anyone can spin up Firecracker. The hard part is babysitting a hundred of these things at once, snapshotting them every so often, recovering them when a host dies, and not losing the agent's state in the meantime.

I'll plug the thing I work on once and move on. The Builder Sandbox tier is a managed wrapper around exactly this microVM-with-sudo model, with the snapshot and recovery loop already wired up. If you don't want to babysit it, that's the option. If you want to babysit it yourself, Firecracker is open source and the docs are fine.

What I'd tell past me

Running one tiny agent on a host you own? Docker is fine. The overhead is real, the boundary is real enough for that scope, and you already know the tools.

The moment you have an agent that needs sudo, runs for more than a few days, and might do something weird at 3am, switch to a real VM. The 60 MB and the 500 ms of extra boot time will pay for themselves the first time the agent does something stupid. The snapshot story alone is worth the migration.

The thing I didn't expect, going in, was how much my mental model changed. With Docker I treat the container as a thing the agent lives in. With a microVM I treat the VM as a thing the agent is. That shift, more than any individual feature, is what made the seven-day test feel different on day three than it did on day one.

A few specifics, in case you try this

The microVM rebuild was Firecracker 1.5 with a vanilla Ubuntu 22.04 rootfs, 2 vCPU, 1 GB RAM, virtio-net for the network. Boot times stayed under a second consistently once I trimmed the kernel config. I used jailer to drop privileges on the Firecracker process itself, and seccomp filters on the agent's user inside the VM. None of that is exotic. The Firecracker docs cover all of it. The only thing I had to figure out the hard way was the snapshot directory layout, which the docs assume you already understand.

For Docker the comparison build was the standard python:3.12-slim base with the agent process as the entrypoint, a tmpfs mount for /tmp, and --cap-drop=ALL plus only the capabilities the agent actually needed. Even with that, the chmod-777 case still worked inside the container because sudo plus CAP_FOWNER is enough for filesystem-mode changes. You can lock this down further with seccomp profiles, but at that point you have built a worse VM with extra steps.

If you want the longer cost breakdown of running this yourself versus paying someone to keep it alive, I wrote that up here: https://rapidclaw.dev/blog/openclaw-hosting-cost-self-host-vs-managed.

[The Boring AI Agent Workloads That Actually Pay in 2026]

Tijo Gaucher — Mon, 04 May 2026 04:38:04 +0000

Every other post on my feed is still pitching the "ambient agent that runs your whole job." If you actually run agents in production, you know that story is mostly vibes. The workloads that real people pay for, repeatedly, look almost embarrassingly mundane.

After a year of running agents for SMEs — accounting firms, e-commerce shops, two solo law practices — here are the four shapes of work that consistently survive the trial-to-paid conversion. None of them require AGI. All of them require an agent that doesn't fall over on day eleven.

1. Scheduled jobs that used to be cron + a human

The unsexy starting point. A cron job kicks off at 6 AM. It logs into a portal, scrapes a number, drops it in a sheet, and Slacks the team if a threshold trips. That used to be a half-day Selenium project plus a $40/mo VPS plus the ongoing maintenance tax of the portal redesigning itself every quarter.

An agent flips the math. The same job is now a five-line prompt and a browser tool. The portal redesigning itself is the agent's problem now, not yours. The cost question stops being "how much engineering time" and becomes "how reliable is the runtime."

That second question is the entire moat for managed agent platforms. It's also the reason most of the open-source-only "just spin up your own" pitches fall apart at month two. The agent works fine. The orchestration around it — retries, secret rotation, the headless browser updating, the model deprecating — is what bleeds the operator dry.

2. Browser automation that was too brittle for RPA

If you've ever priced UiPath or Automation Anywhere for a small business, you know the answer: it's not for them. The licensing is enterprise-shaped and the bot creation requires a specialist. Meanwhile, the actual workflow — log in, click three things, download a CSV, email it — is the kind of thing every five-person operation needs done weekly.

Agents with a real sandboxed browser tool eat this category. Not because they're smarter than RPA, but because they degrade gracefully. When the "Export" button moves three pixels left, an agent finds it. When the page adds a cookie banner, an agent dismisses it. The thing that used to take a consultant three days to update takes the agent zero.

The catch is that "real sandboxed browser" is doing a lot of work in that sentence. A Docker container with a headless Chromium is fine for a demo. For production, you want a MicroVM with sudo so the agent can actually install things, persistent file storage so its session survives a restart, and live port forwarding so you can watch it work when something looks off. That's roughly the hardware bill that managed OpenClaw hosting abstracts away.

3. Coding agents that don't touch production

This one is the most counterintuitive. The coding agent market that's working isn't replacing engineers — it's replacing the "I'll get to it next sprint" backlog at companies that don't have engineers.

Real example: a roofing company. Their internal "system" is a Google Sheet, a Calendly, and three Zapiers. They have a list of forty small tweaks they want — a column added here, a webhook there, a conditional email. None of it is hard. All of it is too small for a contractor and too unfamiliar for the owner. An agent with shell access and the patience to iterate clears that backlog in a weekend. The owner doesn't read the code. The owner reads the result.

The reliability bar here is different from the production code reliability bar. The agent doesn't need to write perfect code. It needs to not silently break the spreadsheet that runs the business. That's an observability problem, not an intelligence problem. Snapshot the state before each change, let the operator roll back, and the whole category gets safer than it sounds.

4. The "always-on assistant" that is actually a search index

The mythology of the AI assistant is that it answers anything. The reality of the paying assistant is narrower: it knows your stuff. Your contracts, your meeting notes, your invoices, your support tickets. It can pull a number out of a 200-page master services agreement faster than the human who wrote the agreement.

These deployments don't fail because the model is dumb. They fail because the data plumbing is broken — stale embeddings, a connector that silently drops half the documents, a permission boundary that leaks one tenant's data into another. None of which is a model problem.

This is the workload most people quote when they say "we tried AI and it didn't work." What didn't work was the integration. The model is fine.

What actually distinguishes the survivors

Look at those four. None of them require a frontier model. None of them require "agentic reasoning" past a couple of hops. What they require is a runtime that:

Doesn't crash, or recovers gracefully when it does
Has the right tools wired up (browser, shell, file storage, an email sender)
Surfaces what it's doing well enough that a non-engineer can tell when it's stuck
Costs predictably — flat monthly is much easier to sell to a small business than per-token

This is why the managed-hosting framing is starting to take over the SME conversation. The buyer doesn't want to think about API keys, model selection, or which sandbox their agent is running in. They want the POS-system experience: pay a flat fee, the agent works, somebody else is on the hook when it breaks.

If you're building this for yourself, the Builder Sandbox tier on RapidClaw gives you the MicroVM with sudo and live port-forwarding without the infra babysitting. If you're past the building phase and need something an operator can actually run unsupervised, that's the white-glove side of the same platform.

What I'd skip in 2026

For completeness — the workloads that keep showing up in pitch decks but quietly failing the trial-to-paid test:

The "AI sales rep" that prospects and closes by itself. The "AI manager" that runs your team's standup. The "autonomous research analyst" that reads ten papers and synthesizes a thesis. These will get there. They are not there yet. If you're trying to make rent this quarter, build the boring one.

The boring one is what's paying.

Tijo Gaucher runs RapidClaw, managed OpenClaw hosting for non-technical operators. Previously, content at Human + AI.

[I ran ONE AI agent for 30 days straight — here's what actually broke]

Tijo Gaucher — Thu, 30 Apr 2026 00:46:23 +0000

Most AI agent demos are shaped like a 90-second loop: prompt → tool call → answer. The interesting failures don't show up there. They show up around day 7, when the process you started in a tmux session has eaten 4 GB of RAM, your browser sub-agent is wedged on a captcha you never noticed, and the thing has been retrying the same failed Stripe webhook for 36 hours.

I ran a single OpenClaw agent on a small VPS for 30 days. It was scoped to one boring job: triage incoming sales emails, draft replies, file them in the right folder, ping Slack on anything weird. The agent ran continuously, scheduled by cron, with persistent state in SQLite. No multi-agent orchestration, no fancy memory layer — just one process trying to stay alive.

Here is what actually broke, in the order it happened.

Day 1–3: everything looks great

The first three days are a honeymoon. Latency is good, the agent handles edge cases I didn't think to specify, and the inbox triage rules quietly improve as it picks up patterns. This is where most demo videos end. It's also where most teams declare victory and move on, which is the mistake.

Two things to instrument before day 4 even starts:

Per-run token cost, written to a flat log. You'll need this when you investigate cost drift in week two.
Process RSS memory, sampled every minute. The number that matters isn't the peak — it's the slope.

If you're using a hosted setup like RapidClaw's managed OpenClaw runtime, the slope is graphed for you. If you're self-hosting, write the sampler yourself before you forget. You will forget.

Day 4: the context bloat starts

The agent's working memory file grew to 18,000 tokens. None of it was strictly wrong. It was just… accumulated. Old email threads it had handled, notes about edge cases, a half-finished plan for a problem that resolved itself two days earlier.

The cost per run had quietly tripled.

This is the most boring failure mode in long-running agents and the one nobody warns you about. Your prompt isn't getting worse — your context window is getting fatter. The fix is unglamorous: a compaction step that runs nightly, summarizes anything older than 48 hours into a few bullet points, and archives the rest to a file the agent can grep but doesn't auto-load.

If you skip this, by day 14 you're paying GPT-4-class prices to send the model a partially-decayed copy of last week's todo list every single run.

Day 7: the first silent kill

The OOM killer took the process at 3:47 AM. There was no error in the logs because the process didn't get to write one. It just stopped existing.

This is where most self-hosted agent setups quietly die in production and the operator doesn't notice for two days. The cron job that runs the agent every 15 minutes also exits cleanly when the process is gone — there's no parent supervising health.

Three things you want before day 7:

A liveness file the agent touches on every successful run, plus an external check that alerts when it's stale for more than 30 minutes.
A systemd unit (or equivalent) with Restart=on-failure and MemoryMax= set well below your VPS's actual RAM. You want the agent to die predictably and come back, not get reaped silently.
Logs that flush on every event, not on buffer fill. A buffered log is a log you don't have when the OOM killer arrives.

This is also the point where the "managed hosting" pitch starts to make economic sense for non-developers. Setting up systemd, a watchdog, log shipping, and metric scraping for one agent is two evenings of work for a competent backend engineer. SMEs don't have that engineer.

Day 11: the captcha trap

The agent's browser sub-task hit a captcha while loading a vendor portal. It didn't fail. It didn't error. It just waited. For 90 minutes. Then the headless Chrome process leaked and the next 14 runs spawned new Chrome instances on top of it.

The lesson is that anything involving a real browser needs both a hard wall-clock timeout and a "did the page actually finish loading the thing I asked for?" assertion. A 200 response is not a success signal when the body is a captcha challenge.

If your agent does any web automation at all, this will happen to you. The honest version of the agent demo isn't "watch it browse the web" — it's "watch the watchdog kill a stuck browser session and surface a human-readable reason for it."

Day 18: model drift on the provider side

The replies started getting weirdly formal. Not wrong — just off. I couldn't reproduce it on Claude with the same prompt locally, but in production the change was clear over a 3-day window.

Eventually I figured out the provider had silently routed a percentage of traffic to a slightly different model variant. This is a real thing that happens, and the only way you catch it is logging a stable hash of the prompt and the full response for every run, then diffing aggregates week-over-week. If you're not doing this, you'll just notice "vibes feel different" and have no evidence.

Day 24: the small bug that hid in the schedule

A timezone bug in the cron expression meant the agent ran exactly zero times for 18 hours during a holiday DST shift. Nobody noticed because there was no one in the inbox to notice. The triage queue piled up, and the agent's first run after the gap took 11 minutes and 92,000 tokens to dig out.

Schedules are infrastructure. Test them on a fake clock before you ship them.

Day 30: what stuck

The agent is still running. The job is unglamorous, the per-run cost is now lower than day 1 because of the compaction step, and most weeks I don't think about it. That's the real success criterion for a long-running agent: do you stop having to think about it?

The narrative around "ambient AI agents that do your whole job" is still mostly vibes. The agents that actually pay rent today are boring: scheduled jobs, browser automation, coding agents, inbox triage. They're sticky because once you have one running and supervised, the cost of replacing it is high. They're hard because the supervision is the actual product.

If you're a developer building these for yourself, lean into systemd, structured logs, and a 5-line health check. If you're not — or you're shipping this for non-technical operators who can't be on-call for a Python process — managed runtimes like RapidClaw exist precisely because day-7 reliability is a product, not a feature.

The demo is easy. The 30-day uptime is the moat.

Tijo writes about practical AI agents at Human + AI. RapidClaw is the managed-hosting side of the same operator-focused practice — built for people who want a working AI assistant without becoming a Linux admin.

[The 8-Turn Problem] Why Your Agent Fails at Turn 3 and You Only Notice at Turn 7

Tijo Gaucher — Mon, 20 Apr 2026 04:32:26 +0000

Last Tuesday an agent I shipped decided, mid-conversation, that the user's name was "Export CSV." It wasn't. Seven turns earlier, a tool result had come back with a quoted header row where a username field should have been, and the model silently absorbed that string as ground truth. Every subsequent turn degraded quietly — apologetic tone, subtle hallucinations, a refusal that referenced "your account, Export CSV."

The per-call logs looked fine. The latencies were green. Token usage was nominal. The only way to see the break was to reconstruct the whole conversation as a causal graph and follow the poison forward.

This is the 8-turn problem. It's the single most expensive class of bug I ship, and most of the observability stacks I've tried were built for a world where requests are independent. They aren't.

Why request-level monitoring lies

Traditional APM assumes a request is a closed unit: it came in, it did something, it came out, and if you aggregate p99 and error rate you know whether the system is healthy. That model was fine for stateless services. It's openly broken for agents.

An agent request carries state that isn't in the HTTP payload. It carries the conversation. It carries the tool results that previous turns wrote into context. It carries the model's own prior outputs, which are now training the next inference. A turn that looks locally correct — valid JSON, successful tool call, reasonable response — can be the exact moment your agent quietly goes off the rails for the next 40 minutes of user conversation.

I watch three numbers more than I watch latency:

Turn-over-turn intent drift: does turn N still match the user's original ask?
Tool result contamination rate: how often does a tool response contain strings that look like instructions?
Session success rate, not request success rate: did the user actually get what they came for?

None of those are visible from a metrics dashboard that aggregates individual calls. You need traces that span the whole session, and you need them structured so you can walk them backward from the failure.

What a useful trace actually looks like

The OpenTelemetry GenAI SIG has been converging on gen_ai.* semantic conventions, which is good. The prevailing shape: each tool call, each LLM invocation, each retrieval is a child span, parented to the turn, parented to the session. Do that, and your trace tree tells the story of the reasoning chain.

A few things people get wrong here:

Don't put prompts in span attributes. Attributes are indexed, have size caps, and leak straight into your observability backend as PII. Use span events. Events can be sampled, redacted, or dropped at the Collector without touching app code. This one change will save you a compliance conversation later.

Parent spans by the turn, not just the call. If every LLM call is a root span, you lose the conversational structure. The parent-child relationship between turn 3 and turn 7 is the thing you actually want to trace. If you're building this yourself, each session gets a trace_id, each turn gets a span_id under it, and tool calls and inferences nest under the turn.

Emit a "decision" span. The LLM call itself is one span, but what the agent did with the output — picked a tool, rephrased, escalated — is a different concern and worth its own span. This is where drift shows up first.

At RapidClaw we default to this layout and bolt on session-level rollups so you can ask "which turn did this fail at?" without scrolling through 40 spans.

The debugging workflow that actually works

When a user reports an agent did something weird, the temptation is to grep logs for the error. There's usually no error. Here's the loop I run instead:

Pull the full session trace. Not the failing turn — the whole conversation, from the first user message forward.
Diff the system state between turns. What changed in memory, in the scratchpad, in the retrieved context? This is where you find the poisoned field.
Replay from the suspected turn with the same tool responses. Most agent frameworks let you rehydrate a session; if yours doesn't, you need to fix that before anything else.
Mutate one variable at a time. Change the tool response. Change the model. Change the system prompt. Bisect until the behavior flips.
Write the regression test at the session level. Not a unit test on a single call — a full conversation fixture with expected final state.

Step 3 is where most teams stall. If you can't replay a session deterministically, you're guessing. The replay and re-simulate workflow is the single feature I'd build first in any agent observability tool, including ones I don't run.

Practical hygiene for small teams

I run a small operation — think five agents in production, not five hundred — and the infrastructure choices reflect that. A few things that have held up at this scale:

One OTLP pipeline, everything flows through it. Don't run a separate tracing stack for agents. Emit gen_ai.* spans into the same Collector your regular services use, then branch at the exporter if you want a specialized backend for LLM-specific analysis. Vendor lock-in is a real risk and OTel is the escape hatch.

Sample aggressively on success, keep everything on failure. Full-conversation traces are expensive. A 1% tail-based sampler plus 100% retention for sessions that flagged any of: tool error, user thumbs-down, abnormal turn count, or model refusal — that gives you the signal without drowning.

Tag sessions with the outcome, not just the request. Instrument your app to send a session-end event with "did the user get what they wanted?" If you can't answer that, instrument it first. Every other metric is downstream.

Treat evals and tracing as the same system. Evaluation runs are just traces with known expected outputs. The moment you split them into different tools you start writing glue code that never gets maintained.

The uncomfortable part

Most agent reliability issues I've seen in the last six months aren't model issues. They're context management issues. The model is doing its job — taking what's in the window and producing a plausible next token. The bug is upstream, in what we let accumulate in that window.

Observability for agents is, practically, observability for the context window over time. If your tooling can't show you how a single field mutated across seven turns, it can't help you debug the 8-turn problem. And the 8-turn problem is most of the bugs.

If you want to see how we handle session-level tracing in practice, the RapidClaw quickstart walks through instrumenting a LangGraph agent in about ten minutes. But the principle matters more than the tool: trace the session, not the request, and save yourself the compliance conversation by keeping prompts out of attributes.

Your agents are going to hallucinate. The question is whether you find out at turn 3 or turn 73.

Implementing A2A Protocol for Multi-Agent Communication

Tijo Gaucher — Sat, 18 Apr 2026 03:42:08 +0000

If you've ever wired two AI agents together, you know the drill. Custom JSON schemas, bespoke HTTP endpoints, and a growing pile of adapter code that nobody wants to maintain. Google's A2A (Agent-to-Agent) protocol is the answer to that mess, and I've been implementing it across OpenClaw and Hermes agents on Rapid Claw for the past few weeks. Here's what the implementation actually looks like.

What A2A solves (and what it doesn't)

A2A standardizes the message envelope between independent agents. Think of it as the TCP/IP of agent communication — it defines how agents discover each other, exchange structured messages, delegate tasks, and return results. It doesn't care what framework you're using internally.

The key distinction: MCP (Model Context Protocol) handles agent-to-tool communication. A2A handles agent-to-agent communication. You need both in any serious multi-agent deployment, and they compose cleanly because an A2A peer is essentially a tool with an agent on the other end.

The envelope format

Every A2A message carries the same required fields. The interesting bits go in payload:

envelope = {
    "a2a_version": "1.0",
    "message_id": f"msg_{uuid4().hex}",
    "correlation_id": "conv_01HZKXR7...",  # ties the conversation together
    "trace": {
        "trace_id": "4bf92f3577b34da6...",
        "span_id": "00f067aa0ba902b7",
    },
    "sender": {"agent_id": "planner-openclaw-prod-01", "framework": "openclaw"},
    "recipient": {"agent_id": "executor-hermes-prod-03", "framework": "hermes"},
    "intent": "task.delegate",
    "payload": {
        "task": "summarize_and_file",
        "inputs": {"url": "https://example.com/report.pdf"},
        "constraints": {"max_tokens": 4000, "deadline_ms": 30000}
    },
    "reply_to": "https://agents.rapidclaw.dev/a2a/planner/inbox",
    "expires_at": "2026-04-18T12:34:56Z"
}

Three fields do the heavy lifting: correlation_id threads multi-agent conversations into a single trace, trace carries OpenTelemetry-compatible span context so your existing APM stitches everything together, and intent is the verb recipients dispatch on — not a URL path.

Publishing an OpenClaw agent as an A2A endpoint

An OpenClaw agent becomes an A2A peer by exposing an inbox and registering with a platform registry. The agent doesn't need to know who will call it — only how to respond:

from fastapi import FastAPI, HTTPException
from openclaw import Agent, Task
from a2a import Envelope, verify_signature, sign

app = FastAPI()
planner = Agent.from_config("planner.yaml")

@app.post("/a2a/inbox")
async def inbox(envelope: Envelope):
    if not verify_signature(envelope, allowed=TRUSTED_SIGNERS):
        raise HTTPException(401, "signature verification failed")

    if envelope.intent == "task.delegate":
        task = Task(
            name=envelope.payload["task"],
            inputs=envelope.payload["inputs"],
            trace=envelope.trace,
        )
        result = await planner.run(task)

        reply = Envelope(
            intent="result.return",
            correlation_id=envelope.correlation_id,
            trace=envelope.trace,
            sender={"agent_id": AGENT_ID, "framework": "openclaw"},
            recipient=envelope.sender,
            payload={"status": "ok", "result": result.to_dict()},
        )
        return sign(reply, PRIVATE_KEY).dict()

The caller discovers executors by label, not URL — this is the part A2A gets right. No hardcoded hostnames:

executor = await lookup(
    intent="task.execute",
    labels={"framework": "hermes", "env": "prod"},
)

Three patterns worth implementing

Request/reply is the simplest. Planner calls executor, waits for the reply envelope, acts on it. Use for sub-tasks with clear deadlines.

Fan-out/fan-in dispatches the same intent to a pool of executors in parallel, correlates replies by correlation_id, and takes the first good answer or aggregates. This is how you build research-agent ensembles.

Async with callback fires a task.delegate with a reply_to URL and returns immediately. The callee POSTs a result.return when done. You get durability without holding an HTTP connection open.

The platform layer matters

The protocol is the easy part. Production A2A needs five things at the platform layer: a registry for discovery, identity and mTLS per agent, routing with network policy, observability that stitches traces across agents, and per-agent rate limits. You can build all five yourself — Postgres registry, Vault for keys, Envoy for mTLS, OTEL collector, Redis for rate limits — or use something like Rapid Claw that ships them preconfigured.

If you're thinking about multi-agent architectures more broadly, I wrote up the common orchestration patterns (planner/executor, supervisor, blackboard) that pair well with A2A as the transport layer.

A2A isn't revolutionary — it's the boring infrastructure piece that was missing. And boring infrastructure is exactly what you want when you're trying to ship agent systems that actually work in production.

[Patterns] AI Agent Error Handling That Actually Works

Tijo Gaucher — Fri, 17 Apr 2026 08:47:16 +0000

Most AI agent tutorials show the happy path. Your agent calls an LLM, gets a response, does the thing. Ship it.

Then production happens. Rate limits. Timeouts. Malformed responses. Context window overflows. Your agent goes from "demo-ready" to "incident-generating" in about 48 hours.

I run a small operation — 5 agents max, solo founder. Every failure that wakes me up at 3am is one I should have handled in code. Here are the patterns that actually work.

Classify Your Errors First

Not all errors deserve the same treatment. The first thing I do in any agent system is classify failures into two buckets:

Transient errors: Rate limits (429), timeouts, temporary network blips, model overload. These will probably work if you try again.

Permanent errors: Invalid API keys, malformed prompts, context window exceeded, model doesn't exist. Retrying won't help.

class ErrorClassifier:
    TRANSIENT_CODES = {429, 500, 502, 503, 504}

    @staticmethod
    def classify(error):
        if hasattr(error, 'status_code'):
            if error.status_code in ErrorClassifier.TRANSIENT_CODES:
                return "transient"
        if "timeout" in str(error).lower():
            return "transient"
        return "permanent"

This classification drives everything downstream. Transient errors get retries. Permanent errors get logged, reported, and gracefully degraded. When you're thinking about agent security patterns, error classification also matters — permanent auth errors need different alerting than transient network hiccups.

Retry Strategies That Don't Make Things Worse

The naive approach — retry immediately, retry forever — is how you turn a rate limit into a ban. Exponential backoff with jitter is the baseline:

import random
import time

def retry_with_backoff(fn, max_retries=3, base_delay=1.0):
    for attempt in range(max_retries):
        try:
            return fn()
        except Exception as e:
            if ErrorClassifier.classify(e) == "permanent":
                raise  # Don't retry permanent errors

            if attempt == max_retries - 1:
                raise

            delay = base_delay * (2 ** attempt)
            jitter = random.uniform(0, delay * 0.5)
            time.sleep(delay + jitter)

Key details: jitter prevents thundering herd when multiple agents hit the same limit. And always cap your retries — 3 is usually enough. If it hasn't worked in 3 tries, it's not going to work in 30.

Circuit Breakers for LLM Calls

Retries handle individual failures. Circuit breakers handle systemic ones. If your LLM provider is having a bad day, you don't want every request queuing up and timing out.

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_time=60):
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.recovery_time = recovery_time
        self.last_failure_time = None
        self.state = "closed"  # closed = normal, open = blocking

    def call(self, fn):
        if self.state == "open":
            if time.time() - self.last_failure_time > self.recovery_time:
                self.state = "half-open"
            else:
                raise CircuitOpenError("Circuit breaker is open")

        try:
            result = fn()
            if self.state == "half-open":
                self.state = "closed"
                self.failure_count = 0
            return result
        except Exception as e:
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.failure_count >= self.failure_threshold:
                self.state = "open"
            raise

I wrap every external LLM call in a circuit breaker. When the circuit opens, agents fall back to cached responses or simpler logic instead of piling up failures. If you're taking an observability-first approach, you'll want to track circuit state transitions — they're one of the best early warning signals.

Fallback Chains: Your Safety Net

When your primary model fails, having a fallback chain prevents total outage:

FALLBACK_CHAIN = [
    {"provider": "anthropic", "model": "claude-sonnet-4-20250514"},
    {"provider": "openai", "model": "gpt-4o-mini"},
    {"provider": "local", "model": "cached_response"},
]

def call_with_fallback(prompt, chain=FALLBACK_CHAIN):
    errors = []
    for option in chain:
        try:
            return call_model(option["provider"], option["model"], prompt)
        except Exception as e:
            errors.append(f"{option['provider']}: {e}")
            continue
    raise AllProvidersFailedError(
        f"All {len(chain)} providers failed: {'; '.join(errors)}"
    )

The chain degrades gracefully: premium model → cheaper model → cached/static response. Your users get something even when everything is on fire.

Timeout Handling

LLM calls are slow. An agent waiting 120 seconds for a response that's never coming is wasting resources and blocking downstream work.

import asyncio

async def call_with_timeout(coro, timeout_seconds=30):
    try:
        return await asyncio.wait_for(coro, timeout=timeout_seconds)
    except asyncio.TimeoutError:
        raise TimeoutError(f"LLM call exceeded {timeout_seconds}s limit")

Set aggressive timeouts. For most agent tasks, if you haven't gotten a response in 30 seconds, something is wrong. I default to 30s for completions and 10s for embeddings.

Putting It All Together

Here's how these patterns compose in a real agent:

async def agent_execute(task):
    breaker = get_circuit_breaker("llm_calls")

    try:
        result = breaker.call(
            lambda: retry_with_backoff(
                lambda: call_with_fallback(task.prompt),
                max_retries=3
            )
        )
        return AgentResult(status="success", data=result)

    except CircuitOpenError:
        return AgentResult(
            status="degraded",
            data=get_cached_response(task),
            note="Using cached response - LLM circuit open"
        )
    except AllProvidersFailedError:
        return AgentResult(
            status="failed",
            data=None,
            note="All providers unavailable"
        )

The key insight: every layer has a defined failure mode. Timeouts prevent hangs. Retries handle blips. Circuit breakers prevent cascading failures. Fallbacks provide degraded-but-functional responses.

What I Track

Error handling is only useful if you know it's working. For my small setup, I track:

Error classification distribution — am I seeing more transient or permanent errors?
Circuit breaker state changes — how often are circuits opening?
Fallback chain depth — how far down the chain are requests going?
Retry success rate — are retries actually recovering errors?

Having real-time error monitoring changed how I build agents. Instead of finding out about failures from users, I catch patterns before they become outages.

The Boring Truth

None of these patterns are novel. Circuit breakers come from distributed systems. Retry with backoff is older than most of us. Fallback chains are just failover by another name.

But applying them specifically to AI agents — where failures are probabilistic, responses are non-deterministic, and costs compound with every retry — that's where the craft is. Start with error classification, layer on retries, add circuit breakers, and build fallback chains. Your 3am self will thank you.

[2026] OpenTelemetry for LLM Observability — Self-Hosted Setup

Tijo Gaucher — Fri, 17 Apr 2026 08:43:05 +0000

I've been running a small AI automation shop — just me, a handful of agents, and a self-hosted stack that needs to stay observable without blowing the budget. When I started instrumenting my LLM pipelines, I found that most observability guides assumed you'd use a managed platform. But if you're like me and prefer to own your data and infrastructure, OpenTelemetry gives you a solid, vendor-neutral foundation.

Here's what I've learned getting OpenTelemetry working for LLM agent traces on a self-hosted setup in 2026.

Why OpenTelemetry for LLM Workloads?

OpenTelemetry (OTel) has become the de facto standard for distributed tracing, metrics, and logs. The ecosystem matured significantly through 2025, and the semantic conventions for generative AI — covering LLM calls, token usage, model parameters — landed as stable in early 2026.

For LLM workloads specifically, OTel gives you a few things that matter:

Trace continuity across agent steps. When your agent calls an LLM, retrieves from a vector store, then calls another LLM, each step is a span in a single trace. You see the full chain, not just isolated API calls.

Token and cost attribution. The gen_ai semantic conventions include attributes like gen_ai.usage.input_tokens and gen_ai.usage.output_tokens, which let you track per-request costs without bolting on a separate billing layer.

Vendor neutrality. Whether you're calling OpenAI, Anthropic, or a local model via vLLM, the instrumentation shape is the same. Swap providers without rewriting your observability code.

The Self-Hosted Stack

My setup is modest — a single VPS running the collection and storage layer, with agents deployed separately. Here's the architecture:

[Your LLM Agents]
       |
       v
[OTel Collector]  ← receives traces via OTLP/gRPC
       |
       v
[Tempo / Jaeger]  ← trace storage
[Prometheus]      ← metrics storage
[Grafana]         ← visualization

If you've looked at the self-hosted vs managed cost comparison, you know the economics are favorable when you're running fewer than five agents. The managed platforms charge per span or per seat, which adds up quickly even at small scale.

Setting Up the OTel Collector

The Collector is the central hub. It receives telemetry from your agents, processes it, and exports to your storage backends. Here's a minimal config for LLM traces:

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 512
  attributes:
    actions:
      - key: deployment.environment
        value: production
        action: upsert

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true
  prometheus:
    endpoint: 0.0.0.0:8889

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, attributes]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

Nothing exotic here. The batch processor keeps things efficient, and we're exporting traces to Tempo and metrics to Prometheus. If you want a deeper walkthrough on getting this into production, the production deployment guide covers Docker Compose configs and health checks.

Instrumenting LLM Calls

The actual instrumentation depends on your language and SDK. I'll show Python since that's what most agent code runs on.

First, install the packages:

pip install opentelemetry-api opentelemetry-sdk \
  opentelemetry-exporter-otlp-proto-grpc \
  opentelemetry-instrumentation-requests

Then set up a tracer and wrap your LLM calls:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Initialize
provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="http://your-collector:4317", insecure=True)
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("llm-agent")

def call_llm(prompt, model="claude-sonnet-4-20250514"):
    with tracer.start_as_current_span("llm.call") as span:
        span.set_attribute("gen_ai.system", "anthropic")
        span.set_attribute("gen_ai.request.model", model)
        span.set_attribute("gen_ai.request.max_tokens", 1024)

        response = your_llm_client.complete(prompt=prompt, model=model)

        span.set_attribute("gen_ai.usage.input_tokens", response.usage.input_tokens)
        span.set_attribute("gen_ai.usage.output_tokens", response.usage.output_tokens)
        span.set_attribute("gen_ai.response.model", response.model)

        return response.content

The key is using the gen_ai.* semantic conventions consistently. This means your Grafana dashboards, alerts, and queries work the same regardless of which model or provider you're hitting.

Tracing Multi-Step Agent Workflows

Where this gets really useful is tracing a full agent workflow. Each tool call, retrieval step, and LLM invocation becomes a child span:

def run_agent(task):
    with tracer.start_as_current_span("agent.run") as parent:
        parent.set_attribute("agent.task", task)

        # Step 1: retrieve context
        with tracer.start_as_current_span("retrieval.vector_search"):
            context = search_vector_store(task)

        # Step 2: call LLM with context
        result = call_llm(f"Context: {context}\nTask: {task}")

        # Step 3: maybe call a tool
        if needs_tool_call(result):
            with tracer.start_as_current_span("tool.execute") as tool_span:
                tool_span.set_attribute("tool.name", "web_search")
                tool_result = execute_tool(result)
                result = call_llm(f"Tool result: {tool_result}\nOriginal task: {task}")

        return result

When you view this in Grafana via Tempo, you get a waterfall trace showing exactly where time was spent — was it the vector search? The first LLM call? The tool execution? This is the kind of visibility that makes debugging agent behavior tractable instead of guesswork.

What You Actually See in the Dashboard

Once everything is wired up, your self-hosted observability dashboard shows you:

Latency breakdown per agent step — which spans are slow, and whether it's network or model inference
Token usage over time — catch runaway prompts before they drain your API budget
Error rates by model/provider — spot degraded model endpoints early
Trace search — find the exact trace where an agent went off the rails

For a solo operator running a few agents, this level of visibility is the difference between confidently shipping agent workflows and crossing your fingers every deploy.

Rough Edges and Honest Takes

A few things that are still annoying in 2026:

Auto-instrumentation for LLM SDKs is patchy. The OpenAI Python SDK has decent OTel support now, but Anthropic's is still experimental. You'll likely write some manual spans.

Trace volume can surprise you. Agents that loop — retries, multi-turn conversations — generate a lot of spans. Set up sampling early. A simple tail-based sampler that keeps error traces and samples 10% of success traces works well.

Grafana dashboards take time to build. The gen_ai semantic conventions are new enough that there aren't many pre-built dashboards. Budget an afternoon to set up your panels.

Wrapping Up

OpenTelemetry for LLM observability isn't a silver bullet, but it's the most practical foundation I've found for self-hosted setups. The semantic conventions are mature enough to use in production, the Collector is rock-solid, and the cost of running your own Tempo + Grafana stack is a fraction of what you'd pay for a managed platform.

If you're running a handful of agents and want to actually understand what they're doing, this stack is worth the setup time.