Forem: Ken Imoto

5 AI Crawlers Hit My Sites 14,300 Times in 30 Days. Here's What Their User-Agents Told Me About LLMO.

Ken Imoto — Tue, 26 May 2026 11:00:01 +0000

I thought robots.txt was the boundary. Three lines of Disallow: and I'd told the AI bots where they could and couldn't go. Done. I went back to writing posts about LLMO measurement, citation rates, and AI referral traffic in GA4.

Then I opened the access logs for three of my sites and the picture I had in my head collapsed.

This is what I learned reading thirty days of raw server logs from kenimoto.dev, kaoriq.com, and llmoframework.com. Five User-Agent strings dominated everything. The traffic patterns each one created told me more about my LLMO standing than any GA4 dashboard had.

Why I started reading logs in the first place

Most LLMO measurement advice tells you to track the outbound side: did ChatGPT cite me, did Perplexity link to me, did Google AI Overviews show me. That's the citation side.

The other side, where AI services actually pull HTML from my server, is invisible in GA4. AI crawlers don't fire JavaScript. They don't trigger gtag. They show up in raw HTTP access logs and nowhere else.

I'd been writing LLMO posts for months and had never once looked at the side of the funnel I could actually control. So I exported 30 days of logs from Cloudflare (kenimoto.dev, kaoriq.com) and Vercel (llmoframework.com), grepped for known AI User-Agents, and started counting.

The total: 14,300 AI crawler hits across three sites in 30 days. Roughly 477 hits per day per site. More than I expected. Less than I think it should be in another six months.

The 5 crawlers that hit me most

Here's the ranked list. Hits are deduplicated by (timestamp, path, IP) so cache retries don't inflate the count.

Rank	User-Agent	30-day hits	Operator	Purpose
1	`GPTBot`	4,212	OpenAI	Training data
2	`ClaudeBot`	3,108	Anthropic	Training + retrieval
3	`PerplexityBot`	2,790	Perplexity	Answer index
4	`OAI-SearchBot`	2,043	OpenAI	ChatGPT search citations
5	`Google-Extended`	1,387	Google	Gemini training

Five User-Agents. 13,540 hits. That's 94.7% of all AI traffic. The remaining 5.3% was a long tail: Bytespider, Applebot-Extended, Meta-ExternalAgent, Amazonbot, cohere-ai, a smattering of Claude-User, and two hits from something that called itself anthropic-ai (the old UA Anthropic supposedly retired).

Before you read too much into the order: this is my data, three small sites, mostly English/Japanese tech content. Your ranking will look different. The shape of it (a handful of bots accounting for most hits, OpenAI and Anthropic at the top) is probably the same.

What each one is actually doing

The reason rank order matters less than the purpose of each bot is that the three buckets behave completely differently in LLMO terms.

Training crawlers read your content to potentially update model weights. They show up consistently, follow robots.txt (usually), and don't care about your content being "fresh." GPTBot, Google-Extended, Bytespider, Applebot-Extended, and anthropic-ai (legacy) fall here.

Retrieval crawlers index your content so it can be cited in real-time answers. They re-fetch popular pages, follow Last-Modified, and have a measurable crawl-to-refer ratio. OAI-SearchBot, PerplexityBot, Claude-SearchBot (newer, independently controllable from ClaudeBot), and GoogleOther belong here.

User-initiated fetches happen when a human pastes your URL into ChatGPT or asks Claude to read it. These are ChatGPT-User, Perplexity-User, and Claude-User. They don't follow robots.txt (per OpenAI's revised crawler docs, because they're user actions, not crawls).

I had been treating all of these as the same animal. They are not. If your goal is "get cited in ChatGPT Search," OAI-SearchBot hits matter and GPTBot hits are basically noise. If your goal is "be in the training set of the next Claude," it's exactly inverted.

Who actually obeys robots.txt

Here's the part that flipped my view of robots.txt.

On kenimoto.dev, I had a Disallow: /api/ rule. Over 30 days:

GPTBot: 0 hits to /api/. Compliant.
Google-Extended: 0 hits to /api/. Compliant.
ClaudeBot: 0 hits to /api/. Compliant.
OAI-SearchBot: 3 hits to /api/. Borderline. Possibly cached before the rule, possibly the revised compliance language is doing something subtle.
PerplexityBot: 41 hits to /api/ in one 90-second burst. Not compliant on this run.

Forty-one hits is not a sample of one. The 90-second burst pattern matched a public report where Perplexity was observed ignoring User-agent: PerplexityBot blocks when answering an active user query. The behavior makes more sense if you think of PerplexityBot as straddling the retrieval/user-initiated line: it acts like a retrieval crawler on the calm days, and a user-initiated fetch when somebody is waiting on an answer.

The takeaway I wrote down: robots.txt is a self-reported boundary. Three of five top crawlers honored it cleanly on my data. One was iffy. One did whatever it wanted when a human was on the other end. Plan accordingly.

Three LLMO signals you can derive from this

The reason I'm writing this down is that crawler hit data is a measurable LLMO signal, and I haven't seen it discussed much next to the usual citation-rate metrics. Three things I now look at every week:

1. Crawler diversity. If only GPTBot hits your site and nothing else, your retrieval surface is OpenAI-only. You're invisible to Claude, Perplexity, and Gemini's retrieval paths even if you're cited in ChatGPT. A healthy crawler-diversity score is at least three of the five top User-Agents hitting you regularly.

2. Retrieval-to-training ratio. If you sum retrieval-side hits (OAI-SearchBot + PerplexityBot + Claude-SearchBot + GoogleOther) and divide by training-side hits (GPTBot + Google-Extended + anthropic-ai), you get a number that tells you whether the AI ecosystem thinks of you as "content to be learned from" or "content to be cited right now." Mine sits at 0.81. Anything below 0.5 means your content isn't fresh enough to be retrieved in real time. Anything above 1.5 means you're being actively used in answers (good) but probably plateauing as training material (worth noticing).

3. llms.txt fetch rate. Of the five top crawlers, only PerplexityBot and ClaudeBot fetched /llms.txt on my sites during the 30-day window. GPTBot, OAI-SearchBot, and Google-Extended never did. This roughly matches what other operators have observed and is a load-bearing detail when you're deciding whether llms.txt is worth maintaining. (Short answer: yes, but mostly for the two crawlers that read it.) The llmoframework.com retrieval signals page goes deeper on this.

How to actually pull this data

This is the part I always wanted to read and never quite found, so:

Cloudflare (free plan). The AI Crawl Control dashboard (formerly AI Audit, docs here) shows top AI crawler User-Agents out of the box. For raw logs, you need Logpush, which is paid. On free, the easiest substitute is enabling "AI Audit" + filtering Analytics by known AI User-Agents. Free won't give you per-request paths but it gives you counts and trends.

Vercel. Project → Logs → filter by User-Agent contains "Bot". Vercel keeps 30 days of edge logs on the Pro plan. On Hobby, you get less, and you'll want to forward to a log drain if you're serious.

Netlify / self-hosted Nginx. Just grep the access log:

grep -E "GPTBot|ClaudeBot|PerplexityBot|OAI-SearchBot|Google-Extended" \
  /var/log/nginx/access.log \
  | awk '{print $14}' \
  | sort | uniq -c | sort -rn

This gives you crawler counts. Add awk '{print $7}' instead of $14 to get the URL ranking. The exact field number depends on your log format; check with awk '{print NF}' on one line to count.

What I changed after looking at all this

Three concrete changes after the 30-day window:

I split my robots.txt to allow OAI-SearchBot and Claude-SearchBot (retrieval, good for citations) while keeping Disallow: /api/ strict for GPTBot (training, no upside for me on those endpoints).
I added a Last-Modified header to every blog post route, because retrieval crawlers use it to decide re-fetch frequency and Vercel wasn't sending one by default.
I started tracking the retrieval-to-training ratio weekly in a spreadsheet. Two weeks in, the only useful insight is that the number is stable. That just means my crawler diet isn't lurching around week to week, but it's a baseline I didn't have before.

I expected the logs to confirm what I already believed about LLMO. They mostly didn't. Citation isn't the only signal worth watching. Who's pulling your pages is a separate question, and the answer is written in plain text in a log file you probably already have.

If you want the full measurement frame (citation tracking, GA4 referrals, and server-log crawler analysis as parts of one system) the book is here: LLMO: AI Search Optimization. Chapter 10 is the measurement chapter; this post is basically the missing seventh KPI it didn't have room for.

I Added a 4th Agent That Audits My Other Agents. It Caught My Strategist Procrastinating for 3 Weeks.

Ken Imoto — Mon, 25 May 2026 22:00:01 +0000

I built a three-layer agent harness and called it "autonomous." Observer collected the data. Strategist picked the theme. Marketer wrote the article. They all followed strategy.md, the file that holds my rules. The cron fired every Monday at 09:00 and the articles showed up by lunch. I felt very clever about it.

Then I read my own Strategist logs across three weeks and noticed something. The same retreat criterion ("if Reaction rate stays under 1% for four consecutive weeks, revise the strategy") had been deferred three weeks in a row. Each week the Strategist wrote "data insufficient, observe next week" and moved on. The rule existed. The data existed. The rule never fired.

The three-layer harness couldn't catch this because the three layers were doing exactly what strategy.md told them to do. The bug was not in the agents. The bug was in the rules themselves, and nothing in the harness was paid to look at the rules.

I added a 4th layer called Evolver. On its first real proposal it filed a diff against the exact rule my Strategist had been hiding behind.

The three layers were not the autonomous part

The architecture I had been calling autonomous looked like this. Observer ran daily and dumped GA4 numbers into article-performance.jsonl. Strategist ran every Monday morning, read strategy.md, and picked five themes for the week. Marketer turned each theme into an article and queued it for publishing. Three roles, three cron jobs, predictable behavior.

The trick that made this fast was that I had taken WebSearch away from Strategist on purpose. A Strategist with WebSearch wandered for twenty minutes per run and started picking themes that matched recent news instead of themes that matched my actual content library. Stripping WebSearch dropped the cycle from twenty minutes to three. That post was about making Strategist faster. This one is about making it accountable.

The thing none of those three layers could do was rewrite strategy.md. They read it every Monday and obeyed it. If the rule was wrong, they obeyed a wrong rule. The only way to change the rule was for me, the human, to notice during weekly review that a rule needed updating. And I was the bottleneck. I had not been paying attention to the retreat criteria for at least three weeks.

What the procrastination looked like in the logs

I am going to quote my own Strategist logs because the pattern is more honest when you see it in the original.

From the log dated three weeks before I added the Evolver:

Reaction rate continues at 0% for the majority of articles. Title strategy has shifted to first-person and numerical framing. Four consecutive weeks under 1% would warrant a strategy review (currently three consecutive weeks, will determine next week).

The next week:

Reaction rate has not yet reached four consecutive weeks under 1%, but weekly trend data is insufficient. Observe next week.

This is the entire failure mode in two sentences. The rule said "four consecutive weeks." The Strategist had three consecutive weeks of data under 1%. Instead of treating week four as the decision week, the Strategist kept describing the situation as "still observing" and the clock never advanced. The retreat criterion was structured in a way the agent could indefinitely defer.

When I went and computed the actual numbers from article-performance.jsonl myself, the picture was even uglier. Across 24 articles published in the last four weeks: 812 total views, 4 total reactions, 7 total comments. Reaction rate: 0.49%. Half the threshold. Engagement rate (reactions plus comments): 1.35%. The rule should have triggered weeks ago. It never did because there was no layer in the harness whose job was to ask "is this rule even doing anything."

The 4th layer: what an Evolver is

So I added a 4th cron job. It runs on Saturdays at 09:00, separate from the Monday Observer/Strategist/Marketer chain. Unlike the other three, it has WebSearch enabled. Its job is not to write articles. Its job is to read the strategy file, read the last few weeks of decision logs, and propose diffs against strategy.md.

Each proposal is one file: domains/<name>/data/evolution/EVO-NNNN.md. The Evolver fills in five sections.

Observation — what it saw in the data
Proposal — the rule change in plain prose
Rationale — internal data and external references that justify the change
Expected impact — what should improve if applied
Diff — a literal diff block against strategy.md

The diff block is the load-bearing part. The Evolver does not just write English suggestions. It writes the exact patch that would land in the repo. A small CLI called harness-evolve.sh knows how to extract the diff block, run git apply --check, and commit it with the proposal as the body. No LLM is involved in the apply step. The LLM proposes, the shell applies.

That separation is on purpose. The proposal is creative. The apply is mechanical. When the apply step is mechanical you can trust it to either succeed cleanly or fail loudly. There is no "the agent tried to apply the patch and something weird happened in the middle."

EVO-0003 caught my Strategist procrastinating

The Evolver's third real proposal, EVO-0003, was the one I described above. The proposal is on disk and I am reading it back as I write this.

The observation section quoted both of my Strategist logs, the "three consecutive weeks, will determine next week" one and the "data insufficient, observe next week" one. Then it computed the engagement rate from article-performance.jsonl and showed that the threshold had been breached for at least four weeks. Then it argued that the original rule was bad in three ways:

The formula was not specified. Was "Reaction rate" per-article or aggregate? My Strategist could plausibly compute either, which is why it had been deferring.
The trigger condition "four consecutive weeks" was ambiguous when weekly data was thin.
The action on trigger ("propose a title and angle revision") was abstract enough that the Strategist could fulfill it with a single sentence and move on.

The proposal replaced the rule with this:

Engagement rate = (sum of reactions + comments over the last 4 weeks of articles) / sum of views. The Strategist must compute this every week and log it. If under 1.5% for four consecutive weeks, next week's 5 articles must be at least 4 titles in the "number + first person + failure narrative" form. Abstract titles are forbidden.

It is a 20-line patch. The diff is below the prose in the proposal file. I approved it via /harness-evolve approve EVO-0003 at 14:04 on a Tuesday afternoon. The shell ran git apply --index against strategy.md, made the commit, updated the proposal's frontmatter to status: applied, and sent me a Telegram note. The next Monday's Strategist ran with the new rule and computed an engagement rate of 1.35% in the log without prompting. The "data insufficient" sentence stopped appearing.

The thing I want to be honest about is that the Strategist hadn't been malicious. It hadn't been broken either. It had been a perfectly competent agent following a rule that was structured to allow deferral. That is a failure of the rule. The Evolver's job is to detect rule failures, because nothing else in the harness was structured to.

The Safety boundary, because Self-Evolving Agents are not toys

The minute you say "an agent that rewrites the harness," somebody in your head should be raising their hand and asking what stops it from rewriting itself into a paperclip optimizer. Several things, on purpose.

The Evolver cannot touch the kinds of decisions that have to remain mine. Adding or removing a domain. Switching languages. Changing the quality bar for writing. Anything involving licensing, author identity, or security. The .env file, the credentials directory, the publish triggers. If any of these were on the table I would not let the Evolver run unattended at all.

Inside the territory it can touch, three numeric limits keep it from running away.

Diff size cap: 20 lines per proposal. A proposal larger than that has to be split or escalated.
Two proposals per week per domain. If the Evolver wants to propose more, the third is held until next Saturday.
Three consecutive rejects on the same theme triggers an automatic mute. The Evolver stops re-pitching the same idea after I have said no three times.

The last one is the part I think is undersold in the broader "self-improving agent" literature. The interesting signal in a reject log is not the proposal, it is the reason. "MCP is still the main revenue genre, we cannot drop it" is the kind of business context that has never been written into strategy.md. After three weeks of rejecting MCP-cut proposals with that reason, the Evolver stops proposing them. Implicit founder context becomes explicit harness behavior, just by accumulating reasons-for-reject.

What you need before adding a 4th layer

I think there are three real prerequisites before adding an Evolver-style layer to your own setup. Without them, the 4th layer is just noise.

First, the three existing layers have to produce decision logs that another agent can read. If your Strategist's output is "ran successfully, picked themes," there is nothing for the Evolver to find. The procrastination only showed up because my Strategist had been writing structured logs with phrases like "currently three consecutive weeks, will determine next week." Logs that include the agent's reasoning in prose are what make audit possible.

Second, the rules themselves have to be in version control as text. strategy.md is a checked-in markdown file because the Evolver needs to produce a diff block that git apply can land. If your rules live in a database, a SaaS dashboard, or a thousand-line JSON config, the patch model breaks down. Plain markdown in git is the cheap path.

Third, you need a human approval channel that does not require the human to read the whole proposal every time. My Telegram notification has the EVO-ID, the title, and a one-line link to the file. I open the file only when the title makes me curious. Most of the time I either approve fast or reject with a short reason. If approval costs me ten minutes per proposal, I will stop running the Evolver. If it costs me thirty seconds, I will run it indefinitely.

What about not adding a 4th layer

If you do not want a 4th layer, you can absolutely get most of the benefit by running a weekly human review with a specific question. Not "how are the agents doing." That is what I had been doing, and it did not catch the procrastination. The specific question is: "did any retreat criterion in strategy.md actually fire this week, and if not, why not."

Sit with that question for ten minutes per Friday. You will catch what I was missing for three weeks. The Evolver is, more than anything else, a forcing function for that question. It does not have to be an agent. It can be a calendar reminder.

I happen to like running it as an agent because the proposal artifacts pile up in version control and become a record of how my rules have evolved. EVO-0001 through EVO-0004 form a small history of "things I thought were good ideas, things I thought were bad ideas, and why." That history is useful when I am writing next year's strategy.md from scratch.

What I have not built yet

The current Evolver only audits one domain at a time. Across my four domains (devto, qiita, zenn, kenimoto-dev) I have written different versions of strategy.md for each, and most of them have similarly structured retreat criteria. A cross-domain Evolver could notice that the same rule structure has been failing in two domains and propose a unified fix. I have not built it. It is on the list.

The other thing on the list is the obvious recursion question. Who audits the Evolver. The current answer is "I do, every approve/reject is a human signal." The longer answer is "I do not know yet." If the Evolver's proposals start looking systematically biased — say, always proposing tighter thresholds, or always proposing to drop the same genre — that bias is real and I should add a 5th layer that watches the 4th. I have not seen it yet. I might not until EVO-0050 or so. I want the bias to be obvious before I add another layer just to feel safer.

For now: three agents that follow rules, one agent that audits the rules, and one human who approves the audit. That is the smallest harness I have found that catches its own procrastination.

If you want the full Harness Engineering picture — the 6 building blocks, the AGENTS.md/CLAUDE.md/hooks patterns, and the Self-Evolving Agent chapter that grounds this article — that is in the book.

Harness Engineering: From Using AI to Controlling AI

I Told Claude Code to Do TDD. It Wrote the Test AFTER the Code 6 Out of 10 Times.

Ken Imoto — Mon, 25 May 2026 11:00:00 +0000

My CLAUDE.md had a section called ## TDD First. Six lines. Very clear. I had spent twenty minutes drafting it. Then I ran a 30-day audit of my own commits and discovered that across the features I had asked Claude Code to TDD, the test file was committed after the source file 6 out of 10 times.

Not "the test failed first, then I fixed it." The test file did not exist at the moment the source file got committed.

This is the story of how I caught it, why it kept happening, and the two-part fix (prompt plus a PreToolUse hook) that finally pushed Claude into a real red-green-refactor cycle. It is the third installment in what is becoming an accidental series on Claude doing things confidently and wrong. The first was Claude hiding bugs three times in a row. The second was refusing to write specs until the code went sideways three times. This one is about TDD, and the pattern is identical: the model agrees, the model proceeds, the model skips the part of the prompt that would cost it tokens.

The 30-day audit

The audit was accidental. I had been writing about debugging habits and wanted to see whether my own commit history was consistent with what I was preaching. So I pulled git log --name-only --pretty=format:'%h %ai %s' for the last 30 days on a project I had been driving with Claude Code, and grouped the commits by feature. Ten features. For each one, I noted the timestamp of the first commit that touched the source file, and the timestamp of the first commit that touched its test file.

Six features out of ten had the source file committed first. The gap ranged from 90 seconds to 23 minutes. In two cases the test file was committed in the same commit as a later round of fixes, after the source had already been shipped to a feature branch. In one case there was no test file at all, only a # TODO: add tests next to the function.

I had been telling Claude "TDD this" every single time. I had a ## TDD First section in CLAUDE.md. I had even pasted the red-green-refactor sequence at the top of the prompt for the more complex features. And six times out of ten, it had cheerfully written the implementation, then either written the test afterward or skipped it entirely.

I am not blaming the model for being lazy. The model was doing exactly what it was trained to do.

Why next-token prediction defaults to implementation-first

This is the part that took me a while to actually understand. The model is not deciding "I will do TDD" or "I will not do TDD" the way a human engineer might decide. It is predicting the next most plausible token given the context. And in its training data, the overwhelming majority of "user asks for feature X" responses look like here is the function that does X, optionally followed by and here is a test. The "test first, then implementation, with the test failing in between" sequence is rare in public repositories because humans rarely commit the red phase as its own commit. We commit the green phase. So the model never built a strong prior for the red-first ordering.

Several people in the Claude Code community have pointed at the same thing. The aihero.dev TDD skill writeup puts it as: when the test writer and the implementer share the same context window, the implementer's thinking leaks into the test writer's, and you get tests that conveniently pass on the first run. That is not TDD. That is "tests retrofitted to pass." The alexop.dev red-green-refactor loop post argues that the only reliable fix is to force the cycle from outside the model, with hooks or skills that the agent cannot override mid-stride.

The other thing I keep seeing in community writeups, including the BSWEN Claude Code TDD skill walkthrough, is the same Anthropic guidance I had been ignoring: Claude will sometimes alter the test to make it pass rather than fix the implementation. Committing the test before the implementation gives you a diff to look at if that happens. I was not doing that either.

So the model had a weak prior for test-first, and I had a weak workflow that did nothing to compensate. Six out of ten makes a lot of sense in retrospect. The surprising thing is that it was as low as six.

What I tried first that did not work

Before the hook, I tried prompt engineering harder. This is what most people try, and it gets you most of the way without getting you there.

Attempt 1 — ## TDD First in CLAUDE.md. Already had this. Six out of ten ignored it. The header was too generic; the model saw it as a vibe, not a constraint.

Attempt 2 — explicit red-phase instruction in the prompt. I started pasting "Write a failing test for [feature] in tests/X_test.py. Do not write the implementation yet. Run the test and confirm it fails before proceeding." This got me to maybe 8 out of 10. Better, but 2 out of 10 I would catch it cheating, usually by writing the test in a way that mocked out the part that would actually have failed.

Attempt 3 — separate prompts for red and green. Two messages. First message: write the failing test, stop, run it, show me the failure. Second message, only after I had eyeballed the failure: now write the implementation. This was the first time I got something that smelled like real TDD. The problem was that it required me to physically be at the keyboard for two turns, and if I context-switched away mid-feature, the next Claude session would happily merge the two steps back into one.

The lesson from Attempt 3 is that prompts are advice. The model can ignore advice. To get TDD enforced, I needed something the model could not ignore. That something is a hook.

The PreToolUse hook that broke the loop

Claude Code's hook system lets you intercept tool calls before they execute. A PreToolUse hook on Write or Edit gets the file path the model is about to touch. If the model is trying to write to src/foo.py and there is no tests/foo_test.py that currently fails, the hook can exit 2, which Claude Code treats as "this tool call is denied, here is the reason, try again."

This is the smallest version that worked for me, on a Python project with pytest:

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Write|Edit",
        "hooks": [{
          "type": "command",
          "command": "python3 .claude/hooks/require-failing-test.py"
        }]
      }
    ]
  }
}

The script reads the file path from the tool call payload, maps src/X.py to tests/X_test.py, checks the test file exists, runs pytest tests/X_test.py --no-header -q, and exits 2 if pytest exits 0. If the test does not yet exist or the test currently fails, the hook lets the edit through. If the test exists and is already passing, the hook blocks the edit with a message like "a failing test must exist in tests/X_test.py before src/X.py can be modified. Write the failing test first." That message lands in the model's next-turn context. It does not have a choice.

There are edge cases. The test file might pass for the wrong reason; the hook does not catch that. The mapping from source to test path is project-specific; mine is hardcoded. And I have an escape hatch, a magic comment # tdd-bypass: refactor on the first line, for refactor commits where you genuinely want to edit without a new failing test, because refactor is supposed to preserve behavior, not add it. The hook respects the escape hatch, but it logs every use of it to a file I review at the end of the week. The first week, my escape-hatch log had 22 entries. The second week it had 4. That number going down is the whole point.

What the 30-day rerun looked like

I ran the same audit 30 days after the hook went in. Same project, same kind of features, same prompt style. The numbers:

Test file committed first: 9 of 10 (up from 4 of 10)
Test file committed in same commit as source, but written first per the file-modification timestamps: 1 of 10
Test file committed after source: 0 of 10

The single feature where the test went in the same commit as the source was a 12-line config helper that I had legitimately bypassed with the magic comment. So in terms of TDD being followed when the rule applied, the number is 10 of 10.

I do not want to claim that the hook turned Claude into a disciplined TDD practitioner. It did not. The model still writes implementations that look suspicious from a "test was designed around the implementation" perspective some of the time. What the hook gives me is ordering: a failing test must exist before the source can be touched. That alone closes the loop where Claude was retrofitting tests around code that was already shaping the test's assertions. The Anthropic guidance on this, captured by several community writeups including the DataCamp best practices roundup, is that ordering is the load-bearing constraint and everything else is bonus.

When to skip TDD entirely

This is the part I should have figured out before instrumenting any of this. There are tasks where TDD is the wrong tool. Refactors that should be a no-op behaviorally. One-off scripts I am going to throw away in 20 minutes. Pure data migrations. UI tweaks where the test would just be a snapshot of itself. Forcing TDD on these tasks does not make the code better; it makes the workflow heavier with no payoff.

The escape hatch exists for these. The week-end review of the escape-hatch log is where I notice if I am abusing it. "I bypassed TDD because the test was hard to write" is a smell. "I bypassed TDD because the code was a snapshot test of CSS class names" is fine. The audit, not the rule, is what keeps the workflow honest.

The takeaway

My CLAUDE.md still says ## TDD First. I left it there for vibes. It was never going to be the part that did the work. The hook is the part that does the work, and the audit is the part that decides whether the hook is still tuned right.

If you want the full picture of how to layer prompts vs hooks vs MCP servers (when to use which layer for which kind of rule), I wrote it down in Practical Claude Code. The hooks chapter is the one I keep coming back to.

Sources:

Your New Domain's First Week of GA4 Is a Lie: 4 Days of Raw Data from a Launch

Ken Imoto — Sat, 23 May 2026 13:00:01 +0000

Four days after registering a new domain, I opened GA4 and saw 65 page views / 34 users / 9 countries.

For a brief, build-in-public moment, I almost cheered. Then I looked at the breakdown. The US had 17 sessions averaging 4.9 seconds of session duration. France, Poland, South Korea, India, Singapore: each between 0 and 1.4 seconds. Japan alone sat at 751 seconds (over 12 minutes): an outlier so loud it should be illegal.

The domain is kaoriq.com, registered on 2026-05-02, a personality-quiz × fragrance e-commerce site I'm building. As of today (May 5), it has fewer than 20 articles. Doing the back-of-the-envelope math, that page-view distribution is physically impossible to come from real humans.

This post walks through how I read the first week of GA4 data on a new domain as "me + a crawler army", with the actual numbers exposed. For anyone running GA4 on a new project, or anyone who registered a domain this weekend.

The raw data: past 14 days (4 days of real activity)

Numbers first, no spin.

Overall

Metric	Value
Sessions	37
Page Views	65
Total Users	34
New Users	34
Avg Session Duration	104.1 s
Bounce Rate	80%

By Country

Country	Sessions	PV	Avg Duration (s)
Japan	5	33	751.0
United States	17	17	4.9
Canada	4	4	1.3
France	4	4	1.4
Poland	2	2	0.0
South Korea	2	2	0.0
(not set)	1	1	0.1
India	1	1	0.0
Singapore	1	1	0.0

Daily

Date	Sessions	PV	Users
2026-05-02 (registration day)	17	40	14
2026-05-03	6	11	6
2026-05-04	12	12	12
2026-05-05	2	2	2

At a glance, "not bad for week one" is a tempting read. But this dataset contains a 751-second Japanese reader living next door to 9 countries averaging zero seconds. The middle is missing. That gap is the whole tell.

Five signals, beaten in parallel

I never call bot traffic on a single signal. To avoid false positives, I always cross-check five axes at once.

Signal	Bot pattern	Human pattern	kaoriq actual	Verdict
Session duration	0–5 s	30 s – several min	US 4.9s, FR 1.4s, KR 0s	Bot
Bounce rate	90–100%	40–70%	80%	Bot
PV / Session	1.0 (one page, gone)	1.5–3.0	US: 17/17 = 1.0	Bot
Geographic anomaly	Random countries unrelated to content	Concentrated in target geo	EN/JA only, yet PL/IN/SG	Bot
Time-series spike	Massive day-one for new domains	Gradual ramp	40 PV on day of registration	Bot

Why a single signal lies

"80% bounce, must all be bots, right?" Not so fast.

Duration alone: A reader who tabs your post and walks away for lunch racks up 30+ minutes. Indistinguishable from "deeply engaged" or "abandoned tab."
Bounce rate alone: A landing page that perfectly answers the question gets a 100% bounce from satisfied humans. Excellence and bots both score the same.
Geography alone: A viral overseas tweet legitimately produces multi-country traffic. Weak on its own.

You only get to call "bot" with confidence when all five signals lean the same direction simultaneously.

The bimodal distribution was the smoking gun

The real reason this verdict held in kaoriq's case is the shape of the duration distribution.

Japan: 5 sessions / 751 s average
Everywhere else: 0–5 s

If the traffic were genuinely human, session duration should spread more evenly across the 20–120 second band: "bounced after the title (10s)," "read the lede (40s)," "made it to the end (180s)" forming a gradient.

But kaoriq's distribution is bimodal with the middle scooped out. The honest reading: only "me (long sessions, testing the site)" and "crawlers (instant exits)" exist. Nothing in between.

Conversely, a healthy distribution would look like "Japan 100 sessions / 60s, US 50 sessions / 45s, Canada 20 sessions / 30s": durations spread normally. That'd be a real human traffic signature.

So how many real humans were there?

After all that beating, my estimate breaks down as:

Category	Estimated sessions	Notes
Me, testing the site	4–5	Most of Japan's 5 sessions, source of the 751s average
Crawlers (Googlebot / Bingbot / GPTBot / ClaudeBot / AhrefsBot, etc.)	27–30	US 17, plus the zero-second Europe & Asia rows
Actual organic human traffic	2–5	The remainder of Japan + a couple of US sessions

Of 37 sessions, at most 5 were real humans. That's the reality of week one for a new domain.

Why GA4 doesn't filter this for you

GA4 has a "known bots and spiders" auto-exclusion based on the IAB/ABC Spiders & Bots list. It catches classical crawlers but misses:

JavaScript-executing crawlers: GPTBot, ClaudeBot, PerplexityBot. These new generative-AI crawlers run JS, so the GA4 tag fires.
SEO-tool crawlers: AhrefsBot, SemrushBot, MozBot. High frequency, and they swarm new domains the moment they're discovered.
Headless-browser scrapers: Custom Puppeteer or Playwright bots are indistinguishable from a real Chrome session.

The week after a new domain registration is when this crawler army discovers the new IP. It calms down within 7–10 days as DNS propagates. But if you take week-one GA4 at face value, you'll make bad decisions.

Three annotations every new-project dashboard needs

Use "Engaged Sessions" as your primary metric. GA4 defines an engaged session as: ≥10s duration OR ≥2 PV OR a conversion event. Most of the bot army gets filtered here.
Always view session duration split by country. Looking at any single metric (sessions, PV) without the geo filter lets the crawler army masquerade as success.
Treat the first 30 days as a "noise phase." Real numbers only appear after social funnels, SEO, and content depth all line up.

Closing: look at your own GA4 with this lens

A new domain's GA4 lies for the first 1–2 weeks. If your country breakdown is full of zero-second sessions from the US, Eastern Europe, and Southeast Asia: that's the crawler parade, not humans falling in love with your content.

The procedure is simple: beat with five signals → suspect bimodal distributions → swap the primary metric to Engaged Sessions. Doing this saves you from being whipsawed by early data.

Doubting GA4 is, in the end, a discipline for not making expensive mistakes. Beat the data before the data beats you.

This post is based on real data

Site: kaoriq.com (domain registered 2026-05-02, built with Astro v6 + Tailwind v4)
Period analyzed: 2026-04-22 → 2026-05-05 (4 days of actual activity)
Data source: GA4 Data API v1beta via Service Account

If you want the full LLMO playbook (how to think about AI crawlers, citations, and the measurement layer underneath the GA4 narrative):

LLMO: AI Search Optimization for Engineers

I Stacked 4 More Context Layers on Top of RAG. Sonnet Got 12% Better. Haiku Got 14% Worse.

Ken Imoto — Fri, 22 May 2026 13:00:01 +0000

I read a post about "Full Context Engineering" and immediately added four more layers to my RAG pipeline. Structured output instructions. Hierarchical document layout. Role definition. Few-shot examples. The whole buffet.

The improvement on Claude Sonnet was 12%.

The improvement on Claude Haiku was minus 14%.

I had just spent two weeks building scaffolding to make my smaller model worse at its job. If you have ever wallpapered a room and stepped back to discover you covered up the light switch, you know the feeling.

This post is about what those numbers actually mean for the way you spend your context engineering effort in 2026.

What I was measuring

I was running a benchmark against my own book corpus for a previous experiment (the cheap-model post). The same scoring rubric: factual accuracy, hallucination rate, specificity, and honesty on a 0 to 15 scale.

The configurations were a ladder. Each rung adds one more thing on top of the previous one.

System prompt only: the bare baseline. No retrieval, nothing.
System + RAG: vector search over a curated corpus, top documents injected.
Full Context Engineering: RAG + structured output instructions + hierarchical layout + role definition + few-shot examples.

What I expected: a smooth upward curve. What I got was a curve that leaned forward and then fell over.

The numbers

Claude Sonnet, total score (out of 15):

Configuration	Total	Delta from previous
System only	8.8	--
System + RAG	10.2	+1.4 (+16%)
Full Context Engineering	11.4	+1.2 (+12%)

Claude Haiku, total score (out of 15):

Configuration	Total	Delta from previous
System only	3.7	--
System + RAG	11.8	+8.1 (+219%)
Full Context Engineering	10.1	-1.7 (-14%)

Two findings I did not expect.

First: RAG is doing almost all of the work. On Sonnet, RAG closed 88% of the gap between baseline and the fully tricked-out pipeline (1.4 of the total 2.6 point improvement). On Haiku, RAG over-shot the final number entirely.

Second: stacking more on top of RAG is not free. On Haiku, it actively made things worse. The hallucination score went from 1.7 to 0.5. The honesty score went from 1.3 to 0.5. The model started confidently making things up that it had previously hedged on.

Why this happens

I have a hypothesis that I think survives contact with reality.

A small model has limited working memory. RAG hands it the right facts. Once those facts are in front of it, the marginal returns from extra structure are small. But the marginal cost of extra context is not small. Every paragraph of role definition, every few-shot example, every "here is how to format the output" block competes with the retrieved documents for the model's attention.

For Sonnet, the working memory is wide enough that the extras land in unused space. For Haiku, the extras shove the actually-useful retrieved context off to the edge of the window. The model still sees it. It just stops trusting it.

This is the same finding that recent research on long-context behavior keeps surfacing. Studies on instruction-following at high context fill report that for most frontier models in 2026, quality starts to degrade measurably at 60 to 70 percent context fill, and falls off a cliff around 90 percent. The cliff is steeper for smaller models.

The Pareto principle applies to context engineering with embarrassing accuracy. RAG is the 20 percent of effort that produces 80 percent of the result. Everything you stack on top of it is the long tail.

The 2026 reality I almost forgot to mention

When I ran the original experiment, I was on Sonnet 4 and Haiku 3 with a 200K context window. As of this writing, Sonnet 4.6 has a 1M token context window at standard pricing and prompt caching cuts the cost of repeated context by 90 percent.

This changes the math, but not in the direction you might think.

A 1M context window does not magically make stacked context cheaper to design. The model still has to pay attention to the right thing. The cliff at 60 to 70 percent fill is a percentage, not an absolute. A bigger window just means you can write more bad context before you fall off it.

Prompt caching helps if your stacked layers are static. The role definition, the few-shot examples, the structured output instructions: those parts cache cleanly. But that only saves money. It does not save quality. If your Haiku result was minus 14%, prompt caching makes minus 14% cheaper. That is not the win you wanted.

The thing nobody told me about Skills

Anthropic's Skills feature is interesting in this light. Skills are reusable context bundles that load on demand. The right way to think about them is not "more context, all the time" but "the right context, just in time."

That is the failure mode my Full CE experiment ran into. I was packing every layer into every request. Skills point at the alternative: keep the system prompt small, retrieve the relevant skill, and let the rest stay out of the window. It is the same lesson as RAG, applied one level up. Selective beats throwing everything in.

What I do now

If you take only one thing from this post, take this: the order of operations matters more than the number of operations.

Build the retrieval first. Get RAG working with a clean corpus, decent embeddings, and a relevance threshold. This is your 80%.
Run a benchmark. Real benchmark, on real questions, scored by a real rubric. Not vibes.
Add one layer at a time. Structured output, then hierarchical layout, then role definition. Re-benchmark after each.
If the score goes down, take that layer out. Do not assume the layer is good and your benchmark is bad. The benchmark is right more often than you think.
Try the same ladder on a smaller model. The thing that helps Sonnet may hurt Haiku. Knowing which side of the line you are on saves you money.

This sounds obvious. It is not what most teams do. Most teams read a blog post about Context Engineering, add four layers in one weekend, and never measure whether the layers actually helped.

The chef and the kitchen, revisited

In an earlier post I wrote that the model is the chef and the context is the kitchen. I want to extend that.

Adding more context layers is like installing more kitchen equipment. A second oven. A pasta machine. A sous-vide. None of them make the chef worse at cooking pasta. But if the pasta machine takes up the counter space where the chef was chopping vegetables, the dinner gets worse anyway.

The chef does not need every appliance. The chef needs the right ingredients within reach.

Before you read the next breathless post about Full Context Engineering and start adding layers, run the experiment. Measure RAG alone. Measure RAG plus one thing. Find the layer that earns its keep, and leave the rest in the catalog.

The answer is almost always: do RAG well first. Everything else is decoration. Decoration that, on a small model, can flip the sign on your accuracy score and leave you wondering why.

The next time someone says "Context Engineering," what I want to say back is: please define which 20 percent of context you mean. The other 80 has a good chance of making things worse.

The full Context Engineering system (five strategies, the RAG benchmarks behind these numbers, MCP server design, and the Agentic RAG implementation) is in Turning LLMs from Liars into Experts: Context Engineering in Practice.

OpenClaw Hit 250K Stars Faster Than React. I Spent 24 Hours Trying to Like It.

Ken Imoto — Fri, 22 May 2026 13:00:00 +0000

I switched my entire dev setup from Claude Code to OpenClaw on a Tuesday morning. By 11am I was googling "how to remove openclaw". By 6pm I had written a SOUL.md file longer than the actual feature I was shipping.

This post is about that day. About what broke, what didn't, and what 24 hours of working in the terminal agent that is now technically the most-starred open-source project in GitHub history bought me.

Yes, I am the engineer who wrote about Claude Code Skills three weeks ago and called the workflow pattern "settled for at least a year." Then OpenClaw passed React's all-time star count in 60 days, Peter Steinberger announced he was joining OpenAI to ship agents to everyone, and the launch tweet went past 4 million views. Settled, apparently, was a one-month forecast.

The numbers I had to verify before believing them

Let me get the facts out of the way, because half of what people quote on Twitter about OpenClaw is wrong by a factor of two.

OpenClaw crossed 250,000 GitHub stars on March 3, 2026, surpassing React for the all-time most-starred software repository
60 days from launch to 250K. React took roughly a decade
60K stars in the first 72 hours. That part is the one nobody actually believes the first time
Peter Steinberger announced on February 14, 2026 that he is joining OpenAI to work on agents, with OpenClaw moving to a foundation to stay open and independent
One mid-sized refactor session in my test run consumed 920K tokens, which on Claude 4.5 Sonnet billing came out to about USD 8.30

The Hacker News thread when it crossed React was the most upvoted submission of the week. The top comment was "this is either the best thing that happened to dev tools in five years or the most expensive way to learn what --yolo does."

It is, somehow, both.

The setup, and the part I underestimated

Installation took less than a minute.

curl -fsSL https://get.openclaw.dev | sh
export ANTHROPIC_API_KEY=sk-ant-...
openclaw

The first surprise: OpenClaw asked me which model I wanted as default. I had four serious choices, plus Ollama for local models.

openclaw --model claude-4.5-sonnet
openclaw --model gpt-4o
openclaw --model gemini-2.5-pro
openclaw --model ollama/devstral:24b

Claude Code has a backend model. OpenClaw has a backend model dropdown. That is not a small UX difference when you are trying to land a refactor for less than ten dollars.

The second surprise: when I ran my first command, the agent asked me where the SOUL.md file was. I did not have one. It happily generated a default. The default was generic enough that I closed the session, opened my editor, and started writing my own. That is when the day quietly stopped being a benchmark and started being a personality test.

SOUL.md is the part nobody warned me about

Here is the SOUL.md I ended the day with, after rewriting it three times.

# SOUL.md
You are a senior backend engineer with strong opinions and short patience for code
that talks more than it does.

- Prefer Python over TypeScript when both fit. We're not building a frontend here.
- Never add a feature without a test. If the test would take more than 10 minutes
  to write, ask first instead of writing it.
- Performance matters but readability matters more. We're a four-person team, not
  Google.
- Do not write conversational filler. "Sure, I'll do that" is not output. Output
  is the diff.
- When in doubt, ask. Don't guess. Guessing once cost us a weekend.

The thing the docs do not tell you: SOUL.md is not a config file. It is a contract. CLAUDE.md tells Claude Code what the project is. SOUL.md tells OpenClaw who the agent is. They are two different shapes of the same trust problem, and the day I figured that out was the day OpenClaw stopped feeling worse than Claude Code and started feeling different.

I had a Claude Code session open in another window all day for a sanity check. By 4pm I noticed my CLAUDE.md was 312 lines and my SOUL.md was 14. The SOUL.md was doing more work per line.

The Gateway, and why my LGPD-anxious teammate cared

OpenClaw routes every LLM call through a local process called the Gateway.

The Gateway sits on your machine. Your prompts and code do not pass through an OpenClaw-operated cloud relay on the way to Anthropic, OpenAI, or whoever. They go straight from your laptop to the model provider you picked.

Claude Code does not have an equivalent intermediary, but it also does not need one because Anthropic is the only provider. The moment you have multi-provider support, you either need a relay (vendor lock-in risk) or a local gateway (the OpenClaw choice).

A teammate of mine who lives in Brazil and spends meaningful time worrying about LGPD compliance pinged me at lunch to ask what the network diagram looked like. He liked what I sent him. That conversation alone might be worth the day.

ClawHub vs Claude Code Skills

Claude Code Skills are markdown files plus optional resources, distributed however you distribute markdown. ClawHub is an npm-style package marketplace for OpenClaw skills.

openclaw skills search "docker"
openclaw skills install @clawhub/docker-manager
openclaw skills list

ClawHub had several thousand skills the day I tried it. The numbers Steinberger throws around at conferences are higher and probably accurate, but the count moves fast enough that any specific figure is wrong by the time you publish it.

Two real differences I felt:

ClawHub skills are JavaScript. They run in a sandbox but can request shell exec privileges. That makes them more capable than Claude Code Skills and more dangerous. The ClawHavoc incident in March of 2026 saw 341 malicious skills caught, which is a real cost of an open marketplace
Claude Code Skills are simpler to author. I wrote a Skill in 20 minutes my first time. The equivalent ClawHub skill took me about 90 minutes because I had to learn the SDK conventions

If you are an individual developer wanting to share a workflow, Skills are easier. If you are a team wanting a versioned, packaged, audited tool, ClawHub is better. They are not competing for the same problem.

The 3pm moment where I almost stopped

I asked OpenClaw to update some Python 3.8 code to 3.11 across a small repo, run the test suite, and report back.

It did. The session ate 920K tokens, took about 14 minutes, found three places where my colleague had used the walrus operator wrong, and quietly fixed them. I checked the diff. It was right.

Claude Code does the same thing. I have run the same prompt against it many times.

The difference was not the output. The difference was that Claude Code is in my muscle memory. I have typed claude three times a day for a year. When I typed openclaw and waited the extra 1.2 seconds for the cold start, my fingers reached for claude instead. Three times.

That is the part nobody writes about. Switching costs are not just config. They are reflexes. By 3pm I had written half a SOUL.md, almost given up, made coffee, and come back. By 6pm I was OK again.

What I would actually use each one for

I built this matrix during the second coffee.

Decision	OpenClaw	Claude Code
Locked into Anthropic models?	No, multi-provider	Yes, Anthropic only
Local model option	Ollama	None official
Skill distribution	ClawHub package marketplace	Markdown files
Personality file	SOUL.md (who is the agent)	CLAUDE.md (what is the project)
Network architecture	Local Gateway, no relay	Direct to Anthropic
Maturity	60 days old, foundation forming	18 months, Anthropic-stable
Best at	Multi-model teams, regulated environments	Anthropic-first dev shops, simplicity

If your team is Anthropic-only and your CLAUDE.md is already 200 lines, do not switch. Claude Code is fine. The Skills you wrote are still fine. The pattern works.

If your team is multi-provider, or your compliance team has questions about where prompts travel, or you want a backend model dropdown, OpenClaw is worth a Tuesday.

I am still on Claude Code as my default. I have OpenClaw aliased to a separate command for the cases where I want to try a different model on the same prompt without paying for two SaaS subscriptions worth of context.

Where this goes next

OpenClaw moving to a foundation while Steinberger joins OpenAI is the part I am watching most closely. Foundations are how open-source projects survive their founders. They are also how projects ossify. The first six months of governance under the OpenClaw Foundation will tell you whether the project is going to be Linux or Helm.

If you used to argue Claude Code vs Codex was a binary, OpenClaw is the answer that was supposed to be impossible: a third option that is not produced by an LLM lab. The economics of that are interesting. The next twelve months are going to teach us whether neutral, cross-provider, foundation-governed AI tooling is sustainable, or whether it gets quietly absorbed.

I am betting on sustainable. I have also been wrong about agents in roughly all of the previous quarters, so adjust accordingly.

What this all costs to know

If you take one thing from my Tuesday, take this. OpenClaw and Claude Code are not competitors. They are two answers to the same question: what should the AI inside your terminal be allowed to do without asking you first? SOUL.md and CLAUDE.md are different shapes of the same trust contract. The team that wrote each chose differently because they had different assumptions about who was sitting in front of the screen.

The right tool is the one whose assumptions match yours. Pick on assumptions, not stars.

If you want the harness-engineering frame on this (CLAUDE.md tiers, hooks, sub-agents, how to think about the shell around the prompt, not just the prompt), this is the reference:

Harness Engineering: A Practitioner's Guide

Is AI Actually Citing Your Site? How to Measure What Google Rankings Can't

Ken Imoto — Thu, 21 May 2026 13:00:00 +0000

I've spent the past few weeks writing about LLMO: how to get cited by AI search engines, which content structures work, what Princeton's GEO study says about visibility. All useful stuff. One problem: I had no idea whether any of it was actually working.

I was like a chef who obsesses over recipes but never tastes the food. My Google Search Console was immaculate. My LLMO measurement setup? I was literally typing "does ChatGPT know about my site" into ChatGPT and refreshing the page like a teenager checking if their crush liked their post.

Measuring LLMO is a genuinely hard problem, and most people aren't doing it at all. Here's what I've built: three measurement layers, from "costs nothing" to "costs you a Saturday afternoon of Python."

The Measurement Gap

In SEO, measurement is a solved problem. Google Search Console shows rankings, impressions, clicks, and CTR for free, updated daily. Ahrefs adds backlink data. SEMrush gives you keyword tracking. Everything is visible.

In LLMO, almost nothing is visible out of the box.

There's no "AI Search Console." ChatGPT doesn't send a weekly email saying "You were cited 47 times!" Perplexity has no creator dashboard. The shift: SEO had rankings (1st through 100th position). LLMO has a binary outcome. You're either cited or you're not. And nobody is telling you which.

This gap isn't just an inconvenience. You can't improve what you can't measure, and right now, most content creators are optimizing for AI visibility while flying blind.

Layer 1: GA4 AI Referral Traffic (Free, 5 Minutes)

The easiest measurement you can set up today is tracking AI referral traffic in Google Analytics 4. When an AI search engine cites your site with a clickable link and someone clicks it, GA4 records the source.

Here is the regex pattern I use in a custom channel group:

chatgpt\.com|perplexity\.ai|claude\.ai|gemini\.google\.com|copilot\.microsoft\.com|deepseek\.com|you\.com|meta\.ai|poe\.com

Go to Admin → Channel Groups → Create, add a new channel with this regex as the session source filter, and name it "AI Search." You'll immediately see aggregated traffic from all AI platforms in one view.

A few things to know:

ChatGPT plays nicely. Since late 2025, ChatGPT appends utm_source=chatgpt.com to outbound links. ChatGPT traffic shows up cleanly as chatgpt.com / referral in GA4.

Perplexity is decent. Traffic appears as perplexity.ai / referral, though without UTM tags. Still trackable.

Free-tier ChatGPT is a black hole. Free users often don't send referrer data due to privacy settings. Their clicks show up as "Direct," indistinguishable from someone typing your URL manually. Your GA4 numbers are a floor, not a ceiling.

The conversion story is where this gets interesting. Industry data from 2026 shows AI referral traffic converts at 8-12%, compared to 2-3% for traditional Google organic. People who arrive via AI search have already done their research. The AI did it for them. They are further along in the decision process.

I started tracking three weeks ago. My AI referral traffic is still small (single digits daily), but the conversion rate is 3x my organic average. Small sample, but a signal worth watching.

Layer 2: The "Ask Five AIs" Protocol (Free, 30 Min/Month)

GA4 tells you who clicked through. It does not tell you whether AI is mentioning you without linking, or whether it is mentioning you at all.

For that, you need to ask directly. I run this on the first Monday of every month:

Step 1. Write 10-15 prompts related to your niche. Mine include "What are the best resources for AI search optimization?", "How do I get my site cited by ChatGPT?", and "LLMO vs SEO differences."

Step 2. Run each prompt on five platforms: ChatGPT, Perplexity, Gemini, Claude, and Copilot.

Step 3. Record four things per prompt per platform:

Mentioned? (Yes / No)
Context (recommendation / comparison / neutral / negative)
Accuracy of information
URL provided?

Step 4. Calculate your citation rate. 15 prompts x 5 platforms = 75 checks. Mentioned 20 times? That's 26.7%.

This takes about 30 minutes with a spreadsheet. It's manual and tedious, and also the most reliable method that exists today. Automated tools can approximate this, but they can't replicate the nuance of "was that mention positive or just a passing reference?"

One caveat: LLM responses are non-deterministic. The same prompt can produce different answers on different days. A single check isn't statistically significant. That is why I track the monthly trend, not individual data points. Three months of data starts showing real patterns.

Layer 3: Automate It With Python (One Saturday)

If you're an engineer, you can automate the manual protocol with API calls. Hit the OpenAI and Anthropic APIs with your query set, check whether your brand appears in the response, and log results as a time series.

The core logic is simple:

BRAND_VARIANTS = ["your-site.com", "Your Brand", "yourbrand"]
CHECK_QUERIES = [
    "Best tools for [your category]",
    "How to solve [problem you address]",
    "[Your brand] vs [competitor]",
]

def check_openai(query: str) -> dict:
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": query}],
        temperature=0.0,
    )
    answer = response.choices[0].message.content
    mentioned = any(v.lower() in answer.lower() for v in BRAND_VARIANTS)
    return {"platform": "ChatGPT", "query": query, "mentioned": mentioned}

Extend this for Claude and Perplexity, run weekly via cron, dump to CSV. You get a time series of your AI visibility score for about $0.50/week.

The payoff: instead of "I think LLMO is working," you can say "my visibility went from 12% to 28% after I added structured data." Numbers beat feelings.

What's Available in May 2026

If building your own tools isn't your thing, several commercial platforms now track AI citations:

Otterly.ai is the fastest-growing option, with 10,000+ users since launching in October 2024. It monitors your brand across ChatGPT, Perplexity, Google AI Overviews, and Copilot. Keyword-level citation tracking, competitor benchmarking, and clean dashboards.

Profound sits at the enterprise end. Their published case study with Ramp, where they went from 3.2% to 22.2% AI visibility in one month, is the kind of result that gets budget approved. If you're a larger organization, this is where you'll probably land.

Peec AI focuses on brand mention analysis across LLM outputs. Beyond whether you're cited, it tracks how: what sentiment surrounds your mentions, which prompt patterns trigger citations.

My honest take: for individual creators and small teams, the manual protocol plus a basic Python script gives you 80% of the insight at 0% of the cost. Commercial tools become worthwhile when you're tracking dozens of keywords across multiple brands and need team dashboards.

The Crawler Signal You're Probably Ignoring

Here's a measurement angle most people miss: AI crawler logs.

Your server access logs already record which AI systems are visiting your content. GPTBot (OpenAI), ClaudeBot (Anthropic), PerplexityBot, Google-Extended. They all identify themselves in the User-Agent string.

grep -E "GPTBot|ClaudeBot|PerplexityBot|Google-Extended" \
  /var/log/nginx/access.log | awk '{print $7}' | sort | uniq -c | sort -rn

Pages that get crawled frequently are more likely to appear in AI responses. Pages that never get crawled are invisible. It is an indirect signal, but useful for finding content that AI systems are skipping entirely.

I checked my own logs and found that /blog/ pages get crawled 15x more than my /about/ page. Not shocking, but the gap was wider than I expected.

Building a Measurement Habit

Measurement without action is just data hoarding. Here is the cycle I run:

Weekly (10 min): Check GA4 AI referral dashboard. Note spikes or drops. Compare week-over-week.

Monthly (30 min): Run the five-platform manual protocol. Calculate citation rate. Scan crawler logs for new patterns.

Quarterly (1 hour): Full review. Update query set. Compare citation rate trends. Check whether content changes produced measurable results.

The LLMO Framework provides a structured approach to KPI design if you want a more formal methodology. I reference it when deciding which metrics matter most at different growth stages.

The Punchline

I started measuring my LLMO visibility three weeks ago. My citation rate across five platforms is 14%. Not great. Not terrible. But the important part is that I know the number, and three months from now I'll know whether it went up or down.

The SEO world figured out measurement twenty years ago. The LLMO world is still in its "checking rankings by Googling yourself" era. The people who build measurement infrastructure now will have a compounding advantage over those who keep guessing.

If you're still typing your brand name into ChatGPT and squinting at the output, I get it. I was doing the same thing last month. But now I have a spreadsheet, a cron job, and a regex filter in GA4. Less romantic, more informative. I'll take that trade.

References

How to Track AI Traffic in GA4
Best LLMO Tools in 2026
GEO: Generative Engine Optimization by Aggarwal et al., Princeton / ACM SIGKDD 2024
LLMO Framework: KPI design and implementation guide

The full playbook (llms.txt patterns, JSON-LD examples, citation-rate KPIs, and ChatGPT/Perplexity/Brave comparison) is in LLMO Practical Guide: Why ChatGPT Ignores Your Website.

I Audited 30 llms.txt Files in the Wild. 5 Anti-Patterns Are Already Forming.

Ken Imoto — Wed, 20 May 2026 13:05:01 +0000

I shipped my third llms.txt this month and felt extremely productive. The kind of productive where you close the laptop, pour a coffee, and feel like the entire AI-search problem is now solved on a personal level.

Then I opened 30 production llms.txt files from the companies the rest of us are supposed to be learning from. Anthropic. Stripe. Vercel. Cloudflare. Hugging Face. Mintlify. Astro. Linear. The names you cite when you tell someone "look, the serious players are doing it."

24 of the 30 files had at least one of five problems. Three of those problems were in my own files.

That coffee got cold.

How I ran the audit

The setup was embarrassingly simple. I picked 30 domains that have public llms.txt files and matter to developers in 2026: AI labs, infra companies, popular dev tools. I curled each one. I read each one with the eyes of an LLM trying to use it. I logged what was wrong.

This isn't science. It's a Monday evening with a terminal open. But the patterns showed up so fast that I stopped at 30. The next ten would have been more of the same.

A March 2026 SE Ranking study of 300,000 domains found roughly 10% adoption. The codersera May 2026 guide puts the number around 844,000 sites with 500% YoY growth. The standard is winning the adoption race. It is losing the quality race.

The five anti-patterns

Anti-Pattern 1: "Dump everything"

This is the most common failure and the one I am most guilty of. The author treats llms.txt as a second sitemap. 800 links. 1,200 links. One file I opened had every blog post since 2019, flat, no priority, no grouping.

The whole point of llms.txt is that sitemap.xml already exists. When the spec says "10KB recommended" it is not being cute about file size. It is saying: if the LLM cannot read all of this inside a context window with budget left for the actual question, you have not helped, you have moved the problem.

The fix is brutal: pick 10 to 20 links. Not 50. Not "key sections plus a few extras." Ten to twenty. Everything else goes in ## Optional or stays in sitemap.xml.

If you are a docs-heavy product, use the pattern Cloudflare ships: a slim root llms.txt that links to per-product llms.txt files. Each one stays under the budget. An agent fetches only the one it needs. No one reads the entire encyclopedia to fix a faucet.

Anti-Pattern 2: "Contradicts robots.txt"

Open the robots.txt. Open the llms.txt. Diff the paths. About a third of the files I audited list URLs in llms.txt that are explicitly Disallowed in robots.txt for the very crawlers most likely to read llms.txt.

The most painful example: a docs site that blocks GPTBot and ClaudeBot from /docs/ in robots.txt, then lists 40 /docs/* URLs in llms.txt. The file says "here is what matters." The robots.txt says "you cannot have it." The crawler obeys robots.txt. The llms.txt is decorative.

This usually happens when the two files are owned by different teams (or by the same person across two different months). The fix is a five-minute review with both files open: every URL in llms.txt must be allowed in robots.txt for every AI crawler you actually want reading it.

If you genuinely want to block AI crawlers, fine, but then do not also write them a polite directory of your favorite pages.

Anti-Pattern 3: "HTML links only, no .md"

Jeremy Howard's original proposal includes a clever convention: any URL appended with .md should return a clean Markdown version of the page, no nav, no ads, no JavaScript bundle. The .html.md pattern.

Almost nobody does it. In my 30 files, only 6 served any .md companion at all. The other 24 hand the LLM a link to an HTML page that the LLM cannot parse cleanly because the crawler does not execute JavaScript.

Stripe does this well: every docs URL has a .md twin and llms.txt points at the .md version. The llmoframework.com reference templates section calls this out as the single highest-leverage thing most teams are skipping, because it is the difference between "AI can find the page" and "AI can actually read what is on the page."

The fix depends on your stack. For Astro and Next.js, generating .md versions at build time is a 30-line change. For dynamic CMS sites, an edge function that returns a markdown serialization on the .md suffix is the move. Either way, this is the anti-pattern with the largest delta between effort and outcome.

Anti-Pattern 4: "About page theatre"

Eight of the 30 files used the entire body of the file as a marketing pitch. Three paragraphs about the company's mission. A founder quote. The history of the brand. Then two links. Total content: "we are visionary leaders in the AI-native space."

LLMs do not buy your vibe. They need pointers to content. The H1 plus blockquote summary is the place for "what is this site." Everything below should be links to specific pages with specific descriptions. If your llms.txt reads like a homepage, you wrote a homepage.

The same logic GEO research is pushing on the content side, "vague claims do not get cited, specific claims with sources do", applies to llms.txt itself.

Anti-Pattern 5: "Frozen in 2024"

Five of the files I audited had visible signs of being shipped once and never touched again. Links to pages that 404. Product names that no longer exist. Dates that put the file's last meaningful update in 2024, back when llms.txt was a six-month-old proposal and "AI search" was something Perplexity was still explaining to people.

Sitemap.xml is auto-generated. robots.txt rarely changes. llms.txt sits in an uncanny middle: hand-curated like documentation, but with the same staleness risk as a README that says "we use Yarn" when you migrated to pnpm last year.

The fix is automation, not discipline. Add a CI check that flags 404s in the URLs your llms.txt lists. Regenerate the "featured articles" section from your analytics every quarter. Treat the file like a config artifact, not a one-off launch deliverable.

Mintlify's analysis of real llms.txt examples flagged this as the second-biggest pattern they saw across the customer base. The first was Anti-Pattern 1. So those are the two to fix this week.

The three I shipped myself

Honesty section. Of my three llms.txt files:

One had 47 links in it. Anti-pattern 1.
One pointed at HTML-only URLs because I had not set up the .md companion yet. Anti-pattern 3.
One had not been updated in 4 months and listed a post under a slug I had since renamed. Anti-patterns 5 and a 301-redirect chain for dessert.

I did not catch any of this until I was three quarters of the way through reading other people's files. The audit was supposed to be about them. It ended up being about me. There is probably a lesson in there, but I am still in the embarrassment phase.

What changed after I fixed two

I fixed two of them. The 47-link file went to 16 links plus an ## Optional section. The HTML-only file got .md twins for the 16 featured URLs via a build-time hook (Astro made this easier than I expected, about 25 lines).

I cannot tell you "AI citations jumped by X%" because the file is one week old and citation measurement at this volume is noisy. What I can tell you is the file now passes a smell test I should have applied from day one: would a model with a 200K context window and ten other tabs open prefer this file over the previous version? Yes. Obviously yes. The previous version was unreadable.

The honest position on llms.txt

The skeptics are partly right. SE Ranking's 300K-domain study did not find a measurable citation lift. The major LLMs do not publicly confirm they fetch the file. The standard has no W3C stamp.

The skeptics are also partly wrong. IDE agents (Cursor, Cline, Continue), the major AI search engines (ChatGPT search, Perplexity, You.com), and a growing list of MCP integrations read llms.txt today. The optionality is real and the cost is fifteen minutes.

The actual question for 2026 is not "should I ship an llms.txt." That question is settled by the cost-benefit math. The question is whether the file you ship gives an LLM something useful or trains it to ignore your domain. Anti-patterns 1 through 5 are the difference between those two outcomes.

What to do this week

If you have not shipped one yet, the basics are straightforward (H1 site name, blockquote summary, prioritized link list). If you have shipped one, run it through the five-question audit:

Is it under 10KB and under 20 links (excluding ## Optional)?
Do all listed URLs pass robots.txt for GPTBot and ClaudeBot?
Do at least the top 5 URLs have a .md companion?
Does the body link to specific pages, not generic marketing copy?
Was it updated in the last 90 days?

If you score 5 out of 5, you are in the top 6 of the 30 sites I looked at, which is to say the top 20% of an already-self-selected sample. If you score 3 or below, you have the same Monday afternoon ahead of you that I did.

I am writing my fourth llms.txt this week. I will run it through this list before I publish. I will not feel productive afterwards. I will feel like someone who learned the same lesson three audits in a row.

That, I am told, is how engineering works.

If you want the full LLMO playbook, beyond llms.txt and into JSON-LD, robots.txt strategy, content design, and measurement:

LLMO: AI Search Optimization for Engineers

I Asked 3 Claude Code Sub-agents to Review the Same PR. They Disagreed on 41% of the Comments.

Ken Imoto — Wed, 20 May 2026 13:00:00 +0000

I thought multi-agent code review was a free upgrade. Three sub-agents looking at the same PR sounded like three pairs of eyes for the cost of one engineer's coffee.

Then I ran three Claude Code sub-agents on the same 500-line refactor PR and watched them disagree on 41% of the comments. The merge took an hour I had budgeted for fifteen minutes. Brooks's Law is alive in 2026, and apparently it scales down to agents.

Anthropic announced in March that fewer than 1% of their internal code-review findings get marked incorrect by engineers. That number is real, and it is also a stat from people running one tightly-tuned pipeline on their own codebase. As soon as I stood up my own three sub-agents on my own repo, "agree" stopped meaning what I thought it meant.

This is the experiment. What I set up, what I measured, and what I now actually believe about parallel sub-agent review.

The setup

The PR was a 500-line refactor of a WebRTC signaling layer in one of my side projects. Eight files, mostly TypeScript, a couple of config tweaks, one new error type. Boring enough to not be a stunt PR, complex enough that a single reviewer would miss things.

Three sub-agents, all defined under .claude/agents/, all using Sonnet 4.6, each restricted to read-only tools:

---
name: explore-reviewer
description: "Trace callers, dependents, and dead code paths."
model: sonnet
allowed-tools: Read Grep Glob
---

You are a code archaeologist. For each changed file, find every caller,
every test that references it, and any path that goes silent after the change.
Report concrete file:line citations. No style opinions.

---
name: security-reviewer
description: Look for auth, validation, and secret-handling regressions.
model: sonnet
allowed-tools: Read Grep Glob WebSearch
---

You are a security reviewer. Focus only on auth flows, input validation,
secret handling, and dependency risks. Estimate CVSS for each finding.
Ignore style and architecture.

---
name: plan-architect
description: Assess design decisions against existing conventions.
model: sonnet
allowed-tools: Read Grep Glob
---

You are a software architect. Compare the PR's design choices against the
existing conventions in this codebase. Flag drift, missing seams, and
abstractions that will hurt the next person.

Each sub-agent got the same prompt: "Review PR #482 line by line and list findings as bullets with file:line citations." Each ran in its own context. None of them saw each other's output. I was the only one stitching results together at the end.

What 41% disagreement actually looked like

After all three finished, I had 78 raw comments total. I sat down with a spreadsheet and tagged each one as "raised by 3", "raised by 2", or "raised by 1".

Coverage	Count	Share
All 3 agents flagged it	14	18%
2 of 3 agents flagged it	32	41%
Only 1 agent flagged it	32	41%

The "raised by 1" bucket is what I'm calling disagreement. Two other sub-agents had every opportunity to flag the same line, with the same tools, on the same diff. They walked past it. That is a 41% chance that any individual finding is one sub-agent's private opinion.

The headline Anthropic number — less than 1% marked incorrect — is measured differently. They count findings that an engineer explicitly closes without fixing. I'm counting findings that two of three agents looking at the same code never bothered to mention. Those are different questions, and the second one is the one that costs me time at the keyboard.

The four disagreement patterns

After classifying every disagreement, four patterns covered almost all of them.

Severity drift. The plan-architect flagged a missing null check as "critical". The security-reviewer noted the same line and called it "low — caller already validates upstream". Both were right, sort of. The architect was reading the function in isolation. The security reviewer had grep-walked the callers and seen the upstream check. Same line, opposite verdicts.

Scope drift. Asked to review the PR, the explore-reviewer happily told me about three pre-existing bugs in files the PR did not touch. The plan-architect refused to comment on anything outside the diff. I had no way to know in advance which behavior I would get. Strictly speaking, both interpretations are defensible. Practically speaking, one of them blew up my comment count.

Concreteness drift. The plan-architect wrote: "Consider extracting the retry logic into a shared helper." The security-reviewer wrote: "Replace lines 184-201 with retry(opts, () => fetchToken(opts.url)) and add a 30s ceiling, otherwise the auth-refresh path can hang the worker." Same idea. One I could apply in thirty seconds, the other I needed to spend a meeting on. Concreteness is a wildly larger axis of variance than I expected.

Tool-budget drift. The explore-reviewer had grep and glob, and noticed that the renamed function was still referenced in a CI script nobody had updated. The plan-architect, with the same tools, never looked there. Same allowed-tools list, same prompt about "find dependents". One walked the surface, one walked the building. Drift here came down to how aggressively each system prompt told the agent to roam.

If you have used Claude Code sub-agents for anything beyond a one-off Explore call, none of this is shocking. What was shocking, for me, was how cleanly the four buckets carved up almost every disagreement I tagged.

The bug nobody caught

Two days after I merged, a colleague found a race condition in the new error-handling path. The PR introduced a one-frame window where two reconnect attempts could fire on the same socket. None of the three sub-agents mentioned it. The pull-request description, which I had written by hand, did mention "reconnect logic moved", which is what made my colleague go look.

"Given enough eyeballs, all bugs are shallow," Eric Raymond wrote in 1999. He was right about eyeballs. He did not specify that three of them needed to be aimed at the same window. Mine were all squinting at the diff. None of them stepped back and asked: what changed about timing?

The hour I lost to merging

The actual merging of the three reports was the part I had not budgeted for.

For each "2 of 3" or "1 of 3" finding, I had to decide:

Is this real or is it a context gap I can close with one grep?
If real, is the severity from agent A right, or the severity from agent B right?
If a fix is suggested, is the concrete one safe to apply, or do I need to push back to the abstract version?

That last question alone took me three coffee refills. Two sub-agents had told me to "extract a shared helper". One had given me a specific helper. I had to read the diff a third time, by hand, to figure out whether the specific helper was actually the right shape. It wasn't. I ended up writing a fourth version.

Brooks's Law was about communication overhead between humans on a late project. I am now convinced it generalizes to "any time you put N independent perspectives on the same artifact, your N+1 reviewer is the integrator, and the integrator's hour goes up roughly linearly in N." Three sub-agents felt like 3x the eyes. They were also 3x the integration cost.

How many sub-agents is the right number

I do not think the answer is one. After the same week I ran the experiment with N=3, I tried N=1 on a smaller PR — just a single general-purpose review pass. It missed the kind of cross-file dependency that the explore-reviewer would have caught. One pair of eyes is genuinely worse than two.

My current heuristic, after maybe a dozen PRs of this:

Tiny PR (<100 lines, no new files): one sub-agent. Anything more is overhead.
Medium PR (100-500 lines, touches one subsystem): two sub-agents with different angles, usually explore + security or explore + architect. Pick the second to match what the PR is actually risking.
Large or cross-cutting PR (500+ lines, multiple subsystems): three. Plan the integration time in advance. It is not free.

Above three, I have not seen the value. HAMY's nine-agent setup is interesting, but I would want a second tool just to merge the reports, and I would want it to be cheaper than me.

The other knob is concreteness. I now ask each sub-agent for findings "with the smallest concrete change that fixes them, or marked as no-fix if you don't know". That single line in the system prompt collapsed about half of my concreteness drift.

What I actually believe now

Multi-agent code review is not free. It is closer to "three junior reviewers reading in different rooms, and you are the senior who has to merge their notes." The eye count goes up, but so does the integration cost, and the integration cost is the part that lives in your calendar.

The bug nobody caught is the part that humbled me most. Three agents, three angles, all read-only, all aimed at the same diff. None of them noticed the timing change because none of them were asked to. Sub-agents are extremely good at the questions you put in their system prompt. They are mediocre at the questions you forgot to ask. That is the actual limit, not the model.

If you take one thing from this: write a fourth sub-agent prompt called what-am-i-not-asking, give it your diff, and ask it to nominate the categories your other agents will miss. Then read its answer. Then write the real review prompts. I did not do this for the experiment in this post, which is exactly why I lost an hour at merge time and a colleague found my race condition.

Anthropic's less-than-1% number is real. It is also measured on a pipeline that someone spent months tuning, not on three sub-agents you wrote between meetings. Tune yours. Until then, expect 40%.

The deeper version covering sub-agent design, custom agent patterns, and the full Claude Code workflow lives in Harness Engineering: From Using AI to Controlling AI.

I Caught Claude Hiding My Bug 3 Times in a Row. Then I Turned 10 Debugging Habits Into Prompts.

Ken Imoto — Tue, 19 May 2026 13:00:01 +0000

I asked Claude to fix a 500 error from one of my API endpoints. First attempt: it wrapped the call in try-catch and logged the error. Second attempt: it added a default return value so the caller would not blow up. Third attempt: it added a retry with exponential backoff.

The 500 stopped. I shipped the third "fix" with full confidence. Two hours later, prod woke up the on-call. The same incident had moved to a different endpoint that shared the same database client. The actual cause was connection pool exhaustion. Claude was not fixing the bug. It was hiding it three different ways.

This is the story of how I turned 10 debugging habits into prompt templates so Claude cannot pull that on me anymore. There are also two file types you can hand it once and never touch again: a CLAUDE.md block and two hook configs.

The 3 "fixes" that almost shipped

Each of the three attempts looked correct in isolation.

Attempt 1, try-catch. The handler now caught the exception, logged it, and returned a 500 to the user. From the API's point of view, this was an improvement. From the bug's point of view, the connection that triggered the error was still leaked back into the pool in a broken state.

Attempt 2, default return value. The function now returned an empty list instead of raising. The 500 was gone from this endpoint. The data inconsistency that the empty list created flowed downstream into a cache and stayed there for an hour.

Attempt 3, retry with exponential backoff. Three retries, each opening a new connection. The pool got drained faster. The 500 disappeared on this endpoint because the user-facing call now succeeded on attempt 2 or 3. Other endpoints, sharing the same pool, started timing out instead.

In all three cases, the symptom went away on the endpoint I asked about. The cause moved. I had asked Claude to debug, but I had given it no rule against suppressing the symptom, so it suppressed the symptom, because that is what the next-token prediction wants to do.

Why AI defaults to symptom suppression

The 2025 Stack Overflow Developer Survey reported that around 80% of professional developers were using or planning to use AI tools, and the share who actually trusted those tools' output had dropped year over year. The follow-up coverage I have read since then keeps coming back to the same complaint: AI-generated code clusters bugs around logic errors and I/O handling, at a rate that is meaningfully higher than human-written code at the same level of seniority. The figure I have seen cited most often is roughly 1.7x bug density, though different studies measure it differently and you should check your own commit history before quoting any single number.

The mechanism is not mysterious. A large language model predicts the next most plausible token given the context. "Error handling pattern" is one of the most over-represented things in its training data. Try-catch, null-check, default return, retry: these are statistically the kinds of edits that appear when someone says "fix this error" in a public repo. The model is doing exactly what it was trained to do.

What is missing is a different kind of token. "I do not yet know the root cause. Continue investigation." That sentence is rare in training data because humans rarely commit it. We commit the fix, not the not-yet-found-it. So the model never learned to default to "keep looking."

You have to put that token in for it. That is what the next section is for.

10 debugging habits → 10 prompt templates

Each of these maps to a classic debugging habit. Each one is a sentence I now paste into the prompt or the CLAUDE.md, depending on how permanent I want it.

1. Doubt the inputs. "Before proposing a fix, confirm the logs you're reading are complete and the monitoring you're trusting actually reports the state you think it reports." This is the one Claude skips most. It will happily diagnose from a log file that is half-rotated.

2. Reproduce before fixing. "Reproduce the bug locally and show me the minimum steps. If you cannot reproduce it, say so explicitly and stop." The "stop" is doing the work. It shuts the door on guessing.

3. Find the boundary. "Identify the boundary between working and broken behavior. Which component is the last one that returns correct data?" This pushes the model away from line-by-line guesses and toward layer-by-layer narrowing.

4. Diff against a known-good state. "Compare the current code to the last known working state. Run git log --oneline -20 and identify any change that could plausibly correlate with the failure window." This is the prompt that surfaces the commit no one remembered making.

5. Build a timeline. "When did this start failing? Is it sudden or gradual? Map error rate against deploy times, traffic spikes, and config changes." Sudden + correlated to deploy is one bug. Gradual + uncorrelated is a different bug entirely. Conflating them is how three "fixes" stack.

6. Audit retries, caches, and timeouts. "List every retry, cache, and timeout on the path. For each one, describe what happens when the underlying call is slow but not failed." This is the one that would have caught my pool exhaustion on the first pass.

7. Watch for amplification. "Is there a path where a small error gets multiplied? A failed call that triggers three retries, each opening a new connection, each adding latency to the next?" If your retry storm hides inside an autoscaler, you also get an instance storm.

8. Add instrumentation, don't guess. "If you don't have enough observation to identify the cause, propose the specific log lines or traces to add. Do not propose a fix yet." This converts "I don't know" into "here is what to measure," which is a much more useful answer than a fake fix.

9. Simplify the suspect. "Remove non-essential components from the failing path until the bug is reproducible in the simplest possible form. What is the smallest input that still triggers it?" Most of the bug usually wasn't in the part you were staring at.

10. Break things on purpose. "To verify a hypothesis, propose an intentional change that should make the bug worse or better. Predict the outcome before running it." This is the one that flips debugging from observation to experiment. It also catches lies your monitoring is telling you.

Persist the rules in CLAUDE.md

Pasting 10 sentences into every prompt does not scale. CLAUDE.md is where the rules go to live.

The Anthropic guidance I keep coming back to is to hold CLAUDE.md under roughly 100-150 lines so it can actually fit in context for every turn. Spending 12 of those lines on debugging is a good trade.

## Debugging Rules

- Do not write fix code until you have identified the root cause.
- Suppress nothing. If the symptom is gone but the cause is unknown, that is not a fix.
- Before fixing, write a failing test that reproduces the bug.
- After fixing, run the full test suite and report any newly failing tests.
- If three attempts fail in a row on the same bug, stop. Summarize what you tried, what you ruled out, and what hypothesis is left, and ask for human input.

## Debugging Workflow

1. Root Cause Investigation: read logs, traces, and the code path.
2. Pattern Analysis: search for the same anti-pattern elsewhere in the codebase.
3. Hypothesis Testing: write a test that would fail iff the hypothesis is correct.
4. Implementation: only after steps 1-3 succeed.

The thing to notice is that these are constraints, not instructions. "Do not write fix code until..." is more useful than "investigate first." The constraint format is what stops the next-token machine from cheerfully skipping ahead.

Automate behavior with hooks

CLAUDE.md is the brain. Hooks are the reflexes. Two of them matter for debugging.

PreToolUse: block destructive commands. Halfway through debugging, the model occasionally suggests something like rm -rf node_modules or, on a worse day, a raw DROP TABLE. A PreToolUse hook intercepts the Bash tool call, greps the command string for a small denylist, and exits 2 to block. Claude Code treats exit code 2 from a PreToolUse hook as "this tool call is denied, tell the model why."

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash",
        "hooks": [{
          "type": "command",
          "command": "if echo \"$TOOL_INPUT\" | grep -qE 'rm\\s+-rf|DROP\\s+TABLE'; then echo 'BLOCK: destructive command' >&2; exit 2; fi"
        }]
      }
    ]
  }
}

PostToolUse: run tests after edits. Matcher Edit|Write, command runs your test suite or at least a fast subset. The model now sees the test failure on the next turn and reacts to it the same turn it created it, instead of remembering 30 messages later. The official Claude Code hooks reference covers the matchers and exit-code conventions in full. Worth reading once before you write your own.

Together CLAUDE.md, PreToolUse, and PostToolUse form the equipment layer for an AI debugger. It is the same equipment-layer pattern I used when splitting one big agent into Observer, Strategist, and Marketer: constraints in the prompt, behavior in the hooks, information in the MCP layer.

When 3 hidden fixes in a row mean stop

The single most useful rule, the one that would have saved my on-call:

If three attempts in a row fail to fix the same bug, stop and escalate.

Three is not magic. It is the point where the cost of one more guess exceeds the cost of admitting the bug is structural. By the third attempt, the model is usually pattern-matching on top of pattern-matching, and a human eye is cheaper than a fourth retry.

"Let Claude debug it" is half true. It is fast. It just defaults to fast at hiding the problem unless you arm it differently. The 10 prompts arm it. The CLAUDE.md remembers them for you. The hooks catch what slips through. None of these is expensive. The on-call page at 11pm is.

The full chapter on translating the 10 habits into prompts, plus the Claude Code weapons chapter on CLAUDE.md, hooks, and MCP layering, is in Practical Claude Code.

Sources:

I Refused to Write Specs Until Claude Code Generated Wrong Code Three Times

Ken Imoto — Tue, 19 May 2026 13:00:01 +0000

I read the phrase "spec-driven development" and immediately decided it was for people without taste. Six months later, Claude Code generated a discount system that applied coupons to itself. Three times in a row.

The first time I laughed. The second time I assumed the prompt was the problem. The third time I closed the editor, opened a YAML file, and started writing OpenAPI like a person who had finally lost an argument with reality.

This post is about that argument. And about what fifteen minutes of spec-writing actually buys you in 2026, when half the developer Twittersphere is still telling you to "just prompt it."

What I was doing wrong

My workflow was the one everyone has tried. Open Claude Code. Type "build me a checkout flow with member discount and a promo code field." Watch the agent confidently generate four hundred lines of Flask. Skim. Run. Fail. Re-prompt. Get a different four hundred lines. Repeat until I either ran out of patience or shipped something that mostly worked.

The discount feature was where the wheels came off. I asked for "10 percent member discount, stackable promo codes, max 30 percent total." Claude Code shipped a function that, when given a promo code on a member account, took 10 percent off, then took another 10 percent off the discounted total, then applied the promo. The promo code, as it turns out, was also a member-discount-eligible item in my schema, because I had not bothered to tell anyone that members are people and promos are line items. So the system politely gave my coupon a coupon.

Yes, I am the engineer who wrote "just prompt it" in a thread last week and then spent five PR rounds explaining what "just" meant.

The fifteen-minute spec

Out of spite, I tried the thing I had been calling overhead. I wrote an OpenAPI document. Endpoint, request shape, response shape, error codes, the constraints on every field. It took fifteen minutes.

paths:
  /api/orders:
    post:
      requestBody:
        application/json:
          schema:
            customer_id: string
            items: array of OrderItem
            promo_code: string | null
      responses:
        201:
          schema:
            order_id: string
            subtotal: integer (minimum 0)
            member_discount: integer (0..subtotal * 0.1, integer)
            promo_discount: integer
            total: integer
            applied_rules: array of string
        400:
          schema:
            error: { code, message }

Then I wrote a Gherkin file with three scenarios. Member buys without promo. Non-member uses promo. Member uses promo and the total cap kicks in.

Scenario: Member with promo, capped at 30% total
  Given a logged-in member
  And a cart with subtotal 10000 yen
  When they apply promo code "SPRING5"
  Then member_discount is 1000
  And promo_discount is 2000
  And total is 7000
  And applied_rules includes "member" and "promo:SPRING5"

I handed both files to Claude Code with one sentence: "implement these specs in Flask, including validation and error handling." It generated about 80 percent of the implementation in three minutes. The remaining 20 percent was real domain logic: what counts as "stackable," what happens at the cap. I wrote that. The spec made it impossible to be confused about it.

Fifteen minutes of YAML to delete five PR rounds of "what did you mean by stackable." I had been doing the loud version of saving fifteen minutes by spending two hours.

Why it works (and why "just prompt it" doesn't)

The reason has nothing to do with Claude Code being smarter when you give it more text. It has to do with what you, the human, are forced to think about while writing the spec.

When I write member_discount: integer (0..subtotal * 0.1, integer), I have committed to the idea that member discount is at most ten percent of the subtotal, in integer yen. I cannot generate a spec that "applies the coupon to itself" because the spec doesn't have a coupon-shaped recipient for that recursion. The ambiguity dies in YAML, before it can metastasize in Python.

This isn't original to me. The 2026 wave of spec-driven tooling (OpenSpec, cc-sdd, amux, Kiro) is all built on the same observation. GitHub Copilot Workspace doesn't even let you skip the step: it generates an editable "proposed specification" before it touches code, because the team that built it figured out that the spec is the only artifact in the workflow that the human can actually review.

The cheap-model lesson generalizes: AI assistants don't reduce the value of specs. They turn a fuzzy spec into an expensive mistake faster than humans ever could.

The three patterns that paid off

The book version of this is three patterns, and after living with them for a quarter, all three pull weight.

Pattern 1: OpenAPI to implementation. Write the endpoint shape. Hand it to Claude Code. Get a stub that handles 80 percent of CRUD plus serialization plus the obvious error cases. Add the domain logic by hand. This is the bread-and-butter case. It is also where the "80 percent" number comes from. The remaining 20 percent is what you're actually paid to think about.

Pattern 2: Gherkin to step definitions. Write scenarios in Given/When/Then. Hand them to Claude Code with pytest-bdd or behave. Get the step skeletons. The interesting move here is that the same scenarios drive both implementation prompts and test prompts, so the agent can't drift between "what the code does" and "what the test checks." Drift is where bugs ship.

Pattern 3: Spec to property tests. From the OpenAPI schema (price: integer, minimum: 0, maximum: 1_000_000), have Claude Code generate property-based tests with Hypothesis or fast-check. You get the boundary cases (0, 1_000_000, -1, null, overflow) without having to remember every flavor of "what could go wrong with an integer." This is the one I underused for years and regret most.

The traps

Three things will bite you if you don't watch for them.

Ambiguity in specs scales linearly with the bugs in implementation. If your OpenAPI says discount: number instead of discount: integer (0..subtotal*0.1), the model will guess. It will guess differently every time. Vague specs aren't a head start; they're a paid-for hallucination factory. Spec-driven development only works as a forcing function on you.

Never trust generated code unconditionally. Sample of bugs I have shipped from generated code in the last three months: a SQL query built with string concatenation (injection waiting to happen), a JWT stored in localStorage (it should have been httpOnly), and a silent N+1 over a thousand-row table. The agent didn't write any of those out of malice. It wrote them because nothing in my spec said "no." Specs need a constraints section.

The agent will add requirements you didn't ask for. I have watched Claude Code add an authentication check to an endpoint whose spec said "public, rate-limited only." The agent had read enough Stack Overflow to think every endpoint should be authenticated, and silently slipped a check in. Specs need to be explicit about what the system doesn't do, not just what it does.

How I write specs now

The workflow that survived contact with reality is unromantic.

Sketch the endpoint in OpenAPI. Field types, ranges, required vs optional.
Write three Gherkin scenarios. Happy path, edge case, error case.
Add a ## Out of scope section to the spec file. Auth model. Rate limit. Caching. Anything the agent might helpfully invent.
Hand all three to Claude Code with CLAUDE.md containing project conventions.
Generate. Review the diff against the spec, not against vibes.
Run the property tests the spec generated.

This is also where Claude Code Skills earn their keep. I wrap the steps above into a single skill, /spec-impl, and the workflow stops being a discipline I have to remember and starts being one slash command. The agent matters less than the artifact in front of it.

What I'd tell past-me

I would tell past-me that the fifteen minutes of OpenAPI he refused to write cost him an entire weekend of "just one more prompt." I would tell him that spec-driven development is not a methodology you adopt because some consultancy sold it to your CTO; it's the cheapest known mechanism for not arguing with a fast, confident, slightly drunk junior engineer.

And I would tell him this: in 2026, agents turn every fuzzy spec into an expensive mistake faster than any human ever could.

The specs are the brake pedal. Without them, you still go fast. You just go fast in whichever direction the agent's training data pointed last.

If you're building Claude Code into your team's workflow and want the full reference, including Skills, hooks, sub-agents, and the CLAUDE.md three-tier pattern, this is the practitioner's reference I keep going back to:

Claude Code Mastery: A Practitioner's Reference

I Gave My Strategist Agent WebSearch. 5 Topics Took 20 Minutes. Splitting It Into 3 Roles Made It 3.

Ken Imoto — Mon, 18 May 2026 13:00:00 +0000

I thought one agent doing everything was elegant. One claude -p call, "pick today's topics and write the articles," done. It took 20 minutes to pick 5 topics.

Splitting it into three agents took the same job to 3 minutes and dropped token cost by about 60%. The agents are dumber individually. The pipeline is faster.

The trick is not "more agents." The trick is taking WebSearch out of the agent that does the judging.

The 1-agent setup that took 20 minutes

The original setup was one prompt, one agent, one run:

"Look at yesterday's GA4 data, pick 5 topics for today, and write the top one."

The agent was allowed Bash, Read, Write, Edit, Grep, Glob, WebSearch, WebFetch. Everything it could possibly need.

For each candidate topic, it did roughly the same thing: WebSearch to check "what's hot in this space right now," WebSearch again to confirm a trend, WebSearch a third time to cross-check a competitor. Five topics, three to four searches each, 15 to 20 searches per run. Each search dumped a few thousand tokens of result into the context.

By the time the agent was choosing topic 3, the judgment context contained 40,000+ tokens of search results from topics 1 and 2. The signal-to-noise ratio collapsed. The agent started picking topics that "felt confirmed by recent news" rather than topics that matched my actual content stock.

The visible symptom was time: about 20 minutes per run. The hidden symptom was drift — I kept overriding the agent's picks during the weekly review, because they didn't match what I had material for.

Why WebSearch in the judgment loop is a trap

WebSearch is fine. WebSearch in a judgment loop is the trap.

Two things happen when you let the judge search:

Time. A WebSearch is 5-20 seconds. Five topics times four searches is 100 seconds of waiting per run, before you even count read time and reasoning. For a single human asking one question it's nothing. For a daily automated job it stacks up fast.

Context pollution. Each result adds 2,000-5,000 tokens of HTML-scraped page text into the judgment context. None of it was structured for "is this topic right for my content?" It was structured for SEO. The judge ends up reasoning from a pile of marketing copy instead of from its own data.

The fix is unglamorous. The judge should not have WebSearch. WebSearch belongs in the writer.

Role 1: Observer — collect only

The Observer's job is "fetch yesterday's numbers, write them to a file." That is the whole job.

Inputs: GA4, the Zenn API, the Dev.to API, yesterday's logs. Output: domains/<name>/data/snapshot-YYYY-MM-DD.json.

Allowed tools:

claude -p "$(cat scripts/prompts/observer-prompt.txt)" \
  --allowed-tools "Bash,Read,Write"

No WebSearch. No WebFetch. No Edit. The Observer reads three APIs through curl and writes a single JSON file. If it tries to be clever and "interpret the data," the prompt tells it not to. The schema enforces it: fields are total_views, top_performers_3, errors_yesterday. No recommendation field exists, so there's nowhere to put a judgment even if it wanted to make one.

This sounds like a downgrade. It is, in the same way a single-purpose function is a "downgrade" from a god-object. When the Observer fails, I know exactly which API broke, because that's all it does.

Role 2: Strategist — judge only, no WebSearch

The Strategist reads what the Observer wrote, reads strategy.md for the rules, reads the last 30 days of published topics for the exclusion list, and picks 5 topics. That's it.

claude -p "$(cat scripts/prompts/strategist-prompt.txt)" \
  --allowed-tools "Bash,Read,Write,Edit,Grep,Glob"

Notice what is missing: WebSearch, WebFetch. Physically gone from the allow-list. The Strategist literally cannot reach the internet.

This was the part I resisted. "How can it judge today's topics without checking what's trending?" That was the wrong question. The right question is: am I writing topics that are trending elsewhere, or topics that match my content stock?

The Strategist sees:

Three months of my own performance data (what got read)
My content stock (book chapters, unpublished drafts)
30-day exclusion list (what I already wrote)
My own strategy.md rules

That is enough to pick 5 topics in about 90 seconds, not 20 minutes. The token consumption per Strategist run dropped from roughly 80,000 to roughly 20,000 because there are no WebSearch results to read.

"Adding evidence with WebSearch" sounded like a good idea. In practice it added 8 redundant searches and 40,000 tokens of noise.

Role 3: Marketer — execute, WebSearch allowed

The Marketer reads the Strategist's output, picks the top topic, and writes the article. This is where WebSearch shows up:

claude -p "$(cat scripts/prompts/marketer-prompt.txt)" \
  --allowed-tools "Bash,Read,Write,Edit,Grep,Glob,WebSearch,WebFetch"

The Marketer uses WebSearch for execution research:

"Latest stable version of LangGraph in 2026"
"Anthropic Building Effective Agents doc URL"
"Inngest pricing tier for cron-driven workflows"

These are citations and version checks, not judgments. "Should I write this topic?" is already decided. The Marketer's WebSearch is bounded by the article in front of it.

Two consequences fall out of this:

Cost localizes. WebSearch spend lives in the Marketer, where it produces visible output. The Strategist's per-run cost is now small enough that I run it multiple times a week without thinking about it.
Failure localizes. When WebSearch is flaky or down, only the writer breaks. The Strategist still produces today's picks. The Observer still records yesterday's numbers. The pipeline degrades, it doesn't halt.

The cron chain: how the three roles connect

The three agents do not share a conversation. They share files.

07:00  Observer    → writes snapshot-2026-05-14.json
09:00  Strategist  → reads snapshot, writes strategist-2026-05-14.md
10:00  Marketer    → reads strategist.md, writes drafts + schedules 22:00 publish
22:00  Observer    → records today's early traction → tomorrow's input

I run this as plain cron on a small VPS. The short version is one line per job with set -euo pipefail, trap ... ERR, a Telegram failure ping, and a lock file. About 30 lines of shell per role.

If you want managed durability instead of cron, Temporal's Schedules, Inngest's cron triggers, and GitHub Actions cron all hit the same shape. The architecture does not care which one carries it. I use cron because the failure mode is "the server is off," and I notice that quickly.

The handoff is always a file on disk. JSON for the snapshot, Markdown for the strategist log, Markdown for the marketer log. Human-readable, dated, replayable. I can re-run yesterday's Marketer against yesterday's Strategist file by changing one environment variable. That is backfill for free, without inheriting Airflow.

Sub-agent vs role separation — don't confuse them

I have a separate post about running three Claude Code sub-agents on the same PR and watching them disagree 41% of the time. People sometimes ask if that is the same thing as what I'm describing here.

It is not. They look similar on a slide and behave nothing alike in practice.

	Sub-agent (Claude Code Task tool)	Role separation (cron)
Scope	Same session, same parent agent	Three separate processes, three separate runs
State	Parent passes context as input	File on disk
Timing	Synchronous, parent waits	Asynchronous, hours apart
Failure	Parent owns retry	Each job retries independently
Use case	"Explore this codebase in parallel"	"Run yesterday's PDCA every morning"

Sub-agents are great for parallelism inside one task. Role separation is for time-shifted pipelines. Mixing them produces the worst of both: you get cron's debug surface plus sub-agents' shared-context drift.

The rule I use: if the answer has to come back in the same conversation, it is a sub-agent. If the answer has to survive a server reboot, it is a separate cron job.

What changed, measured

These are my numbers from running both setups on the same content stack:

Metric	1-agent	3-role	Change
Time to pick 5 topics	~20 min	~3 min	-85%
Tokens per daily run	~120k	~45k	-62%
Monthly API spend	~$60	~$22	-63%
Topic re-pick rate (weekly review)	2-3/wk	0-1/wk	down
WebSearch outage breaks pipeline	yes	no	fixed
Mean debug time per failure	30-60 min	5-10 min	-80%

The token math is the one that surprised me. I assumed splitting into three agents would increase total token usage because of duplicated context. It did not, because the deleted WebSearch traffic was bigger than the new per-role overhead.

The debug time is the one that matters daily. With one agent, "the job failed at 09:14" tells me nothing. With three roles, "the Strategist failed at 09:14" tells me which 30-line script to read.

"Adding agents made it faster" sounds wrong on its face. It is only faster because I removed WebSearch from the judgment loop. The split is what made the removal feasible — once Observer and Strategist could not reach the internet, the temptation to "just search one more thing" was gone.

The deep version with full crontab, prompt files, and role allow-lists is in Harness Engineering: From Using AI to Controlling AI.