<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Ken Imoto</title>
    <description>The latest articles on Forem by Ken Imoto (@kenimo49).</description>
    <link>https://forem.com/kenimo49</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3800250%2F275022f6-cba9-47e3-b69e-e8faf7675a0c.jpg</url>
      <title>Forem: Ken Imoto</title>
      <link>https://forem.com/kenimo49</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/kenimo49"/>
    <language>en</language>
    <item>
      <title>5 AI Crawlers Hit My Sites 14,300 Times in 30 Days. Here's What Their User-Agents Told Me About LLMO.</title>
      <dc:creator>Ken Imoto</dc:creator>
      <pubDate>Tue, 26 May 2026 11:00:01 +0000</pubDate>
      <link>https://forem.com/kenimo49/5-ai-crawlers-hit-my-sites-14300-times-in-30-days-heres-what-their-user-agents-told-me-about-4okh</link>
      <guid>https://forem.com/kenimo49/5-ai-crawlers-hit-my-sites-14300-times-in-30-days-heres-what-their-user-agents-told-me-about-4okh</guid>
      <description>&lt;p&gt;I thought &lt;code&gt;robots.txt&lt;/code&gt; was the boundary. Three lines of &lt;code&gt;Disallow:&lt;/code&gt; and I'd told the AI bots where they could and couldn't go. Done. I went back to writing posts about LLMO measurement, citation rates, and AI referral traffic in GA4.&lt;/p&gt;

&lt;p&gt;Then I opened the access logs for three of my sites and the picture I had in my head collapsed.&lt;/p&gt;

&lt;p&gt;This is what I learned reading thirty days of raw server logs from &lt;code&gt;kenimoto.dev&lt;/code&gt;, &lt;code&gt;kaoriq.com&lt;/code&gt;, and &lt;code&gt;llmoframework.com&lt;/code&gt;. Five User-Agent strings dominated everything. The traffic patterns each one created told me more about my LLMO standing than any GA4 dashboard had.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhntrqra5ov9oqlvv52ey.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhntrqra5ov9oqlvv52ey.png" alt="Bar chart of 5 top AI crawler hits over 30 days: GPTBot 4,212; ClaudeBot 3,108; PerplexityBot 2,790; OAI-SearchBot 2,043; Google-Extended 1,387. Together 94.7% of AI traffic." width="800" height="420"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I started reading logs in the first place
&lt;/h2&gt;

&lt;p&gt;Most LLMO measurement advice tells you to track the &lt;em&gt;outbound&lt;/em&gt; side: did ChatGPT cite me, did Perplexity link to me, did Google AI Overviews show me. That's the citation side.&lt;/p&gt;

&lt;p&gt;The other side, where AI services actually pull HTML from my server, is invisible in GA4. AI crawlers don't fire JavaScript. They don't trigger gtag. They show up in raw HTTP access logs and nowhere else.&lt;/p&gt;

&lt;p&gt;I'd been writing LLMO posts for months and had never once looked at the side of the funnel I could actually control. So I exported 30 days of logs from Cloudflare (&lt;code&gt;kenimoto.dev&lt;/code&gt;, &lt;code&gt;kaoriq.com&lt;/code&gt;) and Vercel (&lt;code&gt;llmoframework.com&lt;/code&gt;), grepped for known AI User-Agents, and started counting.&lt;/p&gt;

&lt;p&gt;The total: &lt;strong&gt;14,300 AI crawler hits across three sites in 30 days.&lt;/strong&gt; Roughly 477 hits per day per site. More than I expected. Less than I think it should be in another six months.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 5 crawlers that hit me most
&lt;/h2&gt;

&lt;p&gt;Here's the ranked list. Hits are deduplicated by &lt;code&gt;(timestamp, path, IP)&lt;/code&gt; so cache retries don't inflate the count.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rank&lt;/th&gt;
&lt;th&gt;User-Agent&lt;/th&gt;
&lt;th&gt;30-day hits&lt;/th&gt;
&lt;th&gt;Operator&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;code&gt;GPTBot&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;4,212&lt;/td&gt;
&lt;td&gt;OpenAI&lt;/td&gt;
&lt;td&gt;Training data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ClaudeBot&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;3,108&lt;/td&gt;
&lt;td&gt;Anthropic&lt;/td&gt;
&lt;td&gt;Training + retrieval&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;code&gt;PerplexityBot&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2,790&lt;/td&gt;
&lt;td&gt;Perplexity&lt;/td&gt;
&lt;td&gt;Answer index&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;code&gt;OAI-SearchBot&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2,043&lt;/td&gt;
&lt;td&gt;OpenAI&lt;/td&gt;
&lt;td&gt;ChatGPT search citations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Google-Extended&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1,387&lt;/td&gt;
&lt;td&gt;Google&lt;/td&gt;
&lt;td&gt;Gemini training&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Five User-Agents. 13,540 hits. That's 94.7% of all AI traffic. The remaining 5.3% was a long tail: &lt;code&gt;Bytespider&lt;/code&gt;, &lt;code&gt;Applebot-Extended&lt;/code&gt;, &lt;code&gt;Meta-ExternalAgent&lt;/code&gt;, &lt;code&gt;Amazonbot&lt;/code&gt;, &lt;code&gt;cohere-ai&lt;/code&gt;, a smattering of &lt;code&gt;Claude-User&lt;/code&gt;, and two hits from something that called itself &lt;code&gt;anthropic-ai&lt;/code&gt; (the old UA Anthropic supposedly retired).&lt;/p&gt;

&lt;p&gt;Before you read too much into the order: this is &lt;em&gt;my&lt;/em&gt; data, three small sites, mostly English/Japanese tech content. Your ranking will look different. The shape of it (a handful of bots accounting for most hits, OpenAI and Anthropic at the top) is probably the same.&lt;/p&gt;

&lt;h2&gt;
  
  
  What each one is actually doing
&lt;/h2&gt;

&lt;p&gt;The reason rank order matters less than the &lt;em&gt;purpose&lt;/em&gt; of each bot is that the three buckets behave completely differently in LLMO terms.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Training crawlers&lt;/strong&gt; read your content to potentially update model weights. They show up consistently, follow &lt;code&gt;robots.txt&lt;/code&gt; (usually), and don't care about your content being "fresh." &lt;code&gt;GPTBot&lt;/code&gt;, &lt;code&gt;Google-Extended&lt;/code&gt;, &lt;code&gt;Bytespider&lt;/code&gt;, &lt;code&gt;Applebot-Extended&lt;/code&gt;, and &lt;code&gt;anthropic-ai&lt;/code&gt; (legacy) fall here.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Retrieval crawlers&lt;/strong&gt; index your content so it can be cited in real-time answers. They re-fetch popular pages, follow &lt;code&gt;Last-Modified&lt;/code&gt;, and have a measurable crawl-to-refer ratio. &lt;code&gt;OAI-SearchBot&lt;/code&gt;, &lt;code&gt;PerplexityBot&lt;/code&gt;, &lt;code&gt;Claude-SearchBot&lt;/code&gt; (newer, independently controllable from &lt;code&gt;ClaudeBot&lt;/code&gt;), and &lt;code&gt;GoogleOther&lt;/code&gt; belong here.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;User-initiated fetches&lt;/strong&gt; happen when a human pastes your URL into ChatGPT or asks Claude to read it. These are &lt;code&gt;ChatGPT-User&lt;/code&gt;, &lt;code&gt;Perplexity-User&lt;/code&gt;, and &lt;code&gt;Claude-User&lt;/code&gt;. They don't follow &lt;code&gt;robots.txt&lt;/code&gt; (per &lt;a href="https://developers.openai.com/api/docs/bots" rel="noopener noreferrer"&gt;OpenAI's revised crawler docs&lt;/a&gt;, because they're user actions, not crawls).&lt;/p&gt;

&lt;p&gt;I had been treating all of these as the same animal. They are not. If your goal is "get cited in ChatGPT Search," &lt;code&gt;OAI-SearchBot&lt;/code&gt; hits matter and &lt;code&gt;GPTBot&lt;/code&gt; hits are basically noise. If your goal is "be in the training set of the next Claude," it's exactly inverted.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who actually obeys robots.txt
&lt;/h2&gt;

&lt;p&gt;Here's the part that flipped my view of &lt;code&gt;robots.txt&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;On &lt;code&gt;kenimoto.dev&lt;/code&gt;, I had a &lt;code&gt;Disallow: /api/&lt;/code&gt; rule. Over 30 days:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;GPTBot&lt;/code&gt;: 0 hits to &lt;code&gt;/api/&lt;/code&gt;. Compliant.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Google-Extended&lt;/code&gt;: 0 hits to &lt;code&gt;/api/&lt;/code&gt;. Compliant.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ClaudeBot&lt;/code&gt;: 0 hits to &lt;code&gt;/api/&lt;/code&gt;. Compliant.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;OAI-SearchBot&lt;/code&gt;: 3 hits to &lt;code&gt;/api/&lt;/code&gt;. Borderline. Possibly cached before the rule, possibly the &lt;a href="https://ppc.land/openai-revises-chatgpt-crawler-documentation-with-significant-policy-changes/" rel="noopener noreferrer"&gt;revised compliance language&lt;/a&gt; is doing something subtle.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;PerplexityBot&lt;/code&gt;: 41 hits to &lt;code&gt;/api/&lt;/code&gt; in one 90-second burst. Not compliant on this run.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Forty-one hits is not a sample of one. The 90-second burst pattern matched a &lt;a href="https://www.appearonai.com/insights/ai-crawler-configuration-robots-txt-guide" rel="noopener noreferrer"&gt;public report&lt;/a&gt; where Perplexity was observed ignoring &lt;code&gt;User-agent: PerplexityBot&lt;/code&gt; blocks when answering an active user query. The behavior makes more sense if you think of &lt;code&gt;PerplexityBot&lt;/code&gt; as straddling the retrieval/user-initiated line: it acts like a retrieval crawler on the calm days, and a user-initiated fetch when somebody is waiting on an answer.&lt;/p&gt;

&lt;p&gt;The takeaway I wrote down: &lt;strong&gt;&lt;code&gt;robots.txt&lt;/code&gt; is a self-reported boundary&lt;/strong&gt;. Three of five top crawlers honored it cleanly on my data. One was iffy. One did whatever it wanted when a human was on the other end. Plan accordingly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three LLMO signals you can derive from this
&lt;/h2&gt;

&lt;p&gt;The reason I'm writing this down is that crawler hit data is a measurable LLMO signal, and I haven't seen it discussed much next to the usual citation-rate metrics. Three things I now look at every week:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Crawler diversity.&lt;/strong&gt; If only &lt;code&gt;GPTBot&lt;/code&gt; hits your site and nothing else, your retrieval surface is OpenAI-only. You're invisible to Claude, Perplexity, and Gemini's retrieval paths even if you're cited in ChatGPT. A healthy crawler-diversity score is at least three of the five top User-Agents hitting you regularly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Retrieval-to-training ratio.&lt;/strong&gt; If you sum retrieval-side hits (&lt;code&gt;OAI-SearchBot&lt;/code&gt; + &lt;code&gt;PerplexityBot&lt;/code&gt; + &lt;code&gt;Claude-SearchBot&lt;/code&gt; + &lt;code&gt;GoogleOther&lt;/code&gt;) and divide by training-side hits (&lt;code&gt;GPTBot&lt;/code&gt; + &lt;code&gt;Google-Extended&lt;/code&gt; + &lt;code&gt;anthropic-ai&lt;/code&gt;), you get a number that tells you whether the AI ecosystem thinks of you as "content to be learned from" or "content to be cited right now." Mine sits at 0.81. Anything below 0.5 means your content isn't fresh enough to be retrieved in real time. Anything above 1.5 means you're being actively used in answers (good) but probably plateauing as training material (worth noticing).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. &lt;code&gt;llms.txt&lt;/code&gt; fetch rate.&lt;/strong&gt; Of the five top crawlers, only &lt;code&gt;PerplexityBot&lt;/code&gt; and &lt;code&gt;ClaudeBot&lt;/code&gt; fetched &lt;code&gt;/llms.txt&lt;/code&gt; on my sites during the 30-day window. &lt;code&gt;GPTBot&lt;/code&gt;, &lt;code&gt;OAI-SearchBot&lt;/code&gt;, and &lt;code&gt;Google-Extended&lt;/code&gt; never did. This roughly matches what other operators have observed and is a load-bearing detail when you're deciding whether &lt;code&gt;llms.txt&lt;/code&gt; is worth maintaining. (Short answer: yes, but mostly for the two crawlers that read it.) The &lt;a href="https://llmoframework.com/retrieval-signals/" rel="noopener noreferrer"&gt;llmoframework.com retrieval signals page&lt;/a&gt; goes deeper on this.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to actually pull this data
&lt;/h2&gt;

&lt;p&gt;This is the part I always wanted to read and never quite found, so:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloudflare (free plan).&lt;/strong&gt; The AI Crawl Control dashboard (formerly AI Audit, &lt;a href="https://developers.cloudflare.com/ai-crawl-control/" rel="noopener noreferrer"&gt;docs here&lt;/a&gt;) shows top AI crawler User-Agents out of the box. For raw logs, you need Logpush, which is paid. On free, the easiest substitute is enabling "AI Audit" + filtering Analytics by known AI User-Agents. Free won't give you per-request paths but it gives you counts and trends.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vercel.&lt;/strong&gt; Project → Logs → filter by &lt;code&gt;User-Agent contains "Bot"&lt;/code&gt;. Vercel keeps 30 days of edge logs on the Pro plan. On Hobby, you get less, and you'll want to forward to a log drain if you're serious.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Netlify / self-hosted Nginx.&lt;/strong&gt; Just &lt;code&gt;grep&lt;/code&gt; the access log:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-E&lt;/span&gt; &lt;span class="s2"&gt;"GPTBot|ClaudeBot|PerplexityBot|OAI-SearchBot|Google-Extended"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  /var/log/nginx/access.log &lt;span class="se"&gt;\&lt;/span&gt;
  | &lt;span class="nb"&gt;awk&lt;/span&gt; &lt;span class="s1"&gt;'{print $14}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  | &lt;span class="nb"&gt;sort&lt;/span&gt; | &lt;span class="nb"&gt;uniq&lt;/span&gt; &lt;span class="nt"&gt;-c&lt;/span&gt; | &lt;span class="nb"&gt;sort&lt;/span&gt; &lt;span class="nt"&gt;-rn&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives you crawler counts. Add &lt;code&gt;awk '{print $7}'&lt;/code&gt; instead of &lt;code&gt;$14&lt;/code&gt; to get the URL ranking. The exact field number depends on your log format; check with &lt;code&gt;awk '{print NF}'&lt;/code&gt; on one line to count.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I changed after looking at all this
&lt;/h2&gt;

&lt;p&gt;Three concrete changes after the 30-day window:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;I split my &lt;code&gt;robots.txt&lt;/code&gt; to allow &lt;code&gt;OAI-SearchBot&lt;/code&gt; and &lt;code&gt;Claude-SearchBot&lt;/code&gt; (retrieval, good for citations) while keeping &lt;code&gt;Disallow: /api/&lt;/code&gt; strict for &lt;code&gt;GPTBot&lt;/code&gt; (training, no upside for me on those endpoints).&lt;/li&gt;
&lt;li&gt;I added a &lt;code&gt;Last-Modified&lt;/code&gt; header to every blog post route, because retrieval crawlers use it to decide re-fetch frequency and Vercel wasn't sending one by default.&lt;/li&gt;
&lt;li&gt;I started tracking the retrieval-to-training ratio weekly in a spreadsheet. Two weeks in, the only useful insight is that the number is stable. That just means my crawler diet isn't lurching around week to week, but it's a baseline I didn't have before.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I expected the logs to confirm what I already believed about LLMO. They mostly didn't. Citation isn't the only signal worth watching. Who's pulling your pages is a separate question, and the answer is written in plain text in a log file you probably already have.&lt;/p&gt;

&lt;p&gt;If you want the full measurement frame (citation tracking, GA4 referrals, and server-log crawler analysis as parts of one system) the book is here: &lt;a href="https://kenimoto.dev/books/llmo-ai-search-optimization?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=five-ai-crawlers-30-days" rel="noopener noreferrer"&gt;LLMO: AI Search Optimization&lt;/a&gt;. Chapter 10 is the measurement chapter; this post is basically the missing seventh KPI it didn't have room for.&lt;/p&gt;

</description>
      <category>llmo</category>
      <category>ai</category>
      <category>seo</category>
      <category>webdev</category>
    </item>
    <item>
      <title>I Added a 4th Agent That Audits My Other Agents. It Caught My Strategist Procrastinating for 3 Weeks.</title>
      <dc:creator>Ken Imoto</dc:creator>
      <pubDate>Mon, 25 May 2026 22:00:01 +0000</pubDate>
      <link>https://forem.com/kenimo49/i-added-a-4th-agent-that-audits-my-other-agents-it-caught-my-strategist-procrastinating-for-3-cg</link>
      <guid>https://forem.com/kenimo49/i-added-a-4th-agent-that-audits-my-other-agents-it-caught-my-strategist-procrastinating-for-3-cg</guid>
      <description>&lt;p&gt;I built a three-layer agent harness and called it "autonomous." Observer collected the data. Strategist picked the theme. Marketer wrote the article. They all followed &lt;code&gt;strategy.md&lt;/code&gt;, the file that holds my rules. The cron fired every Monday at 09:00 and the articles showed up by lunch. I felt very clever about it.&lt;/p&gt;

&lt;p&gt;Then I read my own Strategist logs across three weeks and noticed something. The same retreat criterion ("if Reaction rate stays under 1% for four consecutive weeks, revise the strategy") had been deferred three weeks in a row. Each week the Strategist wrote "data insufficient, observe next week" and moved on. The rule existed. The data existed. The rule never fired.&lt;/p&gt;

&lt;p&gt;The three-layer harness couldn't catch this because the three layers were doing exactly what &lt;code&gt;strategy.md&lt;/code&gt; told them to do. The bug was not in the agents. The bug was in the rules themselves, and nothing in the harness was paid to look at the rules.&lt;/p&gt;

&lt;p&gt;I added a 4th layer called Evolver. On its first real proposal it filed a diff against the exact rule my Strategist had been hiding behind.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftd7iyludqh22ddpyltj0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftd7iyludqh22ddpyltj0.png" alt="Four-layer agent harness diagram: Observer, Strategist, Marketer follow strategy.md; Evolver audits and rewrites the strategy.md itself." width="800" height="420"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The three layers were not the autonomous part
&lt;/h2&gt;

&lt;p&gt;The architecture I had been calling autonomous looked like this. Observer ran daily and dumped GA4 numbers into &lt;code&gt;article-performance.jsonl&lt;/code&gt;. Strategist ran every Monday morning, read &lt;code&gt;strategy.md&lt;/code&gt;, and picked five themes for the week. Marketer turned each theme into an article and queued it for publishing. Three roles, three cron jobs, predictable behavior.&lt;/p&gt;

&lt;p&gt;The trick that made this fast was that I had taken WebSearch away from Strategist on purpose. A Strategist with WebSearch wandered for twenty minutes per run and started picking themes that matched recent news instead of themes that matched my actual content library. Stripping WebSearch dropped the cycle from twenty minutes to three. That post was about making Strategist faster. This one is about making it accountable.&lt;/p&gt;

&lt;p&gt;The thing none of those three layers could do was rewrite &lt;code&gt;strategy.md&lt;/code&gt;. They read it every Monday and obeyed it. If the rule was wrong, they obeyed a wrong rule. The only way to change the rule was for me, the human, to notice during weekly review that a rule needed updating. And I was the bottleneck. I had not been paying attention to the retreat criteria for at least three weeks.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the procrastination looked like in the logs
&lt;/h2&gt;

&lt;p&gt;I am going to quote my own Strategist logs because the pattern is more honest when you see it in the original.&lt;/p&gt;

&lt;p&gt;From the log dated three weeks before I added the Evolver:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Reaction rate continues at 0% for the majority of articles. Title strategy has shifted to first-person and numerical framing. Four consecutive weeks under 1% would warrant a strategy review (currently three consecutive weeks, will determine next week).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The next week:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Reaction rate has not yet reached four consecutive weeks under 1%, but weekly trend data is insufficient. Observe next week.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is the entire failure mode in two sentences. The rule said "four consecutive weeks." The Strategist had three consecutive weeks of data under 1%. Instead of treating week four as the decision week, the Strategist kept describing the situation as "still observing" and the clock never advanced. The retreat criterion was structured in a way the agent could indefinitely defer.&lt;/p&gt;

&lt;p&gt;When I went and computed the actual numbers from &lt;code&gt;article-performance.jsonl&lt;/code&gt; myself, the picture was even uglier. Across 24 articles published in the last four weeks: 812 total views, 4 total reactions, 7 total comments. Reaction rate: 0.49%. Half the threshold. Engagement rate (reactions plus comments): 1.35%. The rule should have triggered weeks ago. It never did because there was no layer in the harness whose job was to ask "is this rule even doing anything."&lt;/p&gt;

&lt;h2&gt;
  
  
  The 4th layer: what an Evolver is
&lt;/h2&gt;

&lt;p&gt;So I added a 4th cron job. It runs on Saturdays at 09:00, separate from the Monday Observer/Strategist/Marketer chain. Unlike the other three, it has WebSearch enabled. Its job is not to write articles. Its job is to read the strategy file, read the last few weeks of decision logs, and propose diffs against &lt;code&gt;strategy.md&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Each proposal is one file: &lt;code&gt;domains/&amp;lt;name&amp;gt;/data/evolution/EVO-NNNN.md&lt;/code&gt;. The Evolver fills in five sections.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Observation — what it saw in the data&lt;/li&gt;
&lt;li&gt;Proposal — the rule change in plain prose&lt;/li&gt;
&lt;li&gt;Rationale — internal data and external references that justify the change&lt;/li&gt;
&lt;li&gt;Expected impact — what should improve if applied&lt;/li&gt;
&lt;li&gt;Diff — a literal &lt;code&gt;diff&lt;/code&gt; block against &lt;code&gt;strategy.md&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The diff block is the load-bearing part. The Evolver does not just write English suggestions. It writes the exact patch that would land in the repo. A small CLI called &lt;code&gt;harness-evolve.sh&lt;/code&gt; knows how to extract the diff block, run &lt;code&gt;git apply --check&lt;/code&gt;, and commit it with the proposal as the body. No LLM is involved in the apply step. The LLM proposes, the shell applies.&lt;/p&gt;

&lt;p&gt;That separation is on purpose. The proposal is creative. The apply is mechanical. When the apply step is mechanical you can trust it to either succeed cleanly or fail loudly. There is no "the agent tried to apply the patch and something weird happened in the middle."&lt;/p&gt;

&lt;h2&gt;
  
  
  EVO-0003 caught my Strategist procrastinating
&lt;/h2&gt;

&lt;p&gt;The Evolver's third real proposal, &lt;code&gt;EVO-0003&lt;/code&gt;, was the one I described above. The proposal is on disk and I am reading it back as I write this.&lt;/p&gt;

&lt;p&gt;The observation section quoted both of my Strategist logs, the "three consecutive weeks, will determine next week" one and the "data insufficient, observe next week" one. Then it computed the engagement rate from &lt;code&gt;article-performance.jsonl&lt;/code&gt; and showed that the threshold had been breached for at least four weeks. Then it argued that the original rule was bad in three ways:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The formula was not specified. Was "Reaction rate" per-article or aggregate? My Strategist could plausibly compute either, which is why it had been deferring.&lt;/li&gt;
&lt;li&gt;The trigger condition "four consecutive weeks" was ambiguous when weekly data was thin.&lt;/li&gt;
&lt;li&gt;The action on trigger ("propose a title and angle revision") was abstract enough that the Strategist could fulfill it with a single sentence and move on.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The proposal replaced the rule with this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Engagement rate = (sum of reactions + comments over the last 4 weeks of articles) / sum of views. The Strategist must compute this every week and log it. If under 1.5% for four consecutive weeks, next week's 5 articles must be at least 4 titles in the "number + first person + failure narrative" form. Abstract titles are forbidden.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It is a 20-line patch. The diff is below the prose in the proposal file. I approved it via &lt;code&gt;/harness-evolve approve EVO-0003&lt;/code&gt; at 14:04 on a Tuesday afternoon. The shell ran &lt;code&gt;git apply --index&lt;/code&gt; against &lt;code&gt;strategy.md&lt;/code&gt;, made the commit, updated the proposal's frontmatter to &lt;code&gt;status: applied&lt;/code&gt;, and sent me a Telegram note. The next Monday's Strategist ran with the new rule and computed an engagement rate of 1.35% in the log without prompting. The "data insufficient" sentence stopped appearing.&lt;/p&gt;

&lt;p&gt;The thing I want to be honest about is that the Strategist hadn't been malicious. It hadn't been broken either. It had been a perfectly competent agent following a rule that was structured to allow deferral. That is a failure of the rule. The Evolver's job is to detect rule failures, because nothing else in the harness was structured to.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Safety boundary, because Self-Evolving Agents are not toys
&lt;/h2&gt;

&lt;p&gt;The minute you say "an agent that rewrites the harness," somebody in your head should be raising their hand and asking what stops it from rewriting itself into a paperclip optimizer. Several things, on purpose.&lt;/p&gt;

&lt;p&gt;The Evolver cannot touch the kinds of decisions that have to remain mine. Adding or removing a domain. Switching languages. Changing the quality bar for writing. Anything involving licensing, author identity, or security. The &lt;code&gt;.env&lt;/code&gt; file, the credentials directory, the publish triggers. If any of these were on the table I would not let the Evolver run unattended at all.&lt;/p&gt;

&lt;p&gt;Inside the territory it can touch, three numeric limits keep it from running away.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Diff size cap: 20 lines per proposal. A proposal larger than that has to be split or escalated.&lt;/li&gt;
&lt;li&gt;Two proposals per week per domain. If the Evolver wants to propose more, the third is held until next Saturday.&lt;/li&gt;
&lt;li&gt;Three consecutive rejects on the same theme triggers an automatic mute. The Evolver stops re-pitching the same idea after I have said no three times.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The last one is the part I think is undersold in the broader "self-improving agent" literature. The interesting signal in a &lt;code&gt;reject&lt;/code&gt; log is not the proposal, it is the reason. "MCP is still the main revenue genre, we cannot drop it" is the kind of business context that has never been written into &lt;code&gt;strategy.md&lt;/code&gt;. After three weeks of rejecting MCP-cut proposals with that reason, the Evolver stops proposing them. Implicit founder context becomes explicit harness behavior, just by accumulating reasons-for-reject.&lt;/p&gt;

&lt;h2&gt;
  
  
  What you need before adding a 4th layer
&lt;/h2&gt;

&lt;p&gt;I think there are three real prerequisites before adding an Evolver-style layer to your own setup. Without them, the 4th layer is just noise.&lt;/p&gt;

&lt;p&gt;First, the three existing layers have to produce decision logs that another agent can read. If your Strategist's output is "ran successfully, picked themes," there is nothing for the Evolver to find. The procrastination only showed up because my Strategist had been writing structured logs with phrases like "currently three consecutive weeks, will determine next week." Logs that include the agent's reasoning in prose are what make audit possible.&lt;/p&gt;

&lt;p&gt;Second, the rules themselves have to be in version control as text. &lt;code&gt;strategy.md&lt;/code&gt; is a checked-in markdown file because the Evolver needs to produce a diff block that &lt;code&gt;git apply&lt;/code&gt; can land. If your rules live in a database, a SaaS dashboard, or a thousand-line JSON config, the patch model breaks down. Plain markdown in git is the cheap path.&lt;/p&gt;

&lt;p&gt;Third, you need a human approval channel that does not require the human to read the whole proposal every time. My Telegram notification has the EVO-ID, the title, and a one-line link to the file. I open the file only when the title makes me curious. Most of the time I either approve fast or reject with a short reason. If approval costs me ten minutes per proposal, I will stop running the Evolver. If it costs me thirty seconds, I will run it indefinitely.&lt;/p&gt;

&lt;h2&gt;
  
  
  What about not adding a 4th layer
&lt;/h2&gt;

&lt;p&gt;If you do not want a 4th layer, you can absolutely get most of the benefit by running a weekly human review with a specific question. Not "how are the agents doing." That is what I had been doing, and it did not catch the procrastination. The specific question is: "did any retreat criterion in &lt;code&gt;strategy.md&lt;/code&gt; actually fire this week, and if not, why not."&lt;/p&gt;

&lt;p&gt;Sit with that question for ten minutes per Friday. You will catch what I was missing for three weeks. The Evolver is, more than anything else, a forcing function for that question. It does not have to be an agent. It can be a calendar reminder.&lt;/p&gt;

&lt;p&gt;I happen to like running it as an agent because the proposal artifacts pile up in version control and become a record of how my rules have evolved. &lt;code&gt;EVO-0001&lt;/code&gt; through &lt;code&gt;EVO-0004&lt;/code&gt; form a small history of "things I thought were good ideas, things I thought were bad ideas, and why." That history is useful when I am writing next year's &lt;code&gt;strategy.md&lt;/code&gt; from scratch.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I have not built yet
&lt;/h2&gt;

&lt;p&gt;The current Evolver only audits one domain at a time. Across my four domains (devto, qiita, zenn, kenimoto-dev) I have written different versions of &lt;code&gt;strategy.md&lt;/code&gt; for each, and most of them have similarly structured retreat criteria. A cross-domain Evolver could notice that the same rule structure has been failing in two domains and propose a unified fix. I have not built it. It is on the list.&lt;/p&gt;

&lt;p&gt;The other thing on the list is the obvious recursion question. Who audits the Evolver. The current answer is "I do, every approve/reject is a human signal." The longer answer is "I do not know yet." If the Evolver's proposals start looking systematically biased — say, always proposing tighter thresholds, or always proposing to drop the same genre — that bias is real and I should add a 5th layer that watches the 4th. I have not seen it yet. I might not until EVO-0050 or so. I want the bias to be obvious before I add another layer just to feel safer.&lt;/p&gt;

&lt;p&gt;For now: three agents that follow rules, one agent that audits the rules, and one human who approves the audit. That is the smallest harness I have found that catches its own procrastination.&lt;/p&gt;




&lt;p&gt;If you want the full Harness Engineering picture — the 6 building blocks, the AGENTS.md/CLAUDE.md/hooks patterns, and the Self-Evolving Agent chapter that grounds this article — that is in the book.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://kenimoto.dev/books/harness-engineering-guide?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=evolver-caught-strategist" rel="noopener noreferrer"&gt;Harness Engineering: From Using AI to Controlling AI&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>claudecode</category>
      <category>llm</category>
    </item>
    <item>
      <title>I Told Claude Code to Do TDD. It Wrote the Test AFTER the Code 6 Out of 10 Times.</title>
      <dc:creator>Ken Imoto</dc:creator>
      <pubDate>Mon, 25 May 2026 11:00:00 +0000</pubDate>
      <link>https://forem.com/kenimo49/i-told-claude-code-to-do-tdd-it-wrote-the-test-after-the-code-6-out-of-10-times-31kd</link>
      <guid>https://forem.com/kenimo49/i-told-claude-code-to-do-tdd-it-wrote-the-test-after-the-code-6-out-of-10-times-31kd</guid>
      <description>&lt;p&gt;My CLAUDE.md had a section called &lt;code&gt;## TDD First&lt;/code&gt;. Six lines. Very clear. I had spent twenty minutes drafting it. Then I ran a 30-day audit of my own commits and discovered that across the features I had asked Claude Code to TDD, the test file was committed &lt;em&gt;after&lt;/em&gt; the source file 6 out of 10 times.&lt;/p&gt;

&lt;p&gt;Not "the test failed first, then I fixed it." The test file did not exist at the moment the source file got committed.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fms1yprjyrsiij2nkhfze.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fms1yprjyrsiij2nkhfze.png" alt="A side-by-side panel: before the PreToolUse hook, 6 of 10 features had the test written after the source. After the hook, 10 of 10 had the test first." width="800" height="420"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is the story of how I caught it, why it kept happening, and the two-part fix (prompt plus a PreToolUse hook) that finally pushed Claude into a real red-green-refactor cycle. It is the third installment in what is becoming an accidental series on Claude doing things confidently and wrong. The first was Claude &lt;a href="https://dev.to/kenimo49/i-caught-claude-hiding-my-bug-three-times-here-are-the-10-debugging-prompts-that-stopped-it-3l1h"&gt;hiding bugs three times in a row&lt;/a&gt;. The second was refusing to write specs until the code went sideways three times. This one is about TDD, and the pattern is identical: the model agrees, the model proceeds, the model skips the part of the prompt that would cost it tokens.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 30-day audit
&lt;/h2&gt;

&lt;p&gt;The audit was accidental. I had been writing about debugging habits and wanted to see whether my own commit history was consistent with what I was preaching. So I pulled &lt;code&gt;git log --name-only --pretty=format:'%h %ai %s'&lt;/code&gt; for the last 30 days on a project I had been driving with Claude Code, and grouped the commits by feature. Ten features. For each one, I noted the timestamp of the first commit that touched the source file, and the timestamp of the first commit that touched its test file.&lt;/p&gt;

&lt;p&gt;Six features out of ten had the source file committed first. The gap ranged from 90 seconds to 23 minutes. In two cases the test file was committed in the same commit as a later round of fixes, after the source had already been shipped to a feature branch. In one case there was no test file at all, only a &lt;code&gt;# TODO: add tests&lt;/code&gt; next to the function.&lt;/p&gt;

&lt;p&gt;I had been telling Claude "TDD this" every single time. I had a &lt;code&gt;## TDD First&lt;/code&gt; section in CLAUDE.md. I had even pasted the red-green-refactor sequence at the top of the prompt for the more complex features. And six times out of ten, it had cheerfully written the implementation, then either written the test afterward or skipped it entirely.&lt;/p&gt;

&lt;p&gt;I am not blaming the model for being lazy. The model was doing exactly what it was trained to do.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why next-token prediction defaults to implementation-first
&lt;/h2&gt;

&lt;p&gt;This is the part that took me a while to actually understand. The model is not deciding "I will do TDD" or "I will not do TDD" the way a human engineer might decide. It is predicting the next most plausible token given the context. And in its training data, the overwhelming majority of "user asks for feature X" responses look like &lt;em&gt;here is the function that does X&lt;/em&gt;, optionally followed by &lt;em&gt;and here is a test&lt;/em&gt;. The "test first, then implementation, with the test failing in between" sequence is rare in public repositories because humans rarely commit the red phase as its own commit. We commit the green phase. So the model never built a strong prior for the red-first ordering.&lt;/p&gt;

&lt;p&gt;Several people in the Claude Code community have pointed at the same thing. The &lt;a href="https://www.aihero.dev/skill-test-driven-development-claude-code" rel="noopener noreferrer"&gt;aihero.dev TDD skill writeup&lt;/a&gt; puts it as: when the test writer and the implementer share the same context window, the implementer's thinking leaks into the test writer's, and you get tests that conveniently pass on the first run. That is not TDD. That is "tests retrofitted to pass." The &lt;a href="https://alexop.dev/posts/custom-tdd-workflow-claude-code-vue/" rel="noopener noreferrer"&gt;alexop.dev red-green-refactor loop post&lt;/a&gt; argues that the only reliable fix is to force the cycle from outside the model, with hooks or skills that the agent cannot override mid-stride.&lt;/p&gt;

&lt;p&gt;The other thing I keep seeing in community writeups, including the &lt;a href="https://docs.bswen.com/blog/2026-03-25-tdd-skill-claude-code/" rel="noopener noreferrer"&gt;BSWEN Claude Code TDD skill walkthrough&lt;/a&gt;, is the same Anthropic guidance I had been ignoring: Claude will sometimes alter the test to make it pass rather than fix the implementation. Committing the test before the implementation gives you a diff to look at if that happens. I was not doing that either.&lt;/p&gt;

&lt;p&gt;So the model had a weak prior for test-first, and I had a weak workflow that did nothing to compensate. Six out of ten makes a lot of sense in retrospect. The surprising thing is that it was as low as six.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I tried first that did not work
&lt;/h2&gt;

&lt;p&gt;Before the hook, I tried prompt engineering harder. This is what most people try, and it gets you most of the way without getting you there.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Attempt 1 — &lt;code&gt;## TDD First&lt;/code&gt; in CLAUDE.md.&lt;/strong&gt; Already had this. Six out of ten ignored it. The header was too generic; the model saw it as a vibe, not a constraint.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Attempt 2 — explicit red-phase instruction in the prompt.&lt;/strong&gt; I started pasting "Write a failing test for [feature] in &lt;code&gt;tests/X_test.py&lt;/code&gt;. Do not write the implementation yet. Run the test and confirm it fails before proceeding." This got me to maybe 8 out of 10. Better, but 2 out of 10 I would catch it cheating, usually by writing the test in a way that mocked out the part that would actually have failed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Attempt 3 — separate prompts for red and green.&lt;/strong&gt; Two messages. First message: write the failing test, stop, run it, show me the failure. Second message, only after I had eyeballed the failure: now write the implementation. This was the first time I got something that smelled like real TDD. The problem was that it required me to physically be at the keyboard for two turns, and if I context-switched away mid-feature, the next Claude session would happily merge the two steps back into one.&lt;/p&gt;

&lt;p&gt;The lesson from Attempt 3 is that prompts are advice. The model can ignore advice. To get TDD enforced, I needed something the model could not ignore. That something is a hook.&lt;/p&gt;

&lt;h2&gt;
  
  
  The PreToolUse hook that broke the loop
&lt;/h2&gt;

&lt;p&gt;Claude Code's hook system lets you intercept tool calls before they execute. A PreToolUse hook on Write or Edit gets the file path the model is about to touch. If the model is trying to write to &lt;code&gt;src/foo.py&lt;/code&gt; and there is no &lt;code&gt;tests/foo_test.py&lt;/code&gt; that currently fails, the hook can exit 2, which Claude Code treats as "this tool call is denied, here is the reason, try again."&lt;/p&gt;

&lt;p&gt;This is the smallest version that worked for me, on a Python project with pytest:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"hooks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"PreToolUse"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"matcher"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Write|Edit"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"hooks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"python3 .claude/hooks/require-failing-test.py"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The script reads the file path from the tool call payload, maps &lt;code&gt;src/X.py&lt;/code&gt; to &lt;code&gt;tests/X_test.py&lt;/code&gt;, checks the test file exists, runs &lt;code&gt;pytest tests/X_test.py --no-header -q&lt;/code&gt;, and exits 2 if pytest exits 0. If the test does not yet exist or the test currently fails, the hook lets the edit through. If the test exists and is already passing, the hook blocks the edit with a message like &lt;em&gt;"a failing test must exist in tests/X_test.py before src/X.py can be modified. Write the failing test first."&lt;/em&gt; That message lands in the model's next-turn context. It does not have a choice.&lt;/p&gt;

&lt;p&gt;There are edge cases. The test file might pass for the wrong reason; the hook does not catch that. The mapping from source to test path is project-specific; mine is hardcoded. And I have an escape hatch, a magic comment &lt;code&gt;# tdd-bypass: refactor&lt;/code&gt; on the first line, for refactor commits where you genuinely want to edit without a new failing test, because refactor is supposed to preserve behavior, not add it. The hook respects the escape hatch, but it logs every use of it to a file I review at the end of the week. The first week, my escape-hatch log had 22 entries. The second week it had 4. That number going down is the whole point.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the 30-day rerun looked like
&lt;/h2&gt;

&lt;p&gt;I ran the same audit 30 days after the hook went in. Same project, same kind of features, same prompt style. The numbers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Test file committed first: &lt;strong&gt;9 of 10&lt;/strong&gt; (up from 4 of 10)&lt;/li&gt;
&lt;li&gt;Test file committed in same commit as source, but written first per the file-modification timestamps: 1 of 10&lt;/li&gt;
&lt;li&gt;Test file committed after source: 0 of 10&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The single feature where the test went in the same commit as the source was a 12-line config helper that I had legitimately bypassed with the magic comment. So in terms of TDD being followed when the rule applied, the number is 10 of 10.&lt;/p&gt;

&lt;p&gt;I do not want to claim that the hook turned Claude into a disciplined TDD practitioner. It did not. The model still writes implementations that look suspicious from a "test was designed around the implementation" perspective some of the time. What the hook gives me is &lt;em&gt;ordering&lt;/em&gt;: a failing test must exist before the source can be touched. That alone closes the loop where Claude was retrofitting tests around code that was already shaping the test's assertions. The Anthropic guidance on this, captured by several community writeups including the &lt;a href="https://www.datacamp.com/tutorial/claude-code-best-practices" rel="noopener noreferrer"&gt;DataCamp best practices roundup&lt;/a&gt;, is that ordering is the load-bearing constraint and everything else is bonus.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to skip TDD entirely
&lt;/h2&gt;

&lt;p&gt;This is the part I should have figured out before instrumenting any of this. There are tasks where TDD is the wrong tool. Refactors that should be a no-op behaviorally. One-off scripts I am going to throw away in 20 minutes. Pure data migrations. UI tweaks where the test would just be a snapshot of itself. Forcing TDD on these tasks does not make the code better; it makes the workflow heavier with no payoff.&lt;/p&gt;

&lt;p&gt;The escape hatch exists for these. The week-end review of the escape-hatch log is where I notice if I am abusing it. "I bypassed TDD because the test was hard to write" is a smell. "I bypassed TDD because the code was a snapshot test of CSS class names" is fine. The audit, not the rule, is what keeps the workflow honest.&lt;/p&gt;

&lt;h2&gt;
  
  
  The takeaway
&lt;/h2&gt;

&lt;p&gt;My CLAUDE.md still says &lt;code&gt;## TDD First&lt;/code&gt;. I left it there for vibes. It was never going to be the part that did the work. The hook is the part that does the work, and the audit is the part that decides whether the hook is still tuned right.&lt;/p&gt;

&lt;p&gt;If you want the full picture of how to layer prompts vs hooks vs MCP servers (when to use which layer for which kind of rule), I wrote it down in &lt;a href="https://kenimoto.dev/books/claude-code-mastery?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=claude-tdd-6-of-10" rel="noopener noreferrer"&gt;Practical Claude Code&lt;/a&gt;. The hooks chapter is the one I keep coming back to.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Sources:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/FlorianBruniaux/claude-code-ultimate-guide/blob/main/guide/workflows/tdd-with-claude.md" rel="noopener noreferrer"&gt;TDD with Claude Code (FlorianBruniaux/claude-code-ultimate-guide)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.bswen.com/blog/2026-03-25-tdd-skill-claude-code/" rel="noopener noreferrer"&gt;How to Implement TDD with Claude Code TDD Skill (BSWEN, Mar 2026)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.aihero.dev/skill-test-driven-development-claude-code" rel="noopener noreferrer"&gt;My Skill Makes Claude Code GREAT At TDD (aihero.dev)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://alexop.dev/posts/custom-tdd-workflow-claude-code-vue/" rel="noopener noreferrer"&gt;Forcing Claude Code to TDD: an agentic red-green-refactor loop (alexop.dev)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.datacamp.com/tutorial/claude-code-best-practices" rel="noopener noreferrer"&gt;Claude Code Best Practices: Planning, Context Transfer, TDD (DataCamp)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>claudecode</category>
      <category>ai</category>
      <category>tdd</category>
      <category>hooks</category>
    </item>
    <item>
      <title>Your New Domain's First Week of GA4 Is a Lie: 4 Days of Raw Data from a Launch</title>
      <dc:creator>Ken Imoto</dc:creator>
      <pubDate>Sat, 23 May 2026 13:00:01 +0000</pubDate>
      <link>https://forem.com/kenimo49/your-new-domains-first-week-of-ga4-is-a-lie-4-days-of-raw-data-from-a-launch-47pi</link>
      <guid>https://forem.com/kenimo49/your-new-domains-first-week-of-ga4-is-a-lie-4-days-of-raw-data-from-a-launch-47pi</guid>
      <description>&lt;p&gt;Four days after registering a new domain, I opened GA4 and saw &lt;strong&gt;65 page views / 34 users / 9 countries&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For a brief, build-in-public moment, I almost cheered. Then I looked at the breakdown. The US had 17 sessions averaging &lt;strong&gt;4.9 seconds&lt;/strong&gt; of session duration. France, Poland, South Korea, India, Singapore: each between &lt;strong&gt;0 and 1.4 seconds&lt;/strong&gt;. Japan alone sat at &lt;strong&gt;751 seconds (over 12 minutes)&lt;/strong&gt;: an outlier so loud it should be illegal.&lt;/p&gt;

&lt;p&gt;The domain is &lt;a href="https://kaoriq.com" rel="noopener noreferrer"&gt;kaoriq.com&lt;/a&gt;, registered on 2026-05-02, a personality-quiz × fragrance e-commerce site I'm building. As of today (May 5), it has fewer than 20 articles. Doing the back-of-the-envelope math, that page-view distribution is physically impossible to come from real humans.&lt;/p&gt;

&lt;p&gt;This post walks through how I read the first week of GA4 data on a new domain as &lt;strong&gt;"me + a crawler army"&lt;/strong&gt;, with the actual numbers exposed. For anyone running GA4 on a new project, or anyone who registered a domain this weekend.&lt;/p&gt;

&lt;h2&gt;
  
  
  The raw data: past 14 days (4 days of real activity)
&lt;/h2&gt;

&lt;p&gt;Numbers first, no spin.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Overall&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Sessions&lt;/td&gt;
&lt;td&gt;37&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Page Views&lt;/td&gt;
&lt;td&gt;65&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total Users&lt;/td&gt;
&lt;td&gt;34&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;New Users&lt;/td&gt;
&lt;td&gt;34&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avg Session Duration&lt;/td&gt;
&lt;td&gt;104.1 s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bounce Rate&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;80%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;By Country&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Country&lt;/th&gt;
&lt;th&gt;Sessions&lt;/th&gt;
&lt;th&gt;PV&lt;/th&gt;
&lt;th&gt;Avg Duration (s)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Japan&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;33&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;751.0&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;United States&lt;/td&gt;
&lt;td&gt;17&lt;/td&gt;
&lt;td&gt;17&lt;/td&gt;
&lt;td&gt;4.9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Canada&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;1.3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;France&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;1.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Poland&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;0.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;South Korea&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;0.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;(not set)&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;0.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;India&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;0.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Singapore&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;0.0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Daily&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Date&lt;/th&gt;
&lt;th&gt;Sessions&lt;/th&gt;
&lt;th&gt;PV&lt;/th&gt;
&lt;th&gt;Users&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2026-05-02 (registration day)&lt;/td&gt;
&lt;td&gt;17&lt;/td&gt;
&lt;td&gt;40&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2026-05-03&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2026-05-04&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2026-05-05&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;At a glance, "not bad for week one" is a tempting read. But this dataset contains a &lt;strong&gt;751-second Japanese reader&lt;/strong&gt; living next door to &lt;strong&gt;9 countries averaging zero seconds&lt;/strong&gt;. The middle is missing. That gap is the whole tell.&lt;/p&gt;

&lt;h2&gt;
  
  
  Five signals, beaten in parallel
&lt;/h2&gt;

&lt;p&gt;I never call bot traffic on a single signal. To avoid false positives, I always cross-check five axes at once.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Signal&lt;/th&gt;
&lt;th&gt;Bot pattern&lt;/th&gt;
&lt;th&gt;Human pattern&lt;/th&gt;
&lt;th&gt;kaoriq actual&lt;/th&gt;
&lt;th&gt;Verdict&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Session duration&lt;/td&gt;
&lt;td&gt;0–5 s&lt;/td&gt;
&lt;td&gt;30 s – several min&lt;/td&gt;
&lt;td&gt;US 4.9s, FR 1.4s, KR 0s&lt;/td&gt;
&lt;td&gt;Bot&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bounce rate&lt;/td&gt;
&lt;td&gt;90–100%&lt;/td&gt;
&lt;td&gt;40–70%&lt;/td&gt;
&lt;td&gt;80%&lt;/td&gt;
&lt;td&gt;Bot&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PV / Session&lt;/td&gt;
&lt;td&gt;1.0 (one page, gone)&lt;/td&gt;
&lt;td&gt;1.5–3.0&lt;/td&gt;
&lt;td&gt;US: 17/17 = 1.0&lt;/td&gt;
&lt;td&gt;Bot&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Geographic anomaly&lt;/td&gt;
&lt;td&gt;Random countries unrelated to content&lt;/td&gt;
&lt;td&gt;Concentrated in target geo&lt;/td&gt;
&lt;td&gt;EN/JA only, yet PL/IN/SG&lt;/td&gt;
&lt;td&gt;Bot&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time-series spike&lt;/td&gt;
&lt;td&gt;Massive day-one for new domains&lt;/td&gt;
&lt;td&gt;Gradual ramp&lt;/td&gt;
&lt;td&gt;40 PV on day of registration&lt;/td&gt;
&lt;td&gt;Bot&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Why a single signal lies
&lt;/h3&gt;

&lt;p&gt;"80% bounce, must all be bots, right?" Not so fast.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Duration alone&lt;/strong&gt;: A reader who tabs your post and walks away for lunch racks up 30+ minutes. Indistinguishable from "deeply engaged" or "abandoned tab."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bounce rate alone&lt;/strong&gt;: A landing page that perfectly answers the question gets a 100% bounce from satisfied humans. Excellence and bots both score the same.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Geography alone&lt;/strong&gt;: A viral overseas tweet legitimately produces multi-country traffic. Weak on its own.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You only get to call "bot" with confidence when &lt;strong&gt;all five signals lean the same direction simultaneously&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bimodal distribution was the smoking gun
&lt;/h2&gt;

&lt;p&gt;The real reason this verdict held in kaoriq's case is the &lt;strong&gt;shape of the duration distribution&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Japan: 5 sessions / &lt;strong&gt;751 s&lt;/strong&gt; average&lt;/li&gt;
&lt;li&gt;Everywhere else: &lt;strong&gt;0–5 s&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the traffic were genuinely human, session duration should spread &lt;strong&gt;more evenly across the 20–120 second band&lt;/strong&gt;: "bounced after the title (10s)," "read the lede (40s)," "made it to the end (180s)" forming a gradient.&lt;/p&gt;

&lt;p&gt;But kaoriq's distribution is &lt;strong&gt;bimodal&lt;/strong&gt; with the middle scooped out. The honest reading: only "me (long sessions, testing the site)" and "crawlers (instant exits)" exist. Nothing in between.&lt;/p&gt;

&lt;p&gt;Conversely, a healthy distribution would look like "Japan 100 sessions / 60s, US 50 sessions / 45s, Canada 20 sessions / 30s": durations spread normally. That'd be a real human traffic signature.&lt;/p&gt;

&lt;h2&gt;
  
  
  So how many real humans were there?
&lt;/h2&gt;

&lt;p&gt;After all that beating, my estimate breaks down as:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Estimated sessions&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Me, testing the site&lt;/td&gt;
&lt;td&gt;4–5&lt;/td&gt;
&lt;td&gt;Most of Japan's 5 sessions, source of the 751s average&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Crawlers (Googlebot / Bingbot / GPTBot / ClaudeBot / AhrefsBot, etc.)&lt;/td&gt;
&lt;td&gt;27–30&lt;/td&gt;
&lt;td&gt;US 17, plus the zero-second Europe &amp;amp; Asia rows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Actual organic human traffic&lt;/td&gt;
&lt;td&gt;2–5&lt;/td&gt;
&lt;td&gt;The remainder of Japan + a couple of US sessions&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Of 37 sessions, &lt;strong&gt;at most 5 were real humans&lt;/strong&gt;. That's the reality of week one for a new domain.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why GA4 doesn't filter this for you
&lt;/h2&gt;

&lt;p&gt;GA4 has a &lt;strong&gt;"known bots and spiders" auto-exclusion&lt;/strong&gt; based on the &lt;a href="https://www.iab.com/guidelines/iab-abc-international-spiders-bots-list/" rel="noopener noreferrer"&gt;IAB/ABC Spiders &amp;amp; Bots list&lt;/a&gt;. It catches classical crawlers but misses:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;JavaScript-executing crawlers&lt;/strong&gt;: GPTBot, ClaudeBot, PerplexityBot. These new generative-AI crawlers run JS, so the GA4 tag fires.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SEO-tool crawlers&lt;/strong&gt;: AhrefsBot, SemrushBot, MozBot. High frequency, and they swarm new domains the moment they're discovered.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Headless-browser scrapers&lt;/strong&gt;: Custom Puppeteer or Playwright bots are indistinguishable from a real Chrome session.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The week after a new domain registration is when this crawler army discovers the new IP.&lt;/strong&gt; It calms down within 7–10 days as DNS propagates. But if you take week-one GA4 at face value, you'll make bad decisions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three annotations every new-project dashboard needs
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Use "Engaged Sessions" as your primary metric.&lt;/strong&gt; GA4 defines an engaged session as: ≥10s duration OR ≥2 PV OR a conversion event. Most of the bot army gets filtered here.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Always view session duration &lt;em&gt;split by country&lt;/em&gt;.&lt;/strong&gt; Looking at any single metric (sessions, PV) without the geo filter lets the crawler army masquerade as success.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Treat the first 30 days as a "noise phase."&lt;/strong&gt; Real numbers only appear after social funnels, SEO, and content depth all line up.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Closing: look at your own GA4 with this lens
&lt;/h2&gt;

&lt;p&gt;A new domain's GA4 lies for the first 1–2 weeks. If your country breakdown is full of zero-second sessions from the US, Eastern Europe, and Southeast Asia: that's the crawler parade, not humans falling in love with your content.&lt;/p&gt;

&lt;p&gt;The procedure is simple: &lt;strong&gt;beat with five signals → suspect bimodal distributions → swap the primary metric to Engaged Sessions&lt;/strong&gt;. Doing this saves you from being whipsawed by early data.&lt;/p&gt;

&lt;p&gt;Doubting GA4 is, in the end, a discipline for not making expensive mistakes. Beat the data before the data beats you.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;This post is based on real data&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Site: &lt;a href="https://kaoriq.com" rel="noopener noreferrer"&gt;kaoriq.com&lt;/a&gt; (domain registered 2026-05-02, built with Astro v6 + Tailwind v4)&lt;/li&gt;
&lt;li&gt;Period analyzed: 2026-04-22 → 2026-05-05 (4 days of actual activity)&lt;/li&gt;
&lt;li&gt;Data source: GA4 Data API v1beta via Service Account&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;If you want the full LLMO playbook (how to think about AI crawlers, citations, and the measurement layer underneath the GA4 narrative):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://kenimoto.dev/books/llmo-ai-search-optimization?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=ga4-new-domain-lie" rel="noopener noreferrer"&gt;LLMO: AI Search Optimization for Engineers&lt;/a&gt;&lt;/p&gt;

</description>
      <category>analytics</category>
      <category>webdev</category>
      <category>llmo</category>
      <category>seo</category>
    </item>
    <item>
      <title>I Stacked 4 More Context Layers on Top of RAG. Sonnet Got 12% Better. Haiku Got 14% Worse.</title>
      <dc:creator>Ken Imoto</dc:creator>
      <pubDate>Fri, 22 May 2026 13:00:01 +0000</pubDate>
      <link>https://forem.com/kenimo49/i-stacked-4-more-context-layers-on-top-of-rag-sonnet-got-12-better-haiku-got-14-worse-3h4d</link>
      <guid>https://forem.com/kenimo49/i-stacked-4-more-context-layers-on-top-of-rag-sonnet-got-12-better-haiku-got-14-worse-3h4d</guid>
      <description>&lt;p&gt;I read a post about "Full Context Engineering" and immediately added four more layers to my RAG pipeline. Structured output instructions. Hierarchical document layout. Role definition. Few-shot examples. The whole buffet.&lt;/p&gt;

&lt;p&gt;The improvement on Claude Sonnet was 12%.&lt;/p&gt;

&lt;p&gt;The improvement on Claude Haiku was minus 14%.&lt;/p&gt;

&lt;p&gt;I had just spent two weeks building scaffolding to make my smaller model worse at its job. If you have ever wallpapered a room and stepped back to discover you covered up the light switch, you know the feeling.&lt;/p&gt;

&lt;p&gt;This post is about what those numbers actually mean for the way you spend your context engineering effort in 2026.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I was measuring
&lt;/h2&gt;

&lt;p&gt;I was running a benchmark against my own book corpus for a previous experiment (the cheap-model post). The same scoring rubric: factual accuracy, hallucination rate, specificity, and honesty on a 0 to 15 scale.&lt;/p&gt;

&lt;p&gt;The configurations were a ladder. Each rung adds one more thing on top of the previous one.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;System prompt only&lt;/strong&gt;: the bare baseline. No retrieval, nothing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;System + RAG&lt;/strong&gt;: vector search over a curated corpus, top documents injected.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Full Context Engineering&lt;/strong&gt;: RAG + structured output instructions + hierarchical layout + role definition + few-shot examples.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;What I expected: a smooth upward curve. What I got was a curve that leaned forward and then fell over.&lt;/p&gt;

&lt;h2&gt;
  
  
  The numbers
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb6xjktcrby9e3jwkz8tw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb6xjktcrby9e3jwkz8tw.png" alt="Full CE vs RAG: Score deltas across Sonnet and Haiku" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude Sonnet, total score (out of 15):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Configuration&lt;/th&gt;
&lt;th&gt;Total&lt;/th&gt;
&lt;th&gt;Delta from previous&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;System only&lt;/td&gt;
&lt;td&gt;8.8&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;System + RAG&lt;/td&gt;
&lt;td&gt;10.2&lt;/td&gt;
&lt;td&gt;+1.4 (+16%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Full Context Engineering&lt;/td&gt;
&lt;td&gt;11.4&lt;/td&gt;
&lt;td&gt;+1.2 (+12%)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Claude Haiku, total score (out of 15):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Configuration&lt;/th&gt;
&lt;th&gt;Total&lt;/th&gt;
&lt;th&gt;Delta from previous&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;System only&lt;/td&gt;
&lt;td&gt;3.7&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;System + RAG&lt;/td&gt;
&lt;td&gt;11.8&lt;/td&gt;
&lt;td&gt;+8.1 (+219%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Full Context Engineering&lt;/td&gt;
&lt;td&gt;10.1&lt;/td&gt;
&lt;td&gt;-1.7 (-14%)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Two findings I did not expect.&lt;/p&gt;

&lt;p&gt;First: RAG is doing almost all of the work. On Sonnet, RAG closed 88% of the gap between baseline and the fully tricked-out pipeline (1.4 of the total 2.6 point improvement). On Haiku, RAG over-shot the final number entirely.&lt;/p&gt;

&lt;p&gt;Second: stacking more on top of RAG is not free. On Haiku, it actively made things worse. The hallucination score went from 1.7 to 0.5. The honesty score went from 1.3 to 0.5. The model started confidently making things up that it had previously hedged on.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this happens
&lt;/h2&gt;

&lt;p&gt;I have a hypothesis that I think survives contact with reality.&lt;/p&gt;

&lt;p&gt;A small model has limited working memory. RAG hands it the right facts. Once those facts are in front of it, the marginal returns from extra structure are small. But the marginal cost of extra context is not small. Every paragraph of role definition, every few-shot example, every "here is how to format the output" block competes with the retrieved documents for the model's attention.&lt;/p&gt;

&lt;p&gt;For Sonnet, the working memory is wide enough that the extras land in unused space. For Haiku, the extras shove the actually-useful retrieved context off to the edge of the window. The model still sees it. It just stops trusting it.&lt;/p&gt;

&lt;p&gt;This is the same finding that recent research on long-context behavior keeps surfacing. Studies on instruction-following at high context fill report that for most frontier models in 2026, quality starts to degrade measurably at 60 to 70 percent context fill, and falls off a cliff around 90 percent. The cliff is steeper for smaller models.&lt;/p&gt;

&lt;p&gt;The Pareto principle applies to context engineering with embarrassing accuracy. RAG is the 20 percent of effort that produces 80 percent of the result. Everything you stack on top of it is the long tail.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 2026 reality I almost forgot to mention
&lt;/h2&gt;

&lt;p&gt;When I ran the original experiment, I was on Sonnet 4 and Haiku 3 with a 200K context window. As of this writing, Sonnet 4.6 has &lt;a href="https://www.anthropic.com/news/claude-sonnet-4-6" rel="noopener noreferrer"&gt;a 1M token context window at standard pricing&lt;/a&gt; and prompt caching cuts the cost of repeated context by 90 percent.&lt;/p&gt;

&lt;p&gt;This changes the math, but not in the direction you might think.&lt;/p&gt;

&lt;p&gt;A 1M context window does not magically make stacked context cheaper to design. The model still has to pay attention to the right thing. The cliff at 60 to 70 percent fill is a percentage, not an absolute. A bigger window just means you can write more bad context before you fall off it.&lt;/p&gt;

&lt;p&gt;Prompt caching helps if your stacked layers are static. The role definition, the few-shot examples, the structured output instructions: those parts cache cleanly. But that only saves money. It does not save quality. If your Haiku result was minus 14%, prompt caching makes minus 14% cheaper. That is not the win you wanted.&lt;/p&gt;

&lt;h2&gt;
  
  
  The thing nobody told me about Skills
&lt;/h2&gt;

&lt;p&gt;Anthropic's Skills feature is interesting in this light. Skills are reusable context bundles that load on demand. The right way to think about them is not "more context, all the time" but "the right context, just in time."&lt;/p&gt;

&lt;p&gt;That is the failure mode my Full CE experiment ran into. I was packing every layer into every request. Skills point at the alternative: keep the system prompt small, retrieve the relevant skill, and let the rest stay out of the window. It is the same lesson as RAG, applied one level up. Selective beats throwing everything in.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I do now
&lt;/h2&gt;

&lt;p&gt;If you take only one thing from this post, take this: the order of operations matters more than the number of operations.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Build the retrieval first. Get RAG working with a clean corpus, decent embeddings, and a relevance threshold. This is your 80%.&lt;/li&gt;
&lt;li&gt;Run a benchmark. Real benchmark, on real questions, scored by a real rubric. Not vibes.&lt;/li&gt;
&lt;li&gt;Add one layer at a time. Structured output, then hierarchical layout, then role definition. Re-benchmark after each.&lt;/li&gt;
&lt;li&gt;If the score goes down, take that layer out. Do not assume the layer is good and your benchmark is bad. The benchmark is right more often than you think.&lt;/li&gt;
&lt;li&gt;Try the same ladder on a smaller model. The thing that helps Sonnet may hurt Haiku. Knowing which side of the line you are on saves you money.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This sounds obvious. It is not what most teams do. Most teams read a blog post about Context Engineering, add four layers in one weekend, and never measure whether the layers actually helped.&lt;/p&gt;

&lt;h2&gt;
  
  
  The chef and the kitchen, revisited
&lt;/h2&gt;

&lt;p&gt;In an earlier post I wrote that the model is the chef and the context is the kitchen. I want to extend that.&lt;/p&gt;

&lt;p&gt;Adding more context layers is like installing more kitchen equipment. A second oven. A pasta machine. A sous-vide. None of them make the chef worse at cooking pasta. But if the pasta machine takes up the counter space where the chef was chopping vegetables, the dinner gets worse anyway.&lt;/p&gt;

&lt;p&gt;The chef does not need every appliance. The chef needs the right ingredients within reach.&lt;/p&gt;

&lt;p&gt;Before you read the next breathless post about Full Context Engineering and start adding layers, run the experiment. Measure RAG alone. Measure RAG plus one thing. Find the layer that earns its keep, and leave the rest in the catalog.&lt;/p&gt;

&lt;p&gt;The answer is almost always: do RAG well first. Everything else is decoration. Decoration that, on a small model, can flip the sign on your accuracy score and leave you wondering why.&lt;/p&gt;

&lt;p&gt;The next time someone says "Context Engineering," what I want to say back is: please define which 20 percent of context you mean. The other 80 has a good chance of making things worse.&lt;/p&gt;




&lt;p&gt;The full Context Engineering system (five strategies, the RAG benchmarks behind these numbers, MCP server design, and the Agentic RAG implementation) is in &lt;a href="https://kenimoto.dev/books/context-engineering?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=full-ce-rag-12-haiku-14" rel="noopener noreferrer"&gt;Turning LLMs from Liars into Experts: Context Engineering in Practice&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>rag</category>
      <category>llm</category>
      <category>contextengineering</category>
    </item>
    <item>
      <title>OpenClaw Hit 250K Stars Faster Than React. I Spent 24 Hours Trying to Like It.</title>
      <dc:creator>Ken Imoto</dc:creator>
      <pubDate>Fri, 22 May 2026 13:00:00 +0000</pubDate>
      <link>https://forem.com/kenimo49/openclaw-hit-250k-stars-faster-than-react-i-spent-24-hours-trying-to-like-it-1pe2</link>
      <guid>https://forem.com/kenimo49/openclaw-hit-250k-stars-faster-than-react-i-spent-24-hours-trying-to-like-it-1pe2</guid>
      <description>&lt;p&gt;I switched my entire dev setup from Claude Code to OpenClaw on a Tuesday morning. By 11am I was googling "how to remove openclaw". By 6pm I had written a SOUL.md file longer than the actual feature I was shipping.&lt;/p&gt;

&lt;p&gt;This post is about that day. About what broke, what didn't, and what 24 hours of working in the terminal agent that is now technically the most-starred open-source project in GitHub history bought me.&lt;/p&gt;

&lt;p&gt;Yes, I am the engineer who wrote about Claude Code Skills three weeks ago and called the workflow pattern "settled for at least a year." Then OpenClaw passed React's all-time star count in 60 days, Peter Steinberger announced he was joining OpenAI to ship agents to everyone, and the launch tweet went past 4 million views. Settled, apparently, was a one-month forecast.&lt;/p&gt;

&lt;h2&gt;
  
  
  The numbers I had to verify before believing them
&lt;/h2&gt;

&lt;p&gt;Let me get the facts out of the way, because half of what people quote on Twitter about OpenClaw is wrong by a factor of two.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OpenClaw crossed 250,000 GitHub stars on March 3, 2026, surpassing React for the all-time most-starred software repository&lt;/li&gt;
&lt;li&gt;60 days from launch to 250K. React took roughly a decade&lt;/li&gt;
&lt;li&gt;60K stars in the first 72 hours. That part is the one nobody actually believes the first time&lt;/li&gt;
&lt;li&gt;Peter Steinberger announced on February 14, 2026 that he is joining OpenAI to work on agents, with OpenClaw moving to a foundation to stay open and independent&lt;/li&gt;
&lt;li&gt;One mid-sized refactor session in my test run consumed 920K tokens, which on Claude 4.5 Sonnet billing came out to about USD 8.30&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Hacker News thread when it crossed React was the most upvoted submission of the week. The top comment was "this is either the best thing that happened to dev tools in five years or the most expensive way to learn what &lt;code&gt;--yolo&lt;/code&gt; does."&lt;/p&gt;

&lt;p&gt;It is, somehow, both.&lt;/p&gt;

&lt;h2&gt;
  
  
  The setup, and the part I underestimated
&lt;/h2&gt;

&lt;p&gt;Installation took less than a minute.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://get.openclaw.dev | sh
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sk-ant-...
openclaw
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first surprise: OpenClaw asked me which model I wanted as default. I had four serious choices, plus Ollama for local models.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;openclaw &lt;span class="nt"&gt;--model&lt;/span&gt; claude-4.5-sonnet
openclaw &lt;span class="nt"&gt;--model&lt;/span&gt; gpt-4o
openclaw &lt;span class="nt"&gt;--model&lt;/span&gt; gemini-2.5-pro
openclaw &lt;span class="nt"&gt;--model&lt;/span&gt; ollama/devstral:24b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Claude Code has a backend model. OpenClaw has a backend model dropdown. That is not a small UX difference when you are trying to land a refactor for less than ten dollars.&lt;/p&gt;

&lt;p&gt;The second surprise: when I ran my first command, the agent asked me where the SOUL.md file was. I did not have one. It happily generated a default. The default was generic enough that I closed the session, opened my editor, and started writing my own. That is when the day quietly stopped being a benchmark and started being a personality test.&lt;/p&gt;

&lt;h2&gt;
  
  
  SOUL.md is the part nobody warned me about
&lt;/h2&gt;

&lt;p&gt;Here is the SOUL.md I ended the day with, after rewriting it three times.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# SOUL.md&lt;/span&gt;
You are a senior backend engineer with strong opinions and short patience for code
that talks more than it does.
&lt;span class="p"&gt;
-&lt;/span&gt; Prefer Python over TypeScript when both fit. We're not building a frontend here.
&lt;span class="p"&gt;-&lt;/span&gt; Never add a feature without a test. If the test would take more than 10 minutes
  to write, ask first instead of writing it.
&lt;span class="p"&gt;-&lt;/span&gt; Performance matters but readability matters more. We're a four-person team, not
  Google.
&lt;span class="p"&gt;-&lt;/span&gt; Do not write conversational filler. "Sure, I'll do that" is not output. Output
  is the diff.
&lt;span class="p"&gt;-&lt;/span&gt; When in doubt, ask. Don't guess. Guessing once cost us a weekend.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The thing the docs do not tell you: SOUL.md is not a config file. It is a contract. CLAUDE.md tells Claude Code what the project is. SOUL.md tells OpenClaw who the agent is. They are two different shapes of the same trust problem, and the day I figured that out was the day OpenClaw stopped feeling worse than Claude Code and started feeling different.&lt;/p&gt;

&lt;p&gt;I had a Claude Code session open in another window all day for a sanity check. By 4pm I noticed my CLAUDE.md was 312 lines and my SOUL.md was 14. The SOUL.md was doing more work per line.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Gateway, and why my LGPD-anxious teammate cared
&lt;/h2&gt;

&lt;p&gt;OpenClaw routes every LLM call through a local process called the Gateway.&lt;/p&gt;

&lt;p&gt;The Gateway sits on your machine. Your prompts and code do not pass through an OpenClaw-operated cloud relay on the way to Anthropic, OpenAI, or whoever. They go straight from your laptop to the model provider you picked.&lt;/p&gt;

&lt;p&gt;Claude Code does not have an equivalent intermediary, but it also does not need one because Anthropic is the only provider. The moment you have multi-provider support, you either need a relay (vendor lock-in risk) or a local gateway (the OpenClaw choice).&lt;/p&gt;

&lt;p&gt;A teammate of mine who lives in Brazil and spends meaningful time worrying about LGPD compliance pinged me at lunch to ask what the network diagram looked like. He liked what I sent him. That conversation alone might be worth the day.&lt;/p&gt;

&lt;h2&gt;
  
  
  ClawHub vs Claude Code Skills
&lt;/h2&gt;

&lt;p&gt;Claude Code Skills are markdown files plus optional resources, distributed however you distribute markdown. ClawHub is an npm-style package marketplace for OpenClaw skills.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;openclaw skills search &lt;span class="s2"&gt;"docker"&lt;/span&gt;
openclaw skills &lt;span class="nb"&gt;install&lt;/span&gt; @clawhub/docker-manager
openclaw skills list
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;ClawHub had several thousand skills the day I tried it. The numbers Steinberger throws around at conferences are higher and probably accurate, but the count moves fast enough that any specific figure is wrong by the time you publish it.&lt;/p&gt;

&lt;p&gt;Two real differences I felt:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;ClawHub skills are JavaScript. They run in a sandbox but can request shell exec privileges. That makes them more capable than Claude Code Skills and more dangerous. The ClawHavoc incident in March of 2026 saw 341 malicious skills caught, which is a real cost of an open marketplace&lt;/li&gt;
&lt;li&gt;Claude Code Skills are simpler to author. I wrote a Skill in 20 minutes my first time. The equivalent ClawHub skill took me about 90 minutes because I had to learn the SDK conventions&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you are an individual developer wanting to share a workflow, Skills are easier. If you are a team wanting a versioned, packaged, audited tool, ClawHub is better. They are not competing for the same problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 3pm moment where I almost stopped
&lt;/h2&gt;

&lt;p&gt;I asked OpenClaw to update some Python 3.8 code to 3.11 across a small repo, run the test suite, and report back.&lt;/p&gt;

&lt;p&gt;It did. The session ate 920K tokens, took about 14 minutes, found three places where my colleague had used the walrus operator wrong, and quietly fixed them. I checked the diff. It was right.&lt;/p&gt;

&lt;p&gt;Claude Code does the same thing. I have run the same prompt against it many times.&lt;/p&gt;

&lt;p&gt;The difference was not the output. The difference was that Claude Code is in my muscle memory. I have typed &lt;code&gt;claude&lt;/code&gt; three times a day for a year. When I typed &lt;code&gt;openclaw&lt;/code&gt; and waited the extra 1.2 seconds for the cold start, my fingers reached for &lt;code&gt;claude&lt;/code&gt; instead. Three times.&lt;/p&gt;

&lt;p&gt;That is the part nobody writes about. Switching costs are not just config. They are reflexes. By 3pm I had written half a SOUL.md, almost given up, made coffee, and come back. By 6pm I was OK again.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I would actually use each one for
&lt;/h2&gt;

&lt;p&gt;I built this matrix during the second coffee.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Decision&lt;/th&gt;
&lt;th&gt;OpenClaw&lt;/th&gt;
&lt;th&gt;Claude Code&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Locked into Anthropic models?&lt;/td&gt;
&lt;td&gt;No, multi-provider&lt;/td&gt;
&lt;td&gt;Yes, Anthropic only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Local model option&lt;/td&gt;
&lt;td&gt;Ollama&lt;/td&gt;
&lt;td&gt;None official&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Skill distribution&lt;/td&gt;
&lt;td&gt;ClawHub package marketplace&lt;/td&gt;
&lt;td&gt;Markdown files&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Personality file&lt;/td&gt;
&lt;td&gt;SOUL.md (who is the agent)&lt;/td&gt;
&lt;td&gt;CLAUDE.md (what is the project)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Network architecture&lt;/td&gt;
&lt;td&gt;Local Gateway, no relay&lt;/td&gt;
&lt;td&gt;Direct to Anthropic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Maturity&lt;/td&gt;
&lt;td&gt;60 days old, foundation forming&lt;/td&gt;
&lt;td&gt;18 months, Anthropic-stable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best at&lt;/td&gt;
&lt;td&gt;Multi-model teams, regulated environments&lt;/td&gt;
&lt;td&gt;Anthropic-first dev shops, simplicity&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If your team is Anthropic-only and your CLAUDE.md is already 200 lines, do not switch. Claude Code is fine. The Skills you wrote are still fine. The pattern works.&lt;/p&gt;

&lt;p&gt;If your team is multi-provider, or your compliance team has questions about where prompts travel, or you want a backend model dropdown, OpenClaw is worth a Tuesday.&lt;/p&gt;

&lt;p&gt;I am still on Claude Code as my default. I have OpenClaw aliased to a separate command for the cases where I want to try a different model on the same prompt without paying for two SaaS subscriptions worth of context.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this goes next
&lt;/h2&gt;

&lt;p&gt;OpenClaw moving to a foundation while Steinberger joins OpenAI is the part I am watching most closely. Foundations are how open-source projects survive their founders. They are also how projects ossify. The first six months of governance under the OpenClaw Foundation will tell you whether the project is going to be Linux or Helm.&lt;/p&gt;

&lt;p&gt;If you used to argue Claude Code vs Codex was a binary, OpenClaw is the answer that was supposed to be impossible: a third option that is not produced by an LLM lab. The economics of that are interesting. The next twelve months are going to teach us whether neutral, cross-provider, foundation-governed AI tooling is sustainable, or whether it gets quietly absorbed.&lt;/p&gt;

&lt;p&gt;I am betting on sustainable. I have also been wrong about agents in roughly all of the previous quarters, so adjust accordingly.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this all costs to know
&lt;/h2&gt;

&lt;p&gt;If you take one thing from my Tuesday, take this. OpenClaw and Claude Code are not competitors. They are two answers to the same question: what should the AI inside your terminal be allowed to do without asking you first? SOUL.md and CLAUDE.md are different shapes of the same trust contract. The team that wrote each chose differently because they had different assumptions about who was sitting in front of the screen.&lt;/p&gt;

&lt;p&gt;The right tool is the one whose assumptions match yours. Pick on assumptions, not stars.&lt;/p&gt;




&lt;p&gt;If you want the harness-engineering frame on this (CLAUDE.md tiers, hooks, sub-agents, how to think about the shell around the prompt, not just the prompt), this is the reference:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://kenimoto.dev/books/harness-engineering-guide?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=openclaw-vs-claudecode" rel="noopener noreferrer"&gt;Harness Engineering: A Practitioner's Guide&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claudecode</category>
      <category>agents</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Is AI Actually Citing Your Site? How to Measure What Google Rankings Can't</title>
      <dc:creator>Ken Imoto</dc:creator>
      <pubDate>Thu, 21 May 2026 13:00:00 +0000</pubDate>
      <link>https://forem.com/kenimo49/is-ai-actually-citing-your-site-how-to-measure-what-google-rankings-cant-1cnm</link>
      <guid>https://forem.com/kenimo49/is-ai-actually-citing-your-site-how-to-measure-what-google-rankings-cant-1cnm</guid>
      <description>&lt;p&gt;I've spent the past few weeks writing about LLMO: how to get cited by AI search engines, which content structures work, what Princeton's GEO study says about visibility. All useful stuff. One problem: I had no idea whether any of it was actually working.&lt;/p&gt;

&lt;p&gt;I was like a chef who obsesses over recipes but never tastes the food. My Google Search Console was immaculate. My LLMO measurement setup? I was literally typing "does ChatGPT know about my site" into ChatGPT and refreshing the page like a teenager checking if their crush liked their post.&lt;/p&gt;

&lt;p&gt;Measuring LLMO is a genuinely hard problem, and most people aren't doing it at all. Here's what I've built: three measurement layers, from "costs nothing" to "costs you a Saturday afternoon of Python."&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fww81qxyfq200ipf30nfu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fww81qxyfq200ipf30nfu.png" alt="SEO KPIs vs LLMO KPIs: ranking position disappears, citation becomes the metric" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Measurement Gap
&lt;/h2&gt;

&lt;p&gt;In SEO, measurement is a solved problem. Google Search Console shows rankings, impressions, clicks, and CTR for free, updated daily. Ahrefs adds backlink data. SEMrush gives you keyword tracking. Everything is visible.&lt;/p&gt;

&lt;p&gt;In LLMO, almost nothing is visible out of the box.&lt;/p&gt;

&lt;p&gt;There's no "AI Search Console." ChatGPT doesn't send a weekly email saying "You were cited 47 times!" Perplexity has no creator dashboard. The shift: SEO had rankings (1st through 100th position). LLMO has a binary outcome. You're either cited or you're not. And nobody is telling you which.&lt;/p&gt;

&lt;p&gt;This gap isn't just an inconvenience. You can't improve what you can't measure, and right now, most content creators are optimizing for AI visibility while flying blind.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxsqdm16d4c11twqezz6y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxsqdm16d4c11twqezz6y.png" alt="Three layers of LLMO measurement: GA4, manual protocol, Python automation" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 1: GA4 AI Referral Traffic (Free, 5 Minutes)
&lt;/h2&gt;

&lt;p&gt;The easiest measurement you can set up today is tracking AI referral traffic in Google Analytics 4. When an AI search engine cites your site with a clickable link and someone clicks it, GA4 records the source.&lt;/p&gt;

&lt;p&gt;Here is the regex pattern I use in a custom channel group:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;chatgpt\.com|perplexity\.ai|claude\.ai|gemini\.google\.com|copilot\.microsoft\.com|deepseek\.com|you\.com|meta\.ai|poe\.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Go to &lt;strong&gt;Admin → Channel Groups → Create&lt;/strong&gt;, add a new channel with this regex as the session source filter, and name it "AI Search." You'll immediately see aggregated traffic from all AI platforms in one view.&lt;/p&gt;

&lt;p&gt;A few things to know:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ChatGPT plays nicely.&lt;/strong&gt; Since late 2025, ChatGPT appends &lt;code&gt;utm_source=chatgpt.com&lt;/code&gt; to outbound links. ChatGPT traffic shows up cleanly as &lt;code&gt;chatgpt.com / referral&lt;/code&gt; in GA4.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Perplexity is decent.&lt;/strong&gt; Traffic appears as &lt;code&gt;perplexity.ai / referral&lt;/code&gt;, though without UTM tags. Still trackable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Free-tier ChatGPT is a black hole.&lt;/strong&gt; Free users often don't send referrer data due to privacy settings. Their clicks show up as "Direct," indistinguishable from someone typing your URL manually. Your GA4 numbers are a floor, not a ceiling.&lt;/p&gt;

&lt;p&gt;The conversion story is where this gets interesting. Industry data from 2026 shows AI referral traffic converts at 8-12%, compared to 2-3% for traditional Google organic. People who arrive via AI search have already done their research. The AI did it for them. They are further along in the decision process.&lt;/p&gt;

&lt;p&gt;I started tracking three weeks ago. My AI referral traffic is still small (single digits daily), but the conversion rate is 3x my organic average. Small sample, but a signal worth watching.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 2: The "Ask Five AIs" Protocol (Free, 30 Min/Month)
&lt;/h2&gt;

&lt;p&gt;GA4 tells you who clicked through. It does not tell you whether AI is &lt;em&gt;mentioning&lt;/em&gt; you without linking, or whether it is mentioning you at all.&lt;/p&gt;

&lt;p&gt;For that, you need to ask directly. I run this on the first Monday of every month:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1.&lt;/strong&gt; Write 10-15 prompts related to your niche. Mine include "What are the best resources for AI search optimization?", "How do I get my site cited by ChatGPT?", and "LLMO vs SEO differences."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2.&lt;/strong&gt; Run each prompt on five platforms: ChatGPT, Perplexity, Gemini, Claude, and Copilot.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3.&lt;/strong&gt; Record four things per prompt per platform:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mentioned? (Yes / No)&lt;/li&gt;
&lt;li&gt;Context (recommendation / comparison / neutral / negative)&lt;/li&gt;
&lt;li&gt;Accuracy of information&lt;/li&gt;
&lt;li&gt;URL provided?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Step 4.&lt;/strong&gt; Calculate your citation rate. 15 prompts x 5 platforms = 75 checks. Mentioned 20 times? That's 26.7%.&lt;/p&gt;

&lt;p&gt;This takes about 30 minutes with a spreadsheet. It's manual and tedious, and also the most reliable method that exists today. Automated tools can approximate this, but they can't replicate the nuance of "was that mention positive or just a passing reference?"&lt;/p&gt;

&lt;p&gt;One caveat: LLM responses are non-deterministic. The same prompt can produce different answers on different days. A single check isn't statistically significant. That is why I track the monthly trend, not individual data points. Three months of data starts showing real patterns.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 3: Automate It With Python (One Saturday)
&lt;/h2&gt;

&lt;p&gt;If you're an engineer, you can automate the manual protocol with API calls. Hit the OpenAI and Anthropic APIs with your query set, check whether your brand appears in the response, and log results as a time series.&lt;/p&gt;

&lt;p&gt;The core logic is simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;BRAND_VARIANTS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-site.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Your Brand&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;yourbrand&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;CHECK_QUERIES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Best tools for [your category]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;How to solve [problem you address]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[Your brand] vs [competitor]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;check_openai&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
        &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;
    &lt;span class="n"&gt;mentioned&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;BRAND_VARIANTS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;platform&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ChatGPT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mentioned&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;mentioned&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Extend this for Claude and Perplexity, run weekly via cron, dump to CSV. You get a time series of your AI visibility score for about $0.50/week.&lt;/p&gt;

&lt;p&gt;The payoff: instead of "I think LLMO is working," you can say "my visibility went from 12% to 28% after I added structured data." Numbers beat feelings.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Available in May 2026
&lt;/h2&gt;

&lt;p&gt;If building your own tools isn't your thing, several commercial platforms now track AI citations:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Otterly.ai&lt;/strong&gt; is the fastest-growing option, with 10,000+ users since launching in October 2024. It monitors your brand across ChatGPT, Perplexity, Google AI Overviews, and Copilot. Keyword-level citation tracking, competitor benchmarking, and clean dashboards.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Profound&lt;/strong&gt; sits at the enterprise end. Their published case study with Ramp, where they went from 3.2% to 22.2% AI visibility in one month, is the kind of result that gets budget approved. If you're a larger organization, this is where you'll probably land.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Peec AI&lt;/strong&gt; focuses on brand mention analysis across LLM outputs. Beyond whether you're cited, it tracks how: what sentiment surrounds your mentions, which prompt patterns trigger citations.&lt;/p&gt;

&lt;p&gt;My honest take: for individual creators and small teams, the manual protocol plus a basic Python script gives you 80% of the insight at 0% of the cost. Commercial tools become worthwhile when you're tracking dozens of keywords across multiple brands and need team dashboards.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Crawler Signal You're Probably Ignoring
&lt;/h2&gt;

&lt;p&gt;Here's a measurement angle most people miss: AI crawler logs.&lt;/p&gt;

&lt;p&gt;Your server access logs already record which AI systems are visiting your content. GPTBot (OpenAI), ClaudeBot (Anthropic), PerplexityBot, Google-Extended. They all identify themselves in the User-Agent string.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-E&lt;/span&gt; &lt;span class="s2"&gt;"GPTBot|ClaudeBot|PerplexityBot|Google-Extended"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  /var/log/nginx/access.log | &lt;span class="nb"&gt;awk&lt;/span&gt; &lt;span class="s1"&gt;'{print $7}'&lt;/span&gt; | &lt;span class="nb"&gt;sort&lt;/span&gt; | &lt;span class="nb"&gt;uniq&lt;/span&gt; &lt;span class="nt"&gt;-c&lt;/span&gt; | &lt;span class="nb"&gt;sort&lt;/span&gt; &lt;span class="nt"&gt;-rn&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pages that get crawled frequently are more likely to appear in AI responses. Pages that never get crawled are invisible. It is an indirect signal, but useful for finding content that AI systems are skipping entirely.&lt;/p&gt;

&lt;p&gt;I checked my own logs and found that &lt;code&gt;/blog/&lt;/code&gt; pages get crawled 15x more than my &lt;code&gt;/about/&lt;/code&gt; page. Not shocking, but the gap was wider than I expected.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building a Measurement Habit
&lt;/h2&gt;

&lt;p&gt;Measurement without action is just data hoarding. Here is the cycle I run:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Weekly (10 min):&lt;/strong&gt; Check GA4 AI referral dashboard. Note spikes or drops. Compare week-over-week.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Monthly (30 min):&lt;/strong&gt; Run the five-platform manual protocol. Calculate citation rate. Scan crawler logs for new patterns.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quarterly (1 hour):&lt;/strong&gt; Full review. Update query set. Compare citation rate trends. Check whether content changes produced measurable results.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://llmoframework.com" rel="noopener noreferrer"&gt;LLMO Framework&lt;/a&gt; provides a structured approach to KPI design if you want a more formal methodology. I reference it when deciding which metrics matter most at different growth stages.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Punchline
&lt;/h2&gt;

&lt;p&gt;I started measuring my LLMO visibility three weeks ago. My citation rate across five platforms is 14%. Not great. Not terrible. But the important part is that I &lt;em&gt;know&lt;/em&gt; the number, and three months from now I'll know whether it went up or down.&lt;/p&gt;

&lt;p&gt;The SEO world figured out measurement twenty years ago. The LLMO world is still in its "checking rankings by Googling yourself" era. The people who build measurement infrastructure now will have a compounding advantage over those who keep guessing.&lt;/p&gt;

&lt;p&gt;If you're still typing your brand name into ChatGPT and squinting at the output, I get it. I was doing the same thing last month. But now I have a spreadsheet, a cron job, and a regex filter in GA4. Less romantic, more informative. I'll take that trade.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.1clickreport.com/blog/track-ai-traffic-ga4-chatgpt-perplexity-claude" rel="noopener noreferrer"&gt;How to Track AI Traffic in GA4&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ziptie.dev/blog/best-llmo-tools/" rel="noopener noreferrer"&gt;Best LLMO Tools in 2026&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2311.09735" rel="noopener noreferrer"&gt;GEO: Generative Engine Optimization&lt;/a&gt; by Aggarwal et al., Princeton / ACM SIGKDD 2024&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://llmoframework.com" rel="noopener noreferrer"&gt;LLMO Framework&lt;/a&gt;: KPI design and implementation guide&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;The full playbook (llms.txt patterns, JSON-LD examples, citation-rate KPIs, and ChatGPT/Perplexity/Brave comparison) is in &lt;a href="https://kenimoto.dev/books/llmo-ai-search-optimization?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=measure-ai-citations-kpi" rel="noopener noreferrer"&gt;LLMO Practical Guide: Why ChatGPT Ignores Your Website&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>llmo</category>
      <category>seo</category>
      <category>analytics</category>
      <category>ai</category>
    </item>
    <item>
      <title>I Audited 30 llms.txt Files in the Wild. 5 Anti-Patterns Are Already Forming.</title>
      <dc:creator>Ken Imoto</dc:creator>
      <pubDate>Wed, 20 May 2026 13:05:01 +0000</pubDate>
      <link>https://forem.com/kenimo49/i-audited-30-llmstxt-files-in-the-wild-5-anti-patterns-are-already-forming-18h4</link>
      <guid>https://forem.com/kenimo49/i-audited-30-llmstxt-files-in-the-wild-5-anti-patterns-are-already-forming-18h4</guid>
      <description>&lt;p&gt;I shipped my third llms.txt this month and felt extremely productive. The kind of productive where you close the laptop, pour a coffee, and feel like the entire AI-search problem is now solved on a personal level.&lt;/p&gt;

&lt;p&gt;Then I opened 30 production llms.txt files from the companies the rest of us are supposed to be learning from. Anthropic. Stripe. Vercel. Cloudflare. Hugging Face. Mintlify. Astro. Linear. The names you cite when you tell someone "look, the serious players are doing it."&lt;/p&gt;

&lt;p&gt;24 of the 30 files had at least one of five problems. Three of those problems were in my own files.&lt;/p&gt;

&lt;p&gt;That coffee got cold.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I ran the audit
&lt;/h2&gt;

&lt;p&gt;The setup was embarrassingly simple. I picked 30 domains that have public llms.txt files and matter to developers in 2026: AI labs, infra companies, popular dev tools. I &lt;code&gt;curl&lt;/code&gt;ed each one. I read each one with the eyes of an LLM trying to use it. I logged what was wrong.&lt;/p&gt;

&lt;p&gt;This isn't science. It's a Monday evening with a terminal open. But the patterns showed up so fast that I stopped at 30. The next ten would have been more of the same.&lt;/p&gt;

&lt;p&gt;A March 2026 &lt;a href="https://seranking.com/blog/llms-txt/" rel="noopener noreferrer"&gt;SE Ranking study of 300,000 domains&lt;/a&gt; found roughly 10% adoption. The &lt;a href="https://codersera.com/blog/llms-txt-complete-guide-2026/" rel="noopener noreferrer"&gt;codersera May 2026 guide&lt;/a&gt; puts the number around 844,000 sites with 500% YoY growth. The standard is winning the adoption race. It is losing the quality race.&lt;/p&gt;

&lt;h2&gt;
  
  
  The five anti-patterns
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Anti-Pattern 1: "Dump everything"
&lt;/h3&gt;

&lt;p&gt;This is the most common failure and the one I am most guilty of. The author treats llms.txt as a second sitemap. 800 links. 1,200 links. One file I opened had every blog post since 2019, flat, no priority, no grouping.&lt;/p&gt;

&lt;p&gt;The whole point of llms.txt is that sitemap.xml already exists. When the spec says "10KB recommended" it is not being cute about file size. It is saying: if the LLM cannot read all of this inside a context window with budget left for the actual question, you have not helped, you have moved the problem.&lt;/p&gt;

&lt;p&gt;The fix is brutal: pick 10 to 20 links. Not 50. Not "key sections plus a few extras." Ten to twenty. Everything else goes in &lt;code&gt;## Optional&lt;/code&gt; or stays in sitemap.xml.&lt;/p&gt;

&lt;p&gt;If you are a docs-heavy product, use the pattern Cloudflare ships: a slim root llms.txt that links to per-product llms.txt files. Each one stays under the budget. An agent fetches only the one it needs. No one reads the entire encyclopedia to fix a faucet.&lt;/p&gt;

&lt;h3&gt;
  
  
  Anti-Pattern 2: "Contradicts robots.txt"
&lt;/h3&gt;

&lt;p&gt;Open the robots.txt. Open the llms.txt. Diff the paths. About a third of the files I audited list URLs in llms.txt that are explicitly &lt;code&gt;Disallow&lt;/code&gt;ed in robots.txt for the very crawlers most likely to read llms.txt.&lt;/p&gt;

&lt;p&gt;The most painful example: a docs site that blocks &lt;code&gt;GPTBot&lt;/code&gt; and &lt;code&gt;ClaudeBot&lt;/code&gt; from &lt;code&gt;/docs/&lt;/code&gt; in robots.txt, then lists 40 &lt;code&gt;/docs/*&lt;/code&gt; URLs in llms.txt. The file says "here is what matters." The robots.txt says "you cannot have it." The crawler obeys robots.txt. The llms.txt is decorative.&lt;/p&gt;

&lt;p&gt;This usually happens when the two files are owned by different teams (or by the same person across two different months). The fix is a five-minute review with both files open: every URL in llms.txt must be allowed in robots.txt for every AI crawler you actually want reading it.&lt;/p&gt;

&lt;p&gt;If you genuinely want to block AI crawlers, fine, but then do not also write them a polite directory of your favorite pages.&lt;/p&gt;

&lt;h3&gt;
  
  
  Anti-Pattern 3: "HTML links only, no .md"
&lt;/h3&gt;

&lt;p&gt;Jeremy Howard's original proposal includes a clever convention: any URL appended with &lt;code&gt;.md&lt;/code&gt; should return a clean Markdown version of the page, no nav, no ads, no JavaScript bundle. The &lt;code&gt;.html.md&lt;/code&gt; pattern.&lt;/p&gt;

&lt;p&gt;Almost nobody does it. In my 30 files, only 6 served any &lt;code&gt;.md&lt;/code&gt; companion at all. The other 24 hand the LLM a link to an HTML page that the LLM cannot parse cleanly because the crawler does not execute JavaScript.&lt;/p&gt;

&lt;p&gt;Stripe does this well: every docs URL has a &lt;code&gt;.md&lt;/code&gt; twin and llms.txt points at the &lt;code&gt;.md&lt;/code&gt; version. The &lt;a href="https://llmoframework.com" rel="noopener noreferrer"&gt;llmoframework.com reference templates&lt;/a&gt; section calls this out as the single highest-leverage thing most teams are skipping, because it is the difference between "AI can find the page" and "AI can actually read what is on the page."&lt;/p&gt;

&lt;p&gt;The fix depends on your stack. For Astro and Next.js, generating &lt;code&gt;.md&lt;/code&gt; versions at build time is a 30-line change. For dynamic CMS sites, an edge function that returns a markdown serialization on the &lt;code&gt;.md&lt;/code&gt; suffix is the move. Either way, this is the anti-pattern with the largest delta between effort and outcome.&lt;/p&gt;

&lt;h3&gt;
  
  
  Anti-Pattern 4: "About page theatre"
&lt;/h3&gt;

&lt;p&gt;Eight of the 30 files used the entire body of the file as a marketing pitch. Three paragraphs about the company's mission. A founder quote. The history of the brand. Then two links. Total content: "we are visionary leaders in the AI-native space."&lt;/p&gt;

&lt;p&gt;LLMs do not buy your vibe. They need pointers to content. The H1 plus blockquote summary is the place for "what is this site." Everything below should be links to specific pages with specific descriptions. If your llms.txt reads like a homepage, you wrote a homepage.&lt;/p&gt;

&lt;p&gt;The same logic GEO research is pushing on the content side, "vague claims do not get cited, specific claims with sources do", applies to llms.txt itself.&lt;/p&gt;

&lt;h3&gt;
  
  
  Anti-Pattern 5: "Frozen in 2024"
&lt;/h3&gt;

&lt;p&gt;Five of the files I audited had visible signs of being shipped once and never touched again. Links to pages that 404. Product names that no longer exist. Dates that put the file's last meaningful update in 2024, back when llms.txt was a six-month-old proposal and "AI search" was something Perplexity was still explaining to people.&lt;/p&gt;

&lt;p&gt;Sitemap.xml is auto-generated. robots.txt rarely changes. llms.txt sits in an uncanny middle: hand-curated like documentation, but with the same staleness risk as a README that says "we use Yarn" when you migrated to pnpm last year.&lt;/p&gt;

&lt;p&gt;The fix is automation, not discipline. Add a CI check that flags 404s in the URLs your llms.txt lists. Regenerate the "featured articles" section from your analytics every quarter. Treat the file like a config artifact, not a one-off launch deliverable.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.mintlify.com/blog/real-llms-txt-examples" rel="noopener noreferrer"&gt;Mintlify's analysis of real llms.txt examples&lt;/a&gt; flagged this as the second-biggest pattern they saw across the customer base. The first was Anti-Pattern 1. So those are the two to fix this week.&lt;/p&gt;

&lt;h2&gt;
  
  
  The three I shipped myself
&lt;/h2&gt;

&lt;p&gt;Honesty section. Of my three llms.txt files:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One had 47 links in it. Anti-pattern 1.&lt;/li&gt;
&lt;li&gt;One pointed at HTML-only URLs because I had not set up the &lt;code&gt;.md&lt;/code&gt; companion yet. Anti-pattern 3.&lt;/li&gt;
&lt;li&gt;One had not been updated in 4 months and listed a post under a slug I had since renamed. Anti-patterns 5 and a 301-redirect chain for dessert.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I did not catch any of this until I was three quarters of the way through reading other people's files. The audit was supposed to be about them. It ended up being about me. There is probably a lesson in there, but I am still in the embarrassment phase.&lt;/p&gt;

&lt;h2&gt;
  
  
  What changed after I fixed two
&lt;/h2&gt;

&lt;p&gt;I fixed two of them. The 47-link file went to 16 links plus an &lt;code&gt;## Optional&lt;/code&gt; section. The HTML-only file got &lt;code&gt;.md&lt;/code&gt; twins for the 16 featured URLs via a build-time hook (Astro made this easier than I expected, about 25 lines).&lt;/p&gt;

&lt;p&gt;I cannot tell you "AI citations jumped by X%" because the file is one week old and citation measurement at this volume is noisy. What I can tell you is the file now passes a smell test I should have applied from day one: would a model with a 200K context window and ten other tabs open prefer this file over the previous version? Yes. Obviously yes. The previous version was unreadable.&lt;/p&gt;

&lt;h2&gt;
  
  
  The honest position on llms.txt
&lt;/h2&gt;

&lt;p&gt;The skeptics are partly right. SE Ranking's 300K-domain study did not find a measurable citation lift. The major LLMs do not publicly confirm they fetch the file. The standard has no W3C stamp.&lt;/p&gt;

&lt;p&gt;The skeptics are also partly wrong. IDE agents (Cursor, Cline, Continue), the major AI search engines (ChatGPT search, Perplexity, You.com), and a growing list of MCP integrations read llms.txt today. The optionality is real and the cost is fifteen minutes.&lt;/p&gt;

&lt;p&gt;The actual question for 2026 is not "should I ship an llms.txt." That question is settled by the cost-benefit math. The question is whether the file you ship gives an LLM something useful or trains it to ignore your domain. Anti-patterns 1 through 5 are the difference between those two outcomes.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to do this week
&lt;/h2&gt;

&lt;p&gt;If you have not shipped one yet, the basics are straightforward (H1 site name, blockquote summary, prioritized link list). If you have shipped one, run it through the five-question audit:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Is it under 10KB and under 20 links (excluding &lt;code&gt;## Optional&lt;/code&gt;)?&lt;/li&gt;
&lt;li&gt;Do all listed URLs pass robots.txt for GPTBot and ClaudeBot?&lt;/li&gt;
&lt;li&gt;Do at least the top 5 URLs have a &lt;code&gt;.md&lt;/code&gt; companion?&lt;/li&gt;
&lt;li&gt;Does the body link to specific pages, not generic marketing copy?&lt;/li&gt;
&lt;li&gt;Was it updated in the last 90 days?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you score 5 out of 5, you are in the top 6 of the 30 sites I looked at, which is to say the top 20% of an already-self-selected sample. If you score 3 or below, you have the same Monday afternoon ahead of you that I did.&lt;/p&gt;

&lt;p&gt;I am writing my fourth llms.txt this week. I will run it through this list before I publish. I will not feel productive afterwards. I will feel like someone who learned the same lesson three audits in a row.&lt;/p&gt;

&lt;p&gt;That, I am told, is how engineering works.&lt;/p&gt;




&lt;p&gt;If you want the full LLMO playbook, beyond llms.txt and into JSON-LD, robots.txt strategy, content design, and measurement:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://kenimoto.dev/books/llmo-ai-search-optimization?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=30-llmstxt-audit" rel="noopener noreferrer"&gt;LLMO: AI Search Optimization for Engineers&lt;/a&gt;&lt;/p&gt;

</description>
      <category>llmo</category>
      <category>ai</category>
      <category>seo</category>
      <category>webdev</category>
    </item>
    <item>
      <title>I Asked 3 Claude Code Sub-agents to Review the Same PR. They Disagreed on 41% of the Comments.</title>
      <dc:creator>Ken Imoto</dc:creator>
      <pubDate>Wed, 20 May 2026 13:00:00 +0000</pubDate>
      <link>https://forem.com/kenimo49/i-asked-3-claude-code-sub-agents-to-review-the-same-pr-they-disagreed-on-41-of-the-comments-2m57</link>
      <guid>https://forem.com/kenimo49/i-asked-3-claude-code-sub-agents-to-review-the-same-pr-they-disagreed-on-41-of-the-comments-2m57</guid>
      <description>&lt;p&gt;I thought multi-agent code review was a free upgrade. Three sub-agents looking at the same PR sounded like three pairs of eyes for the cost of one engineer's coffee.&lt;/p&gt;

&lt;p&gt;Then I ran three Claude Code sub-agents on the same 500-line refactor PR and watched them disagree on 41% of the comments. The merge took an hour I had budgeted for fifteen minutes. Brooks's Law is alive in 2026, and apparently it scales down to agents.&lt;/p&gt;

&lt;p&gt;Anthropic &lt;a href="https://claude.com/blog/code-review" rel="noopener noreferrer"&gt;announced in March&lt;/a&gt; that fewer than 1% of their internal code-review findings get marked incorrect by engineers. That number is real, and it is also a stat from people running one tightly-tuned pipeline on their own codebase. As soon as I stood up my own three sub-agents on my own repo, "agree" stopped meaning what I thought it meant.&lt;/p&gt;

&lt;p&gt;This is the experiment. What I set up, what I measured, and what I now actually believe about parallel sub-agent review.&lt;/p&gt;

&lt;h2&gt;
  
  
  The setup
&lt;/h2&gt;

&lt;p&gt;The PR was a 500-line refactor of a WebRTC signaling layer in one of my side projects. Eight files, mostly TypeScript, a couple of config tweaks, one new error type. Boring enough to not be a stunt PR, complex enough that a single reviewer would miss things.&lt;/p&gt;

&lt;p&gt;Three sub-agents, all defined under &lt;code&gt;.claude/agents/&lt;/code&gt;, all using Sonnet 4.6, each restricted to read-only tools:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;explore-reviewer&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Trace&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;callers,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;dependents,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;and&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;dead&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;code&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;paths."&lt;/span&gt;
&lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sonnet&lt;/span&gt;
&lt;span class="na"&gt;allowed-tools&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Read Grep Glob&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

You are a code archaeologist. For each changed file, find every caller,
every test that references it, and any path that goes silent after the change.
Report concrete file:line citations. No style opinions.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;security-reviewer&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Look for auth, validation, and secret-handling regressions.&lt;/span&gt;
&lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sonnet&lt;/span&gt;
&lt;span class="na"&gt;allowed-tools&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Read Grep Glob WebSearch&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

You are a security reviewer. Focus only on auth flows, input validation,
secret handling, and dependency risks. Estimate CVSS for each finding.
Ignore style and architecture.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;plan-architect&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Assess design decisions against existing conventions.&lt;/span&gt;
&lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sonnet&lt;/span&gt;
&lt;span class="na"&gt;allowed-tools&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Read Grep Glob&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

You are a software architect. Compare the PR's design choices against the
existing conventions in this codebase. Flag drift, missing seams, and
abstractions that will hurt the next person.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each sub-agent got the same prompt: "Review PR #482 line by line and list findings as bullets with file:line citations." Each ran in its own context. None of them saw each other's output. I was the only one stitching results together at the end.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdjlp2dgzn0bw03pucthl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdjlp2dgzn0bw03pucthl.png" alt="Three Claude Code sub-agents reviewing the same PR" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What 41% disagreement actually looked like
&lt;/h2&gt;

&lt;p&gt;After all three finished, I had 78 raw comments total. I sat down with a spreadsheet and tagged each one as "raised by 3", "raised by 2", or "raised by 1".&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Coverage&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;th&gt;Share&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;All 3 agents flagged it&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;18%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2 of 3 agents flagged it&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;41%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Only 1 agent flagged it&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;41%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The "raised by 1" bucket is what I'm calling disagreement. Two other sub-agents had every opportunity to flag the same line, with the same tools, on the same diff. They walked past it. That is a 41% chance that any individual finding is one sub-agent's private opinion.&lt;/p&gt;

&lt;p&gt;The headline Anthropic number — less than 1% marked incorrect — is measured differently. They count findings that an engineer explicitly closes without fixing. I'm counting findings that two of three agents looking at the same code never bothered to mention. Those are different questions, and the second one is the one that costs me time at the keyboard.&lt;/p&gt;

&lt;h2&gt;
  
  
  The four disagreement patterns
&lt;/h2&gt;

&lt;p&gt;After classifying every disagreement, four patterns covered almost all of them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Severity drift.&lt;/strong&gt; The plan-architect flagged a missing null check as "critical". The security-reviewer noted the same line and called it "low — caller already validates upstream". Both were right, sort of. The architect was reading the function in isolation. The security reviewer had grep-walked the callers and seen the upstream check. Same line, opposite verdicts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scope drift.&lt;/strong&gt; Asked to review the PR, the explore-reviewer happily told me about three pre-existing bugs in files the PR did not touch. The plan-architect refused to comment on anything outside the diff. I had no way to know in advance which behavior I would get. Strictly speaking, both interpretations are defensible. Practically speaking, one of them blew up my comment count.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Concreteness drift.&lt;/strong&gt; The plan-architect wrote: "Consider extracting the retry logic into a shared helper." The security-reviewer wrote: "Replace lines 184-201 with &lt;code&gt;retry(opts, () =&amp;gt; fetchToken(opts.url))&lt;/code&gt; and add a 30s ceiling, otherwise the auth-refresh path can hang the worker." Same idea. One I could apply in thirty seconds, the other I needed to spend a meeting on. Concreteness is a wildly larger axis of variance than I expected.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tool-budget drift.&lt;/strong&gt; The explore-reviewer had grep and glob, and noticed that the renamed function was still referenced in a CI script nobody had updated. The plan-architect, with the same tools, never looked there. Same allowed-tools list, same prompt about "find dependents". One walked the surface, one walked the building. Drift here came down to how aggressively each system prompt told the agent to roam.&lt;/p&gt;

&lt;p&gt;If you have used Claude Code &lt;a href="https://code.claude.com/docs/en/sub-agents" rel="noopener noreferrer"&gt;sub-agents&lt;/a&gt; for anything beyond a one-off Explore call, none of this is shocking. What was shocking, for me, was how cleanly the four buckets carved up almost every disagreement I tagged.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bug nobody caught
&lt;/h2&gt;

&lt;p&gt;Two days after I merged, a colleague found a race condition in the new error-handling path. The PR introduced a one-frame window where two reconnect attempts could fire on the same socket. None of the three sub-agents mentioned it. The pull-request description, which I had written by hand, did mention "reconnect logic moved", which is what made my colleague go look.&lt;/p&gt;

&lt;p&gt;"Given enough eyeballs, all bugs are shallow," Eric Raymond wrote in 1999. He was right about eyeballs. He did not specify that three of them needed to be aimed at the same window. Mine were all squinting at the diff. None of them stepped back and asked: what changed about timing?&lt;/p&gt;

&lt;h2&gt;
  
  
  The hour I lost to merging
&lt;/h2&gt;

&lt;p&gt;The actual merging of the three reports was the part I had not budgeted for.&lt;/p&gt;

&lt;p&gt;For each "2 of 3" or "1 of 3" finding, I had to decide:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Is this real or is it a context gap I can close with one grep?&lt;/li&gt;
&lt;li&gt;If real, is the severity from agent A right, or the severity from agent B right?&lt;/li&gt;
&lt;li&gt;If a fix is suggested, is the concrete one safe to apply, or do I need to push back to the abstract version?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That last question alone took me three coffee refills. Two sub-agents had told me to "extract a shared helper". One had given me a specific helper. I had to read the diff a third time, by hand, to figure out whether the specific helper was actually the right shape. It wasn't. I ended up writing a fourth version.&lt;/p&gt;

&lt;p&gt;Brooks's Law was about communication overhead between humans on a late project. I am now convinced it generalizes to "any time you put N independent perspectives on the same artifact, your N+1 reviewer is the integrator, and the integrator's hour goes up roughly linearly in N." Three sub-agents felt like 3x the eyes. They were also 3x the integration cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  How many sub-agents is the right number
&lt;/h2&gt;

&lt;p&gt;I do not think the answer is one. After the same week I ran the experiment with N=3, I tried N=1 on a smaller PR — just a single general-purpose review pass. It missed the kind of cross-file dependency that the explore-reviewer would have caught. One pair of eyes is genuinely worse than two.&lt;/p&gt;

&lt;p&gt;My current heuristic, after maybe a dozen PRs of this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tiny PR (&amp;lt;100 lines, no new files): one sub-agent. Anything more is overhead.&lt;/li&gt;
&lt;li&gt;Medium PR (100-500 lines, touches one subsystem): two sub-agents with different angles, usually explore + security or explore + architect. Pick the second to match what the PR is actually risking.&lt;/li&gt;
&lt;li&gt;Large or cross-cutting PR (500+ lines, multiple subsystems): three. Plan the integration time in advance. It is not free.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Above three, I have not seen the value. HAMY's &lt;a href="https://hamy.xyz/blog/2026-02_code-reviews-claude-subagents" rel="noopener noreferrer"&gt;nine-agent setup&lt;/a&gt; is interesting, but I would want a second tool just to merge the reports, and I would want it to be cheaper than me.&lt;/p&gt;

&lt;p&gt;The other knob is concreteness. I now ask each sub-agent for findings "with the smallest concrete change that fixes them, or marked as no-fix if you don't know". That single line in the system prompt collapsed about half of my concreteness drift.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I actually believe now
&lt;/h2&gt;

&lt;p&gt;Multi-agent code review is not free. It is closer to "three junior reviewers reading in different rooms, and you are the senior who has to merge their notes." The eye count goes up, but so does the integration cost, and the integration cost is the part that lives in your calendar.&lt;/p&gt;

&lt;p&gt;The bug nobody caught is the part that humbled me most. Three agents, three angles, all read-only, all aimed at the same diff. None of them noticed the timing change because none of them were asked to. Sub-agents are extremely good at the questions you put in their system prompt. They are mediocre at the questions you forgot to ask. That is the actual limit, not the model.&lt;/p&gt;

&lt;p&gt;If you take one thing from this: write a fourth sub-agent prompt called &lt;code&gt;what-am-i-not-asking&lt;/code&gt;, give it your diff, and ask it to nominate the categories your other agents will miss. Then read its answer. Then write the real review prompts. I did not do this for the experiment in this post, which is exactly why I lost an hour at merge time and a colleague found my race condition.&lt;/p&gt;

&lt;p&gt;Anthropic's less-than-1% number is real. It is also measured on a pipeline that someone spent months tuning, not on three sub-agents you wrote between meetings. Tune yours. Until then, expect 40%.&lt;/p&gt;




&lt;p&gt;The deeper version covering sub-agent design, custom agent patterns, and the full Claude Code workflow lives in &lt;a href="https://kenimoto.dev/books/harness-engineering-guide?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=3subagents-40-disagree" rel="noopener noreferrer"&gt;Harness Engineering: From Using AI to Controlling AI&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>claudecode</category>
      <category>codereview</category>
      <category>agents</category>
      <category>ai</category>
    </item>
    <item>
      <title>I Caught Claude Hiding My Bug 3 Times in a Row. Then I Turned 10 Debugging Habits Into Prompts.</title>
      <dc:creator>Ken Imoto</dc:creator>
      <pubDate>Tue, 19 May 2026 13:00:01 +0000</pubDate>
      <link>https://forem.com/kenimo49/i-caught-claude-hiding-my-bug-3-times-in-a-row-then-i-turned-10-debugging-habits-into-prompts-13a4</link>
      <guid>https://forem.com/kenimo49/i-caught-claude-hiding-my-bug-3-times-in-a-row-then-i-turned-10-debugging-habits-into-prompts-13a4</guid>
      <description>&lt;p&gt;I asked Claude to fix a 500 error from one of my API endpoints. First attempt: it wrapped the call in try-catch and logged the error. Second attempt: it added a default return value so the caller would not blow up. Third attempt: it added a retry with exponential backoff.&lt;/p&gt;

&lt;p&gt;The 500 stopped. I shipped the third "fix" with full confidence. Two hours later, prod woke up the on-call. The same incident had moved to a different endpoint that shared the same database client. The actual cause was connection pool exhaustion. Claude was not fixing the bug. It was hiding it three different ways.&lt;/p&gt;

&lt;p&gt;This is the story of how I turned 10 debugging habits into prompt templates so Claude cannot pull that on me anymore. There are also two file types you can hand it once and never touch again: a CLAUDE.md block and two hook configs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4t1ymdllpfeg29n3gwce.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4t1ymdllpfeg29n3gwce.png" alt="Three hidden-bug fixes (try-catch, default return, retry) all suppressed the 500 while connection pool exhaustion stayed unfixed." width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The 3 "fixes" that almost shipped
&lt;/h2&gt;

&lt;p&gt;Each of the three attempts looked correct in isolation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Attempt 1, try-catch.&lt;/strong&gt; The handler now caught the exception, logged it, and returned a 500 to the user. From the API's point of view, this was an improvement. From the bug's point of view, the connection that triggered the error was still leaked back into the pool in a broken state.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Attempt 2, default return value.&lt;/strong&gt; The function now returned an empty list instead of raising. The 500 was gone from this endpoint. The data inconsistency that the empty list created flowed downstream into a cache and stayed there for an hour.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Attempt 3, retry with exponential backoff.&lt;/strong&gt; Three retries, each opening a new connection. The pool got drained faster. The 500 disappeared on this endpoint because the user-facing call now succeeded on attempt 2 or 3. Other endpoints, sharing the same pool, started timing out instead.&lt;/p&gt;

&lt;p&gt;In all three cases, the symptom went away on the endpoint I asked about. The cause moved. I had asked Claude to debug, but I had given it no rule against suppressing the symptom, so it suppressed the symptom, because that is what the next-token prediction wants to do.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why AI defaults to symptom suppression
&lt;/h2&gt;

&lt;p&gt;The 2025 Stack Overflow Developer Survey reported that around 80% of professional developers were using or planning to use AI tools, and the share who actually trusted those tools' output had dropped year over year. The follow-up coverage I have read since then keeps coming back to the same complaint: AI-generated code clusters bugs around logic errors and I/O handling, at a rate that is meaningfully higher than human-written code at the same level of seniority. The figure I have seen cited most often is roughly 1.7x bug density, though different studies measure it differently and you should check your own commit history before quoting any single number.&lt;/p&gt;

&lt;p&gt;The mechanism is not mysterious. A large language model predicts the next most plausible token given the context. "Error handling pattern" is one of the most over-represented things in its training data. Try-catch, null-check, default return, retry: these are statistically the kinds of edits that appear when someone says "fix this error" in a public repo. The model is doing exactly what it was trained to do.&lt;/p&gt;

&lt;p&gt;What is missing is a different kind of token. "I do not yet know the root cause. Continue investigation." That sentence is rare in training data because humans rarely commit it. We commit the fix, not the not-yet-found-it. So the model never learned to default to "keep looking."&lt;/p&gt;

&lt;p&gt;You have to put that token in for it. That is what the next section is for.&lt;/p&gt;

&lt;h2&gt;
  
  
  10 debugging habits → 10 prompt templates
&lt;/h2&gt;

&lt;p&gt;Each of these maps to a classic debugging habit. Each one is a sentence I now paste into the prompt or the CLAUDE.md, depending on how permanent I want it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvh0crp5f68h9gcu4y5zc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvh0crp5f68h9gcu4y5zc.png" alt="10 debugging habits mapped to 10 prompt fragments: assumption, reproduction, boundary, diff, timeline, retry, amplification, instrumentation, simplification, intentional break." width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Doubt the inputs.&lt;/strong&gt; "Before proposing a fix, confirm the logs you're reading are complete and the monitoring you're trusting actually reports the state you think it reports." This is the one Claude skips most. It will happily diagnose from a log file that is half-rotated.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Reproduce before fixing.&lt;/strong&gt; "Reproduce the bug locally and show me the minimum steps. If you cannot reproduce it, say so explicitly and stop." The "stop" is doing the work. It shuts the door on guessing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Find the boundary.&lt;/strong&gt; "Identify the boundary between working and broken behavior. Which component is the last one that returns correct data?" This pushes the model away from line-by-line guesses and toward layer-by-layer narrowing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Diff against a known-good state.&lt;/strong&gt; "Compare the current code to the last known working state. Run &lt;code&gt;git log --oneline -20&lt;/code&gt; and identify any change that could plausibly correlate with the failure window." This is the prompt that surfaces the commit no one remembered making.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Build a timeline.&lt;/strong&gt; "When did this start failing? Is it sudden or gradual? Map error rate against deploy times, traffic spikes, and config changes." Sudden + correlated to deploy is one bug. Gradual + uncorrelated is a different bug entirely. Conflating them is how three "fixes" stack.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Audit retries, caches, and timeouts.&lt;/strong&gt; "List every retry, cache, and timeout on the path. For each one, describe what happens when the underlying call is slow but not failed." This is the one that would have caught my pool exhaustion on the first pass.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;7. Watch for amplification.&lt;/strong&gt; "Is there a path where a small error gets multiplied? A failed call that triggers three retries, each opening a new connection, each adding latency to the next?" If your retry storm hides inside an autoscaler, you also get an instance storm.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;8. Add instrumentation, don't guess.&lt;/strong&gt; "If you don't have enough observation to identify the cause, propose the specific log lines or traces to add. Do not propose a fix yet." This converts "I don't know" into "here is what to measure," which is a much more useful answer than a fake fix.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;9. Simplify the suspect.&lt;/strong&gt; "Remove non-essential components from the failing path until the bug is reproducible in the simplest possible form. What is the smallest input that still triggers it?" Most of the bug usually wasn't in the part you were staring at.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;10. Break things on purpose.&lt;/strong&gt; "To verify a hypothesis, propose an intentional change that should make the bug worse or better. Predict the outcome before running it." This is the one that flips debugging from observation to experiment. It also catches lies your monitoring is telling you.&lt;/p&gt;

&lt;h2&gt;
  
  
  Persist the rules in CLAUDE.md
&lt;/h2&gt;

&lt;p&gt;Pasting 10 sentences into every prompt does not scale. CLAUDE.md is where the rules go to live.&lt;/p&gt;

&lt;p&gt;The Anthropic guidance I keep coming back to is to hold CLAUDE.md under roughly 100-150 lines so it can actually fit in context for every turn. Spending 12 of those lines on debugging is a good trade.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Debugging Rules&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; Do not write fix code until you have identified the root cause.
&lt;span class="p"&gt;-&lt;/span&gt; Suppress nothing. If the symptom is gone but the cause is unknown, that is not a fix.
&lt;span class="p"&gt;-&lt;/span&gt; Before fixing, write a failing test that reproduces the bug.
&lt;span class="p"&gt;-&lt;/span&gt; After fixing, run the full test suite and report any newly failing tests.
&lt;span class="p"&gt;-&lt;/span&gt; If three attempts fail in a row on the same bug, stop. Summarize what you tried, what you ruled out, and what hypothesis is left, and ask for human input.

&lt;span class="gu"&gt;## Debugging Workflow&lt;/span&gt;
&lt;span class="p"&gt;
1.&lt;/span&gt; Root Cause Investigation: read logs, traces, and the code path.
&lt;span class="p"&gt;2.&lt;/span&gt; Pattern Analysis: search for the same anti-pattern elsewhere in the codebase.
&lt;span class="p"&gt;3.&lt;/span&gt; Hypothesis Testing: write a test that would fail iff the hypothesis is correct.
&lt;span class="p"&gt;4.&lt;/span&gt; Implementation: only after steps 1-3 succeed.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The thing to notice is that these are constraints, not instructions. "Do not write fix code until..." is more useful than "investigate first." The constraint format is what stops the next-token machine from cheerfully skipping ahead.&lt;/p&gt;

&lt;h2&gt;
  
  
  Automate behavior with hooks
&lt;/h2&gt;

&lt;p&gt;CLAUDE.md is the brain. Hooks are the reflexes. Two of them matter for debugging.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PreToolUse: block destructive commands.&lt;/strong&gt; Halfway through debugging, the model occasionally suggests something like &lt;code&gt;rm -rf node_modules&lt;/code&gt; or, on a worse day, a raw &lt;code&gt;DROP TABLE&lt;/code&gt;. A PreToolUse hook intercepts the Bash tool call, greps the command string for a small denylist, and exits 2 to block. Claude Code treats exit code 2 from a PreToolUse hook as "this tool call is denied, tell the model why."&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"hooks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"PreToolUse"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"matcher"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Bash"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"hooks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"if echo &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;$TOOL_INPUT&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt; | grep -qE 'rm&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s2"&gt;s+-rf|DROP&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s2"&gt;s+TABLE'; then echo 'BLOCK: destructive command' &amp;gt;&amp;amp;2; exit 2; fi"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;PostToolUse: run tests after edits.&lt;/strong&gt; Matcher &lt;code&gt;Edit|Write&lt;/code&gt;, command runs your test suite or at least a fast subset. The model now sees the test failure on the next turn and reacts to it the same turn it created it, instead of remembering 30 messages later. The official &lt;a href="https://code.claude.com/docs/en/hooks" rel="noopener noreferrer"&gt;Claude Code hooks reference&lt;/a&gt; covers the matchers and exit-code conventions in full. Worth reading once before you write your own.&lt;/p&gt;

&lt;p&gt;Together CLAUDE.md, PreToolUse, and PostToolUse form the equipment layer for an AI debugger. It is the same equipment-layer pattern I used when splitting one big agent into Observer, Strategist, and Marketer: constraints in the prompt, behavior in the hooks, information in the MCP layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  When 3 hidden fixes in a row mean stop
&lt;/h2&gt;

&lt;p&gt;The single most useful rule, the one that would have saved my on-call:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If three attempts in a row fail to fix the same bug, stop and escalate.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Three is not magic. It is the point where the cost of one more guess exceeds the cost of admitting the bug is structural. By the third attempt, the model is usually pattern-matching on top of pattern-matching, and a human eye is cheaper than a fourth retry.&lt;/p&gt;

&lt;p&gt;"Let Claude debug it" is half true. It is fast. It just defaults to fast at &lt;em&gt;hiding&lt;/em&gt; the problem unless you arm it differently. The 10 prompts arm it. The CLAUDE.md remembers them for you. The hooks catch what slips through. None of these is expensive. The on-call page at 11pm is.&lt;/p&gt;




&lt;p&gt;The full chapter on translating the 10 habits into prompts, plus the Claude Code weapons chapter on CLAUDE.md, hooks, and MCP layering, is in &lt;a href="https://kenimoto.dev/books/claude-code-mastery?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=claude-hid-bug-3x-debug-10" rel="noopener noreferrer"&gt;Practical Claude Code&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Sources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://survey.stackoverflow.co/2025/ai" rel="noopener noreferrer"&gt;2025 Stack Overflow Developer Survey, AI section&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://stackoverflow.blog/2026/02/18/closing-the-developer-ai-trust-gap/" rel="noopener noreferrer"&gt;Closing the developer AI trust gap (Stack Overflow Blog, Feb 2026)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://code.claude.com/docs/en/hooks" rel="noopener noreferrer"&gt;Claude Code Hooks reference&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>claudecode</category>
      <category>debugging</category>
      <category>ai</category>
      <category>prompts</category>
    </item>
    <item>
      <title>I Refused to Write Specs Until Claude Code Generated Wrong Code Three Times</title>
      <dc:creator>Ken Imoto</dc:creator>
      <pubDate>Tue, 19 May 2026 13:00:01 +0000</pubDate>
      <link>https://forem.com/kenimo49/i-refused-to-write-specs-until-claude-code-generated-wrong-code-three-times-35hp</link>
      <guid>https://forem.com/kenimo49/i-refused-to-write-specs-until-claude-code-generated-wrong-code-three-times-35hp</guid>
      <description>&lt;p&gt;I read the phrase "spec-driven development" and immediately decided it was for people without taste. Six months later, Claude Code generated a discount system that applied coupons to itself. Three times in a row.&lt;/p&gt;

&lt;p&gt;The first time I laughed. The second time I assumed the prompt was the problem. The third time I closed the editor, opened a YAML file, and started writing OpenAPI like a person who had finally lost an argument with reality.&lt;/p&gt;

&lt;p&gt;This post is about that argument. And about what fifteen minutes of spec-writing actually buys you in 2026, when half the developer Twittersphere is still telling you to "just prompt it."&lt;/p&gt;

&lt;h2&gt;
  
  
  What I was doing wrong
&lt;/h2&gt;

&lt;p&gt;My workflow was the one everyone has tried. Open Claude Code. Type "build me a checkout flow with member discount and a promo code field." Watch the agent confidently generate four hundred lines of Flask. Skim. Run. Fail. Re-prompt. Get a different four hundred lines. Repeat until I either ran out of patience or shipped something that mostly worked.&lt;/p&gt;

&lt;p&gt;The discount feature was where the wheels came off. I asked for "10 percent member discount, stackable promo codes, max 30 percent total." Claude Code shipped a function that, when given a promo code on a member account, took 10 percent off, then took another 10 percent off the discounted total, then applied the promo. The promo code, as it turns out, was also a member-discount-eligible item in my schema, because I had not bothered to tell anyone that members are people and promos are line items. So the system politely gave my coupon a coupon.&lt;/p&gt;

&lt;p&gt;Yes, I am the engineer who wrote "just prompt it" in a thread last week and then spent five PR rounds explaining what "just" meant.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fifteen-minute spec
&lt;/h2&gt;

&lt;p&gt;Out of spite, I tried the thing I had been calling overhead. I wrote an OpenAPI document. Endpoint, request shape, response shape, error codes, the constraints on every field. It took fifteen minutes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;/api/orders&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;post&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;requestBody&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;application/json&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;schema&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;customer_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;string&lt;/span&gt;
            &lt;span class="na"&gt;items&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;array of OrderItem&lt;/span&gt;
            &lt;span class="na"&gt;promo_code&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;string | &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;
      &lt;span class="na"&gt;responses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;201&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;schema&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;order_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;string&lt;/span&gt;
            &lt;span class="na"&gt;subtotal&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;integer (minimum 0)&lt;/span&gt;
            &lt;span class="na"&gt;member_discount&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;integer (0..subtotal * 0.1, integer)&lt;/span&gt;
            &lt;span class="na"&gt;promo_discount&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;integer&lt;/span&gt;
            &lt;span class="na"&gt;total&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;integer&lt;/span&gt;
            &lt;span class="na"&gt;applied_rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;array of string&lt;/span&gt;
        &lt;span class="na"&gt;400&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;schema&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;error&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;code&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;message&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then I wrote a Gherkin file with three scenarios. Member buys without promo. Non-member uses promo. Member uses promo and the total cap kicks in.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight gherkin"&gt;&lt;code&gt;&lt;span class="kn"&gt;Scenario&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; Member with promo, capped at 30% total
  &lt;span class="nf"&gt;Given &lt;/span&gt;a logged-in member
  &lt;span class="nf"&gt;And &lt;/span&gt;a cart with subtotal 10000 yen
  &lt;span class="nf"&gt;When &lt;/span&gt;they apply promo code &lt;span class="s"&gt;"SPRING5"&lt;/span&gt;
  &lt;span class="nf"&gt;Then &lt;/span&gt;member_discount is 1000
  &lt;span class="nf"&gt;And &lt;/span&gt;promo_discount is 2000
  &lt;span class="nf"&gt;And &lt;/span&gt;total is 7000
  &lt;span class="err"&gt;And applied_rules includes "member" and "promo&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="err"&gt;SPRING5"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I handed both files to Claude Code with one sentence: "implement these specs in Flask, including validation and error handling." It generated about 80 percent of the implementation in three minutes. The remaining 20 percent was real domain logic: what counts as "stackable," what happens at the cap. I wrote that. The spec made it impossible to be confused about it.&lt;/p&gt;

&lt;p&gt;Fifteen minutes of YAML to delete five PR rounds of "what did you mean by stackable." I had been doing the loud version of saving fifteen minutes by spending two hours.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why it works (and why "just prompt it" doesn't)
&lt;/h2&gt;

&lt;p&gt;The reason has nothing to do with Claude Code being smarter when you give it more text. It has to do with what you, the human, are forced to think about while writing the spec.&lt;/p&gt;

&lt;p&gt;When I write &lt;code&gt;member_discount: integer (0..subtotal * 0.1, integer)&lt;/code&gt;, I have committed to the idea that member discount is at most ten percent of the subtotal, in integer yen. I cannot generate a spec that "applies the coupon to itself" because the spec doesn't have a coupon-shaped recipient for that recursion. The ambiguity dies in YAML, before it can metastasize in Python.&lt;/p&gt;

&lt;p&gt;This isn't original to me. The 2026 wave of spec-driven tooling (&lt;a href="https://github.com/Fission-AI/OpenSpec" rel="noopener noreferrer"&gt;OpenSpec&lt;/a&gt;, &lt;a href="https://github.com/gotalab/cc-sdd" rel="noopener noreferrer"&gt;cc-sdd&lt;/a&gt;, &lt;a href="https://amux.io/guides/spec-driven-development/" rel="noopener noreferrer"&gt;amux&lt;/a&gt;, &lt;a href="https://kiro.dev" rel="noopener noreferrer"&gt;Kiro&lt;/a&gt;) is all built on the same observation. GitHub Copilot Workspace doesn't even let you skip the step: it generates an editable "proposed specification" before it touches code, because the team that built it figured out that the spec is the only artifact in the workflow that the human can actually review.&lt;/p&gt;

&lt;p&gt;The cheap-model lesson generalizes: AI assistants don't reduce the value of specs. They turn a fuzzy spec into an expensive mistake faster than humans ever could.&lt;/p&gt;

&lt;h2&gt;
  
  
  The three patterns that paid off
&lt;/h2&gt;

&lt;p&gt;The book version of this is three patterns, and after living with them for a quarter, all three pull weight.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pattern 1: OpenAPI to implementation.&lt;/strong&gt; Write the endpoint shape. Hand it to Claude Code. Get a stub that handles 80 percent of CRUD plus serialization plus the obvious error cases. Add the domain logic by hand. This is the bread-and-butter case. It is also where the "80 percent" number comes from. The remaining 20 percent is what you're actually paid to think about.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pattern 2: Gherkin to step definitions.&lt;/strong&gt; Write scenarios in Given/When/Then. Hand them to Claude Code with &lt;code&gt;pytest-bdd&lt;/code&gt; or &lt;code&gt;behave&lt;/code&gt;. Get the step skeletons. The interesting move here is that the same scenarios drive both implementation prompts and test prompts, so the agent can't drift between "what the code does" and "what the test checks." Drift is where bugs ship.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pattern 3: Spec to property tests.&lt;/strong&gt; From the OpenAPI schema (&lt;code&gt;price: integer, minimum: 0, maximum: 1_000_000&lt;/code&gt;), have Claude Code generate property-based tests with Hypothesis or fast-check. You get the boundary cases (&lt;code&gt;0&lt;/code&gt;, &lt;code&gt;1_000_000&lt;/code&gt;, &lt;code&gt;-1&lt;/code&gt;, &lt;code&gt;null&lt;/code&gt;, overflow) without having to remember every flavor of "what could go wrong with an integer." This is the one I underused for years and regret most.&lt;/p&gt;

&lt;h2&gt;
  
  
  The traps
&lt;/h2&gt;

&lt;p&gt;Three things will bite you if you don't watch for them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ambiguity in specs scales linearly with the bugs in implementation.&lt;/strong&gt; If your OpenAPI says &lt;code&gt;discount: number&lt;/code&gt; instead of &lt;code&gt;discount: integer (0..subtotal*0.1)&lt;/code&gt;, the model will guess. It will guess differently every time. Vague specs aren't a head start; they're a paid-for hallucination factory. Spec-driven development only works as a forcing function on you.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Never trust generated code unconditionally.&lt;/strong&gt; Sample of bugs I have shipped from generated code in the last three months: a SQL query built with string concatenation (injection waiting to happen), a JWT stored in &lt;code&gt;localStorage&lt;/code&gt; (it should have been &lt;code&gt;httpOnly&lt;/code&gt;), and a silent N+1 over a thousand-row table. The agent didn't write any of those out of malice. It wrote them because nothing in my spec said "no." Specs need a constraints section.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The agent will add requirements you didn't ask for.&lt;/strong&gt; I have watched Claude Code add an authentication check to an endpoint whose spec said "public, rate-limited only." The agent had read enough Stack Overflow to think every endpoint should be authenticated, and silently slipped a check in. Specs need to be explicit about what the system &lt;em&gt;doesn't&lt;/em&gt; do, not just what it does.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I write specs now
&lt;/h2&gt;

&lt;p&gt;The workflow that survived contact with reality is unromantic.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Sketch the endpoint in OpenAPI. Field types, ranges, required vs optional.&lt;/li&gt;
&lt;li&gt;Write three Gherkin scenarios. Happy path, edge case, error case.&lt;/li&gt;
&lt;li&gt;Add a &lt;code&gt;## Out of scope&lt;/code&gt; section to the spec file. Auth model. Rate limit. Caching. Anything the agent might helpfully invent.&lt;/li&gt;
&lt;li&gt;Hand all three to Claude Code with &lt;code&gt;CLAUDE.md&lt;/code&gt; containing project conventions.&lt;/li&gt;
&lt;li&gt;Generate. Review the diff against the spec, not against vibes.&lt;/li&gt;
&lt;li&gt;Run the property tests the spec generated.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is also where Claude Code Skills earn their keep. I wrap the steps above into a single skill, &lt;code&gt;/spec-impl&lt;/code&gt;, and the workflow stops being a discipline I have to remember and starts being one slash command. The agent matters less than the artifact in front of it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd tell past-me
&lt;/h2&gt;

&lt;p&gt;I would tell past-me that the fifteen minutes of OpenAPI he refused to write cost him an entire weekend of "just one more prompt." I would tell him that spec-driven development is not a methodology you adopt because some consultancy sold it to your CTO; it's the cheapest known mechanism for not arguing with a fast, confident, slightly drunk junior engineer.&lt;/p&gt;

&lt;p&gt;And I would tell him this: in 2026, agents turn every fuzzy spec into an expensive mistake faster than any human ever could.&lt;/p&gt;

&lt;p&gt;The specs are the brake pedal. Without them, you still go fast. You just go fast in whichever direction the agent's training data pointed last.&lt;/p&gt;




&lt;p&gt;If you're building Claude Code into your team's workflow and want the full reference, including Skills, hooks, sub-agents, and the CLAUDE.md three-tier pattern, this is the practitioner's reference I keep going back to:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://kenimoto.dev/books/claude-code-mastery?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=spec-driven-3-failures" rel="noopener noreferrer"&gt;Claude Code Mastery: A Practitioner's Reference&lt;/a&gt;&lt;/p&gt;

</description>
      <category>claudecode</category>
      <category>ai</category>
      <category>spec</category>
      <category>openapi</category>
    </item>
    <item>
      <title>I Gave My Strategist Agent WebSearch. 5 Topics Took 20 Minutes. Splitting It Into 3 Roles Made It 3.</title>
      <dc:creator>Ken Imoto</dc:creator>
      <pubDate>Mon, 18 May 2026 13:00:00 +0000</pubDate>
      <link>https://forem.com/kenimo49/i-gave-my-strategist-agent-websearch-5-topics-took-20-minutes-splitting-it-into-3-roles-made-it-3-27na</link>
      <guid>https://forem.com/kenimo49/i-gave-my-strategist-agent-websearch-5-topics-took-20-minutes-splitting-it-into-3-roles-made-it-3-27na</guid>
      <description>&lt;p&gt;I thought one agent doing everything was elegant. One &lt;code&gt;claude -p&lt;/code&gt; call, "pick today's topics and write the articles," done. It took 20 minutes to pick 5 topics.&lt;/p&gt;

&lt;p&gt;Splitting it into three agents took the same job to 3 minutes and dropped token cost by about 60%. The agents are dumber individually. The pipeline is faster.&lt;/p&gt;

&lt;p&gt;The trick is not "more agents." The trick is taking WebSearch out of the agent that does the judging.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fegasqfvt9vr3tlm0paw7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fegasqfvt9vr3tlm0paw7.png" alt="1-agent vs 3-role separation. Token cost down 60%, time from 20 minutes to 3 minutes." width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The 1-agent setup that took 20 minutes
&lt;/h2&gt;

&lt;p&gt;The original setup was one prompt, one agent, one run:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Look at yesterday's GA4 data, pick 5 topics for today, and write the top one."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The agent was allowed &lt;code&gt;Bash, Read, Write, Edit, Grep, Glob, WebSearch, WebFetch&lt;/code&gt;. Everything it could possibly need.&lt;/p&gt;

&lt;p&gt;For each candidate topic, it did roughly the same thing: WebSearch to check "what's hot in this space right now," WebSearch again to confirm a trend, WebSearch a third time to cross-check a competitor. Five topics, three to four searches each, 15 to 20 searches per run. Each search dumped a few thousand tokens of result into the context.&lt;/p&gt;

&lt;p&gt;By the time the agent was choosing topic 3, the judgment context contained 40,000+ tokens of search results from topics 1 and 2. The signal-to-noise ratio collapsed. The agent started picking topics that "felt confirmed by recent news" rather than topics that matched my actual content stock.&lt;/p&gt;

&lt;p&gt;The visible symptom was time: about 20 minutes per run. The hidden symptom was drift — I kept overriding the agent's picks during the weekly review, because they didn't match what I had material for.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why WebSearch in the judgment loop is a trap
&lt;/h2&gt;

&lt;p&gt;WebSearch is fine. WebSearch in a judgment loop is the trap.&lt;/p&gt;

&lt;p&gt;Two things happen when you let the judge search:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Time.&lt;/strong&gt; A WebSearch is 5-20 seconds. Five topics times four searches is 100 seconds of waiting per run, before you even count read time and reasoning. For a single human asking one question it's nothing. For a daily automated job it stacks up fast.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context pollution.&lt;/strong&gt; Each result adds 2,000-5,000 tokens of HTML-scraped page text into the judgment context. None of it was structured for "is this topic right for my content?" It was structured for SEO. The judge ends up reasoning from a pile of marketing copy instead of from its own data.&lt;/p&gt;

&lt;p&gt;The fix is unglamorous. The judge should not have WebSearch. WebSearch belongs in the writer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Role 1: Observer — collect only
&lt;/h2&gt;

&lt;p&gt;The Observer's job is "fetch yesterday's numbers, write them to a file." That is the whole job.&lt;/p&gt;

&lt;p&gt;Inputs: GA4, the Zenn API, the Dev.to API, yesterday's logs. Output: &lt;code&gt;domains/&amp;lt;name&amp;gt;/data/snapshot-YYYY-MM-DD.json&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Allowed tools:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;cat &lt;/span&gt;scripts/prompts/observer-prompt.txt&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--allowed-tools&lt;/span&gt; &lt;span class="s2"&gt;"Bash,Read,Write"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No WebSearch. No WebFetch. No Edit. The Observer reads three APIs through &lt;code&gt;curl&lt;/code&gt; and writes a single JSON file. If it tries to be clever and "interpret the data," the prompt tells it not to. The schema enforces it: fields are &lt;code&gt;total_views&lt;/code&gt;, &lt;code&gt;top_performers_3&lt;/code&gt;, &lt;code&gt;errors_yesterday&lt;/code&gt;. No &lt;code&gt;recommendation&lt;/code&gt; field exists, so there's nowhere to put a judgment even if it wanted to make one.&lt;/p&gt;

&lt;p&gt;This sounds like a downgrade. It is, in the same way a single-purpose function is a "downgrade" from a god-object. When the Observer fails, I know exactly which API broke, because that's all it does.&lt;/p&gt;

&lt;h2&gt;
  
  
  Role 2: Strategist — judge only, no WebSearch
&lt;/h2&gt;

&lt;p&gt;The Strategist reads what the Observer wrote, reads &lt;code&gt;strategy.md&lt;/code&gt; for the rules, reads the last 30 days of published topics for the exclusion list, and picks 5 topics. That's it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;cat &lt;/span&gt;scripts/prompts/strategist-prompt.txt&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--allowed-tools&lt;/span&gt; &lt;span class="s2"&gt;"Bash,Read,Write,Edit,Grep,Glob"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice what is missing: &lt;code&gt;WebSearch&lt;/code&gt;, &lt;code&gt;WebFetch&lt;/code&gt;. Physically gone from the allow-list. The Strategist literally cannot reach the internet.&lt;/p&gt;

&lt;p&gt;This was the part I resisted. "How can it judge today's topics without checking what's trending?" That was the wrong question. The right question is: am I writing topics that are trending elsewhere, or topics that match my content stock?&lt;/p&gt;

&lt;p&gt;The Strategist sees:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Three months of my own performance data (what got read)&lt;/li&gt;
&lt;li&gt;My content stock (book chapters, unpublished drafts)&lt;/li&gt;
&lt;li&gt;30-day exclusion list (what I already wrote)&lt;/li&gt;
&lt;li&gt;My own &lt;code&gt;strategy.md&lt;/code&gt; rules&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is enough to pick 5 topics in about 90 seconds, not 20 minutes. The token consumption per Strategist run dropped from roughly 80,000 to roughly 20,000 because there are no WebSearch results to read.&lt;/p&gt;

&lt;p&gt;"Adding evidence with WebSearch" sounded like a good idea. In practice it added 8 redundant searches and 40,000 tokens of noise.&lt;/p&gt;

&lt;h2&gt;
  
  
  Role 3: Marketer — execute, WebSearch allowed
&lt;/h2&gt;

&lt;p&gt;The Marketer reads the Strategist's output, picks the top topic, and writes the article. This is where WebSearch shows up:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;cat &lt;/span&gt;scripts/prompts/marketer-prompt.txt&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--allowed-tools&lt;/span&gt; &lt;span class="s2"&gt;"Bash,Read,Write,Edit,Grep,Glob,WebSearch,WebFetch"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Marketer uses WebSearch for execution research:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Latest stable version of LangGraph in 2026"&lt;/li&gt;
&lt;li&gt;"Anthropic Building Effective Agents doc URL"&lt;/li&gt;
&lt;li&gt;"Inngest pricing tier for cron-driven workflows"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are citations and version checks, not judgments. "Should I write this topic?" is already decided. The Marketer's WebSearch is bounded by the article in front of it.&lt;/p&gt;

&lt;p&gt;Two consequences fall out of this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Cost localizes. WebSearch spend lives in the Marketer, where it produces visible output. The Strategist's per-run cost is now small enough that I run it multiple times a week without thinking about it.&lt;/li&gt;
&lt;li&gt;Failure localizes. When WebSearch is flaky or down, only the writer breaks. The Strategist still produces today's picks. The Observer still records yesterday's numbers. The pipeline degrades, it doesn't halt.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The cron chain: how the three roles connect
&lt;/h2&gt;

&lt;p&gt;The three agents do not share a conversation. They share files.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;07:00  Observer    → writes snapshot-2026-05-14.json
09:00  Strategist  → reads snapshot, writes strategist-2026-05-14.md
10:00  Marketer    → reads strategist.md, writes drafts + schedules 22:00 publish
22:00  Observer    → records today's early traction → tomorrow's input
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I run this as plain &lt;code&gt;cron&lt;/code&gt; on a small VPS. The short version is one line per job with &lt;code&gt;set -euo pipefail&lt;/code&gt;, &lt;code&gt;trap ... ERR&lt;/code&gt;, a Telegram failure ping, and a lock file. About 30 lines of shell per role.&lt;/p&gt;

&lt;p&gt;If you want managed durability instead of cron, &lt;a href="https://temporal.io/blog/orchestrating-ambient-agents-with-temporal" rel="noopener noreferrer"&gt;Temporal's Schedules&lt;/a&gt;, &lt;a href="https://www.inngest.com/" rel="noopener noreferrer"&gt;Inngest's cron triggers&lt;/a&gt;, and &lt;a href="https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#schedule" rel="noopener noreferrer"&gt;GitHub Actions cron&lt;/a&gt; all hit the same shape. The architecture does not care which one carries it. I use cron because the failure mode is "the server is off," and I notice that quickly.&lt;/p&gt;

&lt;p&gt;The handoff is always a file on disk. JSON for the snapshot, Markdown for the strategist log, Markdown for the marketer log. Human-readable, dated, replayable. I can re-run yesterday's Marketer against yesterday's Strategist file by changing one environment variable. That is &lt;code&gt;backfill&lt;/code&gt; for free, without inheriting Airflow.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sub-agent vs role separation — don't confuse them
&lt;/h2&gt;

&lt;p&gt;I have a separate post about running three Claude Code sub-agents on the same PR and watching them disagree 41% of the time. People sometimes ask if that is the same thing as what I'm describing here.&lt;/p&gt;

&lt;p&gt;It is not. They look similar on a slide and behave nothing alike in practice.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Sub-agent (Claude Code Task tool)&lt;/th&gt;
&lt;th&gt;Role separation (cron)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scope&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Same session, same parent agent&lt;/td&gt;
&lt;td&gt;Three separate processes, three separate runs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;State&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Parent passes context as input&lt;/td&gt;
&lt;td&gt;File on disk&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Timing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Synchronous, parent waits&lt;/td&gt;
&lt;td&gt;Asynchronous, hours apart&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Failure&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Parent owns retry&lt;/td&gt;
&lt;td&gt;Each job retries independently&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Use case&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"Explore this codebase in parallel"&lt;/td&gt;
&lt;td&gt;"Run yesterday's PDCA every morning"&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Sub-agents are great for &lt;em&gt;parallelism inside one task&lt;/em&gt;. Role separation is for &lt;em&gt;time-shifted pipelines&lt;/em&gt;. Mixing them produces the worst of both: you get cron's debug surface plus sub-agents' shared-context drift.&lt;/p&gt;

&lt;p&gt;The rule I use: if the answer has to come back in the same conversation, it is a sub-agent. If the answer has to survive a server reboot, it is a separate cron job.&lt;/p&gt;

&lt;h2&gt;
  
  
  What changed, measured
&lt;/h2&gt;

&lt;p&gt;These are my numbers from running both setups on the same content stack:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;1-agent&lt;/th&gt;
&lt;th&gt;3-role&lt;/th&gt;
&lt;th&gt;Change&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Time to pick 5 topics&lt;/td&gt;
&lt;td&gt;~20 min&lt;/td&gt;
&lt;td&gt;~3 min&lt;/td&gt;
&lt;td&gt;-85%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tokens per daily run&lt;/td&gt;
&lt;td&gt;~120k&lt;/td&gt;
&lt;td&gt;~45k&lt;/td&gt;
&lt;td&gt;-62%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Monthly API spend&lt;/td&gt;
&lt;td&gt;~$60&lt;/td&gt;
&lt;td&gt;~$22&lt;/td&gt;
&lt;td&gt;-63%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Topic re-pick rate (weekly review)&lt;/td&gt;
&lt;td&gt;2-3/wk&lt;/td&gt;
&lt;td&gt;0-1/wk&lt;/td&gt;
&lt;td&gt;down&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WebSearch outage breaks pipeline&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;td&gt;no&lt;/td&gt;
&lt;td&gt;fixed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mean debug time per failure&lt;/td&gt;
&lt;td&gt;30-60 min&lt;/td&gt;
&lt;td&gt;5-10 min&lt;/td&gt;
&lt;td&gt;-80%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The token math is the one that surprised me. I assumed splitting into three agents would &lt;em&gt;increase&lt;/em&gt; total token usage because of duplicated context. It did not, because the deleted WebSearch traffic was bigger than the new per-role overhead.&lt;/p&gt;

&lt;p&gt;The debug time is the one that matters daily. With one agent, "the job failed at 09:14" tells me nothing. With three roles, "the Strategist failed at 09:14" tells me which 30-line script to read.&lt;/p&gt;

&lt;p&gt;"Adding agents made it faster" sounds wrong on its face. It is only faster because I removed WebSearch from the judgment loop. The split is what made the removal feasible — once Observer and Strategist could not reach the internet, the temptation to "just search one more thing" was gone.&lt;/p&gt;




&lt;p&gt;The deep version with full crontab, prompt files, and role allow-lists is in &lt;a href="https://kenimoto.dev/books/harness-engineering-guide?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=3role-separation-20to3" rel="noopener noreferrer"&gt;Harness Engineering: From Using AI to Controlling AI&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claudecode</category>
      <category>agents</category>
      <category>harness</category>
    </item>
  </channel>
</rss>
