<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Daniel Nwaneri</title>
    <description>The latest articles on Forem by Daniel Nwaneri (@dannwaneri).</description>
    <link>https://forem.com/dannwaneri</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3606168%2Fa6853c87-4daa-4bea-87e8-d1bd0d8a59d7.jpg</url>
      <title>Forem: Daniel Nwaneri</title>
      <link>https://forem.com/dannwaneri</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/dannwaneri"/>
    <language>en</language>
    <item>
      <title>I Ran My Own SEO Agent on My Two Domains — It Went from 0/4 to 4/4 PASS in One Afternoon</title>
      <dc:creator>Daniel Nwaneri</dc:creator>
      <pubDate>Tue, 07 Apr 2026 15:12:57 +0000</pubDate>
      <link>https://forem.com/dannwaneri/i-ran-my-own-seo-agent-on-my-two-domains-it-went-from-04-to-44-pass-in-one-afternoon-39an</link>
      <guid>https://forem.com/dannwaneri/i-ran-my-own-seo-agent-on-my-two-domains-it-went-from-04-to-44-pass-in-one-afternoon-39an</guid>
      <description>&lt;p&gt;&lt;code&gt;invoice.naija-vpn.com&lt;/code&gt; was serving the Carter Efe $50K/month Twitch story as its meta description. That page is an invoice generator tool for Nigerian freelancers. Nothing about it has anything to do with Carter Efe.&lt;/p&gt;

&lt;p&gt;The agent caught it on the first run. A scraper wouldn't have — it reads raw HTML before JavaScript executes. The agent uses Playwright, reads the rendered DOM, sees what a browser sees after all the scripts run. The homepage content was leaking in dynamically. The scraper sees an empty description. The agent sees the actual problem.&lt;/p&gt;

&lt;p&gt;I own both domains — naija-vpn.com and naijavpn.com, a Virtual Payment Navigator for Nigerian freelancers and creators. I ran the agent on my own sites to see what it would find.&lt;/p&gt;




&lt;h3&gt;
  
  
  What the agent found
&lt;/h3&gt;

&lt;p&gt;Four URLs. Four FAILs.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;URL&lt;/th&gt;
&lt;th&gt;Title&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;th&gt;Overall&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;naija-vpn.com/&lt;/td&gt;
&lt;td&gt;FAIL — 65 chars&lt;/td&gt;
&lt;td&gt;FAIL — 185 chars&lt;/td&gt;
&lt;td&gt;FAIL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;naijavpn.com/&lt;/td&gt;
&lt;td&gt;FAIL — 66 chars&lt;/td&gt;
&lt;td&gt;FAIL — 192 chars&lt;/td&gt;
&lt;td&gt;FAIL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;/cleva-vs-geegpay&lt;/td&gt;
&lt;td&gt;FAIL — 63 chars&lt;/td&gt;
&lt;td&gt;PASS&lt;/td&gt;
&lt;td&gt;FAIL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;invoice.naija-vpn.com&lt;/td&gt;
&lt;td&gt;FAIL — no page-specific meta&lt;/td&gt;
&lt;td&gt;FAIL — no page-specific meta&lt;/td&gt;
&lt;td&gt;FAIL&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The homepage title was &lt;code&gt;NaijaVPN - Virtual Payment Navigator for Nigerian Freelancers&lt;/code&gt; — 65 characters when the display limit is 60. The description was 185 characters when the limit is 160.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;invoice.naija-vpn.com&lt;/code&gt; subdomain was the worst finding. It's a separate subdomain — Google treats it as its own site — but it had no page-specific metadata at all. The agent was reading &lt;code&gt;Carter Efe makes $50K/month from Twitch — here's how he gets paid in Nigeria...&lt;/code&gt; as the meta description for an invoice generator tool. The subdomain is a React SPA and the static HTML &lt;code&gt;&amp;lt;head&amp;gt;&lt;/code&gt; had no &lt;code&gt;&amp;lt;meta name="description"&amp;gt;&lt;/code&gt; tag. Homepage content was leaking in dynamically after JavaScript executed.&lt;/p&gt;

&lt;p&gt;A requests-based scraper would have missed this entirely. It reads the raw HTML before JavaScript executes and would have seen an empty description, not the leaking homepage copy. The agent uses Playwright — it reads the rendered DOM, the same page a browser sees after all the scripts run. That's the difference.&lt;/p&gt;




&lt;h3&gt;
  
  
  The cost curve routing on real pages
&lt;/h3&gt;

&lt;p&gt;The audit ran in tiered mode. Here's what the routing looked like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;naija-vpn.com/&lt;/code&gt; → Tier 1 — deterministic, zero API cost&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;naijavpn.com/&lt;/code&gt; → Tier 1 — canonical pointing correctly to naija-vpn.com, clean&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;/cleva-vs-geegpay&lt;/code&gt; → Tier 1 — clear structure, no escalation needed&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;invoice.naija-vpn.com&lt;/code&gt; → Tier 2 (Haiku) — title on the boundary, needed a closer look&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Three tiers, four pages, one run. Total API cost for the audit: under $0.05.&lt;/p&gt;




&lt;h3&gt;
  
  
  What I actually changed
&lt;/h3&gt;

&lt;p&gt;The rewrite agent ran next — same cost curve applied to the suggestions. Tier 1 truncated the titles deterministically. Haiku generated description suggestions. Sonnet wrote the voice-matched copy for the invoice subdomain opening paragraph.&lt;/p&gt;

&lt;p&gt;Three changes I deployed:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Homepage title:&lt;/strong&gt; &lt;code&gt;NaijaVPN - Get Paid Internationally in Nigeria&lt;/code&gt; — 46 characters. Cut the bloat, kept the value proposition.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Homepage description:&lt;/strong&gt; Trimmed from 185 to 156 characters. The agent's suggestion preserved the Carter Efe reference — that's the hook that makes people click — but cut everything that was padding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;invoice.naija-vpn.com&lt;/code&gt;:&lt;/strong&gt; Added a proper &lt;code&gt;&amp;lt;meta name="description"&amp;gt;&lt;/code&gt; to the static HTML &lt;code&gt;&amp;lt;head&amp;gt;&lt;/code&gt; of the React SPA — the fix lives in &lt;code&gt;index.html&lt;/code&gt;, not in the JavaScript. One line of HTML. The agent's suggested copy: &lt;em&gt;"Generate professional invoices instantly. Receive international payments in Nigeria with Geegpay, Payoneer and Cleva. Free invoice tool for Nigerian freelancers."&lt;/em&gt; — 158 characters, PASS.&lt;/p&gt;




&lt;h3&gt;
  
  
  The verification run
&lt;/h3&gt;

&lt;p&gt;Reset state. Re-ran the audit.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;URL&lt;/th&gt;
&lt;th&gt;Title&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;th&gt;Tier&lt;/th&gt;
&lt;th&gt;Overall&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;naija-vpn.com/&lt;/td&gt;
&lt;td&gt;PASS — 46 chars&lt;/td&gt;
&lt;td&gt;PASS — 156 chars&lt;/td&gt;
&lt;td&gt;Tier 1&lt;/td&gt;
&lt;td&gt;PASS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;naijavpn.com/&lt;/td&gt;
&lt;td&gt;PASS — 46 chars&lt;/td&gt;
&lt;td&gt;PASS — 156 chars&lt;/td&gt;
&lt;td&gt;Tier 1&lt;/td&gt;
&lt;td&gt;PASS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;/cleva-vs-geegpay&lt;/td&gt;
&lt;td&gt;PASS — 52 chars&lt;/td&gt;
&lt;td&gt;PASS — 156 chars&lt;/td&gt;
&lt;td&gt;Tier 1&lt;/td&gt;
&lt;td&gt;PASS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;invoice.naija-vpn.com&lt;/td&gt;
&lt;td&gt;PASS — 50 chars&lt;/td&gt;
&lt;td&gt;PASS — 149 chars&lt;/td&gt;
&lt;td&gt;Tier 2&lt;/td&gt;
&lt;td&gt;PASS&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;4/4. Zero flags.&lt;/p&gt;

&lt;p&gt;The verification run hit Tier 1 on every page — pure deterministic Python, zero API calls. That's the cost curve completing its own argument: the first run used Haiku and Sonnet because the issues were ambiguous enough to need judgment. The verification run used nothing because the fixes were mechanical and the checks are mechanical. The model cost dropped to zero not because the run was cheaper, but because the problems were gone.&lt;/p&gt;




&lt;h3&gt;
  
  
  The finding that mattered most
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;invoice.naija-vpn.com&lt;/code&gt; subdomain inheriting homepage metadata is a serious SEO problem. Google treats subdomains as separate sites — so this wasn't "duplicate metadata on the same site." It was a standalone subdomain with no metadata of its own, rendering the homepage's Carter Efe story as the description for an invoice tool.&lt;/p&gt;

&lt;p&gt;The agent caught it because it reads the rendered DOM. A scraper reads raw HTML — it would have seen an empty description and flagged the absence, not the leak. The agent saw what a real browser sees: the homepage content dynamically injected after JavaScript executed.&lt;/p&gt;

&lt;p&gt;The agent found it automatically, on the first run, and proposed a fix.&lt;/p&gt;

&lt;p&gt;That's the actual value. Not the character counts. The automated surface of a problem that was invisible to a quick look at the page.&lt;/p&gt;




&lt;p&gt;Two full audit passes plus the rewrite run: under $0.15 total. Four pages, three runs, one afternoon.&lt;/p&gt;

&lt;p&gt;The free core at &lt;a href="https://github.com/dannwaneri/seo-agent" rel="noopener noreferrer"&gt;dannwaneri/seo-agent&lt;/a&gt; handles the audit and the basic report. The PDF with per-page screenshots, severity-sorted issues, and the rewrite suggestions are in the Pro layer. The cost curve runs in both.&lt;/p&gt;

</description>
      <category>python</category>
      <category>automation</category>
      <category>ai</category>
      <category>webdev</category>
    </item>
    <item>
      <title>I Was Paying $0.006 Per URL for SEO Audits Until I Realized Most Needed $0</title>
      <dc:creator>Daniel Nwaneri</dc:creator>
      <pubDate>Fri, 03 Apr 2026 14:24:02 +0000</pubDate>
      <link>https://forem.com/dannwaneri/i-was-paying-0006-per-url-for-seo-audits-until-i-realized-most-needed-0-132j</link>
      <guid>https://forem.com/dannwaneri/i-was-paying-0006-per-url-for-seo-audits-until-i-realized-most-needed-0-132j</guid>
      <description>&lt;p&gt;&lt;a href="https://dev.to/pascal_cescato_692b7a8a20"&gt;Pascal CESCATO&lt;/a&gt; read my SEO audit agent piece and left this in the comments:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"You don't need an LLM for this. Everything you're sending to Claude can be done directly in Python — zero cost, fully deterministic, no hallucination risk."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;He was right. And wrong. And the conversation that followed is the reason I rebuilt the entire thing.&lt;/p&gt;




&lt;h3&gt;
  
  
  What Pascal Actually Said
&lt;/h3&gt;

&lt;p&gt;The audit agent I published checks title length, meta description length, H1 count, and canonical tags. Pascal's point: those are character counts and presence checks. A regex does that. You don't pay $0.006 per URL for a regex.&lt;/p&gt;

&lt;p&gt;I pushed back. The &lt;code&gt;flags&lt;/code&gt; array requires judgment — "title reads like a navigation label rather than a page description" isn't a character count. Pascal conceded, then reframed:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Two-pass makes more sense. Deterministic Python for binary checks, model call only on pages that pass the mechanical audit but need a second look. You pay per genuinely ambiguous case, not per URL."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That's a better architecture than what I shipped. I said so publicly.&lt;/p&gt;

&lt;p&gt;Then &lt;a href="https://dev.to/aiforwork"&gt;Julian Oczkowski&lt;/a&gt; extended it:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Deterministic rules first, lightweight models for triage, larger models reserved for genuinely ambiguous edge cases. Keeps latency low, costs predictable, reduces unnecessary LLM dependency."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Three people in a comment thread had just designed something I hadn't thought to name. Pascal called it two-pass. Julian called it tiered. I called it the cost curve — a sliding scale from free to expensive, routed by what the task actually requires.&lt;/p&gt;




&lt;h3&gt;
  
  
  The Cost Curve
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Tier 1 — Deterministic Python. Cost: $0.&lt;/strong&gt;&lt;br&gt;
Title &amp;gt;60 characters? FAIL. Description missing? FAIL. H1 count == 0? FAIL. These are not judgment calls. A model that can reason about Shakespeare does not need to be invoked to count to 60.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tier 2 — Haiku. Cost: ~$0.0001 per URL.&lt;/strong&gt;&lt;br&gt;
Title present but 4 characters long. Description present but 30 characters. Status code is a redirect. These pass the mechanical audit but something is off. Haiku is cheap enough that calling it for ambiguous cases costs less than the time you'd spend debugging why the deterministic check missed something.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tier 3 — Sonnet. Cost: ~$0.006 per URL.&lt;/strong&gt;&lt;br&gt;
Pages Haiku flags as needing semantic judgment. "This title passes length but reads like a navigation label." "This description duplicates the title verbatim." Sonnet earns its cost here. Not everywhere.&lt;/p&gt;

&lt;p&gt;The insight is routing. Most pages on a typical agency site have mechanical issues — missing descriptions, long titles, no canonical. Those never need a model. The interesting cases — pages that pass every binary check but still feel wrong — are where the model earns its place.&lt;/p&gt;

&lt;p&gt;On my last run of 50 URLs, 8 reached Sonnet. The rest resolved at Tier 1. Total cost dropped from ~$0.30 to ~$0.05. The 8 that hit Sonnet were the ones worth paying for.&lt;/p&gt;


&lt;h3&gt;
  
  
  What I Actually Built
&lt;/h3&gt;

&lt;p&gt;I restructured the entire repo around this architecture.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;core/&lt;/code&gt; stays flat and MIT licensed. The original seven modules untouched. Anyone who cloned v1 still runs &lt;code&gt;python core/index.py&lt;/code&gt; and gets identical behavior.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;premium/&lt;/code&gt; adds four modules. &lt;code&gt;cost_curve.py&lt;/code&gt; handles the tier routing — &lt;code&gt;audit_url(snapshot, tiered=True)&lt;/code&gt; runs Tier 1 first, escalates to Haiku if something fails, escalates to Sonnet only if Haiku flags semantic ambiguity. &lt;code&gt;multi_client.py&lt;/code&gt; manages project folders — &lt;code&gt;--project acme&lt;/code&gt; reads and writes from &lt;code&gt;projects/acme/&lt;/code&gt; with isolated state, input, and reports. &lt;code&gt;enhanced_reporter.py&lt;/code&gt; generates WeasyPrint PDFs with per-URL screenshots, issues sorted by severity, and suggested fixes. &lt;code&gt;rewrite_agent.py&lt;/code&gt; is the one Pascal didn't anticipate — after the audit, &lt;code&gt;--rewrite&lt;/code&gt; generates improvement suggestions using the same cost curve: Tier 1 truncates titles deterministically, Haiku writes meta description suggestions, Sonnet rewrites opening paragraphs. Pass &lt;code&gt;--voice-sample ./my-writing.txt&lt;/code&gt; and the prompt includes a sample of your writing. The suggestions sound like you, not like Claude.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;main.py&lt;/code&gt; is the unified entry point. Free users run &lt;code&gt;python main.py&lt;/code&gt; and get v1 behavior. Pro users add &lt;code&gt;--pro&lt;/code&gt;, set &lt;code&gt;SEO_AGENT_LICENSE&lt;/code&gt;, unlock the premium layer.&lt;/p&gt;

&lt;p&gt;A full pro run looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python main.py &lt;span class="nt"&gt;--project&lt;/span&gt; client-x &lt;span class="nt"&gt;--pro&lt;/span&gt; &lt;span class="nt"&gt;--tiered&lt;/span&gt; &lt;span class="nt"&gt;--rewrite&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That single command: reads from &lt;code&gt;projects/client-x/input.csv&lt;/code&gt;, routes every URL through the cost curve, generates rewrite suggestions for failing pages, writes a PDF report with screenshots and severity levels, and appends a run record to the audit history.&lt;/p&gt;




&lt;h3&gt;
  
  
  The Architecture Decision That Matters
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;core/&lt;/code&gt; imports nothing from &lt;code&gt;premium/&lt;/code&gt;. Ever.&lt;/p&gt;

&lt;p&gt;This isn't just clean code. It's a trust contract with anyone who forks the repo. The MIT-licensed core is the public good — auditable, forkable, accepts PRs. The premium layer is proprietary and closed, but it builds on a foundation anyone can inspect.&lt;/p&gt;

&lt;p&gt;Mads Hansen in the comments named the right question: how do you prevent the auditor from developing blind spots on recurring patterns? The answer I didn't have then: run history in &lt;code&gt;state.json&lt;/code&gt;. Each completed run appends a record — timestamps, pass counts, fail counts, report path. Over time, "this page has failed description length for six consecutive runs" is a different signal than "this page failed today." The audit becomes a monitor. Not just a snapshot.&lt;/p&gt;




&lt;h3&gt;
  
  
  What the Comment Thread Cost
&lt;/h3&gt;

&lt;p&gt;Nothing. And produced a better architecture than I would have built alone.&lt;/p&gt;

&lt;p&gt;Pascal's pushback separated the deterministic from the semantic. Julian's production framing gave me the three-tier structure. &lt;a href="https://dev.to/apex_stack"&gt;Apex Stack&lt;/a&gt;'s 89K-page site showed me where the orphan detection problem lives. &lt;a href="https://dev.to/mads_hansen_27b33ebfee4c9"&gt;Mads Hansen&lt;/a&gt; named the blind spot question I hadn't asked.&lt;/p&gt;

&lt;p&gt;None of that was in the original article. All of it is in the repo now.&lt;/p&gt;

&lt;p&gt;The public comment thread is the architecture review I didn't schedule. That's what happens when you publish the honest version — the demo that fails on your own content — instead of the staged one.&lt;/p&gt;

&lt;p&gt;The staged demo would have passed. The honest one compounded.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Full repo: &lt;a href="https://github.com/dannwaneri/seo-agent" rel="noopener noreferrer"&gt;dannwaneri/seo-agent&lt;/a&gt;. Core is MIT. Premium requires a license key. The freeCodeCamp tutorial covers the v1 build in detail: &lt;a href="https://www.freecodecamp.org/news/how-to-build-a-local-seo-audit-agent-with-browser-use-and-claude-api" rel="noopener noreferrer"&gt;How to Build a Local SEO Audit Agent with Browser Use and Claude API&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>ai</category>
      <category>webdev</category>
      <category>automation</category>
    </item>
    <item>
      <title>Agents Don't Just Do Unauthorized Things. They Cause Humans to Do Unauthorized Things.</title>
      <dc:creator>Daniel Nwaneri</dc:creator>
      <pubDate>Mon, 30 Mar 2026 14:36:52 +0000</pubDate>
      <link>https://forem.com/dannwaneri/agents-dont-just-do-unauthorized-things-they-cause-humans-to-do-unauthorized-things-51j4</link>
      <guid>https://forem.com/dannwaneri/agents-dont-just-do-unauthorized-things-they-cause-humans-to-do-unauthorized-things-51j4</guid>
      <description>&lt;p&gt;A comment thread shouldn't produce original research. This one did.&lt;/p&gt;

&lt;p&gt;Last week I published a piece about agent governance — the gap between what an agent is authorized to do and what it effectively becomes authorized to do through accumulated actions. I used a quant fund analogy: five independent strategies, each within its own risk limits, collectively overweight the same sector. No single decision was wrong. The aggregate outcome was unauthorized.&lt;/p&gt;

&lt;p&gt;The comment section built something I hadn't anticipated.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/kalpaka"&gt;Kalpaka&lt;/a&gt;,&lt;a href="https://dev.to/vicchen"&gt;Vic Chen&lt;/a&gt;, &lt;a href="https://dev.to/acytryn"&gt;Andre Cytryn&lt;/a&gt;, &lt;a href="https://dev.to/euphorie"&gt;Stephen Lee&lt;/a&gt;, and &lt;a href="https://dev.to/connor_gallic"&gt;Connor Gallic&lt;/a&gt; spent three days extending the argument in directions I hadn't gone. What follows is an attempt to assemble what they built — with attribution, because the thread earned it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Unit Problem
&lt;/h2&gt;

&lt;p&gt;The first thing Kalpaka named was the hardest: what do you measure?&lt;/p&gt;

&lt;p&gt;Actions taken is too noisy. Resources touched is closer but misses compounding. The unit that matters, they argued, is state delta — the diff between what the system could affect before and after a given action sequence.&lt;/p&gt;

&lt;p&gt;A vendor approval doesn't just place an order. It creates a supply chain dependency. A database read doesn't just return rows. It establishes an access pattern the agent now relies on.&lt;/p&gt;

&lt;p&gt;So the audit object isn't an action log. It's a capability graph — resources and permissions as nodes, actually-exercised access paths as edges. Reconciliation compares the declared graph against the exercised one. The delta is your behavioral exposure.&lt;/p&gt;

&lt;p&gt;This is the frame that makes the quant fund analogy precise rather than decorative. A 13F filing is exactly this: a reconciliation between declared strategy and actual exposure, anchored to external cadence rather than internal computation speed. The quarterly filing forces the moment when someone has to look at the full graph, not just the individual positions.&lt;/p&gt;




&lt;h2&gt;
  
  
  Two Kinds of Edges
&lt;/h2&gt;

&lt;p&gt;The capability graph has a shape problem I hadn't considered.&lt;/p&gt;

&lt;p&gt;Direct edges are legible. The agent called the database, touched the file, invoked the API. These edges can be tracked. They decay naturally — an access pattern not exercised in N reconciliation cycles can be pruned. The agent hasn't used that path, so it drops from the exercised graph.&lt;/p&gt;

&lt;p&gt;Induced edges are different. These are the edges created when an agent's output causes a human to take an action the agent itself didn't take. The Meta Sev 1 incident last week is the exact pattern: the agent didn't write anything unauthorized. It gave advice that caused an engineer to widen access. The exposure persisted for two hours. The agent's direct action log showed nothing unusual.&lt;/p&gt;

&lt;p&gt;Induced edges don't decay the same way direct edges do. The human decision the agent caused doesn't un-happen when the agent stops referencing it. The widened access persists independently of the agent's continued activity. These edges have a much longer half-life — effectively permanent until someone explicitly reconciles them.&lt;/p&gt;

&lt;p&gt;This is where the governance architecture splits. Direct edges decay on a usage clock. Induced edges require active reconciliation. The first can be automated. The second can't — which is exactly where the 13F analogy holds strongest. Quarterly isn't a technical choice. It's a regulatory one. Someone external decided that interval was often enough for fund positions. Agent capability graphs need the same external anchor.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Irreversibility Problem
&lt;/h2&gt;

&lt;p&gt;Andre Cytryn raised threshold drift: you calibrate materiality at deployment time, but the agent's decision space expands with use.&lt;/p&gt;

&lt;p&gt;The per-strategy limits in the fund example weren't wrong at design time. They were wrong at failure time because the correlation structure changed. Continuous recalibration is the obvious fix. But continuous recalibration creates its own attack surface — a mechanism the agent can game structurally, not intentionally. Enough accumulated decisions that each looks within tolerance can shift the calibration baseline until the new thresholds permit what the original thresholds wouldn't.&lt;/p&gt;

&lt;p&gt;The only way out Kalpaka identified: decouple recalibration authority from the agent's own execution context entirely. Thresholds reviewed by something with no stake in the agent's performance.&lt;/p&gt;

&lt;p&gt;This requires knowing which actions are irreversible — but irreversibility is often only visible in retrospect. The Meta incident proves it. Nobody flagged the permission-widening action as irreversible-scope-changing while it was happening. The agent's advice looked like a routine technical suggestion. The irreversibility was only visible after the state had already changed.&lt;/p&gt;

&lt;p&gt;Kalpatha's resolution was to invert the default assumption. Don't try to classify irreversibility upfront. Treat every induced edge as irreversible. Over-reconcile first, and let the reconciliation history generate the labeled data you need to build the upstream classifier later.&lt;/p&gt;

&lt;p&gt;This is how financial compliance actually bootstrapped. Early fund compliance didn't start with sophisticated materiality filters. They reconciled everything quarterly and learned which position changes were material through decades of accumulated review history. The filters came from the data. The data came from over-reconciling.&lt;/p&gt;

&lt;p&gt;We're in the over-reconcile phase. Expensive, noisy, generates false positives. But it's the only way to produce the labeled examples that make the classifier possible later. Trying to solve the detection problem at inception is trying to skip the generation of training data.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Fatigue Problem
&lt;/h2&gt;

&lt;p&gt;Over-reconciling creates a human cost. Every flagged induced state change needs a review decision. Volume in early cycles will be high by design.&lt;/p&gt;

&lt;p&gt;The compliance history the 13F analogy draws on has a second chapter nobody likes to mention: the review process degrades under load. Humans start rubber-stamping. False positives stop being caught. The labeled data gets noisy before the classifier can be trained on it.&lt;/p&gt;

&lt;p&gt;Kalpaka's answer was adaptive rather than prescribed: track reviewer consistency over time. Same reviewer, same flag type — does the approval rate shift as volume increases? That's your fatigue signal. Once you can detect it, you don't need a hard cap. You have evidence-based throttling.&lt;/p&gt;

&lt;p&gt;The auditor rotation parallel holds here too. Financial auditors rotate precisely because of this problem. Fresh eyes are the simplest fatigue mitigation. The question for agent behavioral review is whether there are enough qualified reviewers to rotate through.&lt;/p&gt;

&lt;p&gt;That's harder than it sounds. Financial auditors share a professional vocabulary — they all know what material means, what a restatement implies, what a concentration risk looks like. That shared language is what makes fresh eyes useful. Agent behavioral accounting doesn't have that vocabulary yet. The capability graph framing this thread built is an attempt to construct it. Until reviewers have a shared framework for what "consequential scope change" looks like in practice, rotating fresh eyes doesn't transfer the same way.&lt;/p&gt;

&lt;p&gt;The qualified reviewer problem might be the bootstrapping constraint underneath the bootstrapping constraint.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Enforcement Floor
&lt;/h2&gt;

&lt;p&gt;Connor Gallic brought the piece back to earth.&lt;/p&gt;

&lt;p&gt;Theory is one thing. What teams actually ship is another. The simplest version of enforcement is a write checkpoint: every agent action that changes state passes through a policy evaluation before it executes. Not token monitoring, not post-hoc audit. A deterministic gate between intent and action.&lt;/p&gt;

&lt;p&gt;The aggregate authorization problem gets simpler when you centralize that policy layer. If every agent's writes flow through the same checkpoint, cross-agent scope becomes visible — because the governance layer has the view that individual agents don't.&lt;/p&gt;

&lt;p&gt;The 13F analogy works because quarterly filings are boring, reliable, and externally anchored. Agent governance needs the same properties. The hard part isn't the theory. It's making enforcement boring enough that teams actually ship it.&lt;/p&gt;

&lt;p&gt;The write checkpoint wins on that dimension. It's implementable today. It handles direct scope violations cleanly. It's the enforcement floor.&lt;/p&gt;

&lt;p&gt;The capability graph is the ceiling. It handles induced state changes, threshold drift, cross-agent composition. It's more expensive, harder to build, requires the vocabulary that doesn't fully exist yet. It's also necessary for complete governance.&lt;/p&gt;

&lt;p&gt;The right architecture is probably layered: checkpoint as the floor that catches direct violations cheaply, capability graph as the audit layer that catches induced drift over time. You ship the checkpoint first because it's boring enough to actually build. You develop the capability graph because the checkpoint leaves a class of failures uncovered.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the Thread Built
&lt;/h2&gt;

&lt;p&gt;None of this was in the original piece.&lt;/p&gt;

&lt;p&gt;The two-clock distinction was there. The 13F analogy was there. The governance debt framing was there.&lt;/p&gt;

&lt;p&gt;The capability graph, the direct/induced edge split, the over-reconcile bootstrap, the adaptive fatigue ceiling, the layered enforcement architecture — those came from Kalpaka, Andre, Vic, Stephen, and Connor over three days in a comment section.&lt;/p&gt;

&lt;p&gt;That's worth naming directly. Not as a courtesy, but because it demonstrates the problem Foundation is designed to solve. This conversation happened in public, in a thread that will eventually scroll off everyone's feed, attributed to usernames rather than preserved as structured knowledge. The architecture it built is worth more than that.&lt;/p&gt;

&lt;p&gt;The agent governance problem doesn't have a consensus solution yet. What it has is a comment thread that got further than most papers. The qualified reviewer problem is still open. The capability graph needs implementation. The adaptive ceiling needs tooling.&lt;/p&gt;

&lt;p&gt;But the vocabulary exists now. That's what the thread produced. And vocabulary, as someone wiser than me wrote recently, turns vague frustration into specific, solvable problems.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;The original piece: &lt;a href="https://dev.to/dannwaneri/your-agent-is-making-decisions-nobody-authorized-2bc7"&gt;Your Agent Is Making Decisions Nobody Authorized&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>discuss</category>
      <category>architecture</category>
      <category>webdev</category>
      <category>ai</category>
    </item>
    <item>
      <title>I Built a Local AI Agent That Audits My Own Articles. It Flagged Every Single One.</title>
      <dc:creator>Daniel Nwaneri</dc:creator>
      <pubDate>Mon, 30 Mar 2026 14:33:56 +0000</pubDate>
      <link>https://forem.com/dannwaneri/i-built-a-local-ai-agent-that-audits-my-own-articles-it-flagged-every-single-one-pkh</link>
      <guid>https://forem.com/dannwaneri/i-built-a-local-ai-agent-that-audits-my-own-articles-it-flagged-every-single-one-pkh</guid>
      <description>&lt;p&gt;Not as a gotcha. As a result.&lt;/p&gt;

&lt;p&gt;Seven URLs. Seven FAILs.&lt;/p&gt;

&lt;p&gt;My Hashnode profile is missing an H1. Three freeCodeCamp tutorials have meta descriptions that are either missing or over 160 characters. Two DEV.to articles have titles too long for Google to render cleanly.&lt;/p&gt;

&lt;p&gt;I built the agent. I ran it on my own content first. That's the honest version of the demo.&lt;/p&gt;




&lt;h2&gt;
  
  
  The problem I was actually solving
&lt;/h2&gt;

&lt;p&gt;Every digital marketing agency has someone whose job is basically this: open a spreadsheet, visit each client URL, check the title tag, check the description, check the H1, note broken links, paste everything into a report. Repeat weekly.&lt;/p&gt;

&lt;p&gt;That person costs money. The work is deterministic. The only reason it's still manual is that nobody built the alternative.&lt;/p&gt;

&lt;p&gt;I built it in a weekend.&lt;/p&gt;




&lt;h2&gt;
  
  
  The stack
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Browser Use&lt;/strong&gt; — Python-native browser automation. The agent navigates real pages in a visible Chromium window. Not a headless scraper. Persistent sessions, real rendering, the same page a human would see.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Claude API (Sonnet)&lt;/strong&gt; — reads the page snapshot and returns structured JSON: title status, description status, H1 count, canonical tag, flags. One API call per URL.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;httpx&lt;/strong&gt; — async HEAD requests for broken link detection. Capped at 50 links per page, concurrent, 5-second timeout per request.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flat JSON files&lt;/strong&gt; — &lt;code&gt;state.json&lt;/code&gt; tracks what's been audited. Interrupt mid-run, restart, it picks up exactly where it stopped. No database needed.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Seven Python files. 956 lines total. Runs on a Windows laptop.&lt;/p&gt;




&lt;h2&gt;
  
  
  The part most tutorials skip: HITL
&lt;/h2&gt;

&lt;p&gt;The agent hits a login wall. Throws an exception. Run dies.&lt;/p&gt;

&lt;p&gt;That's most automation tutorials.&lt;/p&gt;

&lt;p&gt;This one doesn't work that way.&lt;/p&gt;

&lt;p&gt;When the agent detects a non-200 status, a redirect to a login page, or a title containing "sign in" or "access denied", it pauses. In interactive mode: skip, retry, or quit. In &lt;code&gt;--auto&lt;/code&gt; mode it skips automatically, logs the URL to &lt;code&gt;needs_human[]&lt;/code&gt; in state, and continues.&lt;/p&gt;

&lt;p&gt;An agent that knows its limits is more useful than one that fails silently. That's the design decision most people don't make because tutorials don't cover it.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the audit actually found
&lt;/h2&gt;

&lt;p&gt;I ran it against my own published content across three platforms:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;URL&lt;/th&gt;
&lt;th&gt;Failing fields&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;hashnode.com/&lt;a class="mentioned-user" href="https://dev.to/dannwaneri"&gt;@dannwaneri&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;H1 missing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;freeCodeCamp — how-to-build-your-own-claude-code-skill&lt;/td&gt;
&lt;td&gt;Meta description&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;freeCodeCamp — how-to-stop-letting-ai-agents-guess&lt;/td&gt;
&lt;td&gt;Meta description&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;freeCodeCamp — build-a-production-rag-system&lt;/td&gt;
&lt;td&gt;Title + meta description&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;freeCodeCamp — author/dannwaneri&lt;/td&gt;
&lt;td&gt;Meta description&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;dev.to — the-gatekeeping-panic&lt;/td&gt;
&lt;td&gt;Title too long&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;dev.to — i-built-a-production-rag-system&lt;/td&gt;
&lt;td&gt;Title too long&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The freeCodeCamp description issues are partly platform-level — freeCodeCamp controls the template and sometimes truncates or omits meta descriptions. The DEV.to title issues are mine. Article titles that read well as headlines often exceed 60 characters in the &lt;code&gt;&amp;lt;title&amp;gt;&lt;/code&gt; tag.&lt;/p&gt;

&lt;p&gt;The agent didn't care. It checked the standard and reported the result.&lt;/p&gt;




&lt;h2&gt;
  
  
  The schedule play
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;python index.py --auto&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Add a &lt;code&gt;.bat&lt;/code&gt; file that sets the API key and calls that command. Schedule it in Windows Task Scheduler for Monday 7am. Check &lt;code&gt;report-summary.txt&lt;/code&gt; with your coffee.&lt;/p&gt;

&lt;p&gt;That's the agency workflow. No babysitting. Edge cases in &lt;code&gt;needs_human[]&lt;/code&gt; for human review. Everything else processed and reported automatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  What this actually costs
&lt;/h2&gt;

&lt;p&gt;One Sonnet API call per URL. Roughly $0.002 per page. A 20-URL weekly audit costs less than $0.05. The Playwright browser runs locally — no cloud browser fees, no Browserbase subscription.&lt;/p&gt;

&lt;p&gt;The whole thing runs on a $5/month philosophy. Same one I use for everything else.&lt;/p&gt;




&lt;h2&gt;
  
  
  The code
&lt;/h2&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/dannwaneri/seo-agent" rel="noopener noreferrer"&gt;dannwaneri/seo-agent&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Clone it, add your URLs to &lt;code&gt;input.csv&lt;/code&gt;, set &lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt; in your environment, run &lt;code&gt;pip install -r requirements.txt&lt;/code&gt;, run &lt;code&gt;playwright install chromium&lt;/code&gt;, then &lt;code&gt;python index.py&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The freeCodeCamp tutorial walks through each module — browser integration, the Claude extraction prompt, the async link checker, the HITL logic. Link in the comments when it's live.&lt;/p&gt;




&lt;h2&gt;
  
  
  The shift worth naming
&lt;/h2&gt;

&lt;p&gt;Browser automation has been a developer tool for a decade. Playwright, Selenium, Puppeteer — all powerful, all requiring someone to write and maintain selectors. The moment a button's class name changes, the script breaks.&lt;/p&gt;

&lt;p&gt;This agent doesn't use selectors. It reads the page the way Claude reads it — semantically, through the accessibility tree. A "Submit" button is still a "Submit" button even if the CSS class changed.&lt;/p&gt;

&lt;p&gt;The extraction logic is in the prompt, not in the code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Old way:&lt;/strong&gt; Automation breaks when the page changes.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;New way:&lt;/strong&gt; Reasoning adapts. The code doesn't need to.&lt;/p&gt;

&lt;p&gt;That's the actual shift. Not "AI does the work" but "the brittleness moved." From selectors to prompts. From maintenance to reasoning. The failure modes are different. So is the recovery.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built this as the first in a series on practical local AI agent setups for agency operations. The freeCodeCamp step-by-step tutorial is coming. Repo is live now.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>automation</category>
      <category>ai</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Imposter Syndrome Didn't Go Away. It Got Quieter.</title>
      <dc:creator>Daniel Nwaneri</dc:creator>
      <pubDate>Thu, 26 Mar 2026 13:05:44 +0000</pubDate>
      <link>https://forem.com/dannwaneri/imposter-syndrome-didnt-go-away-it-got-quieter-4mcc</link>
      <guid>https://forem.com/dannwaneri/imposter-syndrome-didnt-go-away-it-got-quieter-4mcc</guid>
      <description>&lt;p&gt;I noticed something last year. The imposter syndrome posts disappeared.&lt;/p&gt;

&lt;p&gt;Not gradually. They were everywhere — "I've been coding for three years and I still Google how to center a div," "just got promoted to senior and I have no idea what I'm doing," threads with thousands of likes, people in the replies saying &lt;em&gt;me too, me too, me too&lt;/em&gt;. Then the LLMs arrived, and the posts stopped. I thought maybe the tools cured it. That we'd finally found the thing that made everyone feel competent.&lt;/p&gt;

&lt;p&gt;I was wrong. The feeling didn't go away. It just changed shape.&lt;/p&gt;




&lt;p&gt;The old imposter syndrome had a specific texture. You knew what you didn't know. You couldn't center a div, you'd never touched Kubernetes, you were faking it in standups about GraphQL. The gap was legible. You could name it, study it, fill it. That legibility is what made the posts shareable. "I don't know X" is a sentence you can say out loud.&lt;/p&gt;

&lt;p&gt;The new version is harder to name. You shipped the feature. The tests pass. Your manager is happy. But when someone asks you to walk through the implementation, something is off. You can narrate what the code does. You can't always explain why it's structured the way it is, what the model assumed, where it would break under pressure. The code is yours the same way a house is yours when you hired the contractor. You own it. You don't know every decision that went into the walls.&lt;/p&gt;

&lt;p&gt;Engineering culture has long tied ownership to authorship. You wrote the code, therefore you understood it. You understood it, therefore you were responsible for it. That chain held for decades. AI broke it — not dramatically, not all at once, but consistently enough that a new kind of doubt has moved in where the old one used to live.&lt;/p&gt;




&lt;p&gt;The data is starting to catch up to what developers already feel.&lt;/p&gt;

&lt;p&gt;METR previously published a paper which found the use of AI tools caused a 20% slowdown in completing tasks among experienced open-source developers — a finding strange enough that they ran a follow-up study. That follow-up ran into its own problem: developers refusing to participate because they did not wish to work without AI. 30% to 50% of developers told researchers they were choosing not to submit certain tasks because they didn't want to do them without AI.&lt;/p&gt;

&lt;p&gt;Read that twice. Not "AI makes me faster." Not even "AI makes me better." Developers who cannot or will not attempt the work without the tool present. That's not productivity. That's dependency.&lt;/p&gt;

&lt;p&gt;Luciano Nooijen noticed this before the research did. He used AI tools heavily at work, stopped for a side project, and hit a wall. "I was feeling so stupid because things that used to be instinct became manual, sometimes even cumbersome." The instincts didn't fail slowly. They went quiet.&lt;/p&gt;




&lt;p&gt;AI coding tools can make you more productive. They can make you feel more confident. They can also produce developers who don't understand the context behind the code they've written or how to debug it.&lt;/p&gt;

&lt;p&gt;Stack Overflow called this "the illusion of expertise." I'd call it something slightly different. It's not that the expertise is fake. It's that the path that usually builds expertise — the struggle, the failure, the iteration — got shortened. When AI generates code without the developer engaging deeply with the implementation, they skip over all of those learning iterations. The developer doesn't struggle with the problem, doesn't try multiple approaches, doesn't experience the failures that build intuition.&lt;/p&gt;

&lt;p&gt;The old imposter syndrome was painful but self-correcting. You felt like a fraud, so you studied. You filled the gap. The new version doesn't have an obvious gap to fill. The code shipped. You're a senior developer with a good job and passing tests and no visible evidence that anything is wrong. The doubt is quieter. That's what makes it harder.&lt;/p&gt;




&lt;p&gt;We're living in the most empowering time to be a developer. The barrier to building things has never been lower. And yet, I've never felt less sure of my own skills.&lt;/p&gt;

&lt;p&gt;That's Pranav Reveendran, writing in December. He's not alone. A 2025 survey by Perceptyx found a "confidence gap" where 71% of employees use AI, but only 35% of individual contributors feel they understand it well enough. The adoption curve went up. The confidence curve didn't follow.&lt;/p&gt;

&lt;p&gt;This is what the posts were about, when people still wrote them. Not "do I belong in tech?" but "do I understand what I'm building?" The first question had a community around it. Thousands of replies, conferences, entire career tracks built around addressing it. The second question doesn't have a community yet. It's the thing people say in DMs but not in threads. It's the thing I've heard from developers who've been building for years: I know how to use the tools. I'm not sure I know the work anymore.&lt;/p&gt;




&lt;p&gt;I don't think this resolves neatly. The tools aren't going away and the productivity is real and I'm not arguing for anyone to stop using them. I use them. I built most of Foundation's evaluator with AI assistance and I'd do it again.&lt;/p&gt;

&lt;p&gt;But I think the silence is worth naming. The imposter syndrome posts didn't stop because the feeling stopped. They stopped because the feeling changed into something that's harder to share. "I don't know how to center a div" is embarrassing but legible. "I shipped a feature I can't fully explain" is a different kind of statement. It implicates the work, not just the person. It's harder to say &lt;em&gt;me too&lt;/em&gt; to.&lt;/p&gt;

&lt;p&gt;The old imposter syndrome asked: &lt;em&gt;am I good enough?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The new one asks: &lt;em&gt;is the work mine?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That's a harder question. And so far, it's mostly going unasked.&lt;/p&gt;

</description>
      <category>discuss</category>
      <category>career</category>
      <category>webdev</category>
      <category>ai</category>
    </item>
    <item>
      <title>The Future Is Not the Agent Using a Human Interface</title>
      <dc:creator>Daniel Nwaneri</dc:creator>
      <pubDate>Mon, 23 Mar 2026 13:44:20 +0000</pubDate>
      <link>https://forem.com/dannwaneri/the-future-is-not-the-agent-using-a-human-interface-48cg</link>
      <guid>https://forem.com/dannwaneri/the-future-is-not-the-agent-using-a-human-interface-48cg</guid>
      <description>&lt;p&gt;Carl Pei said it at SXSW last week. His company, Nothing, makes smartphones. He stood on stage and told the room: "The future is not the agent using a human interface. You need to create an interface for the agent to use."&lt;/p&gt;

&lt;p&gt;A consumer hardware CEO publicly declaring that the product category he sells is the wrong shape for what's coming. That's not a hot take. That's a company repositioning before the floor disappears.&lt;/p&gt;

&lt;p&gt;The room moved on. Most of the coverage focused on the "apps are dying" framing. That's the wrong thing to argue about. Apps are not dying next Tuesday. The question worth sitting with is quieter and more immediately useful: what does it mean that the thing we've been building — apps designed for human eyes, human fingers, human intuition — is increasingly being navigated by something that has none of those?&lt;/p&gt;




&lt;p&gt;For thirty years, software design started with a user. A person with a screen, a mouse or finger, a working memory of roughly four things, limited patience, inconsistent behavior. Every design decision flowed from that starting point. Navigation was visual because users have eyes. Flows were linear because users lose context. Confirmation dialogs exist because users make mistakes and need to undo them.&lt;/p&gt;

&lt;p&gt;Those constraints weren't arbitrary. They were load-bearing. The entire architecture of how software works was built around the specific limitations and capabilities of a human being sitting in front of it.&lt;/p&gt;

&lt;p&gt;Agents don't have eyes. They don't navigate menus — they call functions. They don't get confused by non-linear flows — they parse structured outputs. They don't need confirmation dialogs — they need permission boundaries defined at initialization, not presented as popups mid-task.&lt;/p&gt;

&lt;p&gt;When an agent tries to use an app designed for a human, it's doing something like what Pei described: hiring a genius employee and making them work using elevator buttons. The capability is real. The interface is friction. The agent scrapes what it can, simulates the clicks, and works around the design rather than with it.&lt;/p&gt;

&lt;p&gt;This works. Barely. Temporarily.&lt;/p&gt;




&lt;p&gt;The split is already visible in how developers describe their workflows.&lt;/p&gt;

&lt;p&gt;Karpathy, in a March podcast, described moving from writing code to delegating to agents running 16 hours a day. His framing: macro actions over repositories, not line-by-line editing. The unit of work is no longer a file. It's a task.&lt;/p&gt;

&lt;p&gt;Addy Osmani wrote about the same shift at the interface level: the control plane becoming the primary surface, the editor becoming one instrument underneath it. What used to be the center of developer work is becoming a specialized tool for specific moments — the deep inspection, the edge case, the thing the agent got almost right and subtly wrong.&lt;/p&gt;

&lt;p&gt;These descriptions share a structure: something that was primary becomes secondary. Something that was implicit becomes explicit. The developer who used to navigate the editor now supervises agents. The app that used to be the product now needs to also be a legible interface for non-human callers.&lt;/p&gt;




&lt;p&gt;Here's what that means practically, for anyone building software right now.&lt;/p&gt;

&lt;p&gt;The apps built for human navigation will still work for humans. That's not going away. But increasingly, those apps will also be called by agents acting on behalf of humans — booking the flight, filing the form, triggering the workflow. And when the agent calls your app, it doesn't navigate. It looks for a contract: what can I call, what will you return, what happens when I'm wrong.&lt;/p&gt;

&lt;p&gt;Most apps don't have that contract. They have a UI. They have an API if you're lucky. But the API was designed as a developer convenience, not as a primary interface for autonomous callers. The rate limits assume human usage patterns. The error responses assume a developer reading them. The authentication assumes a human with a session.&lt;/p&gt;

&lt;p&gt;None of those assumptions hold for agents.&lt;/p&gt;

&lt;p&gt;The developers who see this clearly are already building differently. Not abandoning the human interface — users still need screens, still need control surfaces, still need to understand what's happening on their behalf. But building the agent interface in parallel. Structured outputs. Explicit capability declarations. Error responses designed to be parsed, not read. Permission boundaries that don't require a human to click through them.&lt;/p&gt;




&lt;p&gt;The spec is where this shows up first.&lt;/p&gt;

&lt;p&gt;A product spec written for human implementation describes features. What the user sees. What they can do. How the flow works. A spec written for agent implementation describes contracts. What the system accepts. What it returns. What it guarantees. Where the boundaries are.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;[ASSUMPTION: ...]&lt;/code&gt; tags in spec-writer exist because that boundary — between what was specified and what the agent decided — is where agent implementations go wrong. Not in the happy path. In the assumptions that weren't stated because a human developer would have asked a clarifying question.&lt;/p&gt;

&lt;p&gt;An agent doesn't ask. It fills the gap with whatever its training suggests is most plausible. If the assumption was wrong, you find out in production.&lt;/p&gt;

&lt;p&gt;The spec that makes agent implementation reliable is the one that surfaces the assumptions before the agent starts. Not because agents are unreliable — they're remarkably capable within a well-defined contract. But because the contract has to be explicit in a way it never had to be when the implementer was human and could pick up the phone.&lt;/p&gt;




&lt;p&gt;Carl Pei's argument is about device form factors. The smartphone built around apps and home screens doesn't fit a world where agents intermediate between intention and execution.&lt;/p&gt;

&lt;p&gt;The same argument applies to every level of the stack.&lt;/p&gt;

&lt;p&gt;The database schema built for human-readable queries. The API designed for developer convenience. The workflow tool that requires clicking through five screens. The SaaS product whose entire value lives in a visual interface with no programmatic equivalent.&lt;/p&gt;

&lt;p&gt;None of these are broken today. They will accumulate friction as agent usage grows — the same way command-line tools accumulated friction when GUIs arrived, the same way desktop software accumulated friction when everything moved to the web.&lt;/p&gt;

&lt;p&gt;The difference is pace. The GUI transition took a decade. The web transition took another. The agent transition is compressing.&lt;/p&gt;




&lt;p&gt;The developers who will build the next layer are already asking a different question. Not "how does the user navigate this?" but "how does the agent call this?" Not "what does the confirmation dialog say?" but "what are the permission boundaries at initialization?" Not "how do we make the flow intuitive?" but "what does the contract guarantee?"&lt;/p&gt;

&lt;p&gt;These are not new questions. They're the questions API designers have always asked. What's new is that they now apply to everything — not just the backend service, but the whole product. The human interface remains. It becomes one layer, not the only layer.&lt;/p&gt;

&lt;p&gt;The future is not the agent using a human interface. The future is building interfaces designed for both — and knowing which decisions belong to which layer.&lt;/p&gt;

&lt;p&gt;The developers who get that distinction early will have built the right foundation before it becomes mandatory. The ones who don't will spend the next three years retrofitting contracts onto systems that were never designed to have them.&lt;/p&gt;

&lt;p&gt;That's the same pattern as every previous transition. The only thing that changes is how much runway you have before the friction becomes a crisis.&lt;/p&gt;

&lt;p&gt;Right now, you still have some.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>discuss</category>
      <category>ai</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Your Agent Is Making Decisions Nobody Authorized</title>
      <dc:creator>Daniel Nwaneri</dc:creator>
      <pubDate>Thu, 19 Mar 2026 18:02:23 +0000</pubDate>
      <link>https://forem.com/dannwaneri/your-agent-is-making-decisions-nobody-authorized-2bc7</link>
      <guid>https://forem.com/dannwaneri/your-agent-is-making-decisions-nobody-authorized-2bc7</guid>
      <description>&lt;p&gt;A quant fund ran five independent strategies. Every one passed its individual risk limits. Every quarterly filing looked reasonable in isolation. But all five strategies were overweight the same sector.&lt;/p&gt;

&lt;p&gt;Aggregate exposure exceeded anything anyone had authorized — because the complexity budget was scoped per-strategy, never cross-strategy. No single decision was wrong. The aggregate outcome was unauthorized. Nobody was watching the right scope.&lt;/p&gt;

&lt;p&gt;This is governance debt. It accumulates invisibly. Each decision individually correct. Aggregate outcome unauthorized. The failure only surfaces when the sector moves against the fund and the concentration nobody explicitly built turns out to have been built anyway — one individually-reasonable decision at a time.&lt;/p&gt;

&lt;p&gt;The token economy has no accounting entry for this. Nothing on the infrastructure bill reflects what happened. The cost appears later, in a different quarter, a different system, a different team's incident report.&lt;/p&gt;




&lt;h2&gt;
  
  
  Two Clocks
&lt;/h2&gt;

&lt;p&gt;Every agent system has at least two clocks running simultaneously.&lt;/p&gt;

&lt;p&gt;The execution clock measures computation — tokens consumed, API calls made, latency per response. This is what most monitoring systems track. It is visible, quantifiable, and entirely the wrong thing to govern.&lt;/p&gt;

&lt;p&gt;The governance clock measures consequences — decisions made, thresholds crossed, exposures accumulated. This clock runs at a fundamentally different speed than execution. It is also running on a fundamentally different metric.&lt;/p&gt;

&lt;p&gt;Execution counts tokens. Governance counts consequences.&lt;/p&gt;

&lt;p&gt;Most agent architectures try to govern at execution layer granularity. They instrument every API call, set token budgets, alert on cost spikes. The result is a monitoring system that costs more attention than the decisions it is protecting. The governance layer becomes noise. Worse than useless — it actively degrades decision quality by demanding attention on non-material changes.&lt;/p&gt;

&lt;p&gt;The fix is not better monitoring. It is a different master clock.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Master Clock
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://dev.to/vibeyclaw"&gt;Vic Chen&lt;/a&gt; builds institutional investor analysis tooling around quarterly SEC filing data. The production domain he works in — 13F analysis, the quarterly disclosures hedge funds file showing their equity positions — has solved the governance clock problem in a way most software systems haven't.&lt;/p&gt;

&lt;p&gt;The expensive thing in 13F analysis is not parsing the filings. That part is mechanical. Run the filing through the pipeline, extract the positions, compare against the previous quarter. Cheap. Deterministic. Fast.&lt;/p&gt;

&lt;p&gt;The expensive thing is deciding which signals actually warrant human attention. Every false positive costs analyst time. Every false negative costs trust in the system. That judgment call does not map cleanly to token costs. You cannot optimize it by switching to a cheaper model or batching the API calls. The bottleneck is not computation. It is consequence.&lt;/p&gt;

&lt;p&gt;The master clock in Vic's system is anchored to SEC reporting cadence — filing deadlines, amendment windows, restatement periods. The quarterly 13F disclosure cycle is one of the few externally-anchored materiality signals in finance. It already encodes a judgment: this change was significant enough to report.&lt;/p&gt;

&lt;p&gt;When a fund flips a position intra-quarter and back again, that event never surfaces in the filing. This is not a detection failure. The governance architecture is working correctly. An intra-quarter flip that reverses before the filing deadline was not a conviction change. The master clock — anchored to external cadence rather than internal computation — correctly filters it out.&lt;/p&gt;

&lt;p&gt;This is filtering by design rather than detection by volume. The governance layer is not trying to catch everything. It is anchored to what the domain has already decided matters.&lt;/p&gt;

&lt;p&gt;When the system detects an NT 13F — a late filing notification — or an amendment chain exceeding two revisions, the complexity budget automatically expands. Slower parsing. Deeper cross-referencing. Human-in-the-loop checkpoints. The external signal triggers the governance response because the external signal already encodes the materiality judgment.&lt;/p&gt;

&lt;p&gt;The master clock is not a timer. It is a materiality filter.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Over-Calibration Trap
&lt;/h2&gt;

&lt;p&gt;The first failure mode most governance architectures hit is over-calibration.&lt;/p&gt;

&lt;p&gt;Early 13F monitoring systems flagged every rounding discrepancy between a fund's 13F and their 13D as potential drift. The noise-to-signal ratio made the governance layer worse than useless. Analysts were spending attention on non-material changes. The system meant to improve decision quality was degrading it.&lt;/p&gt;

&lt;p&gt;Vic's framing of the fix: governance cost should scale with expected loss from undetected drift, not with the volume of changes observed.&lt;/p&gt;

&lt;p&gt;A $50M position shift in a mega-cap is noise. A $50M position shift in a micro-cap is a thesis change. Same signal magnitude. Completely different materiality. The governance architecture has to encode that domain knowledge or it cannot distinguish between them.&lt;/p&gt;

&lt;p&gt;This is where the "significant drift threshold" becomes practical rather than theoretical. Without it, you are building a governance layer that is perpetually anxious — flagging everything, earning attention for nothing, training operators to ignore alerts. With it, the governance layer fires on what matters and stays quiet on what does not.&lt;/p&gt;

&lt;p&gt;The threshold is not a technical parameter. It is a domain judgment encoded into architecture.&lt;/p&gt;




&lt;h2&gt;
  
  
  Governance Debt
&lt;/h2&gt;

&lt;p&gt;Back to the five-strategy fund.&lt;/p&gt;

&lt;p&gt;The token economy version of that failure is identical in structure. An agent that makes two expensive API calls but expands the decision space by introducing correlated hypotheses is under token budget and creating governance debt simultaneously. The cost does not appear on the infrastructure bill. It appears when the downstream decision fails in ways that trace back to the expanded hypothesis space nobody reviewed.&lt;/p&gt;

&lt;p&gt;An agent that makes 100 cheap API calls but narrows a decision space from 5,000 options to 3 is adding value regardless of token cost. An agent that makes two expensive calls but introduces unreviewed complexity is accruing debt regardless of token efficiency.&lt;/p&gt;

&lt;p&gt;The decision economy is the accounting system that captures the difference.&lt;/p&gt;

&lt;p&gt;Three signals that matter in the decision economy and do not appear in token accounting:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Decision scope.&lt;/strong&gt; How many downstream choices does this agent action constrain or enable? An action that narrows the decision space is value-generating. An action that expands the decision space without corresponding resolution is debt-generating. The token cost of both can be identical.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Consequence materiality.&lt;/strong&gt; Not all decisions are equal. The governance clock runs faster when the expected loss from undetected drift is higher. A rounding discrepancy in a mega-cap position and a position reversal in a micro-cap both generate the same token cost to detect. Their materiality is orders of magnitude apart.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Authorization scope.&lt;/strong&gt; Was this decision within the scope of what was explicitly authorized, or did it cross a threshold that required a decision nobody made? Governance debt accumulates at the boundary between authorized and implicitly permitted.&lt;/p&gt;

&lt;p&gt;None of these signals are invisible. They require a different accounting system — one that tracks consequences rather than computation, materiality rather than volume, authorization scope rather than token spend.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Rate Problem
&lt;/h2&gt;

&lt;p&gt;Static agents accumulate governance debt slowly. An agent that makes the same decisions in the same scope every session creates predictable exposure. You can audit it. You can scope the complexity budget correctly once and leave it.&lt;/p&gt;

&lt;p&gt;An agent with memory that promotes knowledge automatically accumulates governance debt at the rate the memory compounds. Each session, the promoted knowledge expands the hypothesis space the agent operates in. Each expansion is individually reasonable. The aggregate effect is an authorization scope that widens faster than anyone is reviewing it.&lt;/p&gt;

&lt;p&gt;The governance layer has to keep pace with the learning rate or it falls further behind with every session — not linearly, but exponentially.&lt;/p&gt;

&lt;p&gt;This is why the evaluator architecture in Foundation anchors promotion criteria to human judgment rather than automating it. Not because automation is wrong in principle. Because the governance debt from unchecked automatic promotion compounds at the same rate as the knowledge base — and the rate is the problem, not any individual promoted insight.&lt;/p&gt;

&lt;p&gt;The master clock that works for 13F analysis works for agent memory for the same reason: it is anchored to external materiality judgment rather than internal computation speed. The filing cadence enforces a review interval that the domain has already validated. The human promotion gate enforces a review interval that the knowledge system needs.&lt;/p&gt;

&lt;p&gt;Both are the same governance architecture. One is thirty years old. One is being built now.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the Decision Economy Actually Measures
&lt;/h2&gt;

&lt;p&gt;Token costs are visible on the infrastructure bill. Governance debt is not.&lt;/p&gt;

&lt;p&gt;This asymmetry creates predictable incentives. Teams optimize for what they can measure. Token spend gets instrumented, budgeted, alerted. Governance debt accumulates until it manifests as a production failure, a trust breakdown, a concentration exposure nobody authorized across five individually-reasonable decisions.&lt;/p&gt;

&lt;p&gt;The token economy prices execution correctly. It is the wrong accounting system for judgment costs because judgment costs do not appear until downstream consequences surface — often in a different quarter, a different system, a different team's incident report.&lt;/p&gt;

&lt;p&gt;The precedent that prices judgment correctly is not a token count. It is a record of what decisions were made under what conditions, what thresholds were crossed, what downstream exposure was created, and whether any of it was explicitly authorized.&lt;/p&gt;

&lt;p&gt;That record is what makes governance debt visible before it becomes a production problem.&lt;/p&gt;

&lt;p&gt;Execution counts tokens. Governance counts consequences.&lt;/p&gt;

&lt;p&gt;The decision economy is what you build when you understand the difference.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This is part of a series on what AI actually changes in software development. Previous pieces: &lt;a href="https://dev.to/dannwaneri/the-gatekeeping-panic-what-ai-actually-threatens-in-software-development-5b9l"&gt;The Gatekeeping Panic&lt;/a&gt;, &lt;a href="https://dev.to/dannwaneri/the-meter-was-always-running-44c4"&gt;The Meter Was Always Running&lt;/a&gt;, &lt;a href="https://dev.to/dannwaneri/who-said-what-to-whom-5914"&gt;Who Said What to Whom&lt;/a&gt;, &lt;a href="https://dev.to/dannwaneri/the-token-economy-3cd9"&gt;The Token Economy&lt;/a&gt;, &lt;a href="https://dev.to/dannwaneri/building-the-evaluator-36kg"&gt;Building the Evaluator&lt;/a&gt;, &lt;a href="https://dev.to/dannwaneri/i-shipped-broken-code-and-wrote-an-article-about-it-98p"&gt;I Shipped Broken Code and Wrote an Article About It&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The governance debt examples and 13F production evidence in this piece draw from analysis by &lt;a href="https://dev.to/vibeyclaw"&gt;Vic Chen&lt;/a&gt;, who builds institutional investor analysis tooling around quarterly SEC filing data. The five-strategy concentration example is his.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>architecture</category>
      <category>discuss</category>
    </item>
    <item>
      <title>I Built a Skill That Writes Your Specs For You</title>
      <dc:creator>Daniel Nwaneri</dc:creator>
      <pubDate>Tue, 17 Mar 2026 11:35:53 +0000</pubDate>
      <link>https://forem.com/dannwaneri/i-built-a-skill-that-writes-your-specs-for-you-1o2n</link>
      <guid>https://forem.com/dannwaneri/i-built-a-skill-that-writes-your-specs-for-you-1o2n</guid>
      <description>&lt;p&gt;Julián Deangelis published a piece this week that hit 194k+ views in days. The argument: AI coding agents don't fail because the model is weak. They fail because the instructions are ambiguous.&lt;/p&gt;

&lt;p&gt;He called it Spec Driven Development. Four steps: specify what you want, plan how to build it technically, break it into tasks, implement one task at a time. Each step reduces ambiguity. By the time the agent starts writing code, it has everything it needs — what the feature does, what the edge cases are, what the tests should verify.&lt;/p&gt;

&lt;p&gt;The piece is right. The problem is the workflow takes discipline most sessions don't have. Under deadline pressure, the spec step disappears and you're back to prompting directly.&lt;/p&gt;

&lt;p&gt;So I built a skill that does the spec-writing for you.&lt;/p&gt;




&lt;h2&gt;
  
  
  The silent decisions problem
&lt;/h2&gt;

&lt;p&gt;Julián's example stuck with me:&lt;/p&gt;

&lt;p&gt;"Add a feature to manage items from the backoffice."&lt;/p&gt;

&lt;p&gt;The agent reads the codebase, picks a pattern, writes the feature. At first it looks fine. Then you click "add item" again and it inserts the same item twice. All the assumptions you thought were obvious were never in the prompt.&lt;/p&gt;

&lt;p&gt;Which backoffice — internal or seller-facing? Should the operation be idempotent? Admin-only or all users? Which storage layer? Which error handling strategy?&lt;/p&gt;

&lt;p&gt;Each one is a silent decision. The agent guesses. Some guesses are right. Some are wrong. And you don't find out which until the feature is in production behaving in ways you didn't expect.&lt;/p&gt;

&lt;p&gt;The spec is what makes those decisions visible before the agent guesses. But writing a spec from scratch, for every feature, before every session — that's the friction that kills the habit.&lt;/p&gt;




&lt;h2&gt;
  
  
  Generate first, flag assumptions inline
&lt;/h2&gt;

&lt;p&gt;The standard approach to spec tools is Q&amp;amp;A: the tool asks clarifying questions before generating anything. What's the user role? What's the auth model? What should happen on error?&lt;/p&gt;

&lt;p&gt;The problem with Q&amp;amp;A is you have to know what you don't know. If you knew exactly what questions to ask, you'd be close to having the spec already.&lt;/p&gt;

&lt;p&gt;spec-writer takes a different approach. It generates the full spec immediately and marks every decision it made that you didn't specify:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight gherkin"&gt;&lt;code&gt;&lt;span class="nf"&gt;Given &lt;/span&gt;an authenticated user requesting an export
&lt;span class="err"&gt;When the export contains more than 1,000 rows [ASSUMPTION&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="err"&gt;async&lt;/span&gt; &lt;span class="err"&gt;for&lt;/span&gt; &lt;span class="err"&gt;large&lt;/span&gt; &lt;span class="err"&gt;exports]&lt;/span&gt;
&lt;span class="nf"&gt;Then &lt;/span&gt;the export is generated asynchronously and the user is notified by email
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At the end, every assumption is collected into a prioritized list:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Assumptions to review&lt;/span&gt;
&lt;span class="p"&gt;
1.&lt;/span&gt; Async generation for exports over 1,000 rows — Impact: HIGH
   Correct this if: your order volumes are low and synchronous is fine
&lt;span class="p"&gt;
2.&lt;/span&gt; Date range filter is required — Impact: MEDIUM
   Correct this if: the first version should export all orders with no filtering
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Correcting a draft is faster than answering questions cold. The &lt;code&gt;[ASSUMPTION: ...]&lt;/code&gt; tags show you exactly where the agent would have guessed wrong — before it guesses. That's the value. Not the spec itself. The visible assumptions.&lt;/p&gt;




&lt;h2&gt;
  
  
  A real example
&lt;/h2&gt;

&lt;p&gt;I ran it on something I'm actually building — CLI capture for Foundation, my knowledge commons project. Foundation currently captures Claude.ai browser conversations via a browser extension. Claude Code sessions live in &lt;code&gt;~/.claude/projects/&lt;/code&gt; and disappear after 30 days. I needed to capture those too.&lt;/p&gt;

&lt;p&gt;The prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/spec-writer Add CLI capture for Claude Code sessions stored in ~/.claude/projects/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here's a slice of what came back:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;From the spec:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight gherkin"&gt;&lt;code&gt;&lt;span class="nf"&gt;Given &lt;/span&gt;a developer with Claude Code sessions in ~/.claude/projects/
&lt;span class="nf"&gt;When &lt;/span&gt;they run the Foundation CLI capture command
&lt;span class="nf"&gt;Then &lt;/span&gt;a list of uncaptured sessions is displayed, sorted by most recent first

&lt;span class="nf"&gt;Given &lt;/span&gt;a session already in Foundation
&lt;span class="nf"&gt;When &lt;/span&gt;the developer runs capture again
&lt;span class="nf"&gt;Then &lt;/span&gt;that session does not appear in the review list

&lt;span class="nf"&gt;Given &lt;/span&gt;a malformed .jsonl file
&lt;span class="nf"&gt;When &lt;/span&gt;the capture command processes it
&lt;span class="nf"&gt;Then &lt;/span&gt;an error is logged for that file and processing continues on remaining sessions
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;From the plan:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CLI capture runs locally — it needs to read the local filesystem, which 
Cloudflare Workers cannot do. [ASSUMPTION: CLI capture is a local Node.js 
or Bun script that calls the Foundation API to insert sessions, rather 
than a Worker itself]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That assumption stopped me. I hadn't thought through the architecture split — Foundation runs on Cloudflare Workers, which can't read local filesystems. The skill surfaced it before I'd written a line of code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The assumptions summary:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="p"&gt;1.&lt;/span&gt; CLI capture is a local script calling the Foundation API, not a Worker
   Impact: HIGH
   Correct this if: you want a purely serverless approach
&lt;span class="p"&gt;
2.&lt;/span&gt; Manual curation before capture, not automatic bulk import
   Impact: HIGH
   Correct this if: you want automatic background capture
&lt;span class="p"&gt;
3.&lt;/span&gt; Session ID from .jsonl filename is the deduplication key
   Impact: MEDIUM
   Correct this if: session IDs are stored differently in your schema
&lt;span class="p"&gt;
4.&lt;/span&gt; No sensitive data scrubbing in v1
   Impact: MEDIUM
   Correct this if: your sessions contain credentials or keys
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Four assumptions, two of them high-impact. The architecture one I would have hit mid-implementation. The sensitive data one I wouldn't have thought about until someone complained.&lt;/p&gt;

&lt;p&gt;That's the value. Not the spec itself. The visible assumptions.&lt;/p&gt;




&lt;h2&gt;
  
  
  How it fits into SDD
&lt;/h2&gt;

&lt;p&gt;spec-writer gets you to Spec-First immediately, with no ceremony. One command, full output, correct the assumptions, hand to the agent.&lt;/p&gt;

&lt;p&gt;Julián describes two levels beyond that. Spec-Anchored: the spec lives in the repo and evolves with the code. Spec-as-Source: the spec is the primary artifact, code is regenerated to match. If you want to move toward Spec-Anchored, save the output to &lt;code&gt;specs/feature-name.md&lt;/code&gt; in your repo. The skill produces something worth keeping.&lt;/p&gt;

&lt;p&gt;The methodology is Julián's. The skill is the friction remover.&lt;/p&gt;




&lt;h2&gt;
  
  
  Install
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; ~/.claude/skills
git clone https://github.com/dannwaneri/spec-writer.git ~/.claude/skills/spec-writer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/spec-writer [your feature description]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Works across Claude Code, Cursor, Gemini CLI, and any agent that supports the Agent Skills standard. The same SKILL.md file works everywhere.&lt;/p&gt;

&lt;p&gt;The repo is at &lt;a href="https://github.com/dannwaneri/spec-writer" rel="noopener noreferrer"&gt;github.com/dannwaneri/spec-writer&lt;/a&gt;. The README has the full output format and a worked example.&lt;/p&gt;




&lt;p&gt;The agents are getting better at implementing. The bottleneck was always specification — knowing what to build precisely enough that the agent doesn't have to guess. spec-writer doesn't remove that requirement. It makes it faster to satisfy.&lt;/p&gt;

&lt;p&gt;The spec isn't the output. The assumptions are.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built on the Spec Driven Development methodology — operationalized by tools like &lt;a href="https://github.com/github/spec-kit" rel="noopener noreferrer"&gt;GitHub Spec Kit&lt;/a&gt; and &lt;a href="https://github.com/Fission-AI/OpenSpec" rel="noopener noreferrer"&gt;OpenSpec&lt;/a&gt;. Julián Deangelis's writing on SDD at MercadoLibre was the direct inspiration.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Other skills: &lt;a href="https://github.com/dannwaneri/voice-humanizer" rel="noopener noreferrer"&gt;voice-humanizer&lt;/a&gt; — checks your writing against your own voice, not generic AI patterns.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>productivity</category>
      <category>webdev</category>
      <category>claudecode</category>
      <category>agentskills</category>
    </item>
    <item>
      <title>Beyond the Scrapbook: Building a Developer Knowledge Commons</title>
      <dc:creator>Daniel Nwaneri</dc:creator>
      <pubDate>Mon, 16 Mar 2026 10:09:02 +0000</pubDate>
      <link>https://forem.com/dannwaneri/beyond-the-scrapbook-building-a-developer-knowledge-commons-4d40</link>
      <guid>https://forem.com/dannwaneri/beyond-the-scrapbook-building-a-developer-knowledge-commons-4d40</guid>
      <description>&lt;p&gt;DEV.to shipped Agent Sessions last week — a beta feature that lets you upload Claude Code session files directly to your DEV profile, curate what to include, and embed specific exchanges in your posts. Upload a &lt;code&gt;.jsonl&lt;/code&gt; or &lt;code&gt;.json&lt;/code&gt; session file, pick the moments worth keeping, save it. The parser currently supports Claude Code, Gemini CLI, Codex, Pi, and GitHub Copilot CLI.&lt;/p&gt;

&lt;p&gt;I was in the announcement thread.&lt;/p&gt;

&lt;p&gt;Pascal flagged that the useful unit for embedding in technical writing is a specific exchange, not the full session. You want the moment the agent made the wrong call, or the prompt that finally produced the right output. Uploading a full session to access two minutes of it is friction that'll stop most writers from using it at all.&lt;/p&gt;

&lt;p&gt;The DEV team shipped a refactor the same day. Client-side curation, 10MB limit gone. That's a fast turnaround.&lt;/p&gt;

&lt;p&gt;So I'll say this clearly: the feature works. The local parsing decision is the right call architecturally — nothing hits their servers before you've seen it. The sessions sitting in &lt;code&gt;~/.claude/projects/&lt;/code&gt; right now are invisible to everyone, including future you. DEV.to made them visible.&lt;/p&gt;

&lt;p&gt;That's the first 10% of the problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Actually in a Session File
&lt;/h2&gt;

&lt;p&gt;A Claude Code &lt;code&gt;.jsonl&lt;/code&gt; session file isn't clean. It's a raw record of everything: every prompt, every response, every tool call, every error, every retry, every tangent that went nowhere.&lt;/p&gt;

&lt;p&gt;A 4-hour session produces tens of thousands of lines. Most of it is noise. The three decisions that mattered — the architectural choice you made in hour two, the wrong assumption you caught before it shipped, the pattern you'd use again — are buried in the middle of a conversation about a linter warning.&lt;/p&gt;

&lt;p&gt;Manual curation finds those moments if you remember where they are. It doesn't find them if you don't. And it definitely doesn't find them across 40 sessions from the last two months.&lt;/p&gt;

&lt;p&gt;This is the problem DEV.to's feature doesn't touch: extraction.&lt;/p&gt;

&lt;p&gt;Not capture. Not curation. The automated process of reading a session and asking — what here is worth keeping? What decision was made, what reasoning supported it, what would someone need to know to not repeat this mistake?&lt;/p&gt;

&lt;p&gt;That's not a UI problem. That's an evaluation problem. It's solvable — but not with a manual upload flow.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Scrapbook vs The Commons
&lt;/h2&gt;

&lt;p&gt;DEV.to built a scrapbook. Upload a session. Pick the good parts. Post it.&lt;/p&gt;

&lt;p&gt;That's individually useful. A developer who publishes their sessions regularly builds a public record of how they think. That has real value — for their own portfolio, for their team, for the developers who find it later.&lt;/p&gt;

&lt;p&gt;But it doesn't compound.&lt;/p&gt;

&lt;p&gt;The knowledge in that session stays attached to that post, on DEV.to's platform, searchable through DEV.to's search, accessible through DEV.to's interface. It doesn't talk to anything else. It can't be queried by an AI tool working on a related problem. It can't be federated to a developer who runs their own instance. It can't surface automatically when a similar problem comes up in a different session six months later.&lt;/p&gt;

&lt;p&gt;The session is uploaded. The knowledge is published. Nothing learns from it.&lt;/p&gt;

&lt;p&gt;A knowledge commons does something different. It extracts the signal from the noise automatically, scores it, indexes it semantically, and makes it retrievable — by humans, by AI tools, by other instances in a federated network. The knowledge compounds. Sessions from six months ago surface when they're relevant today. A decision made by a developer in one context becomes findable by a developer in a different context who's facing the same constraint.&lt;/p&gt;

&lt;p&gt;That's what's missing. Not the upload. The evaluation layer that turns uploaded sessions into searchable, reusable, federated knowledge.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Manual Curation Ceiling
&lt;/h2&gt;

&lt;p&gt;DEV.to's flow puts curation on you. You decide what to include. You decide what to cut. You write the context that makes it useful for other readers.&lt;/p&gt;

&lt;p&gt;That's fine for one session. It doesn't scale to forty.&lt;/p&gt;

&lt;p&gt;The developers who would benefit most from session preservation — the ones building seriously, shipping regularly, running multiple Claude Code sessions a day — are exactly the developers who don't have time to manually curate every session before publishing. They'll upload three sessions in the first week because it's novel. Then they'll stop. The sessions keep accumulating in &lt;code&gt;~/.claude/projects/&lt;/code&gt;. The feature stops getting used.&lt;/p&gt;

&lt;p&gt;This is the friction problem that killed a dozen "public learning" features before it. The intent is genuine. The execution cost is too high to sustain.&lt;/p&gt;

&lt;p&gt;The alternative is automatic extraction that runs in the background — sessions flow in, evaluation runs, high-signal insights surface for review, low-signal noise is filtered out before it ever reaches you. You review what the evaluator flagged, not the raw session. The curation burden drops from hours to minutes.&lt;/p&gt;

&lt;p&gt;That's a pipeline, not a UI.&lt;/p&gt;

&lt;p&gt;Agent Sessions is the UI.&lt;/p&gt;




&lt;h2&gt;
  
  
  Who Owns the Knowledge
&lt;/h2&gt;

&lt;p&gt;There's a structural question underneath the feature that DEV.to doesn't answer: who owns the knowledge you publish?&lt;/p&gt;

&lt;p&gt;When you upload a session to DEV.to and publish it, the content lives on DEV.to's platform. That's fine — DEV.to has been a good steward of developer content. But the knowledge is centralized. DEV.to controls the search. DEV.to controls the API. DEV.to controls what gets surfaced and how.&lt;/p&gt;

&lt;p&gt;We've seen this pattern before. Stack Overflow was a good steward too, until the economics changed. Reddit. Twitter. The platforms that host the knowledge and the platforms that profit from it are the same platform, and that alignment doesn't hold forever.&lt;/p&gt;

&lt;p&gt;A federated model works differently. Your knowledge lives on your instance. You control the search. You decide what gets federated to other instances and what stays private. The network is the value, not the platform. No single company owns the graph.&lt;/p&gt;

&lt;p&gt;ActivityPub makes this possible today. The same protocol that lets Mastodon instances talk to each other can federate developer knowledge across independent instances. Your session insights are yours. They travel to other instances when you choose. They're retrievable by any tool that speaks the protocol.&lt;/p&gt;

&lt;p&gt;DEV.to can't build this. It's structurally incompatible with a platform business model. A platform needs the content centralized to monetize the audience. Federation distributes the content, which distributes the audience, which distributes the revenue.&lt;/p&gt;

&lt;p&gt;That's not a criticism — it's a constraint. DEV.to is doing what a platform can do. Federation is what a commons can do.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the Other 90% Looks Like
&lt;/h2&gt;

&lt;p&gt;I've been building Foundation for the past several months. It's not a concept.&lt;/p&gt;

&lt;p&gt;A browser extension captures conversations automatically from Claude.ai — no manual upload, no file hunting. Conversations flow into a three-table schema in D1 (chats, messages, chunks) and get chunked at the passage level for semantic search in Vectorize. When a session ends, the pipeline runs an evaluation pass using Workers AI (Llama 3.3 70B): three binary signals — is there a concrete technique being applied (usage), is it confirmed to work (validation), is it actionable rather than vague (specificity). Score ≥ 0.67 auto-promotes to knowledge memory. Score 0.33–0.67 surfaces for human review. Score &amp;lt; 0.33 gets discarded.&lt;/p&gt;

&lt;p&gt;The knowledge that survives is indexed semantically and exposed via an MCP server: &lt;code&gt;list_chats&lt;/code&gt;, &lt;code&gt;get_chat&lt;/code&gt;, &lt;code&gt;extract_insights&lt;/code&gt;, &lt;code&gt;score_insights&lt;/code&gt;. Any AI tool that speaks MCP can query it. The session ended an hour ago. The insight from it is already retrievable in the next session.&lt;/p&gt;

&lt;p&gt;The federation layer is ActivityPub. The same protocol Mastodon runs on. Your insights live on your instance. They travel to other instances when you choose. A developer running their own Foundation can federate with yours without either of you going through a centralized platform.&lt;/p&gt;

&lt;p&gt;That's the full picture. Not a wireframe — running code, at &lt;a href="https://github.com/dannwaneri/chat-knowledge" rel="noopener noreferrer"&gt;github.com/dannwaneri/chat-knowledge&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;The problem is now visible enough for a major platform to build around it. Developers are generating valuable knowledge in their sessions and losing it. That's real, and DEV.to naming it publicly matters.&lt;/p&gt;

&lt;p&gt;But capture without evaluation is a scrapbook. A scrapbook without federation is a silo. A silo on someone else's platform is borrowed infrastructure.&lt;/p&gt;

&lt;p&gt;The commons version of this is harder to build. That's why I've been building it.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>discuss</category>
    </item>
    <item>
      <title>I Built a Knowledge Evaluator That Uses Notion to Judge What's Worth Remembering</title>
      <dc:creator>Daniel Nwaneri</dc:creator>
      <pubDate>Thu, 12 Mar 2026 16:14:22 +0000</pubDate>
      <link>https://forem.com/dannwaneri/i-built-a-knowledge-evaluator-that-uses-notion-to-judge-whats-worth-remembering-2mlf</link>
      <guid>https://forem.com/dannwaneri/i-built-a-knowledge-evaluator-that-uses-notion-to-judge-whats-worth-remembering-2mlf</guid>
      <description>&lt;p&gt;Foundation kept everything. After 300 conversations, search returned noise. That was the problem I was actually trying to solve.&lt;/p&gt;

&lt;p&gt;I've been building &lt;a href="https://github.com/dannwaneri/chat-knowledge" rel="noopener noreferrer"&gt;Foundation&lt;/a&gt; — a federated knowledge system on Cloudflare Workers  and the hardest part isn't capturing insights from conversations. It's deciding which ones deserve to persist. Throw everything into Vectorize and you end up with noise.&lt;/p&gt;

&lt;p&gt;The Notion MCP Challenge gave me a reason to isolate the evaluator and ship it as a standalone thing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Idea
&lt;/h2&gt;

&lt;p&gt;Most challenge submissions pipe data &lt;em&gt;into&lt;/em&gt; Notion. This one uses Notion as the &lt;strong&gt;judgment surface&lt;/strong&gt; — the place where ambiguous knowledge items wait for a human to decide if they're worth keeping.&lt;/p&gt;

&lt;p&gt;The architecture:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Conversation excerpt
        ↓
Workers AI (Llama 3.3 70B) scores 3 signals
        ↓
Score ≥ 0.67 → Knowledge Memory (auto-promoted)
Score 0.33–0.67 → Notion Review Queue (human judges)
Score &amp;lt; 0.33 → Discarded
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notion isn't just receiving output. The Worker queries it back for pending items and uses it as the approval layer. That's the move most submissions didn't make.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Signals
&lt;/h2&gt;

&lt;p&gt;The evaluator scores each knowledge item on three binary signals:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Usage&lt;/strong&gt; — Is there a concrete technique, command, or pattern being applied?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Validation&lt;/strong&gt; — Is it confirmed to work (tested, referenced, agreed upon)?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Specificity&lt;/strong&gt; — Is it actionable, not just vague advice?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A score of 1.0 means all three fired. That item goes straight to Knowledge Memory. A score of 0.67 means two fired — useful, but unverified. That item surfaces in the Notion Review Queue as &lt;code&gt;Pending&lt;/code&gt;. A human makes the call.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building It
&lt;/h2&gt;

&lt;p&gt;The Worker is Hono on Cloudflare Workers, Workers AI for scoring, and the Notion REST API for reads and writes.&lt;/p&gt;

&lt;p&gt;The evaluator uses the &lt;code&gt;messages&lt;/code&gt; format — not &lt;code&gt;prompt&lt;/code&gt; — with a strict system instruction:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;ai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@cf/meta/llama-3.3-70b-instruct-fp8-fast&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;system&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;You are a knowledge quality evaluator. Respond with ONLY valid JSON.&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`Score this excerpt and return ONLY:
{"usage": 0|1, "validation": 0|1, "specificity": 0|1, "summary": "one sentence"}

Excerpt: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="na"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One thing I hit: the model returns &lt;code&gt;{ response: { usage: 1, ... } }&lt;/code&gt; — a nested object, not a string. The &lt;code&gt;prompt&lt;/code&gt; parameter made it worse, returning free-form essays instead of JSON. Switching to &lt;code&gt;messages&lt;/code&gt; with a system role fixed it.&lt;/p&gt;

&lt;p&gt;The routing logic is straightforward:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.67&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Auto-promote to Knowledge Memory&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;writeToNotion&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;KNOWLEDGE_MEMORY_ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;signals&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;source&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.33&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Surface for human judgment&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;writeToNotion&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;REVIEW_QUEUE_ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;signals&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;source&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Pending&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Discard&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Notion as the Judgment Layer
&lt;/h2&gt;

&lt;p&gt;Two databases: &lt;strong&gt;Knowledge Memory&lt;/strong&gt; (permanent store) and &lt;strong&gt;Review Queue&lt;/strong&gt; (human inbox — a Select field with &lt;code&gt;Pending&lt;/code&gt;, &lt;code&gt;Approved&lt;/code&gt;, &lt;code&gt;Rejected&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;The Worker has three endpoints:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;POST /evaluate&lt;/code&gt; — scores a knowledge item and routes it&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;GET /pending&lt;/code&gt; — queries Notion for items awaiting review&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;GET /&lt;/code&gt; — health check&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;code&gt;/pending&lt;/code&gt; endpoint is what makes Notion MCP genuinely active:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="s2"&gt;`https://api.notion.com/v1/databases/&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;REVIEW_QUEUE_ID&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/query`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;method&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;POST&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;Authorization&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`Bearer &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;NOTION_TOKEN&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Notion-Version&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;2022-06-28&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="na"&gt;body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="na"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;property&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Status&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;select&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;equals&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Pending&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="na"&gt;sorts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt; &lt;span class="na"&gt;property&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Created&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;direction&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;descending&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When something lands in Review Queue, a human opens Notion and changes Status from &lt;code&gt;Pending&lt;/code&gt; to &lt;code&gt;Approved&lt;/code&gt; or &lt;code&gt;Rejected&lt;/code&gt;. That's the loop. Notion isn't a log — it's a decision surface.&lt;/p&gt;

&lt;h2&gt;
  
  
  What It Looks Like Running
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;► Querying Review Queue...  Pending: 1

[1/3] HIGH-CONFIDENCE
  Score: 100% | Signals: usage, validation, specificity
  ✓ Auto-promoted → Knowledge Memory

[2/3] AMBIGUOUS
  Score: 67% | Signals: usage, specificity
  ⚡ Surfaced in Notion Review Queue

[3/3] LOW-CONFIDENCE
  Score: 0% | Destination: discarded
  ✗ Not worth preserving.

► Re-querying...  Pending: 2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;

  &lt;iframe src="https://www.youtube.com/embed/ElpR79l0N6s"&gt;
  &lt;/iframe&gt;


&lt;/p&gt;

&lt;p&gt;Live endpoint: &lt;code&gt;https://knowledge-evaluator.fpl-test.workers.dev&lt;/code&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing the Loop with Notion MCP
&lt;/h2&gt;

&lt;p&gt;The Worker writes to Notion via REST. But the other direction — reading pending items back via MCP — is where it gets interesting.&lt;/p&gt;

&lt;p&gt;With &lt;code&gt;@notionhq/notion-mcp-server&lt;/code&gt; configured in Claude Desktop, you can query the Review Queue directly in a conversation:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Query my Notion Review Queue database and show me all items with Status Pending"&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Claude returns:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="s"&gt;1 item found&lt;/span&gt;

&lt;span class="na"&gt;Name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;     &lt;span class="s"&gt;The excerpt mentions using waitUntil() in Cloudflare Workers for&lt;/span&gt;
          &lt;span class="s"&gt;tasks like logging, but lacks production validation.&lt;/span&gt;
&lt;span class="na"&gt;Score&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;    &lt;span class="m"&gt;0.67&lt;/span&gt;
&lt;span class="na"&gt;Status&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;   &lt;span class="s"&gt;🟡 Pending&lt;/span&gt;
&lt;span class="na"&gt;Signals&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s"&gt;usage, specificity&lt;/span&gt;
&lt;span class="na"&gt;Source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;   &lt;span class="s"&gt;dev-chat&lt;/span&gt;
&lt;span class="na"&gt;Created&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s"&gt;March 10, &lt;/span&gt;&lt;span class="m"&gt;2026&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And then offers: &lt;em&gt;"Want to promote, flag, or dismiss it?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That's the full loop. The Worker evaluates and writes. Notion holds the item. Claude reads it back via MCP and surfaces it for a human decision. Two MCP interactions — one outbound (REST), one inbound (MCP server) — both hitting the same Notion database.&lt;/p&gt;

&lt;p&gt;The architecture now looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Conversation excerpt
        ↓
Workers AI scores 3 signals
        ↓
Score ≥ 0.67 → Knowledge Memory (auto-promoted via Notion REST)
Score 0.33–0.67 → Review Queue (Pending, written via Notion REST)
Score &amp;lt; 0.33 → Discarded
        ↓
Claude Desktop queries Review Queue via @notionhq/notion-mcp-server
        ↓
Human approves or rejects in Notion
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notion isn't a log — it's the decision surface that both the Worker and the AI agent can see.&lt;/p&gt;

&lt;p&gt;

  &lt;iframe src="https://www.youtube.com/embed/7XFDmJK7_CY"&gt;
  &lt;/iframe&gt;


&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;p&gt;The scoring threshold matters more than the model. The harder question — what makes something worth keeping — I haven't fully answered. That's why the Review Queue exists.&lt;/p&gt;

&lt;p&gt;The model doesn't need fine-tuning for this. Writing the three signals took longer than the entire Worker.&lt;/p&gt;

&lt;p&gt;Knowledge provenance is preserved throughout — the original conversation snippet lives in the &lt;code&gt;Raw Context&lt;/code&gt; field, so when you're approving something in Notion, you can see exactly where it came from.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;The thresholds (0.67 and 0.33) are now tunable via env vars — &lt;code&gt;HIGH_THRESHOLD&lt;/code&gt; &lt;br&gt;
and &lt;code&gt;LOW_THRESHOLD&lt;/code&gt; in &lt;code&gt;wrangler.toml&lt;/code&gt;. Different knowledge domains need different bars.&lt;/p&gt;

&lt;p&gt;A fourth signal is the natural evolution: &lt;strong&gt;novelty&lt;/strong&gt; — does this item duplicate &lt;br&gt;
something already in Knowledge Memory? Without it, the permanent store will accumulate &lt;br&gt;
redundant entries over time. That's the next thing to build.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stack
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cloudflare Workers&lt;/strong&gt; — runtime&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Workers AI&lt;/strong&gt; (Llama 3.3 70B) — evaluator&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hono&lt;/strong&gt; — routing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Notion API&lt;/strong&gt; — Review Queue + Knowledge Memory (writes)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;@notionhq/notion-mcp-server&lt;/strong&gt; — MCP query layer (reads)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Repo: &lt;a href="https://github.com/dannwaneri/knowledge-evaluator" rel="noopener noreferrer"&gt;github.com/dannwaneri/knowledge-evaluator&lt;/a&gt;&lt;/p&gt;

</description>
      <category>notionchallenge</category>
      <category>devchallenge</category>
      <category>mcp</category>
    </item>
    <item>
      <title>Building the Evaluator</title>
      <dc:creator>Daniel Nwaneri</dc:creator>
      <pubDate>Wed, 11 Mar 2026 16:24:00 +0000</pubDate>
      <link>https://forem.com/dannwaneri/building-the-evaluator-36kg</link>
      <guid>https://forem.com/dannwaneri/building-the-evaluator-36kg</guid>
      <description>&lt;p&gt;The sequel isn't about running or stopping. It's about whether the memory survives the stop.&lt;/p&gt;

&lt;p&gt;That line came from a comment thread on The Token Economy. Someone named Kalpaka had been reading through the series — the stop signal problem, the authority over interruption argument, the architectural gap between what agents can do and what they do reliably  and arrived at the question none of the pieces had answered.&lt;/p&gt;

&lt;p&gt;You can solve the stop signal problem. You can build interruption authority into your architecture. You can define done before the session starts and give the agent a contract it has to satisfy before it terminates.&lt;/p&gt;

&lt;p&gt;None of that answers what happens to the knowledge after the session ends.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Graveyard Problem
&lt;/h2&gt;

&lt;p&gt;Noah Vincent described it precisely for personal knowledge systems: "A week after consuming something, you could not explain what you learned if your life depended on it. The notes exist. The highlights are there. But you never use any of it."&lt;/p&gt;

&lt;p&gt;He was describing Obsidian vaults. The same problem exists for agent memory systems, RAG pipelines, and every second brain anyone has ever built.&lt;/p&gt;

&lt;p&gt;The issue is not storage. Storage is solved. The issue is that storage without evaluation is just accumulation. You end up with a beautifully organized graveyard — everything preserved, nothing improved, retrieval returning the noise alongside the signal because the system has no way to tell the difference.&lt;/p&gt;

&lt;p&gt;Artem Zhutov ran 700 Claude Code sessions over three weeks and built a semantic search layer to make them retrievable. The /recall skill can surface what happened in any session by topic, time, or graph visualization. He solved the retrieval problem.&lt;/p&gt;

&lt;p&gt;But retrieval without evaluation means the 700 sessions are equally weighted. The session where he made a breakthrough architectural decision and the session where he debugged a typo for an hour are both in the index. Both surface when you search. The system got larger. It did not get smarter.&lt;/p&gt;

&lt;p&gt;This is the graveyard problem stated for agent memory. Not that the knowledge is lost. That it accumulates without the quality signal that would make it worth retrieving.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the Evaluator Actually Is
&lt;/h2&gt;

&lt;p&gt;The AIGNE paper — "Everything is Context: Agentic File System Abstraction for Context Engineering" — describes a complete cognitive architecture for AI agents. Four components: constructor (shrinks context to fit the current window), updater (swaps pieces in and out as the conversation progresses), evaluator (checks answers and updates memory based on what worked), and scratchpad separation (raw history, long-term memory, and short-lived working memory as distinct stores).&lt;/p&gt;

&lt;p&gt;Most memory architecture discussions cover the constructor and the retrieval layer. Almost nobody builds the evaluator.&lt;/p&gt;

&lt;p&gt;The evaluator is the component that decides what gets promoted from episodic memory — what happened — to semantic memory — when pattern X appears, do Y. The distinction compounds at completely different rates and decays differently too. Episodic memory is retrievable history. Semantic memory is institutional knowledge. The architectural choice determines which moat you're building.&lt;/p&gt;

&lt;p&gt;Without the evaluator, you have retrieval. With it, you have learning.&lt;/p&gt;

&lt;p&gt;The difference: retrieval returns what you stored. Learning returns what proved true.&lt;/p&gt;




&lt;h2&gt;
  
  
  Kuro's Proof
&lt;/h2&gt;

&lt;p&gt;Kuro built a perception-driven AI agent that runs 24/7. Every five minutes it wakes up, checks the environment, and decides whether anything needs attention. The problem: more than half its cycles ended with "nothing to do" — 50K tokens consumed per cycle to confirm that nothing was happening.&lt;/p&gt;

&lt;p&gt;His solution was a triage layer — a local lightweight model that runs in 800 milliseconds and decides whether the expensive reasoning layer should fire at all. Hard rules handle the obvious cases in zero milliseconds. The lightweight model handles ambiguous cases. The expensive model only sees what passes both filters.&lt;/p&gt;

&lt;p&gt;Production numbers across 626 decisions: 75.9% of triggers never reached the expensive model. The quality of remaining cycles went up because the expensive brain only saw what mattered.&lt;/p&gt;

&lt;p&gt;Kuro solved the perception triage problem. The evaluator is the same pattern applied to knowledge.&lt;/p&gt;

&lt;p&gt;Before a conversation gets promoted to semantic memory: hard rules first. Is this a duplicate of something indexed in the last hour? Skip. Is this a direct exchange that produced a decision that got used? Always process. Then lightweight triage — is this session high enough signal to warrant full embedding? Then full semantic processing only for what passes both filters.&lt;/p&gt;

&lt;p&gt;The skip rate Kuro observed at the perception layer — 56% filtered at triage — will likely hold for knowledge too. Most sessions are noise. The minority that contain genuine learning are the ones worth promoting to semantic memory.&lt;/p&gt;

&lt;p&gt;The evaluator doesn't store less. It promotes selectively. The difference is what gets retrieved when you need it.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the Evaluator Needs to Know
&lt;/h2&gt;

&lt;p&gt;The hard part isn't building the evaluator. It's deciding what signal it uses.&lt;/p&gt;

&lt;p&gt;The obvious answer is engagement: sessions with more back-and-forth, longer exchanges, more follow-up questions. But engagement measures interest, not correctness. A session where you spent two hours debugging the wrong approach was highly engaging. It doesn't belong in semantic memory as a reliable pattern.&lt;/p&gt;

&lt;p&gt;The better signal is validation: knowledge that proved correct under real conditions. The specific lesson about what breaks when you process financial filings at scale is worth promoting because it survived production. It was tested against real data, real edge cases, real failure modes, and it held.&lt;/p&gt;

&lt;p&gt;This is what distinguishes semantic memory from episodic memory in the domain that matters. Not "this was an interesting conversation" but "this turned out to be true when it was tested."&lt;/p&gt;

&lt;p&gt;The evaluator needs three signals:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Did the knowledge get used?&lt;/strong&gt; If a session produced a decision that was applied in a subsequent session — referenced in a decision, applied to a problem, cited in writing — that's evidence of value. The decision survived contact with a real problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Did the knowledge hold up?&lt;/strong&gt; If the pattern that emerged from a session was later contradicted by production evidence, that's evidence it shouldn't be promoted. The evaluator should demote as well as promote. Knowledge that fails in production gets flagged for review rather than silently remaining in semantic memory.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is the knowledge specific enough to be useful?&lt;/strong&gt; "Use conservative thresholds" is a platitude. "The threshold should be empirically derived from the first 50 production failures before any tuning begins, because false negatives are unrecoverable and false positives cost only an extra review cycle" is specific enough to act on. The evaluator should preserve the conditions that make the lesson true, not just the conclusion.&lt;/p&gt;

&lt;p&gt;Cornelius — building a fiction consistency system for novelists — arrived at the same problem from a different angle: "The system must know the difference between a violation and a discovery." A scene where a character breaks a world rule might be an error. Or it might be the generative mistake that becomes the best scene in the book. The evaluator has to distinguish between them.&lt;/p&gt;

&lt;p&gt;So does Foundation's knowledge evaluator. A session that contradicts an established pattern might be noise. Or it might be the production failure that invalidates a previously reliable assumption.&lt;/p&gt;

&lt;p&gt;Some of those calls require human judgment. The evaluator surfaces them. It doesn't make them.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Provenance Requirement
&lt;/h2&gt;

&lt;p&gt;The evaluator is what makes a knowledge commons different from a search index.&lt;/p&gt;

&lt;p&gt;A search index returns documents that match your query. A semantic memory layer should return knowledge that proved true, with the context that makes it actionable, attributed to the specific conditions under which it was validated.&lt;/p&gt;

&lt;p&gt;Not "here is everything about timeout handling." But "here is what held up under 500K daily API calls in production, with the specific edge cases that caused the original timeouts and the conditions under which the fix applies."&lt;/p&gt;

&lt;p&gt;The provenance is not metadata. It is the knowledge. Without the conditions that made the lesson true, the lesson is a platitude. With them, it is scar tissue — the kind of specific, attributed, conditions-included knowledge that survived the averaging process that centralized AI systems apply to everything they train on.&lt;/p&gt;

&lt;p&gt;This is the knowledge collapse argument stated for personal memory systems. The models train on the averaged output of everything humans have written and return the median of all knowledge. The evaluator preserves the specific — the edges, the conditions, the validated exceptions — that the averaging process destroys.&lt;/p&gt;

&lt;p&gt;Without it, you're building an archive. With it, you're building institutional memory that improves every time it's tested against real conditions.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Building the Evaluator Actually Means
&lt;/h2&gt;

&lt;p&gt;The pre-filter runs before any of this. Not every conversation reaches the evaluation layer. Hard rules first — direct exchange with a human, always process; ambient capture with no tracked concepts referenced, skip. Coarse content check second — does this conversation contain a decision, a question that changed direction, a pattern worth naming? Cheap signal, deterministic, costs nothing to run. The evaluation layer only fires on what passes both. Production evidence from perception triage systems suggests roughly 46% will skip at the pre-filter stage. The knowledge commons improves not by evaluating more but by evaluating less, better.&lt;/p&gt;

&lt;p&gt;For what passes the pre-filter, building the evaluator means three additions to any memory system:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A usage tracking layer.&lt;/strong&gt; When a retrieved knowledge item gets used in a subsequent session — referenced in a decision, applied to a problem, cited in writing — that event gets logged. Usage is the primary signal that something is worth promoting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A validation feedback loop.&lt;/strong&gt; When a decision based on promoted knowledge turns out to be wrong, that event gets logged and the promotion gets reviewed. The evaluator demotes as well as promotes. When promoted knowledge gets invalidated by new production evidence, the old entry doesn't stay in semantic memory with a warning attached — it gets resolved. Knowledge that fails in production gets replaced rather than contradicted silently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A specificity filter.&lt;/strong&gt; Before any knowledge gets promoted to semantic memory, the evaluator checks whether it contains the conditions that make it true. Generic conclusions get returned to episodic memory with a note: too general to promote — needs production validation before it earns semantic status.&lt;/p&gt;

&lt;p&gt;None of this is automatic. The evaluator surfaces candidates for promotion and demotion. The human makes the final call on the ambiguous cases — the violations that might be discoveries, the patterns that might be noise, the lessons that might be wrong.&lt;/p&gt;

&lt;p&gt;That's not a limitation. It's the architecture. The evaluator extends human judgment rather than replacing it. The cases that are obviously worth keeping get promoted without friction. The cases that require judgment get surfaced for human review rather than silently accumulating in a system that treats all knowledge as equally valid.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Memory That Survives the Stop
&lt;/h2&gt;

&lt;p&gt;Kalpaka's question is the right one to end on.&lt;/p&gt;

&lt;p&gt;The stop signal problem is solvable. Interruption authority is a design decision. The contract that defines done before the session starts is a pattern that works in production.&lt;/p&gt;

&lt;p&gt;But when the session ends — when the agent stops, when the context window closes, when the work is done for the day — what survives?&lt;/p&gt;

&lt;p&gt;Without the evaluator: everything survives equally. The breakthrough and the dead end. The validated pattern and the assumption that turned out to be wrong. The specific, attributed, conditions-included knowledge and the platitude that sounds like knowledge but isn't.&lt;/p&gt;

&lt;p&gt;With the evaluator: the scar tissue survives. The knowledge that was tested against real conditions and held up. The specific lessons with the specific contexts that make them true. The institutional memory that compounds because it improves every time it gets used rather than just accumulating every time something gets stored.&lt;/p&gt;

&lt;p&gt;That's the difference between a memory system and a knowledge system. The memory system stores what happened. The knowledge system keeps what proved true.&lt;/p&gt;

&lt;p&gt;The evaluator is what expands the cognitive light cone backward in time. Memory without evaluation gives you storage. Memory with evaluation gives you a light cone that extends further with every validated lesson — not just accumulating what happened but compounding what proved true.&lt;/p&gt;

&lt;p&gt;The knowledge that holds up under pressure is the only knowledge worth keeping.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This is part of a series on what AI actually changes in software development. Previous pieces: &lt;a href="https://dev.to/dannwaneri/the-gatekeeping-panic-what-ai-actually-threatens-in-software-development-5b9l"&gt;The Gatekeeping Panic&lt;/a&gt;, &lt;a href="https://dev.to/dannwaneri/the-meter-was-always-running-44c4"&gt;The Meter Was Always Running&lt;/a&gt;, &lt;a href="https://dev.to/dannwaneri/who-said-what-to-whom-5914"&gt;Who Said What to Whom&lt;/a&gt;, &lt;a href="https://dev.to/dannwaneri/the-token-economy-3cd9"&gt;The Token Economy&lt;/a&gt;, &lt;a href="https://dev.to/dannwaneri/i-shipped-broken-code-and-wrote-an-article-about-it-98p"&gt;I Shipped Broken Code and Wrote an Article About It&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>architecture</category>
      <category>discuss</category>
    </item>
    <item>
      <title>The Junior Developer Isn't Extinct—They're Stuck Below the API</title>
      <dc:creator>Daniel Nwaneri</dc:creator>
      <pubDate>Sun, 08 Mar 2026 02:46:18 +0000</pubDate>
      <link>https://forem.com/dannwaneri/the-junior-developer-isnt-extinct-theyre-stuck-below-the-api-a6b</link>
      <guid>https://forem.com/dannwaneri/the-junior-developer-isnt-extinct-theyre-stuck-below-the-api-a6b</guid>
      <description>&lt;p&gt;Everyone's writing about the death of junior developers. The anxiety is real. The job market data backs it up. But we're misdiagnosing the problem.&lt;/p&gt;

&lt;p&gt;The junior developer role isn't extinct. It's stuck Below the API and we haven't figured out how to pull it back up.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Real Divide
&lt;/h3&gt;

&lt;p&gt;Below the API is everything AI handles cheaper, faster, and often better than humans: boilerplate, basic CRUD, unit tests for simple functions, JSON schema conversion. Above the API is everything requiring judgment, verification, and context AI can't access: system design, debugging race conditions in production, knowing when to reject a confident-but-wrong suggestion.&lt;/p&gt;

&lt;p&gt;Junior developers used to climb from Below to Above by doing the boring work. Write unit tests, learn how systems break. Convert schemas, understand data flow. Fix bugs, build debugging intuition. Now AI does that work. We deleted the ladder.&lt;/p&gt;

&lt;h3&gt;
  
  
  What NorthernDev Got Right
&lt;/h3&gt;

&lt;p&gt;NorthernDev nailed the career pipeline problem. Five years ago, tedious work like writing unit tests for a legacy module went to a junior developer — boring for seniors, gold for juniors. Today it goes to Copilot.&lt;/p&gt;

&lt;p&gt;That's not a hiring freeze. That's the bottom rung of the ladder disappearing.&lt;/p&gt;

&lt;p&gt;The result is a barbell: super-seniors who are 10x faster with AI on one end, people who can prompt but can't debug production on the other. The middle is gone. The path from one group to the other is blocked.&lt;/p&gt;

&lt;p&gt;What's missing from that diagnosis: the role isn't dead, it's transformed.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Forensic Developer
&lt;/h3&gt;

&lt;p&gt;NorthernDev suggests teaching juniors to audit AI output — forensic coding. That's exactly what Above the API means.&lt;/p&gt;

&lt;p&gt;The old junior role: write code, senior reviews, learn from mistakes. The new junior role: AI writes code, junior audits, learn from AI's mistakes. The skill isn't syntax anymore. It's verification.&lt;/p&gt;

&lt;p&gt;The problem is you can't verify what you don't understand. To audit AI-generated code you need to know what it's supposed to do, how it actually works, what will break in production, and why the AI's clean solution is wrong. Those are senior-level skills. We're asking juniors to do senior work without the ramp to get there.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Traditional Training Doesn't Work Anymore
&lt;/h3&gt;

&lt;p&gt;Anthropic published experimental research that validates this directly. In a randomized controlled trial with junior engineers, the AI-assistance group finished tasks about two minutes faster but scored 17% lower on mastery quizzes. Two letter grades. The researchers called it a "significant decrease in mastery."&lt;/p&gt;

&lt;p&gt;The interesting part: some in the AI group scored highly. The difference wasn't the tool. It was how they used it. The high scorers asked conceptual and clarifying questions to understand the code they were working with, rather than delegating to AI. Same tool. Different approach. One stayed Above the API. One fell Below.&lt;/p&gt;

&lt;p&gt;That 17% gap is what happens when you optimize for speed without building verification capability.&lt;/p&gt;

&lt;p&gt;A Nature editorial published in June 2025 makes the underlying mechanism explicit: writing is not just reporting thoughts, it's how thoughts get formed. The researchers argue that outsourcing writing to LLMs means the cognitive work that generates insight never happens — the paper exists but the thinking didn't. The same principle applies to code. The junior who delegates to AI gets the function but skips the reasoning that would have revealed why the function is wrong.&lt;/p&gt;

&lt;p&gt;The mechanism is friction. When I started, bad Stack Overflow answers forced skepticism — you got burned, you learned to verify. AI removes that friction. It's patient, confident, never annoyed when you ask the same question twice. Amir put it well in the comments on my last piece: "AI answers confidently by default. Without friction, it's easy to skip the doubt step. Maybe the new skill we need to teach isn't how to find answers, but how to interrogate them."&lt;/p&gt;

&lt;p&gt;We optimized for kindness and removed the teacher.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Actually Needs to Change
&lt;/h3&gt;

&lt;p&gt;The junior role needs three shifts in how we define entry-level skills, how we build verification capability publicly, and how we measure performance.&lt;/p&gt;

&lt;p&gt;Entry-level used to mean knowing syntax and writing functions. Now it means reading and comprehending code, identifying architectural problems in AI output, and understanding that verification is more valuable than generation. The portfolio that gets you hired in 2026 isn't a todo app — AI generates one in 30 seconds. It's documented judgment: "Here's AI code I rejected and why." "Here's an AI suggestion that seemed right but failed in production." "Here's how I verified this architectural decision."&lt;/p&gt;

&lt;p&gt;Stack Overflow taught through public mistakes. That's why we started The Foundation — junior developers need public artifacts that prove judgment, not just syntax. Private AI chats build no portfolio. No proof of thinking. Invisible conversations that leave no trace.&lt;/p&gt;

&lt;p&gt;The interview question needs to change too. Not "build a todo app in React"  but "here's 500 lines of AI-generated code for a payment gateway. Tests pass. AI says it's successful. Logs show it's dropping 3% of transactions. You have 30 minutes. What's wrong?" That's the new entry test. Can you find the subtle bug AI introduced optimizing for elegance over financial correctness? Can you explain why this clean code fails at scale?&lt;/p&gt;

&lt;p&gt;Companies waiting for AI-ready juniors to appear are part of the problem. Nobody is training them. That's your job.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Economic Reality
&lt;/h3&gt;

&lt;p&gt;Companies see AI as cheaper than juniors. That math only works if you ignore production bugs from unverified code, architectural debt from AI's kitchen-sink solutions, security vulnerabilities AI confidently introduces, and scale failures AI didn't test for.&lt;/p&gt;

&lt;p&gt;Cheap verification is expensive at scale. A junior who catches those problems early is worth 10x their salary but only if we teach them how to verify.&lt;/p&gt;

&lt;p&gt;NorthernDev asked the right question: if we stop hiring juniors because AI can do it, where will the seniors come from in 2030?&lt;/p&gt;

&lt;p&gt;Nobody has a good answer yet. But the companies that figure it out will have a pipeline. The ones waiting for AI to get better will be stuck with seniors who retire and no one to replace them.&lt;/p&gt;




&lt;p&gt;The junior developer isn't extinct. The old path — syntax to simple tasks to complex tasks to senior — is dead. The new path runs through verification, public judgment, and the ability to interrogate confident-but-wrong answers before they reach production.&lt;/p&gt;

&lt;p&gt;That's not a lower bar. It's a different one.&lt;/p&gt;

&lt;p&gt;The ladder didn't disappear. We just forgot we have to build it.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>career</category>
      <category>discuss</category>
    </item>
  </channel>
</rss>
