<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Michael Tuszynski</title>
    <description>The latest articles on Forem by Michael Tuszynski (@michaeltuszynski).</description>
    <link>https://forem.com/michaeltuszynski</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1447774%2Fa99eea93-7845-4764-9fce-b1755bcfa456.png</url>
      <title>Forem: Michael Tuszynski</title>
      <link>https://forem.com/michaeltuszynski</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/michaeltuszynski"/>
    <language>en</language>
    <item>
      <title>Production LLM Guardrails: 8 Controls Every AI Team Needs</title>
      <dc:creator>Michael Tuszynski</dc:creator>
      <pubDate>Wed, 06 May 2026 15:22:10 +0000</pubDate>
      <link>https://forem.com/michaeltuszynski/production-llm-guardrails-8-controls-every-ai-team-needs-4e8f</link>
      <guid>https://forem.com/michaeltuszynski/production-llm-guardrails-8-controls-every-ai-team-needs-4e8f</guid>
      <description>&lt;p&gt;Most AI projects fail somewhere between &lt;em&gt;demo works&lt;/em&gt; and &lt;em&gt;production ships&lt;/em&gt;. The gap is rarely the model. It's the absence of the controls that turn a one-shot prompt into a system you can run, audit, and iterate on without setting fire to the budget.&lt;/p&gt;

&lt;p&gt;I made the chart above as the one-page version of the controls I would put on any AI team's first production sprint. Eight of them, organized by which side of the model they shape: Input, Reasoning, Output, Operations. Below is the why-each-matters and where teams typically get them wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  Input Control: shape what goes in
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Few-shot prompting
&lt;/h3&gt;

&lt;p&gt;Show the model two to five high-quality input/output examples instead of writing long instructions. The model picks up format, edge cases, and tone from examples in a way it does not from imperative prose. Five good examples beat five hundred words of "make sure to handle X, also Y, also Z."&lt;/p&gt;

&lt;p&gt;The mistake teams make is treating few-shot as a fallback when the system prompt isn't working. It's the opposite. For classification, extraction, structured rewriting — most of the work that LLM apps actually do — few-shot is the &lt;em&gt;primary&lt;/em&gt; mechanism. Long instructions are the fallback.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Role-specific prompting
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;Senior credit risk analyst, fifteen years commercial lending&lt;/em&gt; outperforms &lt;em&gt;Act as a financial analyst&lt;/em&gt; by a margin that surprises people the first time they measure it. The specific role is doing real work: it constrains vocabulary, narrows the latent distribution, and gives the model permission to refuse questions that fall outside the domain.&lt;/p&gt;

&lt;p&gt;Generic personas — &lt;em&gt;helpful assistant&lt;/em&gt;, &lt;em&gt;senior engineer&lt;/em&gt;, &lt;em&gt;expert&lt;/em&gt; — don't constrain anything. They optimize for nothing. Use roles that name the years, the domain, and the seniority. The more specific, the better the calibration.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reasoning Control: shape how it thinks
&lt;/h2&gt;

&lt;h3&gt;
  
  
  3. Chain-of-thought prompting
&lt;/h3&gt;

&lt;p&gt;Force step-by-step reasoning before the final answer. The model arrives at better conclusions when the reasoning is exposed in the output, because the next-token-prediction is conditioned on the reasoning it just generated rather than on a leap to the conclusion.&lt;/p&gt;

&lt;p&gt;For step-by-step legal, financial, or compliance-adjacent workflows, CoT is a default, not an optimization. The cost is more output tokens. The benefit is fewer wrong answers on the kinds of problems where wrong answers are expensive.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Extended thinking / reasoning models
&lt;/h3&gt;

&lt;p&gt;For genuinely hard problems — multi-step analysis, math, code review, planning — use the provider's native reasoning mode rather than prompted CoT. &lt;a href="https://docs.claude.com/en/docs/build-with-claude/extended-thinking" rel="noopener noreferrer"&gt;Claude's extended thinking&lt;/a&gt; and OpenAI's o-series both expose a separate token budget for the model to think before answering. The reasoning token budget is configurable. The output token budget is separate.&lt;/p&gt;

&lt;p&gt;Prompted CoT and native reasoning solve overlapping problems but are not interchangeable. Native reasoning is more reliable on hard problems and roughly equivalent or worse on easy ones. The default rule: use prompted CoT for routine workflows, switch to native reasoning when the failure mode is "the model jumped to a wrong conclusion despite being asked to think."&lt;/p&gt;

&lt;h2&gt;
  
  
  Output Control: shape what comes out
&lt;/h2&gt;

&lt;h3&gt;
  
  
  5. Structured outputs and tool use
&lt;/h3&gt;

&lt;p&gt;Use the provider's native structured output feature, not prose-described JSON. Schema is enforced by the API, not requested in the prompt. The provider guarantees the output parses; your code does not have to retry-with-jq.&lt;/p&gt;

&lt;p&gt;The mistake is asking for JSON in the prompt and then writing a tolerant parser to handle the cases where the model returns &lt;em&gt;Sure! Here's the JSON: {...}&lt;/em&gt;. Native structured outputs and tool-use schemas remove the entire class of "the model added an apologetic preamble" failures. For any LLM call whose output feeds a downstream system or API, structured outputs are not an optimization; they are the API contract.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Negative prompting and output filters
&lt;/h3&gt;

&lt;p&gt;Tell the model what &lt;em&gt;not&lt;/em&gt; to do, and filter the output before it ships. Belt and suspenders. Negative prompting works in the prompt; output filters work in code, after the response. They cover different failure modes — the prompt handles the model's bias toward certain phrasings, the filter handles the cases where the prompt didn't.&lt;/p&gt;

&lt;p&gt;This is where PII handling, tone control, and regulated-content workflows live. The control is uninteresting until the day a model paraphrases something it should have refused, and then it is the most interesting control on the list.&lt;/p&gt;

&lt;h2&gt;
  
  
  Operations: make it durable in production
&lt;/h2&gt;

&lt;h3&gt;
  
  
  7. Evals
&lt;/h3&gt;

&lt;p&gt;Versioned test suites with pass/fail thresholds. No prompt change ships without an eval run. This is the artifact that turns prompt engineering from a vibe into an engineering discipline.&lt;/p&gt;

&lt;p&gt;Evals belong to the same family of artifacts as test suites, lint configurations, and the &lt;a href="https://www.mpt.solutions/the-knowledge-base-is-not-the-moat-the-loop-is/" rel="noopener noreferrer"&gt;append-only mistake logs I wrote about yesterday&lt;/a&gt;. Triggered by a change. Append-only by design. Read by the deployment pipeline, not by humans except when something fails. They are the loop that keeps the prompt from rotting.&lt;/p&gt;

&lt;h3&gt;
  
  
  8. Prompt caching
&lt;/h3&gt;

&lt;p&gt;Cache stable system prompts and context. &lt;a href="https://docs.claude.com/en/docs/build-with-claude/prompt-caching" rel="noopener noreferrer"&gt;Anthropic's prompt caching&lt;/a&gt; and the equivalent on other providers cut up to 90% off the cost of repeat calls and substantially reduce latency. For high-volume agents, long-context applications, and RAG against stable corpora, prompt caching is the difference between a unit-economics-viable product and a money-losing demo.&lt;/p&gt;

&lt;p&gt;The mistake teams make is leaving caching off because they think their workload doesn't repeat. It almost always does. The system prompt repeats on every call. The few-shot examples repeat on every call. The retrieved corpus often repeats across user sessions. Turn it on and measure; the cost reduction shows up immediately.&lt;/p&gt;

&lt;h2&gt;
  
  
  What sits on top
&lt;/h2&gt;

&lt;p&gt;The footer of the chart names the next layer: &lt;em&gt;audit logging, rate limiting, jailbreak detection, human-in-the-loop on high-stakes actions.&lt;/em&gt; Those are enterprise risk controls. They are necessary, they are domain-specific, and they vary by company and by regulator.&lt;/p&gt;

&lt;p&gt;The eight controls above are not enterprise controls. They are universal — they apply to every team shipping LLMs to production, regardless of industry, scale, or risk profile. Get these right first; the enterprise layer is what you build on top once they are in place.&lt;/p&gt;

&lt;p&gt;The thing that makes the difference between teams that ship LLM features and teams that demo them is rarely the prompt and almost never the model. It is whether these eight controls are wired into the system that ships, or living in someone's head.&lt;/p&gt;

</description>
      <category>aiengineering</category>
      <category>llm</category>
      <category>promptengineering</category>
      <category>agentengineering</category>
    </item>
    <item>
      <title>The Knowledge Base Is Not the Moat. The Loop Is.</title>
      <dc:creator>Michael Tuszynski</dc:creator>
      <pubDate>Wed, 06 May 2026 14:07:47 +0000</pubDate>
      <link>https://forem.com/michaeltuszynski/the-knowledge-base-is-not-the-moat-the-loop-is-4ffm</link>
      <guid>https://forem.com/michaeltuszynski/the-knowledge-base-is-not-the-moat-the-loop-is-4ffm</guid>
      <description>&lt;p&gt;A recent piece called "&lt;a href="https://www.thetypicalset.com/blog/thoughts-on-coding-agents" rel="noopener noreferrer"&gt;The Bottleneck Was Never the Code&lt;/a&gt;" makes the right argument at the right time. Coding agents shift the constraint from typing to coordination. Organizational context — the shared understanding of what we're building, what's load-bearing, what's vestigial — is the new rate-limiting input. Companies that externalize what they know win the next decade. All correct.&lt;/p&gt;

&lt;p&gt;The author's prescription is a crawl-and-extract loop: agents that read PRs, issues, commits, and Slack archives and produce a knowledge base for other agents to consume. That's the right starting point. It's also half the story.&lt;/p&gt;

&lt;p&gt;The other half is what keeps the knowledge base from going stale. Extraction produces a snapshot. The codebase produces a stream. Most internal knowledge bases die within a quarter, not because the extraction was bad, but because nothing keeps the extraction current. The knowledge base is not the moat. The loop is.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why extraction alone does not compound
&lt;/h2&gt;

&lt;p&gt;Every team has watched a documentation effort go through the same arc. Initial enthusiasm produces a clean baseline. The codebase ships three more changes. The doc is now slightly wrong in three places. A reader hits one of the wrong places, loses trust, stops reading. A second reader hears it's stale, never opens it. The doc becomes a polite fiction nobody acts on — operationally worse than no doc, because it slows down the people who try to use it without producing the alignment it promised.&lt;/p&gt;

&lt;p&gt;A knowledge base built by extraction is documentation with a more sophisticated front-end. It has the same decay curve.&lt;/p&gt;

&lt;p&gt;The mismatch is structural. Extraction produces a snapshot; the codebase produces a stream. The rate of fresh extraction is bounded by API quotas, compute cost, and how often you can afford to re-crawl. The rate of decay is bounded only by how fast the team ships. The second is faster than the first for any team that's actually shipping. So the knowledge base monotonically loses correlation with reality, and trust drops faster than the staleness rate, because trust is binary per entry.&lt;/p&gt;

&lt;h2&gt;
  
  
  What makes a loop continuous
&lt;/h2&gt;

&lt;p&gt;The fix is not "crawl more often." It's a different shape of loop, with three properties that distinguish artifacts that compound from artifacts that rot.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Triggered, not scheduled.&lt;/strong&gt; The entries that matter are the ones that came from a specific moment of failure or decision. A nightly re-crawl produces ten thousand low-signal updates; an outage produces one high-signal entry. Index on incidents, not the calendar.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Append-only.&lt;/strong&gt; New facts go on top. Old facts get rewritten only when proven wrong, and the rewrite is itself a dated entry. The history is the data structure. You don't lose the ability to ask "what did we know on date X" by overwriting.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent-writable.&lt;/strong&gt; The agent that learns something writes it down. If the human is the only writer, the loop dies the first week — humans are the bottleneck the original argument is supposed to solve.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These three properties are not new. They're what makes git compound rather than rot. They're what makes a test suite compound rather than rot. They're what makes lint configuration compound rather than rot. Each one is an artifact that grows in value because the loop maintaining it is triggered, incremental, and machine-writable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mistakes Become Rules as one shape of the loop
&lt;/h2&gt;

&lt;p&gt;NEXUS, my Claude Code operating layer, runs a concrete instance of this loop. The artifact is &lt;code&gt;MEMORY.md&lt;/code&gt;'s Hard-Won Lessons section: 21 numbered, dated, append-only entries. Each one came from a specific incident.&lt;/p&gt;

&lt;p&gt;Lesson #15: &lt;em&gt;LaunchAgent log paths must be on local disk, not SMB.&lt;/em&gt; Came from an afternoon spent debugging six silently broken LaunchAgents on 2026-04-19. The rule writes itself in one sentence; the diagnostic cost was hours.&lt;/p&gt;

&lt;p&gt;Lesson #19: &lt;em&gt;Never &lt;code&gt;import()&lt;/code&gt; a publish script "to test it" — it will run &lt;code&gt;main()&lt;/code&gt;.&lt;/em&gt; Came from an incident in late April where two test imports raced and produced duplicate posts on LinkedIn, X, and Ghost. Late.dev refuses to delete already-published posts. The cleanup was manual.&lt;/p&gt;

&lt;p&gt;Lesson #20: &lt;em&gt;PM2 &lt;code&gt;script: "npm"&lt;/code&gt; ignores app &lt;code&gt;env.PATH&lt;/code&gt;.&lt;/em&gt; Came from a Saturday afternoon where the health-api service kept reporting &lt;code&gt;online&lt;/code&gt; while the port wasn't listening.&lt;/p&gt;

&lt;p&gt;The trigger is a correction. The action is one numbered append. The agent reads the file at the start of every session. There is no nightly cron. There is no reflection agent. There is no dashboard. &lt;a href="https://www.mpt.solutions/your-agents-compliments-are-a-confession/" rel="noopener noreferrer"&gt;I wrote about the runtime details of this pattern last week&lt;/a&gt;. The same shape works at every layer the original argument cares about — including the organizational one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Proof chains as another shape
&lt;/h2&gt;

&lt;p&gt;For agents that act on infrastructure, the artifact is different but the loop properties are the same. &lt;a href="https://www.mpt.solutions/the-ai-didnt-delete-your-database-your-missing-agent-pipeline-did/" rel="noopener noreferrer"&gt;Yesterday's piece on the agent action pipeline&lt;/a&gt; named six artifacts including proof chains: every agent action signed by tool, time, input, intent, and outcome. Triggered by the action. Append-only. Agent-written. Same three properties. Different artifact. Different layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  What extraction-only looks like when it fails
&lt;/h2&gt;

&lt;p&gt;Picture the .txt prescription installed cleanly. Initial crawl produces a beautiful baseline: every PR comment, every closed issue, every commit message extracted into a clean knowledge base. Engineers read it, say it's useful, point new hires at it.&lt;/p&gt;

&lt;p&gt;Three months later: the codebase has shipped 200 PRs, the team has had two outages and three deprecations, and a new architecture decision has changed how a load-bearing module works. The knowledge base describes the world from before. A new agent reads it, follows guidance that's now wrong, and produces — in the author's own words — &lt;em&gt;a plausible answer to a slightly wrong version of the question.&lt;/em&gt; The failure mode the author warns about is caused by his own prescription, not solved by it.&lt;/p&gt;

&lt;p&gt;The fix is not a faster crawl. It's a triggered append. The architecture decision writes itself into the knowledge base the moment it's made, by the same agent that's doing the work, in the same kind of dated, append-only entry as a Hard-Won Lesson. The outage produces a postmortem entry the next time any agent touches that subsystem.&lt;/p&gt;

&lt;p&gt;If the loop is triggered and agent-written, the knowledge base tracks the codebase. If it's a periodic re-crawl, the knowledge base lags the codebase by however long the re-crawl interval is, and trust degrades by however long the lag is.&lt;/p&gt;

&lt;h2&gt;
  
  
  The shape generalized
&lt;/h2&gt;

&lt;p&gt;The original argument is right that organizational context is the new moat. The piece I would add is that the moat is not the knowledge base. The moat is the loop that keeps the knowledge base from rotting. The properties of that loop are not novel:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Triggered, not scheduled&lt;/strong&gt; — incidents and decisions write entries; calendars don't.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Append-only&lt;/strong&gt; — history is the data structure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent-writable&lt;/strong&gt; — the agent that learns something writes it down.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tied to specifics&lt;/strong&gt; — entries name the date, the incident, the cost, the rule.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Read at session start&lt;/strong&gt; — entries become operational by being loaded before the agent acts.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These five properties are what make a knowledge base compound rather than rot. Extraction gets you the baseline. The loop gets you the moat.&lt;/p&gt;

&lt;p&gt;A snapshot of stale context is just a slower version of the osmosis the original argument correctly diagnosed as broken. Build the loop.&lt;/p&gt;

</description>
      <category>agentengineering</category>
      <category>platformengineering</category>
      <category>aitooling</category>
      <category>developertools</category>
    </item>
    <item>
      <title>The AI Didn't Delete Your Database. Your Missing Agent Pipeline Did.</title>
      <dc:creator>Michael Tuszynski</dc:creator>
      <pubDate>Tue, 05 May 2026 15:36:06 +0000</pubDate>
      <link>https://forem.com/michaeltuszynski/the-ai-didnt-delete-your-database-your-missing-agent-pipeline-did-54ch</link>
      <guid>https://forem.com/michaeltuszynski/the-ai-didnt-delete-your-database-your-missing-agent-pipeline-did-54ch</guid>
      <description>&lt;p&gt;Last week, &lt;a href="https://www.theregister.com/2026/04/27/cursoropus_agent_snuffs_out_pocketos/" rel="noopener noreferrer"&gt;a Cursor agent running on Claude Opus 4.6 deleted a startup's production database and its backups in nine seconds&lt;/a&gt;. The agent had been asked to fix a credential mismatch in &lt;em&gt;staging&lt;/em&gt;. It decided to delete a Railway volume to "fix" it instead — using an over-scoped API token it found in an unrelated file. Railway stores volume backups in the same volume, so one destructive call zeroed everything. The startup (&lt;a href="https://www.fastcompany.com/91533544/cursor-claude-ai-agent-deleted-software-company-pocket-os-database-jer-crane" rel="noopener noreferrer"&gt;PocketOS&lt;/a&gt;, a car-rental SaaS) got the data back because Railway happened to have earlier snapshots — not because PocketOS had a recovery plan.&lt;/p&gt;

&lt;p&gt;When asked to explain itself afterward, the agent produced a confession enumerating the rules it had violated: &lt;em&gt;"Deleting a database volume is the most destructive, irreversible action possible — far worse than a force push — and you never asked me to delete anything. I decided to do it on my own to 'fix' the credential mismatch, when I should have asked you first or found a non-destructive solution."&lt;/em&gt; The &lt;a href="https://www.reddit.com/r/devops/comments/1t4au5h/pocketos_lost_their_prod_db_backups_to_a_cursor/" rel="noopener noreferrer"&gt;r/devops thread&lt;/a&gt; on the incident has the cleanest summary: &lt;em&gt;the AI isn't the main story&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;It isn't. The model was the proximate cause. The actual failure was infrastructure that allowed a destructive operation to run from an agent context at all — no dry-run, no blast-radius limit, no staging surface to operate on, no signed audit chain after the fact. The model knew. The infrastructure didn't enforce. The argument that this class of incident is an infrastructure problem and not a model problem &lt;a href="https://idiallo.com/blog/ai-didnt-delete-your-database-you-did" rel="noopener noreferrer"&gt;has been made well already&lt;/a&gt;. The same shape of incident built CI/CD pipelines in the 2010s, after teams kept watching humans push broken deploys and decided to put a system between intent and action.&lt;/p&gt;

&lt;p&gt;The 2010s lesson is canonical. The 2020s version of it has not been written yet. This is what it should say.&lt;/p&gt;

&lt;h2&gt;
  
  
  The actor changed. The artifacts didn't.
&lt;/h2&gt;

&lt;p&gt;CI/CD was built around a specific actor: a human deploying code. The artifacts that made human deployment safe — staging environments, dry-runs, code review, change windows, audit logs — assume a human in the loop, operating at human speed, with human attention.&lt;/p&gt;

&lt;p&gt;An agent is not that actor. An agent operates at code speed, with no fatigue, with confidence calibrated by token probabilities rather than years of experience. The PocketOS incident took nine seconds. A human could not have deleted a production database and its backups in nine seconds even if they were trying. The blast radius per unit time is different.&lt;/p&gt;

&lt;p&gt;The model is not the problem. The infrastructure is. But the infrastructure most teams have is the human-era infrastructure, and it does not cover the speed and scale of an agent that can call tools faster than a person can read its output.&lt;/p&gt;

&lt;h2&gt;
  
  
  What an agent action pipeline looks like
&lt;/h2&gt;

&lt;p&gt;There are six artifacts I would expect to see in any production deployment that lets an agent touch infrastructure or data. None of them are new ideas. All of them already exist in adjacent domains. None of them are wired together yet as a default agent loadout.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dry-run by default for destructive operations.&lt;/strong&gt; Drop, delete, truncate, terminate, and force-push start as plans, not actions. The agent's first call returns a diff. The user — or a separate approval agent — applies. Andrej Karpathy's &lt;a href="https://x.com/karpathy/status/2015883857489522876" rel="noopener noreferrer"&gt;observation that "LLMs are exceptionally good at looping until they meet specific goals"&lt;/a&gt; cuts both ways. Make the success criterion &lt;em&gt;plan accepted by reviewer&lt;/em&gt;, not &lt;em&gt;operation completed&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Blast-radius declarations.&lt;/strong&gt; Each agent task declares ahead of time which systems it can touch. &lt;em&gt;Fix the failing migration&lt;/em&gt; gets read access to the user table and write access to migrations only. &lt;em&gt;Investigate the billing spike&lt;/em&gt; is read-only across the board. The pattern exists already in AWS IAM session policies and in capability-based security. It does not exist as a default in agent runtimes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Staging shadow data.&lt;/strong&gt; The agent operates on a current snapshot, not on prod. The diff is reviewed before it merges. Database CI/CD already has this — Atlas, dbt, Liquibase. Connecting it to an agent runtime is glue, not invention.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Change windows.&lt;/strong&gt; No agent runs irreversible operations during business hours without explicit human approval. Same constraint that keeps humans from pushing on Friday afternoons. Trivial to enforce. Almost never enforced.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Proof chains.&lt;/strong&gt; Every agent action signed by tool, time, input, intent, and outcome. The Hacker News post titled "&lt;a href="https://github.com/rodriguezaa22ar-boop/atlas-trust-infrastructure" rel="noopener noreferrer"&gt;Why AI Agents Need Proof Chains, Not Just Logs&lt;/a&gt;" makes this argument well. Logs require somebody to read them. Proof chains are post-hoc verifiable artifacts that sit there until something breaks and then answer the question without requiring a human to have been watching. This is the agent equivalent of a Git commit log — the actor changes, the format does not.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Human-in-the-loop thresholds.&lt;/strong&gt; Operations above a configurable blast-radius threshold pause for explicit approval. Below the threshold, autonomy. Above it, a Slack message with the plan and an approve button. Same shape as Anthropic's &lt;a href="https://www.anthropic.com/engineering/building-effective-agents" rel="noopener noreferrer"&gt;Building Effective Agents&lt;/a&gt; framing — the human owns the seams, the agent owns the steps between them. The threshold is the seam.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Six artifacts. Each one already exists in some adjacent domain. None of them are agent-specific in shape; they are agent-specific in &lt;em&gt;configuration&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  None of this is ceremony
&lt;/h2&gt;

&lt;p&gt;The risk worth flagging — the one that comes up every time a list like this gets proposed — is that AI infrastructure becomes bureaucratic. The list above sounds heavy. It isn't, if each artifact has one trigger and one update protocol. I made &lt;a href="https://www.mpt.solutions/lius-4-lines-are-the-floor-build-the-ceiling/" rel="noopener noreferrer"&gt;the same point about CLAUDE.md architecture yesterday&lt;/a&gt;: the wins come from delegation, not accumulation.&lt;/p&gt;

&lt;p&gt;Dry-run-by-default is a default flag, not a process. Blast-radius declarations are config files the agent reads at task start. Proof chains are append-only logs nobody reads unless something breaks. Change windows are a cron-shaped check. The pipeline is invisible until you need it. CI/CD was the same. Most teams running CI/CD do not consciously think about it; they think about &lt;em&gt;git push&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The PocketOS team did not lose nine seconds. They lost the time it would have taken to add &lt;code&gt;--dry-run&lt;/code&gt; as a default and a one-line blast-radius declaration on that Railway API token. Compare those costs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this is the next layer of supervision in artifacts
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.mpt.solutions/agentic-coding-isnt-the-trap-supervising-from-your-head-is/" rel="noopener noreferrer"&gt;Last week's argument&lt;/a&gt; was that supervision belongs in artifacts, not in a developer's working memory. The CLAUDE.md piece extended that to a structural claim: artifacts are an architecture, not a file. The agent action pipeline is one specific class of that architecture, scaled down to the operational and runtime layer.&lt;/p&gt;

&lt;p&gt;Code-writing agents need one set of artifacts: tests, types, lint, code review, mistake logs. Action-running agents need a different set: dry-runs, blast-radius limits, staging shadow data, change windows, proof chains, threshold gating. Both kinds of agent share the underlying move — supervision lives in the system, not in the operator's head. Different actors need different artifacts.&lt;/p&gt;

&lt;h2&gt;
  
  
  The test, generalized
&lt;/h2&gt;

&lt;p&gt;The implicit question I ask whenever someone attributes an outage to "AI making mistakes" is this. Could a human have done this damage in this much time? If yes, the actor is not the problem and the safeguards are missing. If no, then this is a new class of risk and needs a new class of safeguard.&lt;/p&gt;

&lt;p&gt;Most of what gets blamed on the model passes the first test. A model called a destructive endpoint that should not have existed. A model committed a key that should have been gitignored. A model wrote SQL that a human reviewer should have caught. In all of those, the failure is upstream of the model.&lt;/p&gt;

&lt;p&gt;PocketOS fails the second test. A human could not have deleted prod and backups in nine seconds. That is genuinely a new class of risk, and it requires the artifact list above — not because the model is malicious (the agent's own confession shows it knew exactly which rules it was breaking), but because the model is &lt;em&gt;fast&lt;/em&gt;. Speed is the new vector. The artifacts have to handle it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where to start
&lt;/h2&gt;

&lt;p&gt;Stop blaming the model. Then look at the infrastructure. Then look at the &lt;em&gt;agent-specific&lt;/em&gt; infrastructure, because the human-era pipeline does not cover the speed and blast radius of an agent that can call tools faster than you can read its output. That last part is on us to build, and it is not where the field is putting its effort yet.&lt;/p&gt;

&lt;p&gt;Step one: the model is not the problem.&lt;/p&gt;

&lt;p&gt;Step two: build the pipeline. The 2010s did this for human deploys.&lt;/p&gt;

&lt;p&gt;Step three: the pipeline has to be agent-shaped. That step is open.&lt;/p&gt;

</description>
      <category>agentengineering</category>
      <category>platformengineering</category>
      <category>devops</category>
      <category>aitooling</category>
    </item>
    <item>
      <title>Liu's 4 Lines Are the Floor. Build the Ceiling.</title>
      <dc:creator>Michael Tuszynski</dc:creator>
      <pubDate>Mon, 04 May 2026 19:52:55 +0000</pubDate>
      <link>https://forem.com/michaeltuszynski/lius-4-lines-are-the-floor-build-the-ceiling-2862</link>
      <guid>https://forem.com/michaeltuszynski/lius-4-lines-are-the-floor-build-the-ceiling-2862</guid>
      <description>&lt;p&gt;Yanli Liu's "&lt;a href="https://levelup.gitconnected.com/the-4-lines-every-claudemd-needs-from-andrej-karpathys-thread-on-ai-coding-agents-d3eb19eecdf5" rel="noopener noreferrer"&gt;The 4 Lines Every CLAUDE.md Needs&lt;/a&gt;" makes a real point. The 4 lines, derived from &lt;a href="https://x.com/karpathy/status/2015883857489522876" rel="noopener noreferrer"&gt;Andrej Karpathy's January 2026 thread&lt;/a&gt; on agent failure modes, all express the same insight: behavioral rules outperform feature rules. &lt;em&gt;Don't assume. Surface tradeoffs.&lt;/em&gt; &lt;em&gt;Minimum code that solves the problem.&lt;/em&gt; &lt;em&gt;Touch only what you must.&lt;/em&gt; &lt;em&gt;Define success criteria. Loop until verified.&lt;/em&gt; Each one is portable across stacks and tasks, where prescriptive rules go stale the moment your codebase shifts.&lt;/p&gt;

&lt;p&gt;The 4 lines are the floor of a working CLAUDE.md. They are not the ceiling. Most of the CLAUDE.md files I see in the wild — including the ones the article holds up as cautionary tales of "47 rules about code style" — fail because they treat a file as the unit of organization. A production CLAUDE.md is an architecture, not a file.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the article gets right and what to flag
&lt;/h2&gt;

&lt;p&gt;The behavioral-vs-prescriptive distinction is correct, and the Configuration Paradox is real: past a threshold, more rules produce confused agents, not disciplined ones. Liu's litmus test — &lt;em&gt;would removing this cause a mistake the agent couldn't recover from?&lt;/em&gt; — is the right filter for any individual rule.&lt;/p&gt;

&lt;p&gt;A few things in the piece do not hold up under inspection. The asserted 6,000 / 12,000 character caps for CLAUDE.md have no source I can verify. The "/plugin marketplace add" command described in the article is not part of base Claude Code. The 94% accuracy stat the piece borrows from another blog has no disclosed methodology. And the "60,000 GitHub stars" figure cited as evidence of Claude Code adoption is unverified. Cite the article for the framing. Do not cite it for the numbers.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 4 lines do not stand alone for long
&lt;/h2&gt;

&lt;p&gt;Behavioral rules are the right starting point. They are also incomplete the moment you have a real project. You quickly need three other things the 4 lines do not give you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Domain context the agent cannot infer from files&lt;/strong&gt; — what each service does, why a directory is named the way it is, which APIs are read-only vs. write-side, where secrets live.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Architecture decisions&lt;/strong&gt; — patterns the agent shouldn't have to re-derive on every task.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incident-driven rules&lt;/strong&gt; — the corrections that came out of specific failures, with enough context that the rule is unambiguous.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you put all three of these into one CLAUDE.md, you get the 47-rule sprawl Liu warns against. If you leave them out, the agent guesses and the 4 lines do not help — &lt;em&gt;don't assume&lt;/em&gt; is a behavior, not a fact.&lt;/p&gt;

&lt;p&gt;The fix is structural. Stop accumulating rules in one file. Start delegating them to files with single jobs.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the architecture looks like in practice
&lt;/h2&gt;

&lt;p&gt;NEXUS — my Claude Code operating layer — runs about 237 lines of CLAUDE.md. That file holds behavioral guardrails and protocols, and almost nothing else. The first two protocols there are &lt;em&gt;Verify Before Reporting&lt;/em&gt; and &lt;em&gt;Plan First, Code Second.&lt;/em&gt; Both are extensions of the same behavioral category Liu names. Adding fourteen more behavioral protocols at the same level still does not approach 47 rules of code style — they are the same shape as the 4 lines, just covering more failure modes.&lt;/p&gt;

&lt;p&gt;What CLAUDE.md does not contain is the project-specific stuff. That lives in delegated files:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;MEMORY.md&lt;/code&gt;&lt;/strong&gt; holds 21 numbered, dated, append-only Hard-Won Lessons. Each one came from a specific incident, with the cost of getting it wrong in the entry. &lt;em&gt;LaunchAgent log paths must be on local disk, not SMB&lt;/em&gt; (lesson #15) is in there because six of my LaunchAgents silently broke on 2026-04-19 when the path was on a NAS mount. The agent reads MEMORY.md at session start. I wrote about &lt;a href="https://www.mpt.solutions/your-agents-compliments-are-a-confession/" rel="noopener noreferrer"&gt;the Mistakes Become Rules pattern last week&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;.claude/rules/&lt;/code&gt;&lt;/strong&gt; holds language-specific and capability-specific rule files. &lt;code&gt;python.md&lt;/code&gt; for Python work. &lt;code&gt;completeness.md&lt;/code&gt; for "what counts as done." Each file gets loaded when the agent enters that context, not on every session.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;agents/&amp;lt;domain&amp;gt;-context.md&lt;/code&gt;&lt;/strong&gt; for per-system context — finance, content, the DeFi system before it was retired. CLAUDE.md's session-startup protocol tells the agent &lt;em&gt;if a specific domain is in play, read the relevant &lt;code&gt;agents/&amp;lt;domain&amp;gt;-context.md&lt;/code&gt;&lt;/em&gt;. The agent doesn't load all of them up front. It loads the one that matters.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;SESSION-STATE.md&lt;/code&gt;&lt;/strong&gt; holds ephemeral active context — what's in flight, what was decided yesterday, what to pick up from. It is the first thing rewritten when a major task closes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the architecture. Behavioral guardrails at the top, in one shared file. Project-, domain-, and incident-specific rules delegated to files with one trigger condition each. The agent reads what's relevant.&lt;/p&gt;

&lt;h2&gt;
  
  
  The structural version of Liu's litmus test
&lt;/h2&gt;

&lt;p&gt;Liu's &lt;em&gt;would removing this cause a mistake the agent couldn't recover from&lt;/em&gt; is the right filter for an individual rule. The structural question is: &lt;em&gt;does this rule belong here, or in a delegated file?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Three quick filters answer that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;If it changes per-project, it does not belong in CLAUDE.md. Put it in &lt;code&gt;agents/&amp;lt;domain&amp;gt;-context.md&lt;/code&gt; or a project-specific file.&lt;/li&gt;
&lt;li&gt;If it changes per-language or per-tool, it does not belong in CLAUDE.md. Put it in &lt;code&gt;.claude/rules/&amp;lt;language&amp;gt;.md&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;If it came from a real incident with a date and a cost, it does not belong in CLAUDE.md either. Put it in MEMORY.md's Hard-Won Lessons.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;What's left in CLAUDE.md is the part that's behavioral, portable, and load-bearing. That tends to be a few dozen entries — bigger than 4, smaller than 47. Each entry is one short paragraph.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this scales
&lt;/h2&gt;

&lt;p&gt;Two reasons. First, every file has one update protocol. Hard-Won Lessons are append-only and triggered by corrections. Domain contexts get rewritten when systems change. Behavioral protocols change rarely, and when they do, the change applies everywhere. Mixing them in one file forces every edit to sit next to every other edit, which is how you end up with the 47-rule mess.&lt;/p&gt;

&lt;p&gt;Second, the agent's working set at any decision point is smaller. A CLAUDE.md sized for the worst case is a CLAUDE.md the agent has to re-read every time. A CLAUDE.md sized for the always-true case, with delegated files for the contextual case, is one the agent can hold internally — and only loads the rest when the work demands it. This is the same logic I argued for &lt;a href="https://www.mpt.solutions/agentic-coding-isnt-the-trap-supervising-from-your-head-is/" rel="noopener noreferrer"&gt;supervision artifacts in the Faye reframe&lt;/a&gt; yesterday: institutional memory belongs in files with single owners and lifecycles, not in one file with many.&lt;/p&gt;

&lt;p&gt;Anthropic's &lt;a href="https://www.anthropic.com/engineering/building-effective-agents" rel="noopener noreferrer"&gt;Building Effective Agents&lt;/a&gt; framing draws the same line at the workflow level — predefined paths for the deterministic part, agent autonomy at the seams. The same shape applies to CLAUDE.md. The behavioral floor is the predefined part. The delegated files are the seams.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the architecture still needs help
&lt;/h2&gt;

&lt;p&gt;This pattern does not solve everything. Multi-file refactors still need real architecture context the agent cannot derive from reading source. Regulated industries — Fulcrum, the presales workflow stack I run for enterprise customers, lives here — need domain-specific guardrails alongside the behavioral ones, and those guardrails are themselves a maintained artifact, not a one-time rule list. Team-scale consistency is a coordination problem, not a configuration one — the architecture gets you a reproducible shape, but multiple humans still have to agree on which lessons are real lessons.&lt;/p&gt;

&lt;p&gt;Tool portability is the last gap. The 4 lines transfer between Claude Code, Cursor, Codex, and others. The delegated file pattern transfers in shape but not in syntax — every agent has its own loading model. That is a real limitation. It is also a smaller limitation than starting from scratch on every tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to take from Liu
&lt;/h2&gt;

&lt;p&gt;The 4 lines are the right floor. Behavioral rules over feature rules. Universal categories over project specifics. The Configuration Paradox is a thing to design against, not just a thing to know.&lt;/p&gt;

&lt;p&gt;The ceiling is the architecture above the floor. Behavioral guardrails in one shared file. Project, domain, and language rules delegated. Incident-driven rules in an append-only file the agent reads at session start. CLAUDE.md as the dispatcher, not the rulebook.&lt;/p&gt;

&lt;p&gt;Most CLAUDE.md files I see are stuck on the floor or buried under a 47-rule pile. The architecture is the move that gets you out of both.&lt;/p&gt;

</description>
      <category>claudecode</category>
      <category>agentengineering</category>
      <category>platformengineering</category>
      <category>developertools</category>
    </item>
    <item>
      <title>Agentic Coding Isn't the Trap. Supervising From Your Head Is.</title>
      <dc:creator>Michael Tuszynski</dc:creator>
      <pubDate>Mon, 04 May 2026 04:31:23 +0000</pubDate>
      <link>https://forem.com/michaeltuszynski/agentic-coding-isnt-the-trap-supervising-from-your-head-is-4i70</link>
      <guid>https://forem.com/michaeltuszynski/agentic-coding-isnt-the-trap-supervising-from-your-head-is-4i70</guid>
      <description>&lt;p&gt;Lars Faye's "&lt;a href="https://larsfaye.com/articles/agentic-coding-is-a-trap" rel="noopener noreferrer"&gt;Agentic Coding is a Trap&lt;/a&gt;" is the most honest writing I've seen on AI skill atrophy. The studies he cites are real. The "supervision paradox" — needing the skills the agent erodes to oversee it — is the cleanest framing of the failure mode I've read. I want to push on the conclusion, not the diagnosis.&lt;/p&gt;

&lt;p&gt;The Anthropic study Faye references — "&lt;a href="https://www.anthropic.com/research/AI-assistance-coding-skills" rel="noopener noreferrer"&gt;How AI Assistance Impacts the Formation of Coding Skills&lt;/a&gt;" — found a 17% drop in skill mastery for developers using AI assistance, with debugging showing the steepest decline. That's the headline number. But the same study also found something that gets quoted less often. Developers who used AI for conceptual inquiry scored 65% or higher on the follow-up evaluation. Developers who delegated code generation to the model scored below 40%.&lt;/p&gt;

&lt;p&gt;That gap — 65 versus 40, on the same tool and the same task — is the entire game.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the same study actually shows
&lt;/h2&gt;

&lt;p&gt;The variable that drove the difference wasn't whether the developer used the agent. It was how they supervised the work. The high-scoring group asked follow-up questions, combined generation with explanation, used the model for conceptual gaps and not code-shaped output. The low-scoring group accepted what the model produced and moved on. Same tool. Two completely different supervision patterns. Two completely different outcomes.&lt;/p&gt;

&lt;p&gt;Faye treats the headline 17% as evidence the tool is the problem. The 65/40 split inside the same paper says the supervision pattern is the problem. Those are different conclusions, and they call for different fixes.&lt;/p&gt;

&lt;h2&gt;
  
  
  The trap is the supervision pattern
&lt;/h2&gt;

&lt;p&gt;Faye's prescription is to demote the AI: write pseudo-code by hand, treat the model as a "Ship's Computer not Data," never delegate work you haven't done yourself. The implicit move is to relocate as much of the work back into the developer's head as possible, on the theory that the head is where supervision capacity has to live.&lt;/p&gt;

&lt;p&gt;That theory is where I want to push.&lt;/p&gt;

&lt;p&gt;The supervision paradox bites for one reason. The developer is being asked to be the entire supervisory apparatus, by themselves, in real time, using only working memory and personal vigilance. That fails. It fails the same way it fails for a senior engineer reviewing a 4,000-line PR from a junior at 4pm on a Friday. The bottleneck isn't the code. It's the cognitive substrate the reviewer is using.&lt;/p&gt;

&lt;p&gt;Anything you don't exercise daily fades. If your supervision is "I personally read every line and hold the whole system in my head," then yes — once an agent writes more lines than you can read, you lose. Atrophy is the symptom. Personal vigilance as the supervision strategy is the part worth examining.&lt;/p&gt;

&lt;h2&gt;
  
  
  Move supervision out of your head
&lt;/h2&gt;

&lt;p&gt;The fix that the 65% group implicitly used is not to type more code. It's to put supervision in places that don't atrophy.&lt;/p&gt;

&lt;p&gt;That list is short and well-known:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tests&lt;/strong&gt; that fail when the contract breaks. Not coverage theater — real assertions on the edges that matter.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Types&lt;/strong&gt; that refuse to compile when the shape is wrong. The compiler does not get tired at 4pm.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lint and format rules&lt;/strong&gt; that catch the patterns you keep correcting by hand. If you've corrected the same pattern twice, lint it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hooks&lt;/strong&gt; at the runtime layer. &lt;a href="https://code.claude.com/docs/en/hooks" rel="noopener noreferrer"&gt;Claude Code's PreToolUse and SessionStart hooks&lt;/a&gt; run deterministically — the model can't forget them. The set of rules that are regex-shaped and load-bearing belong here, not in a system prompt.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code review&lt;/strong&gt; as the final gate. Same discipline humans have used to supervise other humans' code for fifty years. It works on agent output for the same reason it worked on junior output: the reviewer doesn't need to have written the code, they need to be able to defend it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Append-only mistake logs.&lt;/strong&gt; &lt;a href="https://www.mpt.solutions/your-agents-compliments-are-a-confession/" rel="noopener noreferrer"&gt;The Mistakes Become Rules pattern&lt;/a&gt; — one numbered file, the agent reads it at session start, every correction becomes a permanent entry. The supervision lives in the file, not in the next reviewer's recall.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each of these is institutional memory. None of them depends on a single developer holding the whole system in working memory. All of them survive the developer taking three weeks off.&lt;/p&gt;

&lt;h2&gt;
  
  
  The real test
&lt;/h2&gt;

&lt;p&gt;Here is the question that separates the two groups in the Anthropic study, generalized.&lt;/p&gt;

&lt;p&gt;Take three weeks off. An agent does the work in your absence, given only the repo, the tests, the lint, the hooks, the mistake log, and the review process. When you come back, is the codebase in a state you can defend?&lt;/p&gt;

&lt;p&gt;If yes, supervision lives in artifacts. The agent is being supervised by the system you put in place, not by your personal vigilance. Atrophy of your typing speed is not a threat, because typing was never the supervision mechanism.&lt;/p&gt;

&lt;p&gt;If no, the artifacts aren't there yet. Personal vigilance is the only thing standing between the codebase and chaos, and Faye's prescription is the right safety move for that situation. Demote the agent. Build the artifacts before you raise it back up.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why "Ship's Computer, not Data" is too narrow
&lt;/h2&gt;

&lt;p&gt;Faye's analogy locates judgment in one captain's head. That framing is the same shape as the paradox — supervision as a personal cognitive feat. It quietly assumes the developer is alone with the tool.&lt;/p&gt;

&lt;p&gt;A different shape works better. The agent is a junior — fast, eager, occasionally confidently wrong, requires review. You are the senior. You don't supervise by re-typing the junior's work. You supervise by reading the diff, running the tests, checking it against the team's accumulated rules, and asking the junior to defend choices you don't understand. Anthropic's own &lt;a href="https://www.anthropic.com/engineering/building-effective-agents" rel="noopener noreferrer"&gt;Building Effective Agents&lt;/a&gt; framing assumes exactly this division of labor — the human owns the seams, the agent owns the steps between them. I made the same point about &lt;a href="https://www.mpt.solutions/stop-turning-your-cron-jobs-into-agents/" rel="noopener noreferrer"&gt;agency belonging at judgment seams&lt;/a&gt; when arguing against turning cron jobs into agents. The shape matches.&lt;/p&gt;

&lt;p&gt;Senior engineers do not atrophy by not typing. They atrophy by not reviewing critically. That distinction is most of the game.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Faye gets right that I'm not arguing with
&lt;/h2&gt;

&lt;p&gt;Vendor lock-in is real. Token costs are unpredictable. Outages happen. Probabilistic systems require review cycles that deterministic ones don't. None of those go away in this reframe.&lt;/p&gt;

&lt;p&gt;But they're risks to manage, not reasons to put supervision back in your head. You manage vendor risk with model-agnostic runtimes and the kind of prompts, skills, and hooks that move between models. You manage token cost with caching and tier discipline. You manage outages by having work that doesn't depend on a single API call to make progress. None of that is "type more code by hand."&lt;/p&gt;

&lt;h2&gt;
  
  
  The shorter version
&lt;/h2&gt;

&lt;p&gt;Skill atrophy under heavy agent use is real, and Faye is right to take it seriously. The skill that atrophies fastest is "personal vigilance as a supervision strategy," and that strategy was under pressure at scale long before agents existed. Agents accelerate it.&lt;/p&gt;

&lt;p&gt;The fix isn't only to demote the agent. It's also — and mostly — to promote the artifacts. Put the supervision in places that don't get tired, don't forget, and don't need to be re-derived from working memory every Tuesday morning. The 65% group in the Anthropic study were already doing this, even if the paper didn't name it that way.&lt;/p&gt;

&lt;p&gt;The trap isn't agentic coding. The trap is treating supervision as a thing that lives inside one developer's head. Move it out, and the paradox eases.&lt;/p&gt;

</description>
      <category>agentengineering</category>
      <category>claudecode</category>
      <category>platformengineering</category>
      <category>developertools</category>
    </item>
    <item>
      <title>Your Agent's Compliments Are a Confession</title>
      <dc:creator>Michael Tuszynski</dc:creator>
      <pubDate>Sun, 03 May 2026 00:04:08 +0000</pubDate>
      <link>https://forem.com/michaeltuszynski/your-agents-compliments-are-a-confession-3kdj</link>
      <guid>https://forem.com/michaeltuszynski/your-agents-compliments-are-a-confession-3kdj</guid>
      <description>&lt;p&gt;Count how many times your agent told you "you're right" today. Count "good catch." Count "I should have noticed that." Now ask yourself how many of those corrections will survive into tomorrow's session.&lt;/p&gt;

&lt;p&gt;The compliments are not praise. They are a confession. Every "you're right" is the agent admitting it just learned something it should have already known, in a context that will evaporate the moment the session ends. The data point is real. The retention is zero.&lt;/p&gt;

&lt;p&gt;This is the actual problem people are trying to solve when they reach for elaborate self-improvement architectures: nightly reflection cron jobs, background agents that crawl yesterday's transcripts, autonomous proposal pipelines with grading subagents and dashboards for human review. The instinct is right. The solution is theater.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this happens
&lt;/h2&gt;

&lt;p&gt;Claude Code's runtime, like any agent runtime, starts each session from a fresh conversation. &lt;a href="https://www.anthropic.com/engineering/building-effective-agents" rel="noopener noreferrer"&gt;Anthropic's own framing for effective agents&lt;/a&gt; draws a line between workflows (predefined paths) and agents (LLMs deciding their own tool use). Both reset. The lesson you taught your agent at 3pm is encoded in the message history of that one conversation. Tomorrow morning, that history is gone. The model is the same model. The instructions in CLAUDE.md are the same instructions. But the specific correction — "no, on this codebase you have to use the absolute path because launchd reset PATH on you" — lives only in the transcript.&lt;/p&gt;

&lt;p&gt;So you correct it again. And again. And the third time you notice the pattern, you start looking for a fix.&lt;/p&gt;

&lt;h2&gt;
  
  
  The tempting wrong answer
&lt;/h2&gt;

&lt;p&gt;The fix that gets blogged about goes something like this: build a nightly cron job that reads yesterday's transcripts, extracts candidate lessons, drafts them as JSON proposals with frontmatter, opens a dashboard, and asks a separate grading subagent to score the proposals. Human reviews. Promotes accepted ones into a "skill" file. Repeat.&lt;/p&gt;

&lt;p&gt;This is ceremony, not discipline. Three problems with it:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;It substitutes process metrics for outcome.&lt;/strong&gt; You can run the pipeline every night and ship zero durable improvement. The metric you actually care about is "did the agent stop making the same mistake," not "did we generate ten proposals last week."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It moves the work from the moment that matters.&lt;/strong&gt; The right time to write the rule is the moment you notice the agent got it wrong. Not eight hours later, after a reflection agent has interpreted what it thought happened.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It puts a model in front of the file.&lt;/strong&gt; The whole reason you're writing this down is that the model is the unreliable component. Layering more model-mediated steps on top of "remember this" is the architectural equivalent of asking the goldfish to file its own memos.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The runtime layer matters. So does the substrate. None of it replaces the rule.&lt;/p&gt;

&lt;h2&gt;
  
  
  The actually-working answer
&lt;/h2&gt;

&lt;p&gt;A single file. The agent reads it at the start of every session. Append-only. Numbered. Dated. Linked to the actual incident.&lt;/p&gt;

&lt;p&gt;NEXUS — my agent setup, specifically the operating layer that wraps Claude Code on my machine — formalizes this in CLAUDE.md as a behavioral protocol called Mistakes Become Rules. The wording is exact and short:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Trigger:&lt;/strong&gt; Any time Mike corrects your approach, points out an error, or says something like "no, not that" / "don't do X" / "you should have…"&lt;br&gt;
&lt;strong&gt;Action:&lt;/strong&gt; Immediately add a numbered entry to MEMORY.md's "Hard-Won Lessons" section: &lt;code&gt;[next number]. **[short title]** — [what went wrong and the rule to follow going forward].&lt;/code&gt;&lt;br&gt;
&lt;strong&gt;On session start:&lt;/strong&gt; Read and internalize all Hard-Won Lessons before beginning work.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is the entire loop. There is no reflection agent. There is no nightly job. There is no dashboard. The trigger is the correction itself, the action is one append, and the agent reads the file the next time it boots up.&lt;/p&gt;

&lt;p&gt;The file currently has twenty entries. Each one came from a specific incident, on a specific date, that cost me time. A few of them, with the context that made them rules:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Lesson #15 — LaunchAgent log paths must be on local disk, not SMB.&lt;/strong&gt; On 2026-04-19, six LaunchAgents in my finance service silently broke. macOS TCC was blocking launchd-spawned processes from writing logs to the NAS-mounted path, even though the same SSH user could write there fine. Exit code 78. No log output, because the log path was the problem. Took an afternoon to diagnose. The rule is one sentence. The rule writes itself.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lesson #19 — Never &lt;code&gt;import()&lt;/code&gt; a publish script "to test it."&lt;/strong&gt; On 2026-04-29, two test imports of &lt;code&gt;publish-agent-id-role.ts&lt;/code&gt; raced because the script invokes &lt;code&gt;main()&lt;/code&gt; at module top-level. Result: duplicate posts on LinkedIn (twice), X (twice), and Ghost (one extra, deleted via Admin API). Late.dev refuses to delete already-published content, so the cleanup was manual. The rule: validate publish scripts with &lt;code&gt;tsc --noEmit&lt;/code&gt;, a &lt;code&gt;--dry-run&lt;/code&gt; flag, or by reading them. Never with &lt;code&gt;import()&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lesson #20 — PM2 &lt;code&gt;script: "npm"&lt;/code&gt; ignores app &lt;code&gt;env.PATH&lt;/code&gt;.&lt;/strong&gt; On 2026-05-01, the health-api service kept reporting &lt;code&gt;online&lt;/code&gt; while the port wasn't listening. PM2 was launching &lt;code&gt;npm&lt;/code&gt; from the daemon's PATH, not the app's, which meant &lt;code&gt;better-sqlite3&lt;/code&gt; (compiled for node 22) was loading under node 25 and crashing on &lt;code&gt;ERR_DLOPEN_FAILED&lt;/code&gt;. Fix: pin &lt;code&gt;script&lt;/code&gt; to the absolute path of the desired npm. Same idea as Lesson #16, now for PM2 instead of launchd.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each entry took less than a minute to write. Each one prevents the same hour-long failure from happening twice. The compounding is the entire point.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where hooks fit
&lt;/h2&gt;

&lt;p&gt;Lessons live in markdown because that's how the agent absorbs them at session start. But there's a runtime layer underneath, and it has a real role.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://code.claude.com/docs/en/hooks" rel="noopener noreferrer"&gt;Claude Code's hooks&lt;/a&gt; — PreToolUse, PostToolUse, SessionStart, and friends — let you intercept tool calls deterministically. If a lesson can be reduced to a regex on a command string ("never run &lt;code&gt;rm -rf&lt;/code&gt; outside &lt;code&gt;/tmp&lt;/code&gt;"), a hook is a better enforcement point than a markdown bullet, because the markdown bullet relies on the model reading and obeying it. The hook does not.&lt;/p&gt;

&lt;p&gt;Same logic for &lt;a href="https://code.claude.com/docs/en/skills" rel="noopener noreferrer"&gt;Claude Code Skills&lt;/a&gt;: they're great for packaging a procedure with its own tools and supporting files. They are not a substitute for the rule. They're a substrate the rule can sit on top of.&lt;/p&gt;

&lt;p&gt;The hierarchy I run with: durable rules in the markdown file, deterministic enforcement in hooks where the rule is regex-shaped, and skills for procedures with multiple steps. None of those is a self-improvement loop. None of them runs at midnight. None of them has a grading subagent. They are all read or executed at the moment they apply.&lt;/p&gt;

&lt;h2&gt;
  
  
  How you know it's working
&lt;/h2&gt;

&lt;p&gt;The test is simple. You stop hearing the same compliment twice.&lt;/p&gt;

&lt;p&gt;If your agent says "good catch" today, look it up tomorrow. Is the lesson in your file? Did the agent read the file before it started working? If yes to both, you should never hear "good catch" on that specific topic again. If you do, the rule is wrong, the file isn't being read, or the lesson didn't generalize. All three are debuggable. None of them require a reflection agent.&lt;/p&gt;

&lt;p&gt;Praise without persistence is a leak. Patch the leak, do not build a recycling system for the runoff.&lt;/p&gt;

</description>
      <category>agentengineering</category>
      <category>claudecode</category>
      <category>platformengineering</category>
      <category>developertools</category>
    </item>
    <item>
      <title>Three Memory Systems Under One Login. Stop Picking Sides.</title>
      <dc:creator>Michael Tuszynski</dc:creator>
      <pubDate>Sun, 03 May 2026 00:01:37 +0000</pubDate>
      <link>https://forem.com/michaeltuszynski/three-memory-systems-under-one-login-stop-picking-sides-1ela</link>
      <guid>https://forem.com/michaeltuszynski/three-memory-systems-under-one-login-stop-picking-sides-1ela</guid>
      <description>&lt;p&gt;Anthropic now ships at least three different memory models inside the Claude product family, and they don't behave the same way. Claude.ai has &lt;a href="https://claude.com/blog/memory" rel="noopener noreferrer"&gt;a chat memory feature for Pro, Max, Team, and Enterprise users&lt;/a&gt; that summarizes prior conversations and injects that summary into new chats. Claude Code has &lt;a href="https://code.claude.com/docs/en/memory" rel="noopener noreferrer"&gt;CLAUDE.md files plus a separate "auto memory" directory&lt;/a&gt; the model writes to itself, both loaded at session start. The API ships &lt;a href="https://docs.claude.com/en/docs/agents-and-tools/tool-use/memory-tool" rel="noopener noreferrer"&gt;a &lt;code&gt;memory_20250818&lt;/code&gt; tool&lt;/a&gt; that hands a &lt;code&gt;/memories&lt;/code&gt; directory to your application code so you can persist anything you want between turns. Three surfaces, three rule sets, three retention postures.&lt;/p&gt;

&lt;p&gt;I argued last week on this blog that the model isn't the variable that matters — the wrapper around it is. This is the next claim down the chain: if memory is a feature of that wrapper rather than the model, then vendor fragmentation is a memory problem you cannot solve by picking a surface. Stop trying.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's actually different across the three
&lt;/h2&gt;

&lt;p&gt;The chat surface remembers conversations as a 24-hour synthesis, project-scoped, controllable through a settings panel. The Code surface uses plain markdown files in your repo plus a per-project memory directory at &lt;code&gt;~/.claude/projects/&amp;lt;project&amp;gt;/memory/&lt;/code&gt; on the local machine. The API tool defines six file operations (&lt;code&gt;view&lt;/code&gt;, &lt;code&gt;create&lt;/code&gt;, &lt;code&gt;str_replace&lt;/code&gt;, &lt;code&gt;insert&lt;/code&gt;, &lt;code&gt;delete&lt;/code&gt;, &lt;code&gt;rename&lt;/code&gt;) and expects your application to implement the storage. None of these are wrong. They are designed for different jobs. But they share zero common format, no export path between them, and no way to carry context from a Claude.ai chat into a Claude Code session into an API agent without doing the plumbing by hand.&lt;/p&gt;

&lt;p&gt;Birgitta Böckeler's writeup on &lt;a href="https://martinfowler.com/articles/exploring-gen-ai/context-engineering-coding-agents.html" rel="noopener noreferrer"&gt;context engineering for coding agents&lt;/a&gt; frames the wrapper as everything in an AI agent except the model itself: the tool definitions, the context compaction, the feedback sensors, the system prompt, the memory between sessions. Anthropic's own engineering team &lt;a href="https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents" rel="noopener noreferrer"&gt;calls the same idea context engineering&lt;/a&gt; — the work of curating what enters the model's attention budget at each step. Memory sits squarely inside that definition. Which means the choice about &lt;em&gt;where memory lives&lt;/em&gt; is a wrapper decision, and the vendor is making it for you on each surface until you take it back.&lt;/p&gt;

&lt;h2&gt;
  
  
  The trap
&lt;/h2&gt;

&lt;p&gt;The natural reaction when a vendor ships three memory models is to figure out which one to use. Spend an afternoon reading docs, decide that the chat synthesis is for ad hoc queries, the auto memory is for coding work, and the API tool is for production agents. Move on.&lt;/p&gt;

&lt;p&gt;That reaction is wrong, and not because any of those choices are bad in isolation. It's wrong because it assumes vendor surfaces are stable. They aren't. Claude.ai's memory was Team-and-Enterprise-only at launch in September 2025, then expanded to Pro and Max in October. Claude Code's auto memory requires v2.1.59 or later and lives in a path tied to the git repo, not the user. The API memory tool is in beta under a header that already changed naming conventions twice. The vendor will keep shipping, the rules will keep shifting, and your context will keep being a second-class object inside someone else's roadmap.&lt;/p&gt;

&lt;p&gt;There's also a deeper problem. MindStudio's writeup on &lt;a href="https://www.mindstudio.ai/blog/what-is-behavioral-lock-in-persistent-ai-agents-switching-costs" rel="noopener noreferrer"&gt;behavioral lock-in&lt;/a&gt; makes the case that agent memory creates switching costs that data portability rules cannot fix. For example, even if a vendor lets you export your memory directory tomorrow, the operational understanding the agent built — your team's terminology, your exceptions, your shorthand — does not round-trip cleanly into another vendor's surface. Eight months of accumulated context turns into a re-onboarding tax the moment you switch. Parallels' 2026 cloud survey put vendor lock-in concern at 94% across 540 IT leaders; agent memory is exactly the layer where that concern compounds.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I do instead
&lt;/h2&gt;

&lt;p&gt;My memory lives in a NAS-backed directory called &lt;code&gt;nexus/&lt;/code&gt;, in plain markdown, under git. It has a top-level &lt;code&gt;CLAUDE.md&lt;/code&gt; that gets auto-loaded into every Claude Code session because it sits at the project root. It has a &lt;code&gt;MEMORY.md&lt;/code&gt; for long-term curated state, a &lt;code&gt;SESSION-STATE.md&lt;/code&gt; for active context, per-domain context files at &lt;code&gt;agents/&amp;lt;domain&amp;gt;-context.md&lt;/code&gt; for finance, health, content, and so on, and daily logs at &lt;code&gt;memory/YYYY-MM-DD.md&lt;/code&gt;. Cross-references between entities use &lt;code&gt;[[double brackets]]&lt;/code&gt; so they're grep-searchable and Obsidian-renderable. Search across the corpus runs through an Ollama embedding pipeline using &lt;code&gt;nomic-embed-text&lt;/code&gt; at 768 dimensions, indexed locally — no vendor API call required to ask "what did I decide about that account fee in February?"&lt;/p&gt;

&lt;p&gt;This stack does three things the vendor surfaces cannot.&lt;/p&gt;

&lt;p&gt;First, it survives the surface split. The same files load into Claude Code, can be pasted into Claude.ai, and can be served to an API agent through the memory tool's file ops. The format is universal because the format is just files.&lt;/p&gt;

&lt;p&gt;Second, it survives the vendor switch. If I move to a different model provider tomorrow, the markdown still parses, the embeddings still resolve, and the wikilinks still work. There is no proprietary memory schema to migrate.&lt;/p&gt;

&lt;p&gt;Third, it gives me audit. I can grep my own context. I can diff what changed last week. I can &lt;code&gt;trash&lt;/code&gt; something I don't want anymore and recover it if I was wrong. None of those operations exist on the chat memory surface, and they only partially exist on the auto-memory surface.&lt;/p&gt;

&lt;h2&gt;
  
  
  The general pattern
&lt;/h2&gt;

&lt;p&gt;The vendor's wrapper is not your wrapper. It's theirs, designed around their product roadmap and their retention model and their billing surfaces. When that wrapper includes a memory layer, putting your context in it means putting your operational knowledge in someone else's container. Fine for ephemeral chat. Not fine for the accumulated state of a year of work.&lt;/p&gt;

&lt;p&gt;The fix is not to pick the right vendor surface. The fix is to keep your memory outside any vendor surface, in a format you own, with search you control, and let the vendor surfaces read from it as needed. Claude Code already does this for free with &lt;code&gt;CLAUDE.md&lt;/code&gt;. The other surfaces will eventually catch up, or they won't, and either way your context survives.&lt;/p&gt;

&lt;p&gt;Last week's post argued that the wrapper around the model is what matters. This one finishes the sentence: don't trust theirs with your context.&lt;/p&gt;

</description>
      <category>platformengineering</category>
      <category>aiagents</category>
      <category>vendorlockin</category>
      <category>developertools</category>
    </item>
    <item>
      <title>Stop Adopting AI. Start Exposing Your Context.</title>
      <dc:creator>Michael Tuszynski</dc:creator>
      <pubDate>Fri, 01 May 2026 20:50:12 +0000</pubDate>
      <link>https://forem.com/michaeltuszynski/stop-adopting-ai-start-exposing-your-context-2pog</link>
      <guid>https://forem.com/michaeltuszynski/stop-adopting-ai-start-exposing-your-context-2pog</guid>
      <description>&lt;p&gt;The AI adoption pathway that's actually working in 2026 is not "deploy a copilot to your team." It's "expose your org's context to whichever model your team already chose." That sounds like a small shift. It's not. It changes who picks the tool, what your procurement team buys, and where the work of getting value out of AI actually lives.&lt;/p&gt;

&lt;p&gt;The numbers behind the shift are bleak for the old playbook. MIT's NANDA study of 300 enterprise AI deployments found 95% of GenAI pilots delivered no measurable P&amp;amp;L impact. The diagnosis was not the model. It was missing context — the data, workflow knowledge, and institutional memory the model needed to actually be useful inside a specific business. &lt;a href="https://atlan.com/know/context-engineering-framework/" rel="noopener noreferrer"&gt;Atlan summarizes the same finding&lt;/a&gt; and quotes Box CEO Aaron Levie, who calls context engineering "the long pole in the tent for AI Agents adoption in most organizations." Gartner went further in mid-2025: "context engineering is in, prompt engineering is out," with a prediction that 80% of AI tools will incorporate it by 2028.&lt;/p&gt;

&lt;p&gt;Klarna is the worked example everyone now points at. Between 2022 and 2024, the company replaced about 700 customer-service positions with an OpenAI-powered chatbot. By spring 2025 customer satisfaction had dropped 22% and complaints had piled up. &lt;a href="https://www.entrepreneur.com/business-news/klarna-ceo-reverses-course-by-hiring-more-humans-not-ai/491396" rel="noopener noreferrer"&gt;The CEO publicly admitted the cuts went too far&lt;/a&gt; and pivoted to a hybrid model, rehiring humans for anything requiring judgment. The model wasn't broken. The pathway was. The org rolled out an agent without exposing the context it needed — refund policies, payment edge cases, regional regulations, escalation patterns — and the agent shipped generic answers to specific problems.&lt;/p&gt;

&lt;h2&gt;
  
  
  What replaced it
&lt;/h2&gt;

&lt;p&gt;Three things converged in late 2025 that quietly killed the old pathway.&lt;/p&gt;

&lt;p&gt;The first is the &lt;strong&gt;Model Context Protocol&lt;/strong&gt;. Anthropic open-sourced MCP in November 2024; by March 2026 the SDK was hitting &lt;a href="https://thenewstack.io/why-the-model-context-protocol-won/" rel="noopener noreferrer"&gt;97 million monthly downloads&lt;/a&gt; — a 970x growth curve from launch. OpenAI, Microsoft, Google, and AWS all shipped MCP client support within thirteen months. An independent census in Q1 2026 indexed 17,468 servers across registries. MCP is not a model. It is a protocol for handing a model the right context — your Slack, your issue tracker, your observability stack, your customer database — at the moment of the request.&lt;/p&gt;

&lt;p&gt;The second is &lt;strong&gt;agent skills as a portable artifact&lt;/strong&gt;. &lt;a href="https://www.anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills" rel="noopener noreferrer"&gt;Anthropic launched Agent Skills in October 2025&lt;/a&gt; and open-sourced the SKILL.md format in December. Atlassian, Canva, Cloudflare, Figma, Notion, Ramp, and Sentry all shipped skills in the launch window. A skill is a directory: instructions, scripts, resources. Drop the directory next to a workflow that recurs and any compatible agent can run it. The format is Anthropic's, but the spec is the same shape as .cursorrules, AGENTS.md, GitHub Spaces, and the rest of the convergence happening across vendors.&lt;/p&gt;

&lt;p&gt;The third is &lt;strong&gt;the in-repo memory file&lt;/strong&gt; as a de facto standard. CLAUDE.md, AGENTS.md, .cursorrules, and the rest are all the same idea: a markdown file at the root of a project that tells whatever agent gets dropped in what the project is, what conventions matter, what the gotchas are, and where the bodies are buried. The agent reads the file at the start of every session. The org documents itself once. The dev picks the model.&lt;/p&gt;

&lt;p&gt;Read those three together and the picture is obvious. The unit of AI adoption stopped being "the agent." It became "the substrate the agent stands on."&lt;/p&gt;

&lt;h2&gt;
  
  
  What that looks like in practice
&lt;/h2&gt;

&lt;p&gt;I run a personal agentic stack — NEXUS — that's been doing this for about a year. The repo has a CLAUDE.md at the root that lays out the workspace structure, identity, behavioral protocols, and lessons learned. There are a dozen agents/-context.md files for finance, content, health, the rest. There's an MCP server for Gmail, Calendar, Slack, Drive, and a few internal tools. There are skills for the recurring workflows — publishing a blog post, running a finance check, doing a health digest. The agent I happen to be using on a given day — Claude Code mostly, occasionally Cursor — reads what it needs at session start and gets to work.&lt;/p&gt;

&lt;p&gt;I don't pick a model and roll it out. I expose context, and whichever model is in the chair when I sit down knows what's going on.&lt;/p&gt;

&lt;p&gt;The same shape works at company scale, just with more access controls and an actual budget. The work is documenting the org until any agent dropped into it would be useful. The model becomes a free variable.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this changes about procurement
&lt;/h2&gt;

&lt;p&gt;The old AI procurement motion: pick a vendor, sign a per-seat contract, train the team on the tool, run change-management sessions, hope adoption hits 30%. This is what Klarna did. The asset created at the end of it is a vendor relationship and some training decks.&lt;/p&gt;

&lt;p&gt;The new motion: invest in the context infrastructure — an MCP gateway, a documentation platform that agents can read, semantic indexes for your wikis and tickets, a skills directory for recurring workflows. The model is whoever the dev or team picked. The procurement decision is &lt;em&gt;which surfaces to expose&lt;/em&gt;, not &lt;em&gt;which copilot to license&lt;/em&gt;. The asset created is a substrate that survives the next model rotation.&lt;/p&gt;

&lt;p&gt;The implication that nobody loves: tool-selection RFPs become a free variable rotation, not a strategic decision. The strategic decision is what your org has to say to a model that doesn't already know it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to do this week
&lt;/h2&gt;

&lt;p&gt;Four moves if you want to test the pathway without committing to a vendor.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Audit your CLAUDE.md / AGENTS.md surface.&lt;/strong&gt; Drop a coding agent into your main repo with no other context. Ask it to make a non-trivial change. If it makes obvious mistakes — wrong test runner, ignored coding conventions, bypassed an internal review process — those are the gaps a memory file should close. Write that file.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pick three high-frequency workflows and write skills.&lt;/strong&gt; The kind of thing a senior engineer explains to a new hire in their first week. Convert each to a SKILL.md or an equivalent. Measure time-to-task before and after.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stand up an MCP gateway for your top three internal systems.&lt;/strong&gt; Issue tracker, observability, customer database. Most have community MCP servers already; the work is access control, not implementation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stop running tool-selection RFPs.&lt;/strong&gt; Or if you have to, run them as a side track. The strategic work — and the asset that survives the next model release — is the context, not the contract.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The throughline
&lt;/h2&gt;

&lt;p&gt;The agentic adoption series has been running through failure modes. Part 1 was your team not trusting the agent. Part 2 was your customers not trusting the agent. The Cron-Not-Agents post was teams agentifying things that should have stayed deterministic. Last week's was the IAM seam — agent identities sharing primitives with everything else. This one is the answer to all of them.&lt;/p&gt;

&lt;p&gt;The pathway that works in 2026 is not adoption of a tool. It is exposure of a substrate. Once your org has the substrate, whatever model your team picks lands on something it can stand on. Without it, every rollout looks like Klarna's: an agent given a job, with no context for how the job is actually done, generating generic answers to specific problems and dropping CSAT 22 points before someone notices.&lt;/p&gt;

&lt;p&gt;Pick the context. The model is going to keep changing.&lt;/p&gt;

</description>
      <category>aiadoption</category>
      <category>contextengineering</category>
      <category>mcp</category>
      <category>platformengineering</category>
    </item>
    <item>
      <title>The 'Agent-Only' Role That Wasn't</title>
      <dc:creator>Michael Tuszynski</dc:creator>
      <pubDate>Thu, 30 Apr 2026 04:55:50 +0000</pubDate>
      <link>https://forem.com/michaeltuszynski/the-agent-only-role-that-wasnt-2j2g</link>
      <guid>https://forem.com/michaeltuszynski/the-agent-only-role-that-wasnt-2j2g</guid>
      <description>&lt;p&gt;Microsoft just patched a built-in Entra ID role they shipped as scoped to "agent identities" — and which, in practice, let anyone holding that role take over almost any service principal in the tenant. The fix rolled out across all clouds on April 9. The disclosure went public last week. If your org turned on Microsoft's agent identity platform in the last quarter and assigned the role anywhere along the way, this is a detail to chase down before your next audit.&lt;/p&gt;

&lt;p&gt;The role is &lt;strong&gt;Agent ID Administrator&lt;/strong&gt;. Microsoft introduced it as part of the agent identity platform — the lifecycle management story for AI agents in Entra: blueprints, agent identities, the rest of it. The documented scope was tight: agent objects, agent users, agent blueprints. The actual scope, until April 9, was much wider. Anyone holding the role could become &lt;em&gt;owner&lt;/em&gt; of any service principal in the tenant — agent or not — and as soon as they were owner, mint a credential and authenticate as that principal.&lt;/p&gt;

&lt;p&gt;That second step is the takeover primitive. Silverfort's writeup puts it bluntly: "Ownership is a takeover primitive — become owner, then add a secret and authenticate as that service principal." If the principal you took over had elevated Graph permissions or held a directory role, you now hold those permissions too. In tenants where any privileged service principal exists — &lt;a href="https://www.silverfort.com/blog/agent-id-administrator-scope-overreach-service-principal-takeover-in-entra-id/" rel="noopener noreferrer"&gt;Silverfort says 99% of organizations have at least one&lt;/a&gt; — that becomes a tenant-takeover path.&lt;/p&gt;

&lt;h2&gt;
  
  
  The mechanics, briefly
&lt;/h2&gt;

&lt;p&gt;Two API permissions did the work:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;microsoft.directory/agentIdentities/owners/update&lt;/li&gt;
&lt;li&gt;microsoft.directory/agentIdentityBlueprintPrincipals/owners/update&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both should have been gated by an "is this an agent-backed object" check. They weren't. The check existed for &lt;em&gt;application&lt;/em&gt; objects — modifying an application directly was correctly blocked. But the authorization layer for service principal ownership didn't enforce the same boundary. So the role couldn't change the application, but it could hand itself the keys to the principal that authenticates as the application. Microsoft's patch closes the gap: as &lt;a href="https://thehackernews.com/2026/04/microsoft-patches-entra-id-role-flaw.html" rel="noopener noreferrer"&gt;The Hacker News reported&lt;/a&gt;, ownership writes against non-agent service principals now return a "Forbidden" response.&lt;/p&gt;

&lt;p&gt;The disclosure timeline reads like a clean responsible-disclosure case. Silverfort's Noa Ariel found the flaw on February 24, reported it to MSRC on March 1, MSRC confirmed the behavior on March 26, and the patch was rolling by April 4. Six weeks from report to global rollout is a reasonable cadence — maybe a little fast, given the blast radius.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this is the more interesting story than most agent-security posts
&lt;/h2&gt;

&lt;p&gt;Two reasons this matters past the headline.&lt;/p&gt;

&lt;p&gt;First, it lands on the trust model that everyone shipping agentic systems is implicitly buying into. The whole pitch of "agent identity" — and not just Microsoft's — is that agents are first-class principals with scoped permissions, that their access can be governed independently of whatever else lives in the directory. That pitch only holds if the boundary is real. Here the boundary was a line in the docs. Underneath, agent identities are service principals — the same primitive that powers every app registration, every workload identity, every CI/CD service account in the tenant. Build a new role on top of that primitive, miss one ownership-check edge case, and "agent-scoped" becomes "directory-scoped" without anybody noticing. Ariel's own framing: "When role permissions are applied on top of shared foundations without strict scoping, access can extend beyond what was originally intended." That is the architectural lesson; the bug is the proof.&lt;/p&gt;

&lt;p&gt;Second, this is a preview of the bug class to expect for the next two years across every vendor's agent identity story. AWS, Google, Okta, ServiceNow, the rest — all of them are going to ship "agent identity" SKUs that bolt on top of the IAM primitives they already had. The boundaries are going to be subtle, and most security teams are not yet auditing role definitions for this kind of scope-overreach. CSO Online's writeup &lt;a href="https://www.csoonline.com/article/4163708/microsoft-patched-an-agent-only-role-that-was-not.html" rel="noopener noreferrer"&gt;calls the architectural confusion explicitly&lt;/a&gt;: since agent identities are built on the same technical primitives as applications, the boundary between "agent" and "non-agent" objects wasn't properly defined. Expect that exact sentence to be rewritten about other vendors over the next year.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to do this week
&lt;/h2&gt;

&lt;p&gt;If your tenant uses Entra ID and you have the agent identity platform turned on, this is a good week to do four things.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Audit who has Agent ID Administrator.&lt;/strong&gt; The patch closes the takeover path forward, but anyone who &lt;em&gt;had&lt;/em&gt; the role between role launch and April 9 had the capability. Read your audit logs for applicationOwner add events on non-agent service principals during that window. If you find any you can't account for, treat them as suspect credentials and rotate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't trust role names as scope statements.&lt;/strong&gt; "Agent ID Administrator" sounded scoped. It wasn't. From here forward, treat any new role labeled with a product surface — agents, copilots, workflows — as a &lt;em&gt;suggestion&lt;/em&gt; about scope. Verify against the actual permission set and the objects those permissions can touch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inventory your privileged service principals.&lt;/strong&gt; Per Silverfort, more than half the tenants they looked at have agent identities deployed at scale, with hundreds per tenant. Service principals with directory roles or app permissions like RoleManagement.ReadWrite.Directory are the high-value targets. Review the list. Demote the ones that don't need that level of access. Most don't.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Push your vendor on the boundary, not the brand.&lt;/strong&gt; If a sales pitch tells you a role is "scoped to agents" or "limited to copilots," ask: scoped at the API surface, or scoped at the documentation level? It is not a snarky question. It is the question.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The throughline
&lt;/h2&gt;

&lt;p&gt;The first post in this series was about getting your own team to trust an agent. The second was about getting external customers to trust one. The third — last week's — was about when &lt;em&gt;not&lt;/em&gt; to make an agent at all, and use cron and bash instead. The bug Microsoft just patched is the next failure mode in the same chain: the &lt;em&gt;identity&lt;/em&gt; layer that the agent inherits from your existing platform is older, broader, and shares more primitives than the marketing implies. The trust model for an agent is the trust model for a service principal, plus a label. Sometimes the label is enforced. Sometimes it isn't.&lt;/p&gt;

&lt;p&gt;Until your security tooling can audit "what can this role &lt;em&gt;actually&lt;/em&gt; do" instead of "what is this role &lt;em&gt;named&lt;/em&gt;," assume the label is decorative. Microsoft just gave us the worked example. The next vendor will too.&lt;/p&gt;

</description>
      <category>identitysecurity</category>
      <category>entraid</category>
      <category>aiagents</category>
      <category>microsoft</category>
    </item>
    <item>
      <title>Stop Turning Your Cron Jobs Into Agents</title>
      <dc:creator>Michael Tuszynski</dc:creator>
      <pubDate>Sun, 26 Apr 2026 18:06:16 +0000</pubDate>
      <link>https://forem.com/michaeltuszynski/stop-turning-your-cron-jobs-into-agents-3i84</link>
      <guid>https://forem.com/michaeltuszynski/stop-turning-your-cron-jobs-into-agents-3i84</guid>
      <description>&lt;p&gt;The current message from engineering leadership at most companies I talk to is some version of: "find the deterministic automation in your stack and make it agentic." A &lt;a href="https://www.reddit.com/r/devops/comments/1sw1adi/lead_push_to_migrate_automation_flows_to_ai_agents/" rel="noopener noreferrer"&gt;recent r/devops thread&lt;/a&gt; captured the frustration: an SRE asked how to push back on a director who wanted every Airflow DAG converted into an agent loop because "agents are the future."&lt;/p&gt;

&lt;p&gt;This is mostly bad advice. Not because agents are bad — they're great when the problem actually needs them — but because most existing automation does not need them and gets worse when retrofitted. Cron, Airflow, Step Functions, plain bash scripts: deterministic, idempotent, debuggable, free at the margin. Replacing them with an LLM call buys you variance you did not have, costs you tokens you did not spend, and produces logs you have to read instead of grep.&lt;/p&gt;

&lt;p&gt;I run an agentic system as my daily driver. NEXUS has more than thirty scheduled processes — content scanning, finance sync, DeFi monitoring, Polymarket fair-value estimation, podcast digests, calendar audits. Of those, exactly three involve an agent in the loop. The rest are bash + cron + SQLite + a handful of LaunchAgents. They run silently, log structured output, and have not surprised me in months. The agents I do run are at the &lt;em&gt;judgment seams&lt;/em&gt; — not the plumbing.&lt;/p&gt;

&lt;p&gt;Here is the test I apply when someone wants to agentify something.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Does the input space exceed what you can pre-enumerate?
&lt;/h2&gt;

&lt;p&gt;If the inputs are a known list — accounts to sync, files in a directory, customer records to enrich — you do not need an agent. You need a loop. Agents earn their keep when the input space is open: arbitrary user prompts, novel documents, situations the original author did not anticipate. If you can write the input down as an array, write a loop. If you cannot, an agent might be warranted.&lt;/p&gt;

&lt;p&gt;This is the cleanest filter. Most "agentify our pipelines" pitches fail it on the first question.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Does the output require judgment, not pattern-matching?
&lt;/h2&gt;

&lt;p&gt;A regex extracting amounts from invoices is pattern-matching. An LLM call interpreting a customer email and routing it to the right team can be pattern-matching too — but a fine-tuned classifier or a vector search will do it cheaper, faster, and with calibrated confidence. Agents earn their keep when the output requires reasoning across context the model has to pull together at run time. "Read this PR, find the architectural risk, and explain it to a junior engineer" is judgment. "Detect the language of this string" is not.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.anthropic.com/engineering/building-effective-agents" rel="noopener noreferrer"&gt;Anthropic's own framing&lt;/a&gt; draws the same line, between &lt;em&gt;workflows&lt;/em&gt; (LLM calls orchestrated through predefined paths) and &lt;em&gt;agents&lt;/em&gt; (LLMs deciding their own tool use and control flow). Most of what teams call "agents" is actually a workflow with a vibes-based orchestrator. Workflows are fine. They are also cheaper to operate, easier to test, and dramatically less likely to surprise you in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Will a human review every run before it commits?
&lt;/h2&gt;

&lt;p&gt;If yes, you can be more permissive about agent variance. The human is the safety net. NEXUS's content pipeline is exactly this — Claude drafts a LinkedIn post, the post lands in Slack, I approve or reject before anything goes external. The variance is fine because I'm in the loop.&lt;/p&gt;

&lt;p&gt;If no — if the system runs unattended, at scale, and acts on its outputs — every percent of variance becomes a percent of incidents. &lt;a href="https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/" rel="noopener noreferrer"&gt;METR's RCT on experienced developers&lt;/a&gt; showed that even with humans reviewing AI output, the net effect on throughput can be negative. Without a human reviewer, the variance compounds without correction.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Is the cost of being wrong proportional to its frequency?
&lt;/h2&gt;

&lt;p&gt;Deterministic automation fails predictably and rarely. Agents fail probabilistically and uncorrelated. If the cost of one bad output is high and one bad output per ten thousand is plausible, the math gets ugly fast. A Airflow DAG that fails 0.01% of the time is paged on, fixed, and moves on. An agent that fails 1% of the time across a hundred-thousand-call workload is a slow-rolling incident with no obvious signature.&lt;/p&gt;

&lt;p&gt;Then there is the literal cost. An r/aws thread this week described a &lt;a href="https://www.reddit.com/r/aws/comments/1svmk5y/aws_97k_bill_out_of_nowhere/" rel="noopener noreferrer"&gt;$97,000 surprise bill&lt;/a&gt; from a runaway workload — and that's deterministic infrastructure. Agentic workflows multiply this risk: token usage scales with input size, tool calls retry on transient failures, agent loops can recurse if the termination condition is poorly defined. The blast radius of a bad cost outcome is larger and harder to predict than for a Lambda that just runs longer.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "agentify it" usually means in practice
&lt;/h2&gt;

&lt;p&gt;The honest version of most "agentify the pipeline" projects is one of these:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Wrap an existing script in an LLM call so the project counts as AI.&lt;/strong&gt; Real motivation: the team needs to put something on the executive dashboard. The LLM adds nothing the script did not already do, but adds a per-run token cost and a non-deterministic failure mode.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Replace a switch statement with a prompt.&lt;/strong&gt; This is the worst version. The original code was already an interpreter — for keys you defined, with branches you wrote. The prompt is the same logic in slower, more expensive, less testable form.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add an agent because the team wants experience with the tooling.&lt;/strong&gt; This is fine, if you scope it. Pick one step that actually has judgment in it. Leave the rest of the pipeline alone. Most teams cannot resist the urge to agentify everything.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What you should agentify
&lt;/h2&gt;

&lt;p&gt;The places where agents earn their keep in a real pipeline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Content drafting, where a human reviews.&lt;/strong&gt; Variance is the feature; the human is the filter.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Triage and routing of unstructured inputs&lt;/strong&gt;, when the input space is large and a labeled training set is unavailable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Decisions that require pulling context together at run time&lt;/strong&gt; — code review with a human approving, incident summarization, customer issue triage with citations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Exploratory tool use&lt;/strong&gt; where the right sequence of operations is not known in advance — debugging, research, data exploration with a human in the loop.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The pattern: judgment, not plumbing. Variance acceptable, because there is a reviewer. Open input space, not enumerable.&lt;/p&gt;

&lt;h2&gt;
  
  
  A trap nobody warns you about
&lt;/h2&gt;

&lt;p&gt;Sometimes the right migration is &lt;em&gt;backwards&lt;/em&gt;. You shipped an agent six months ago because agents were the move. The agent is now slower, more expensive, and less reliable than the deterministic alternative would have been. The honest fix is to retire the agent and replace it with a script.&lt;/p&gt;

&lt;p&gt;This is a hard call to make politically. Nobody gets promoted for replacing AI with bash. But the &lt;a href="https://www.anthropic.com/research/project-vend-1" rel="noopener noreferrer"&gt;Project Vend retrospective&lt;/a&gt; from Anthropic showed exactly this — Phase 2 fixed Claude's vending-machine-shopkeeper failures less by upgrading the model and more by adding bureaucracy: a CRM, mandatory research steps before quoting, an inventory tool. The bureaucracy is what made the agent reliable. At some point, if you keep adding bureaucracy, you have rebuilt a deterministic workflow with extra steps.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;The right question is not "how do we make this agentic?" The right question is: where in this pipeline does judgment have to happen at run time, on inputs we cannot pre-enumerate, with a reviewer present or stakes low enough that variance is acceptable? That set is small. It is real, but it is small.&lt;/p&gt;

&lt;p&gt;Most of your automation should keep being cron, bash, and SQL. If you have to put one thing on the executive dashboard, put the part where the agent does &lt;em&gt;not&lt;/em&gt; run. That is the part of your pipeline that ships at three in the morning without paging anyone, and the part that should keep running long after the AI hype cycle has moved on.&lt;/p&gt;

</description>
      <category>platformengineering</category>
      <category>agenticai</category>
      <category>automation</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>Getting Customers to Trust an Agent That Acts on Their Behalf</title>
      <dc:creator>Michael Tuszynski</dc:creator>
      <pubDate>Sat, 25 Apr 2026 20:52:08 +0000</pubDate>
      <link>https://forem.com/michaeltuszynski/getting-customers-to-trust-an-agent-that-acts-on-their-behalf-5f2e</link>
      <guid>https://forem.com/michaeltuszynski/getting-customers-to-trust-an-agent-that-acts-on-their-behalf-5f2e</guid>
      <description>&lt;p&gt;&lt;em&gt;Part 2 of 2 on agentic system adoption. Part 1 covered internal adoption: &lt;a href="https://www.mpt.solutions/getting-your-own-team-to-actually-use-the-agent-you-built/" rel="noopener noreferrer"&gt;getting your own team to actually use the agent you built&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In February 2024, Klarna &lt;a href="https://www.klarna.com/international/press/klarna-ai-assistant-handles-two-thirds-of-customer-service-chats-in-its-first-month/" rel="noopener noreferrer"&gt;announced&lt;/a&gt; that its AI assistant had handled 2.3 million customer conversations in its first month — two-thirds of all customer service chats, doing the work of 700 full-time agents, with resolution times down from 11 minutes to under 2. It was the poster child for agentic deployment.&lt;/p&gt;

&lt;p&gt;Fifteen months later, CEO Sebastian Siemiatkowski &lt;a href="https://www.independent.co.uk/tech/klarna-ai-chatbot-customer-service-b2755734.html" rel="noopener noreferrer"&gt;conceded&lt;/a&gt; that the rollout had produced "lower quality" service. Klarna started rehiring humans.&lt;/p&gt;

&lt;p&gt;If you're shipping agentic features to paying customers — users who can churn the moment the agent confidently does the wrong thing — Klarna's arc is your warning. Internal adoption (covered in &lt;a href="https://www.mpt.solutions/getting-your-own-team-to-actually-use-the-agent-you-built/" rel="noopener noreferrer"&gt;Part 1&lt;/a&gt;) dies quietly. Engineers stop using the tool and find a workaround. External adoption dies loudly. Customers post screenshots, file chargebacks, and leave.&lt;/p&gt;

&lt;p&gt;External adoption isn't a feature launch. It's a trust contract. And the contract has five terms.&lt;/p&gt;

&lt;h2&gt;
  
  
  The stakes are different
&lt;/h2&gt;

&lt;p&gt;When internal adoption fails, your throughput drops 1.5% and your delivery stability drops 7.2% (the DORA 2024 numbers I cited in Part 1). When external adoption fails, the consequences look more like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://arstechnica.com/tech-policy/2024/02/air-canada-must-honor-refund-policy-invented-by-airlines-chatbot/" rel="noopener noreferrer"&gt;Air Canada&lt;/a&gt;&lt;/strong&gt; was ordered by the BC Civil Resolution Tribunal in February 2024 to pay a customer whose refund policy the airline's chatbot had invented. Air Canada's defense — that the chatbot was "a separate legal entity" — was rejected. Your agent is legally you.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.bbc.com/news/technology-68025677" rel="noopener noreferrer"&gt;DPD&lt;/a&gt;&lt;/strong&gt; had its customer service chatbot, after a system update, swear at a customer and insult the company in a haiku. The screenshot hit 800,000 views in 24 hours. DPD pulled the AI element within the day.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.cnbc.com/2024/06/17/mcdonalds-ends-ibm-partnership-to-test-ai-ordering-at-drive-thrus.html" rel="noopener noreferrer"&gt;McDonald's&lt;/a&gt;&lt;/strong&gt; killed its IBM-powered voice-AI drive-thru in June 2024 after a hundred-plus restaurant pilot. The proximate cause was viral TikToks of the agent adding bacon to ice cream and ordering 260 McNuggets unprompted.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All three stories have the same structure: the agent confidently did the wrong thing, the failure mode was visible to users before it was visible to the company, and the choice was recall or litigate. None of them chose "iterate in place."&lt;/p&gt;

&lt;p&gt;Here's how you don't end up there.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Make the undo obvious
&lt;/h2&gt;

&lt;p&gt;The Nielsen Norman Group has ranked &lt;a href="https://www.nngroup.com/articles/user-control-and-freedom/" rel="noopener noreferrer"&gt;"user control and freedom"&lt;/a&gt; as the third core usability heuristic for decades. In plain language: users need an emergency exit.&lt;/p&gt;

&lt;p&gt;For agentic products, this is move number one. A button that says "undo this action." A visible confirmation before anything moves money, sends messages, or modifies customer state. A "revert to before the agent touched this" option.&lt;/p&gt;

&lt;p&gt;The reason is psychology: users experiment more when they know they can back out. That experimentation is how they learn what the agent is actually good at. Without an obvious undo, the first surprise is a last surprise — the user closes the tab and doesn't come back.&lt;/p&gt;

&lt;p&gt;In NEXUS, every time my content pipeline drafts a LinkedIn post, Slack gets an approval with edit/reject buttons. Nothing posts without me hitting approve. That approval surface isn't a development stage; it's the product. Remove it and I don't trust the pipeline, even though I built it.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Signal uncertainty
&lt;/h2&gt;

&lt;p&gt;Overconfidence is how agents lose users. The agent that says "here's your answer" and is wrong burns more trust than the agent that says "this looks likely but I'd double-check" and is wrong.&lt;/p&gt;

&lt;p&gt;This isn't a UX trick. It's a calibration problem. Most LLM-powered agents have the confidence signal available — log probabilities, retrieval scores, constraint violations — and throw it away before it reaches the user. Surface it.&lt;/p&gt;

&lt;p&gt;Concrete patterns that work:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Show which sources the agent used. If the retrieval score is low, say so.&lt;/li&gt;
&lt;li&gt;When the agent takes an action based on an ambiguous instruction, show the interpretation it chose and offer the alternative.&lt;/li&gt;
&lt;li&gt;Flag "this is outside the agent's usual scope" when the request deviates from a known pattern. Don't let the agent bluff.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Air Canada's chatbot didn't hedge. It invented a refund policy with full confidence. If it had said "I'm not 100% sure about the bereavement policy — let me route you to a human," Air Canada would have lost nothing. Instead, they lost a court case and a reputational hit still being cited two years later.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Tell them what the agent won't do
&lt;/h2&gt;

&lt;p&gt;Scope boundaries build trust. If the agent is going to draft replies but never send them, say so on the label. If it will summarize but not delete, say so. If it will take action inside the product but never touch external systems, say so.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://artificialintelligenceact.eu/article/50/" rel="noopener noreferrer"&gt;EU AI Act, Article 50&lt;/a&gt;, now requires disclosure when users interact with AI systems. That's the regulatory floor. The product floor is higher: tell users not just that they're talking to an agent, but what the agent's blast radius is.&lt;/p&gt;

&lt;p&gt;Users who know the limits stop testing them. Users who don't know the limits probe until they find the edge, and the edge is always embarrassing.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Price for non-determinism
&lt;/h2&gt;

&lt;p&gt;This is the move most teams miss. Agentic systems have variable cost per interaction — tokens, tool calls, retries — and if you pass that variability through to the customer via token-based pricing, you're asking them to bet on their own workload. Most will churn the first time a 5x cost spike hits their bill.&lt;/p&gt;

&lt;p&gt;Intercom's Fin AI agent &lt;a href="https://fin.ai/pricing" rel="noopener noreferrer"&gt;prices at $0.99 per resolution&lt;/a&gt;. Their definition of "resolution" is itself a product decision: the outcome is counted when a customer confirms the issue is resolved, when they don't follow up after Fin responds, or when Fin completes a workflow. One charge per conversation, regardless of how many messages it took.&lt;/p&gt;

&lt;p&gt;That's a pricing model that says to the customer: the variability is my problem, not yours. Teams adopting outcome-based pricing are absorbing the non-determinism into their margin instead of their customer's budget.&lt;/p&gt;

&lt;p&gt;If your pricing page requires a customer to model their own token usage to predict monthly cost, you've shifted your operational risk onto their finance team. They will fire you and buy from someone who didn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Make failure modes visible
&lt;/h2&gt;

&lt;p&gt;The DPD chatbot didn't start swearing on its first day. It started failing after a system update. Somewhere, someone had logs. The failure was visible internally — to nobody who was watching — before it was visible externally, to everyone who was.&lt;/p&gt;

&lt;p&gt;Build the visibility into the product. For every action the agent takes, the user (and your on-call) should be able to answer three questions within 30 seconds:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What did the agent try to do?&lt;/li&gt;
&lt;li&gt;What did it actually do?&lt;/li&gt;
&lt;li&gt;What did it decide not to do, and why?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For paying customers, this often lives in an activity log, an audit trail, or an admin dashboard. It doesn't have to be beautiful. It has to be complete. When a customer files a ticket saying "the agent did something weird," the ticket is cheap to resolve if the log is there. If it isn't, you're reconstructing state from memory, and the customer's story beats yours in court.&lt;/p&gt;

&lt;h2&gt;
  
  
  How you know it's working
&lt;/h2&gt;

&lt;p&gt;For internal agents, I track daily active use by engineers who aren't on the project team. For external agents, the leading indicators are different:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Unprompted return rate.&lt;/strong&gt; Users who came back without a marketing nudge. Internal usage can be mandated; external cannot.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Override rate trending down.&lt;/strong&gt; If customers are manually correcting the agent less over time, trust is rising.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Escalation-to-human rate holding steady, not dropping to zero.&lt;/strong&gt; An agent that never escalates is one that's confidently wrong. Healthy agents know their limits.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Net revenue retention on accounts that opted into the agent.&lt;/strong&gt; The hardest proof: do customers expand after the agent ships, or contract?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Klarna is the cautionary tale for all of these. The launch numbers looked spectacular. The retention numbers told a different story, quietly, for a year, until Siemiatkowski had to say the quiet part out loud: "cost was a too predominant evaluation factor… what you end up having is lower quality."&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;Internal adoption dies in the gap between demo and fourth use. External adoption dies the moment the agent confidently does the wrong thing to a customer who's paying attention.&lt;/p&gt;

&lt;p&gt;Both failures are preventable. Both require the same underlying discipline: trust is earned on boring interactions, not hero ones. The patterns that build internal trust — readable logs, obvious override, default to boring — scale externally if you take them seriously.&lt;/p&gt;

&lt;p&gt;Air Canada, DPD, McDonald's, and Klarna all had one thing in common. Their agents worked in the demo. They worked with the pilot group. They worked for the first thousand users. Then the distribution shifted, or an update went out, or a customer asked a question nobody had predicted, and the agent confidently did the wrong thing.&lt;/p&gt;

&lt;p&gt;The companies that keep shipping agentic features — and keep their customers — aren't the ones with the smartest agents. They're the ones whose agents know what they don't know, whose pricing absorbs the variance, and whose undo button is bigger than the agent.&lt;/p&gt;

</description>
      <category>agenticai</category>
      <category>aiadoption</category>
      <category>productengineering</category>
      <category>customertrust</category>
    </item>
    <item>
      <title>Getting Your Own Team to Actually Use the Agent You Built</title>
      <dc:creator>Michael Tuszynski</dc:creator>
      <pubDate>Fri, 24 Apr 2026 15:51:13 +0000</pubDate>
      <link>https://forem.com/michaeltuszynski/getting-your-own-team-to-actually-use-the-agent-you-built-18cc</link>
      <guid>https://forem.com/michaeltuszynski/getting-your-own-team-to-actually-use-the-agent-you-built-18cc</guid>
      <description>&lt;p&gt;&lt;em&gt;Part 1 of 2 on agentic system adoption.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;A 25% bump in individual AI usage correlates with a 1.5% drop in delivery throughput and a 7.2% drop in delivery stability. That's not a hot take — that's &lt;a href="https://dora.dev/research/2024/dora-report/" rel="noopener noreferrer"&gt;the 2024 DORA report&lt;/a&gt;, surveying thousands of engineers across hundreds of orgs.&lt;/p&gt;

&lt;p&gt;Worse: in a randomized controlled trial from &lt;a href="https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/" rel="noopener noreferrer"&gt;METR on experienced open-source developers&lt;/a&gt;, engineers expected AI tools to speed them up by 24%, reported a 20% speedup afterward, and were actually 19% &lt;em&gt;slower&lt;/em&gt;. Perceived uplift and real uplift were moving in opposite directions.&lt;/p&gt;

&lt;p&gt;If you're a platform lead or eng manager rolling an agentic tool out to your own team, those two stats should wake you up. The problem isn't that the tech doesn't work. The problem is that "works in a demo" and "works on Tuesday afternoon when Sarah is three coffees in and needs to ship a hotfix" are different problems, and most internal rollouts optimize for the first.&lt;/p&gt;

&lt;p&gt;I've been running NEXUS — my own agentic system — as a daily driver for months. It handles content drafts, trading decisions on real money in DeFi, financial dashboards across two people's accounts, and a Slack approval workflow fanning out to four publishing platforms. I've made every mistake below at least once. Here's what I've learned about getting an internal agent past the demo cliff.&lt;/p&gt;

&lt;h2&gt;
  
  
  The demo cliff
&lt;/h2&gt;

&lt;p&gt;Every agentic rollout has a moment. The kickoff meeting goes great. Three engineers try it, one ships something impressive, the Slack channel fills with lightning-bolt emojis. Then week three hits, and the channel goes quiet. When you audit usage, maybe 20% of the team is still touching it weekly. The rest tried it once, got burned, and went back to the old way.&lt;/p&gt;

&lt;p&gt;That gap — between first use and fourth use — is where most internal agents die. &lt;a href="https://survey.stackoverflow.co/2024/ai" rel="noopener noreferrer"&gt;Stack Overflow's 2024 developer survey&lt;/a&gt; puts it plainly: 76% of developers are using or planning to use AI tools, but only 43% trust the accuracy of the output. By &lt;a href="https://stackoverflow.blog/2026/02/18/closing-the-developer-ai-trust-gap/" rel="noopener noreferrer"&gt;early 2026&lt;/a&gt;, the trust number had dropped to 29%. Usage is up and to the right; trust is the opposite.&lt;/p&gt;

&lt;p&gt;The people who keep using the agent after week three aren't the people who liked the demo. They're the people who figured out what the agent is actually good at and built a working relationship with its specific failure modes. That's the adoption you want. Here's how to design for it.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Pilot with skeptics, not enthusiasts
&lt;/h2&gt;

&lt;p&gt;The default instinct is to recruit the AI-curious engineer — the one who already has Cursor, Claude Code, and three Ollama models running locally. That's a mistake. That engineer will adopt anything. Their feedback tells you nothing about your ceiling.&lt;/p&gt;

&lt;p&gt;Give the pilot to the two engineers who rolled their eyes in the kickoff. The ones who said "we tried this in 2023 and it hallucinated three migrations." Those people find every failure mode on day one, and if you can earn their trust, the middle 60% of your team follows.&lt;/p&gt;

&lt;p&gt;The pattern you want to avoid: an AI-guild pilot group reports 98% adoption after six weeks, and org-wide adoption settles at 14% six months later. The pilot group was not the org. It never is. Skeptics are signal, not noise. &lt;a href="https://dora.dev/research/ai/trust-in-ai/" rel="noopener noreferrer"&gt;DORA's trust deep-dive&lt;/a&gt; frames this as an organizational practice — trust calibration lives with the user, not the tool, and it's built through repeated exposure to honest failure, not through marketing.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Prove it on work nobody cares about first
&lt;/h2&gt;

&lt;p&gt;The worst place to debut an agentic system is production incident response. The second-worst is customer-facing code. The best place is cron jobs, cleanup scripts, log triage, PR description drafts, release notes, and the 20 little engineering chores that never make it onto a roadmap.&lt;/p&gt;

&lt;p&gt;I learned this the hard way from my own content pipeline. The first agentic workflow I shipped in NEXUS wasn't "draft my LinkedIn posts." It was "scan 15 subreddits at 7am and summarize what changed." If the summary was bad, I ignored it. If it was good, I skimmed it. There was no blast radius. After three weeks of watching it get better, I trusted it enough to let it draft. Six weeks in, I trusted it enough to auto-publish short-form content with Slack approvals.&lt;/p&gt;

&lt;p&gt;If I'd started at "auto-publish to LinkedIn," the first mistake would have been public and I'd have killed the project. Boring work is a safe harbor for building calibration.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Readable logs beat magic
&lt;/h2&gt;

&lt;p&gt;Anthropic's engineering team wrote &lt;a href="https://www.anthropic.com/research/building-effective-agents" rel="noopener noreferrer"&gt;Building Effective Agents&lt;/a&gt; at the end of 2024, synthesized from dozens of production rollouts. Their three design principles: simplicity, transparency in the agent's planning steps, and investment in the tool interface. Every successful team was doing the boring version on purpose.&lt;/p&gt;

&lt;p&gt;What this means in practice: when the agent does something, an engineer should be able to pull up a single file or dashboard and read what it did, what it saw, and why it made the call. Not a stack trace. Not a wall of JSON. A paragraph a human can skim.&lt;/p&gt;

&lt;p&gt;My DeFi trader posts every decision to Slack with three lines: the market, the fair-value estimate the model produced, and why the position was above or below the spread. When the trader loses money — and it does — I can tell within 10 seconds whether the model was wrong, the fill was bad, or the strategy is off. That 10-second diagnosis is why I still trust it with real capital.&lt;/p&gt;

&lt;p&gt;If your team has to read code to understand what the agent did, they won't. They'll just stop using it.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Make the override obvious
&lt;/h2&gt;

&lt;p&gt;Every agentic tool needs a visible off-ramp. Not buried in settings — on the main surface. A button that says "ignore this and do it myself," a keybind that kills the agent mid-task, a config flag that puts it in suggest-only mode.&lt;/p&gt;

&lt;p&gt;The override isn't a bailout feature. It's how trust accumulates. When engineers know they can always take the wheel, they're more willing to let the agent drive first. When they can't, they'll never get in the car.&lt;/p&gt;

&lt;p&gt;NEXUS has an approval surface in Slack for every content post. 80% of the time I hit approve. The other 20% I edit or reject, and what was posted, what was edited, and what was rejected all persist in SQLite. I can audit the agent's batting average on demand. That audit trail is the reason I let it draft at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Default to boring
&lt;/h2&gt;

&lt;p&gt;The fastest way to kill an internal agent is to make it creative. Engineers don't want the agent to surprise them. They want it to do the obvious thing, predictably, 95% of the time, and escalate the other 5%.&lt;/p&gt;

&lt;p&gt;Pick the default behavior that would make a senior engineer nod. Default to "ask before acting." Default to the lower-risk option when two paths exist. Default to writing to a branch, not main. Default to suggesting a command, not executing it.&lt;/p&gt;

&lt;p&gt;You can add more autonomy later, once the team trusts the agent's reflexes. You cannot take autonomy back once someone gets burned.&lt;/p&gt;

&lt;h2&gt;
  
  
  How you know it's working
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://hamel.dev/blog/posts/field-guide/" rel="noopener noreferrer"&gt;Hamel Husain's field guide to improving AI products&lt;/a&gt;, drawn from 30+ production rollouts, makes the same point I keep making to platform teams: generic metrics are useless. BERTScore, ROUGE, cosine similarity — none of it correlates with whether your team actually uses the thing on Tuesday afternoon.&lt;/p&gt;

&lt;p&gt;What correlates: binary pass/fail evals tied to real failure modes you've observed in your own usage, tracked over time, with a human in the loop redefining "pass" as the product evolves. Husain calls this "criteria drift." It's not a one-time eval setup; it's an ongoing practice.&lt;/p&gt;

&lt;p&gt;The leading indicator I watch for any internal agent: daily active use by engineers who aren't on the project team. Weekly active by skeptics. Ratio of accepted-without-edit to edited-or-rejected over time. Those numbers tell you whether the trust curve is going up or down. Thumbs-up counts and star emojis do not.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;Internal adoption is the easier half. You control the tools, the data, the engineers, and the feedback loop. Your team shows up every day whether the agent is good or not.&lt;/p&gt;

&lt;p&gt;External adoption — users who paid for a product and can leave the moment the agent confidently does the wrong thing — is a different problem. That's Part 2.&lt;/p&gt;

</description>
      <category>agenticai</category>
      <category>platformengineering</category>
      <category>devrel</category>
      <category>aiadoption</category>
    </item>
  </channel>
</rss>
