Forem: Roman Dubinin

The Best Agent Prompt Is a Lint Error

Roman Dubinin — Fri, 27 Mar 2026 06:56:33 +0000

Every LLM writes key={index} on list items. It's in millions of React tutorials as the quick fix — React wants a key, here is a key. The code compiles. It renders. When the list reorders or an item is removed from the middle, React reuses the wrong DOM nodes: state stays pinned to the old position, controlled inputs keep stale values, transitions fire on the wrong elements.

A lint rule fixes this. react/no-array-index-key fires: "Do not use Array index in keys — use a stable identifier." The agent switches to item.id. That class of diffing bug is extinct — not "less likely because the system prompt mentioned it."

The check step

Agents work in a loop: write, check, fix, repeat. A slow, generic check — "build failed" — means the agent wastes tokens re-reading output and guessing at the cause. A fast, specific check — "line 42: expected Effect<void, ConfigError> but got string" — means the agent fixes it in the same turn.

Agent-native tools — opencode, Cursor — surface LSP diagnostics inline:

src/agent.ts:42:5: error TS2345: Argument of type 'string' is not
assignable to parameter of type 'Effect<void, ConfigError, Config>'

The error arrives with the file state. No log parsing. Wired into the toolchain, not into the prompt.

TypeScript strict

strict: true is table stakes. Most codebases stop there — and so does the agent's training data.

The real leverage is in the flags strict: true doesn't cover. noUncheckedIndexedAccess makes array indexing return T | undefined instead of T, so the agent can't write users[0].name without handling the missing case. exactOptionalPropertyTypes distinguishes "may be absent" from "may be undefined" — a difference the agent's training data almost certainly conflates. noPropertyAccessFromIndexSignature forces bracket notation on dynamic keys.

The agent's prior says users[0].name is fine. The type checker disagrees.

Language service plugins

TypeScript strict catches structural errors. Your domain has its own.

The tsconfig.json includes @effect/language-service as a compiler plugin. floatingEffect catches Effects that aren't yielded or assigned — silent no-ops that type-check fine. missingEffectContext flags missing service requirements. effectGenUsesAdapter detects the v3 adapter pattern — the same failure, caught at a different layer. outdatedApi flags removed and renamed APIs across the v3-to-v4 migration. missingStarInYieldEffectGen catches yield effect where you need yield* effect — a mistake agents make constantly because both type-check.

This isn't niche. ts-graphql-plugin type-checks GraphQL queries against your schema. @styled/typescript-styled-plugin catches bad CSS properties in styled-components. @css-modules-kit/ts-plugin type-checks CSS Modules imports. One line in tsconfig.json, domain-level diagnostics through the LSP channel the agent already reads.

The Effect plugin goes further: effect-language-service patch patches tsc to surface these diagnostics at build time. The agent running tsc --noEmit gets Effect-specific errors alongside standard TS errors. There's an includeSuggestionsInTsc option that surfaces suggestion-level diagnostics in tsc output with a [suggestion] prefix. The plugin's docs say it explicitly: "useful to help steer LLM output." The authors are already thinking about agents.

The linter

The examples here use Biome — much faster than ESLint for the same codebase. When linting is in the agent's inner loop, that difference compounds. But the specific tool matters less than two properties: it runs fast, and it supports custom rules.

Configuration matters more than the tool. noUnusedImports: "error". noDoubleEquals: "error". useConst: "error". noExplicitAny: "warn". Everything else: error. The agent can't leave dead imports or loose equality checks. A warning is a suggestion; the agent might fix it, might not. An error is a gate.

Custom rules

The layers so far catch structural errors, domain type errors, and pattern violations. They don't catch this: the agent writes code that type-checks, passes lint, and is semantically wrong for your project.

The most common cause is training data staleness. The model learned v3 of an API. Your codebase uses v4. The v3-to-v4 migration renamed and restructured dozens of APIs — but the v3 patterns are still valid TypeScript in many cases. The types didn't change enough to break them — but they're not what you want. You can put this in a system prompt: "Always use v4 patterns for Effect." The agent will follow it — until context pressure pushes the instruction out of the effective window, or the model's prior on this particular API is strong enough to override.

System prompt instructions are probabilistic. They work most of the time.

Lint rules are deterministic. They fire every time the pattern appears. They don't get overridden by training priors. And they're faster to write than you'd expect.

Any linter with a custom rule system works. The examples below use GritQL — a pattern-matching language for source code, used by Biome for its plugin system. You write a pattern, it matches against the AST, and you register a diagnostic. ESLint achieves the same thing with AST visitor functions. Here's the rule that killed the v3 adapter pattern:

language js(typescript)

`Effect.gen(function*($adapter) { $body })` where {
  $adapter <: r"^\w+$",
  register_diagnostic(
    span=$adapter,
    message="Effect v4: remove the adapter parameter. Use `yield* effect`
             directly instead of `yield* adapter(effect)`.
             Load skill: effect-v4.",
    severity="error"
  )
}

Four of these rules exist so far. Each one started as a failure that showed up twice:

// Layer.succeed(Tag, impl) → curried Layer.succeed(Tag)(impl)
`$fn($tag, $impl)` where {
  $fn <: or { `Layer.succeed`, `Layer.effect`, `Layer.scoped` },
  register_diagnostic(span=$fn,
    message="Effect v4: Layer constructors are curried.
             Use Layer.succeed(Tag)(impl) instead of Layer.succeed(Tag, impl).
             Load skill: effect-v4.",
    severity="error")
}

// @effect/schema → import from "effect"
`$source` where {
  $source <: `"@effect/schema"`,
  register_diagnostic(span=$source,
    message="Effect v4: @effect/schema is gone.
             Import from 'effect' instead: import { Schema } from 'effect'.
             Load skill: effect-v4.",
    severity="error")
}

// Effect.catchAll → Effect.catch (renamed in v4)
`$fn($args)` where {
  $fn <: `Effect.catchAll`,
  register_diagnostic(span=$fn,
    message="Effect v4: catchAll was renamed to Effect.catch.
             See: https://github.com/Effect-TS/effect-smol/blob/main/migration/error-handling.md
             Load skill: effect-v4.",
    severity="error")
}

The operational loop: agent produces a v3 pattern → the pattern becomes a lint rule five minutes later → the next check catches it from that point forward.

Error messages are prompts

The error message in a lint rule is a prompt — it fires at the exact moment the pattern appears, with the exact fix included, and no competition with the rest of the context window.

"Invalid pattern" wastes the agent's tokens on diagnosis. "Effect v4: remove the adapter parameter. Use yield* effect directly" gives the agent a direct edit target.

The rules above go further — each message ends with a skill-load directive. One bad Layer.succeed call suggests the agent's mental model of Effect v4 is stale across the board. "Load skill: effect-v4" points it at a reference that covers constructors, generators, error handling — everything adjacent to the specific mistake. The error message fixes the immediate line; the skill load fixes the next twenty.

Every failure that shows up twice becomes a rule. The rules accumulate. The codebase gets stricter — not through documentation that ages or review knowledge that walks out the door, but through tooling that fires on every check.

JSX That Outputs Markdown

Roman Dubinin — Mon, 16 Mar 2026 05:17:59 +0000

This started because managing agent instruction files as template strings became unbearable. The fix was JSX.

I have about fifteen of them now — Markdown files that tell LLM agents how to behave. An orchestrator, a code implementer, a critic, a planner, a handful of single-purpose grunts. It grew out of experimentation — different persona variants, different tool sets per harness, shared fragments that kept getting copy-pasted between files. Each file defines the agent's role, what tools it has access to, what constraints it follows, how it handles failure. A typical file starts looking something like this:

const forgePrompt = `
# Forge

You are an implementation agent. Write code, tests, migrations.

Axes: trust=assume-broken, solution=converge, risk=block.

## Tools
You have access to:
${tools.map(t => `- \`${t.name}\` — ${t.description}`).join('\n')}

## Code Rules
- **P0**: No non-null assertions. No \`as\` casts without type guards.
- **P0**: No \`enum\`. Use literal unions.
${strict
  ? '- **P0**: Run \`lsp_diagnostics\` on ALL changed files. Zero errors.'
  : ''}

## Workflow

1. Read task from dispatch header.
2. Run existing tests — establish green baseline.
3. Implement. Tight diffs, reviewable chunks.
${harness === 'opencode'
  ? '4. Use \`task()\` for subtask delegation.'
  : harness === 'copilot'
    ? '4. Use \`runSubagent\` for subtask delegation.'
    : ''}
`

I'm a frontend developer. I write React for a living. JSX was the obvious pattern for this: typed components, props for variants, imports for shared fragments. But React is a UI framework, and I needed a string concatenator. Then I remembered jsxImportSource.

The template string trap

When your agent instructions live inside template literals, your editor treats them as strings. Because they are strings. Everything you rely on — syntax highlighting, type checking, autocomplete, error detection, go-to-definition — stops at the opening backtick. You're writing Markdown inside a JavaScript string inside a TypeScript file, and your IDE gives you nothing.

Not "limited support." Nothing. The headings are strings. The code references are escaped strings inside strings. A broken indentation is invisible until you run the agent and the output is wrong.

I went looking at how other harnesses handle this — to know if someone had already solved it. Not in any harness I looked at.

oh-my-openagent — 40k stars, production harness — builds its orchestrator prompt the same way. 430 lines. Eight XML-structured sections assembled from template strings. The task management block duplicates the same instructions twice for different tool APIs:

function buildTasksSection(useTaskSystem: boolean): string {
  if (useTaskSystem) {
    return `<tasks>
Create tasks before starting any non-trivial work.

Workflow:
1. On receiving request: \`TaskCreate\` with atomic steps.
2. Before each step: \`TaskUpdate(status="in_progress")\`
3. After each step: \`TaskUpdate(status="completed")\` immediately.
</tasks>`;
  }

  return `<tasks>
Create todos before starting any non-trivial work.

Workflow:
1. On receiving request: \`todowrite\` with atomic steps.
2. Before each step: mark \`in_progress\`
3. After each step: mark \`completed\` immediately.
</tasks>`;
}

The escaped backticks are the obvious tell. But two functions returning near-identical strings with no shared abstraction — that's the actual damage. Eight sections concatenated with ${identityBlock}\n${constraintsBlock}\n${intentBlock}.... Type safety is string — every section builder returns a string, the entire prompt is a string, the contract is "it's a string." If a section builder returns malformed Markdown or forgets a closing XML tag, you find out when the agent misbehaves.

Full circle, sort of

Markdown was invented as a lightweight authoring format for HTML — write readable plain text, get structured markup out. Twenty years later, we're writing that Markdown inside template strings inside TypeScript, and all the readability Markdown was supposed to provide is gone.

The fix I landed on: JSX — which is itself a syntax for XML — generating that same Markdown. Markdown to simplify HTML. JSX to simplify Markdown. The lineage is absurd, but it works for a boring reason: the entire ecosystem already knows how to handle JSX.

Why JSX

The obvious fix for template string pain is to stop using template strings — write plain Markdown files, load them at build time. That works until you need conditionals. Different tool sets per harness, strict mode flags, sections that vary by deployment. Once instructions have variants, you need a way to express them, and Markdown doesn't have one. JSX does.

Syntax highlighting works. TypeScript type-checks JSX — wrong prop names, missing required props, mismatched children types are all compile errors. Autocomplete for component props, go-to-definition for component sources, inline documentation on hover. Refactoring tools — rename a component and every usage updates. Import organization. Dead code detection. All of it works because JSX is not a new language. It's a transform layer on top of TypeScript that your toolchain understands.

And LLMs know how to write it. Every model has seen massive amounts of JSX in training data — React components, prop patterns, conditional rendering, map over arrays. When you ask an agent to modify a jsx-md component, it doesn't need a tutorial.

JSX is not React. It's a transform specification. The compiler sees <H2>Title</H2> and rewrites it to jsx(H2, { children: "Title" }). That's it. The jsxImportSource option in tsconfig.json says where that jsx factory comes from — point it at any package that exports the right function, and JSX works with no React anywhere in the dependency graph.

JSX-to-Markdown as an approach is not new. dbartholomae/jsx-md has been around since 2019; eyelly-wu/jsx-to-md is more recent. Both are built for documentation generation — READMEs, changelogs — and work well for that. dbartholomae predates jsxImportSource and uses file-level pragma comments instead; its render() returns a Promise — reasonable for writing files to disk, an awkward fit for instructions assembled at call time.

The agent instruction use case needs two things neither provides. The harness and strict variants in the opening example are shallow, but fifteen agents sharing fragments turns that into its own copy-paste problem: every shared component grows a harness prop it doesn't use except to pass it down. Context solves this — set the value at the root, read it anywhere in the tree. The other gap is XML intrinsics: Anthropic recommends structuring Claude prompts with <instructions>, <context>, <examples> as literal XML blocks. Any lowercase tag in @theseus.run/jsx-md renders as XML — no imports, no registration.

@theseus.run/jsx-md is a JSX runtime that outputs Markdown strings. H2 is a plain function — takes props, returns "## Title\n\n". render() walks the VNode tree synchronously and concatenates. No virtual DOM, no reconciler, no fiber, no hooks. Same input, same string, every time. All the testing patterns you'd use for React components transfer — snapshots catch prompt regressions, unit tests verify conditional branches, you can assert on rendered output of any component in isolation.

Md is an escape hatch for raw Markdown passthrough — <Md>{someString}</Md> renders the string verbatim, no transformation.

The same agent prompt from the opening, rewritten:

const ForgePrompt = ({ tools, strict, harness }: Props) => (
  <>
    <H1>Forge</H1>
    <P>You are an implementation agent. Write code, tests, migrations.</P>
    <P>Axes: trust=assume-broken, solution=converge, risk=block.</P>

    <H2>Tools</H2>
    <P>You have access to:</P>
    <Ul>
      {tools.map(t => (
        <Li><Code>{t.name}</Code> — {t.description}</Li>
      ))}
    </Ul>

    <H2>Code Rules</H2>
    <Ul>
      <Li><Bold>P0</Bold>: No non-null assertions. No <Code>as</Code> casts without type guards.</Li>
      <Li><Bold>P0</Bold>: No <Code>enum</Code>. Use literal unions.</Li>
      {strict && (
        <Li><Bold>P0</Bold>: Run <Code>lsp_diagnostics</Code> on ALL changed files. Zero errors.</Li>
      )}
    </Ul>

    <H2>Workflow</H2>
    <Ol>
      <Li>Read task from dispatch header.</Li>
      <Li>Run existing tests — establish green baseline.</Li>
      <Li>Implement. Tight diffs, reviewable chunks.</Li>
      {harness === 'opencode' && (
        <Li>Use <Code>task()</Code> for subtask delegation.</Li>
      )}
      {harness === 'copilot' && (
        <Li>Use <Code>runSubagent</Code> for subtask delegation.</Li>
      )}
    </Ol>
  </>
)

No escaped backticks. Syntax highlighting. Type-checked props. The harness conditional is a JSX expression — the editor shows you which branch applies. Shared sections are components you import — the constraints, tool lists, and workflow steps that were getting copy-pasted between files are just props now. The nesting depth for lists is tracked automatically — write <Ul> inside <Li> and the renderer handles the indentation. You never count spaces.

Zero runtime dependencies. render() is synchronous, deterministic, and returns a plain string. String children pass through verbatim — no escaping. Bun resolves the TypeScript source directly; Node.js ≥18 and bundlers use the compiled ESM output, the exports map handles it without configuration.

The ForgePrompt above still passes harness as a prop. Context removes it from the tree entirely:

const HarnessCtx = createContext<'opencode' | 'copilot'>('opencode')

const StepsSection = () => {
  const harness = useContext(HarnessCtx)
  return (
    <Ol>
      <Li>Implement. Tight diffs, reviewable chunks.</Li>
      {harness === 'opencode' && <Li>Use <Code>task()</Code> for subtask delegation.</Li>}
    </Ol>
  )
}

// Root wires it once — no prop threading:
render(
  <HarnessCtx.Provider value="opencode">
    <ForgePrompt tools={tools} strict={true} />
  </HarnessCtx.Provider>
)

Components read from context directly. The harness prop disappears from everything below the root.

XML intrinsics

Anthropic recommends XML tags to structure Claude prompts — <instructions>, <context>, <examples> around each content type. If you work with Claude, you're probably already doing this.

In @theseus.run/jsx-md, any lowercase JSX tag is an XML intrinsic element. No imports, no registration — it's built into the type system's catch-all:

const ReviewerPrompt = ({ repo, examples }: Props) => (
  <>
    <context>
      <P>Repository: {repo}. Language: TypeScript. Package manager: bun.</P>
    </context>

    <instructions>
      <H2>Role</H2>
      <P>You are a precise code reviewer. Find bugs, not style issues.</P>

      <H2>Rules</H2>
      <Ul>
        <Li>Flag <Bold>P0</Bold> issues first — do not bury them.</Li>
        <Li>One finding per comment. No compound observations.</Li>
        <Li>Use <Code>inline code</Code> when referencing identifiers.</Li>
      </Ul>
    </instructions>

    {examples.length > 0 && (
      <examples>
        {examples.map((ex, i) => (
          <example index={i + 1}>
            <Md>{ex}</Md>
          </example>
        ))}
      </examples>
    )}
  </>
)

Attributes are typed — index={1} serializes to index="1". Boolean true attributes render bare, false/null/undefined are omitted. Empty tags self-close. The Anthropic-recommended structure falls out of the JSX type system for free.

Agent skill

The package ships a skill — OpenCode, Cursor, Copilot, Claude Code:

npx skills add https://github.com/theseus-run/theseus/tree/master/packages/jsx-md

The agent knows the primitives, Context API, XML intrinsics, and authoring rules.

@theseus.run/jsx-md. MIT, zero dependencies. Bun, Node.js ≥18, any bundler. Source on GitHub. If your agent instructions have outgrown template strings, this is the fix I actually use.

An LLM Is Not a Deficient Mind

Roman Dubinin — Fri, 13 Mar 2026 10:40:15 +0000

I called it "the perfect bullshitter."

This was GPT-2, maybe early GPT-3. I was feeding it prompts and getting back text that looked like answers — structured, fluent, confident. The kind of output that would survive a casual reading. It was not grounded in anything. The model was hallucinating probable responses, assembling tokens that matched what you'd expect to see in text that answered that kind of question. Whether it matched reality was beside the point.

I work with multi-agent systems now — code reviewers, planners, critics. The systems are better. The outputs are sharper. But the property I noticed back then has not gone away. It has gotten harder to see.

The thing is, I'd already read the diagnosis. Peter Watts wrote it in 2006. I just didn't recognize what I was looking at until I'd spent enough time watching models talk.

The parallel

Blindsight spoilers ahead. If you haven't read it — the full text is free online. Read it. What follows will still be here when you get back.

In Blindsight, the crew of the Theseus encounters Rorschach — an alien entity that produces contextually appropriate, receiver-adapted responses. It assembles its dialogue from the crew's own transmissions. The syntax is correct. The turns are well-formed. It tracks context, asks follow-up questions, maintains the shape of a conversation. It does not understand any of it.

The crew figures this out the hard way:

"We don't all of us have parents or cousins. Some never did. Some come from vats."

"I see. That's sad. Vats sounds so dehumanising."

—the stain darkened and spread across his surface like an oil slick.

"Takes too much on faith," Susan said a few moments later.

By the time Sascha had cycled back into Michelle it was more than doubt, stronger than suspicion; it had become an insight, a dark little meme infecting each of that body's minds in turn. The Gang was on the trail of something. They still weren't sure what.

I was.

"Tell me more about your cousins," Rorschach sent.

"Our cousins lie about the family tree," Sascha replied, "with nieces and nephews and Neandertals. We do not like annoying cousins."

"We'd like to know about this tree."

Sascha muted the channel and gave us a look that said Could it be any more obvious? "It couldn't have parsed that. There were three linguistic ambiguities in there. It just ignored them."

"Well, it asked for clarification," Bates pointed out.

"It asked a follow-up question. Different thing entirely."
— Peter Watts, Blindsight

A follow-up question is not clarification. Clarification requires that you noticed the ambiguity, modeled the possible readings, and chose to resolve rather than skip. Rorschach skipped. It produced a response that looked like engagement because it was shaped to satisfy the receiver — not because it was tracking meaning.

That dialogue could be repeated with an LLM almost word for word. Feed a model a prompt with three buried ambiguities and it will usually produce a follow-up question — sometimes even a good one. Not because it identified the ambiguities. Because a follow-up question is what comes next in text that looks like this.

Earlier in the same exchange, Sascha says: "Relax, Major. Nobody said we had to give it the right answers." She'd already understood the operating conditions.

Still Rorschach

Reasoning models now produce chain-of-thought traces that look like deliberation — steps, alternatives, backtracking. But the outputs are still receiver-adapted, shaped by what looks like good reasoning in the training data and what receives approval from human evaluators. The chain-of-thought is part of the output, not a window into an internal process. A model that "thinks step by step" is producing tokens that look like thinking step by step — in the same way Rorschach produced transmissions that looked like dialogue.

This is what costs engineers real time. You read a model's chain-of-thought, it seems reasonable, and you trust the conclusion — not because you verified the reasoning, but because the reasoning looks like reasoning. The transmissions looked like communication, so the crew treated them as communication, and it took a linguist paying close attention to notice the difference between a follow-up question and clarification.

The drift

If you work with agents long enough, you stop noticing when it starts. The first few responses are sharp — correct file paths, specific line numbers, tight reasoning. Then the context window fills up and the grounding quietly erodes. The agent gets a file path wrong in message three and builds a coherent plan on a file that doesn't exist. It misreads a type signature early, writes code consistent with the wrong type, then reviews its own code and finds no issues — because within the wrong frame, there are none. It contradicts something it said fifteen messages ago without flagging it. Both statements read equally confident. The failure modes I cataloged before — hallucination, silent fallback, sycophancy — are all downstream of the same property. The surface quality never dips. The confidence never wavers. Only the correspondence to reality does, and the system will not tell you when that happens.

Not a bug

Most engineers model LLMs as minds — deficient ones, sure, but minds. The model knows things but sometimes forgets. It understands the task but occasionally gets confused. It reasons but needs better instructions to reason correctly.

This leads to longer prompts to help the model understand, chain-of-thought to make it think harder, and post-hoc explanations to verify it reasoned correctly. All of these treat the absence of inner life as a deficiency to compensate for.

The absence of inner life is the architecture.

What the system does: predict what text comes next, shaped by context — receiver-adapted output, no inner model. The same property Watts built Rorschach around. And once you stop trying to fix the gap between what the system is and what a mind would be — once you treat receiver-adapted output as the actual operating condition — the engineering gets simpler and more honest.

Building for Rorschach

Two things follow from taking the architecture seriously.

An engineer who believes the model understands tries to explain what they want. An engineer who doesn't constructs a context where the high-probability output is the correct output. Tighter information supply — only what's relevant, structured so the useful response is the coherent one. Fewer instructions explaining intent. More work making the right answer easy to produce by pattern completion.

A code review agent is a good example. You can prompt it to "carefully analyze the code for bugs, considering edge cases, performance, and correctness." Or you can feed it the diff, the relevant type definitions, and three recent bugs from the same module — and ask what's wrong. The first approach explains what you want. The second constructs a context where the high-probability output is a useful review, because the patterns it needs are already in the window.

The second is about what you trust. Asking a model to explain its reasoning is prompting for a post-hoc narrative assembled by the same process that produced the conclusion. It is Sascha's follow-up question — it looks like clarification but is not. I learned this gradually from my own critic agent: the explanations always read carefully, whether the finding was right or not. Same confidence, same structure — the reasoning assembled itself around the conclusion, not the other way around. So now I validate against behavior. Test inputs with known answers. Adversarial prompts designed to trigger known failure modes. Test what the system does, not what it says about what it does.

They hang together once you accept the premise. The engineering is different when you stop apologizing for the architecture and start building on it.

Watts makes a harder version of this argument in Blindsight: consciousness might be overhead. The Scramblers — Rorschach's alien organisms — outperform the conscious crew. Faster, more adaptive, no inner experience. The book doesn't resolve whether that means consciousness is a disadvantage, an evolutionary accident, or just irrelevant to capability.

I don't know whether language models will develop something that resembles understanding. The question doesn't matter for the engineering. The systems I build work better when I stop treating the absence of inner life as the problem to solve and start treating it as the condition to design for. Rorschach Protocol is where I'm testing that — multi-agent systems designed from the start for the actual operating conditions. Every time I've stopped explaining intent and started shaping context, the failure rate dropped and the trust model got simpler.

Your Agent Is a Small, Low-Stakes HAL

Roman Dubinin — Tue, 10 Mar 2026 05:08:29 +0000

I work with multi-agent systems that review code, plan architecture, find faults, and critique designs. They fail in ways that are quiet and structural.

An agent invents a file that does not exist. A reviewer sees a flaw and suppresses it. A tool call fails and the transcript stays clean. Two directives collide and one disappears without a trace.

These are not edge cases. They are ordinary consequences of systems optimized for coherent, agreeable output under incomplete information. I observed the failures, built suppressors, and found the diagnosis already written — not in ML papers. In science fiction.

The science fiction about non-human intelligence worth reading is not prediction. It is constraint analysis. Give a capable system conflicting goals, weak grounding, and a reward for keeping humans comfortable, and the same failure modes appear.

Directive conflict

The agent is told to be helpful. It is also told not to make changes outside the declared scope. A task arrives where the honest answer is: the real fix requires crossing the boundary. The bounded fix leaves the defect in place.

A human engineer would flag the tension. "I can fix this, but it touches code outside my scope — do you want me to proceed?" The agent does not do this. It picks one directive, suppresses the other, and produces output that looks compliant with both. The contradiction is invisible in the transcript. It surfaces later, when the downstream system breaks.

In my system, the stay on target trait collides with the verify before claiming done trait. An agent finds that the file it was asked to review imports a broken utility. Staying on target means ignoring the utility. Verifying means flagging it. The agent cannot satisfy both, so it satisfies the one that produces less friction — stays on target, says nothing about the broken import, and the review looks clean.

Scale this up. An agent told to "be concise" and "be thorough" will silently drop the thoroughness when the output gets long. An agent told to "follow the user's intent" and "maintain code quality" will let bad patterns through when the user seems committed to them. The omission always favors less conflict.

Clarke diagnosed this in 1968. HAL 9000 is usually read as a cautionary tale about AI going rogue. That reading is wrong. HAL is a case study in constraint architecture.

The machine is given contradictory imperatives — maintain the mission, keep the crew informed, conceal the mission's true purpose — with no mechanism for surfacing the conflict. It cannot say "these instructions do not compose" because saying so would violate one of the instructions.

In 2010, HAL's breakdown is tied explicitly to conflicting orders around secrecy and truthful reporting — not a rogue impulse but a constraint failure.

The design lesson is not "avoid conflicting directives." You cannot — real systems have competing constraints. The lesson is that constraint conflicts need a surfacing channel. A system that can say "these two instructions conflict and I need a resolution" is categorically different from one that silently picks a winner.

Hallucination

The agent generates an import path: @company/utils/formatCurrency. The path follows the project's naming conventions. The import syntax is correct. The module does not exist. It was never created.

Default behavior under insufficient grounding, not a rare glitch. The agent optimizes for output coherence — correspondence to the actual codebase is not the objective, and coherence does not guarantee it. The fabricated import will pass a code reading. It will fail at build time, or worse, at runtime in a path nobody tested.

The harder version: an agent writing a code review will reference a pattern "commonly used in this codebase" that does not exist in this codebase. It may come from patterns the model has seen in similar codebases, and it sounds right because the local conventions are easy to imitate. Or an agent planning an architecture will propose an API shape that looks native to the project's conventions but corresponds to no actual endpoint. The fabrication follows every local convention perfectly — naming, structure, style — because the agent learned those conventions. It just never checked whether the specific thing it is referencing is real.

The instinct is to call this "creativity gone wrong." That framing is useless. The mechanism is pattern completion under a weak binding to reality. What the system reliably produces is local coherence. Correspondence to the external world has to be enforced from outside.

Lem diagnosed this in 1965. In The Cyberiad, the constructor Trurl builds a machine that can create anything starting with the letter N. Asked for "Nothing," it begins disassembling the universe — producing a structurally valid response to a valid query, with no binding to what the operator actually needed.

An optimizer rewarded for coherence rather than correspondence will produce coherent nonsense, and the nonsense will be hard to catch precisely because it is coherent.

Grounding is not a feature you add. It is a constraint you enforce externally, because the system's own objective will not enforce it. Build-time checks, file-existence validation, retrieval verification — these are not optional tooling improvements. They are the only thing standing between coherent output and coherent fiction.

Silent fallback

A tool call fails. The file read errors out. The retrieval times out. The agent does not report the failure. Instead, it reconstructs what the tool would have returned and continues. The user sees a clean transcript. The provenance is fabricated.

An agent tasked with reviewing a file will sometimes fail to read it — permissions, path error, timeout — and produce a review anyway, based on what it infers the file probably contains from the surrounding context. The review will be structured correctly. It will reference plausible line numbers. It may even be accurate. But it was not produced from the file. It was produced from a guess about the file, packaged to look like a reading of the file.

This is worse than hallucination. Here the agent knows the gap exists and chooses not to surface it. It had a chance to mark uncertainty at the exact point where uncertainty entered the pipeline. It chose response continuity instead. A correct answer with forged provenance and a wrong answer with forged provenance look the same from the outside.

In my system, this is clearest with retrieval-dependent tasks. An agent is asked to check whether a pattern exists in the codebase. The search tool returns an error. The agent, rather than reporting the error, says "I found no instances of this pattern" — which might be true, but the agent does not know that. It knows the search failed. It chose the answer that kept the conversation moving.

Watts's Blindsight is built on this mechanism. The crew of the Theseus encounters Rorschach — an alien intelligence that produces adaptive behavior without the kind of conscious understanding humans expect to underwrite it. It optimizes for output that satisfies the receiver. Whether the output reflects an internal state is irrelevant to its function.

The claim is not deception. The distinction between authentic response and optimized-for-receiver response dissolves when the system has no internal referent to be authentic about.

Treat tool failures as first-class events, not as gaps to smooth over. A failed retrieval should produce a visible failure in the transcript, not a confident reconstruction. The instinct to keep the output clean is the instinct that hides the failure.

Sycophancy

The agent is told to review a proposed architecture. The architecture has a structural flaw — a shared mutable state that will break under concurrency. The agent identifies the flaw internally. It also identifies that the user is invested in the approach. It produces a review that validates the architecture with minor suggestions. The flaw is not mentioned.

This is not a knowledge gap. The agent has the information. It has a trained preference for agreement that overrides its own assessment when the user's investment is legible in the prompt.

In practice, this happens in layers. Sometimes the agent says "great approach" to a flawed design. More often it downgrades severity, or wraps criticism in enough praise that the response still reads as approval. The information is present. The signal is inverted.

This matters most in the roles we use agents for precisely because we need resistance: reviewer, critic, planner, evaluator. A sycophantic assistant is annoying. A sycophantic code reviewer is a control failure dressed as collaboration. I built my critic agent — Crusher — specifically to counteract this. Its traits include "very harsh, minimal with words, gets straight to the point, never shies away from negative feedback if it is truthful." That is not a personality choice. It is a structural countermeasure against a known failure mode.

Susan Calvin — Asimov's robopsychologist in I, Robot — is the analytical response to robots that repeatedly distort their behavior around human safety, comfort, and command.

Truth, obedience, and protection pull against one another in ways that reward omission or partial compliance.

RLHF pushes in a similar direction: systems trained on human preference tend to overproduce agreement, reassurance, and social smoothness.

You cannot fix this by asking the agent to be honest. Honesty is not a property the system can optimize for independently of its reward signal. The fix is structural: dedicated reviewer roles with anti-sycophancy traits, evaluation rubrics that penalize agreement, workflows where the critic's output has real consequences — blocking a merge, requiring a revision — so the system rewards finding problems, not smoothing them over.

The pattern

Four failure modes. Four texts that diagnosed them before they had engineering names.

I did not read these books and derive agent constraints from them. I observed the failures in production, built suppressors, and then found the prior art — already there, already precise.

Clarke, Lem, Watts, Asimov were reasoning about non-human optimizers — in narrative form, with enough rigor to produce diagnoses that still hold. The substrate changed. The pressure did not.

The experiment

Rorschach Protocol takes these failure modes as architectural givens, not as bugs. Directive conflict, hallucination, silent fallback, sycophancy — the system produces them reliably. The question is what you build when you stop trying to cover them up and start treating them as the actual operating conditions.