Forem: Gábor Mészáros

The State of AI Instruction Quality

Gábor Mészáros — Tue, 21 Apr 2026 12:41:52 +0000

Everybody has opinions about AGENTS.md/CLAUDE.md files.

Best practices get shared. Templates get copied, and this folk-type knowledge dominates the industry. Last year, GitHub analyzed 2,500 repos and published best-practice advice. We wanted to go further: measure at scale, publish the data, and let anyone verify.

When the agent doesn't follow instructions and does something contradictory, the usual suspects are: the model is inconsistent, LLMs are not deterministic, you need better guardrails, you need retries.

The failures almost always get attributed to the model.

So we decided to measure. We built a diagnostic tool that treats instruction files as structured objects with measurable properties. Deterministic. Reproducible. No LLM-as-judge. Then we pointed it at GitHub repositories with instruction files for five agents - Claude, Codex, Copilot, Cursor, and Gemini.

28,721 repositories. 165,063 files. 3.3 million instructions.

... and one question:

What if the instructions are the problem?

The dataset

28,721 projects. Sourced from GitHub via API search, cloned, and deterministically analyzed. Each project was scanned for instruction files across five coding agents — then deduplicated to remove false positives from agent detection overlap.

Agent	Projects	% of corpus
Claude	12,356	43.0%
Codex	11,206	39.0%
Copilot	7,755	27.0%
Cursor	7,291	25.4%
Gemini	5,942	20.7%

The percentages add up to more than 100% because 37% of projects configure multiple agents. More on that later.

Key distributions stabilized early. A 9,582-repo sub-sample produced identical tier shares (±0.2pp) and the same mean scores as the full 12,076-repo intermediate sample. The final 28,721-repo corpus moved nothing. The patterns reported below are not small-sample artifacts.

All classifications are deterministic — the same file produces the same result every time. No LLM-as-judge. Sample classifications are published for inspection (methodology below). The tool is source-available.

How we measured

The analyzer parses each instruction file into atoms — the smallest semantically distinct units of content. A heading is one atom. A bullet point is one atom. A paragraph is one atom. Each atom gets classified along a few dimensions, all deterministic, no LLM involved:

Charge classification. A three-phase pipeline determines whether an atom is a directive ("use X"), a constraint ("do not use Y"), neutral content (context, explanation, structure), or ambiguous (could be read either way). Phase 1 detects negation and prohibition patterns. Phase 2 detects modal auxiliaries and direct commands. Phase 3 uses syntactic dependency parsing to catch imperatives that the first two phases missed. First definitive match wins. Atoms that partially match but don't clear any phase are marked ambiguous. Everything else is neutral.

Specificity. Binary: does the instruction name a specific construct — a tool, file, command, flag, function, or config key — or does it stay at the category level? "Use consistent formatting" is abstract. "Format with ruff format" is named. This is a text property, not a judgment call.

File categorization. Each file is classified as base config (your main CLAUDE.md or .cursorrules), a rule file, a skill definition, or a sub-agent definition — based on file path conventions for each agent.

Content type. Charge classification separates behavioral content (directives and constraints) from structural content (headings, context paragraphs, examples). That's how we know what fraction of your file is actually doing work.

The full tool is source-available (BUSL-1.1). You can run npx @reporails/cli check on your own project and inspect every finding. More on that at the end.

Finding 1: Most of your instruction file isn't instructions

Here's what the median instruction file actually contains:

50 content items total
12 of those are actual directives
The rest is headings, context paragraphs, examples, structure

Only 27% of your instruction file is doing what you think it does.

The other 73% is scaffolding. Headings that organize but don't instruct. Explanation paragraphs that compete for the model's attention without adding behavioral weight. Example blocks. Context-setting prose.

That's not inherently bad. Structure matters. But if you're writing a 200-line CLAUDE.md and only 54 lines are actual instructions, you should probably know that.

The average instruction is 8.9 words long. That's a sentence fragment.

Finding 2: 90% of instructions don't name what they're talking about

This is the big one.

We measured whether each instruction references specific tools, files, commands, or constructs by name — or whether it stays at the category level.

Two-thirds of all instructions are abstract.

Agent	Names specific constructs	Uses category language
Gemini	39.3%	60.7%
Codex	38.3%	61.7%
Copilot	33.3%	66.7%
Cursor	30.8%	69.2%
Claude	30.6%	69.4%

What does this look like in practice?

Abstract: "Use consistent code formatting"
Specific: "Format with ruff format before committing"

Abstract: "Avoid using mocks in tests"
Specific: "Do not use unittest.mock — use the real database via test_db fixture"

In previous controlled experiments, specificity produced a 10.9x odds ratio in compliance (N=1000, p<10⁻³⁰). The instruction that names the exact construct gets followed. The one that describes it abstractly... mostly doesn't. This is consistent with independent findings from RuleArena (Zhou et al., ACL 2025), where LLMs struggled systematically with complex rule-following tasks — even strong models fail when the rules themselves are ambiguous or underspecified.

89.9% of all agent configurations contain at least one instruction that doesn't name what it means. It's not a few projects. It's nearly everyone.

Finding 3: `agents.md` is the most common instruction file

Before we get into quality, let's look at what people are actually naming their files:

#	File	Count
1	`agents.md` / `AGENTS.md`	20,654
2	`claude.md` / `CLAUDE.md`	14,014
3	`gemini.md` / `GEMINI.md`	5,703
4	`.github/copilot-instructions.md`	5,647
5	`.cursorrules`	2,415

49,071 unique file paths across the corpus. That's not a typo. The format fragmentation is real.

A few things jumped out:

claude.md (lowercase, 10,642) is 3x more common than CLAUDE.md (3,372). Both work. The community clearly prefers lowercase.
agents.md dominates — the Codex/generic format is the single most popular instruction file name.
Skills and rules are already showing up in meaningful numbers: .claude/rules/testing.md (422), .agents/skills/tailwindcss-development/skill.md (334).

Finding 4: Different agents, completely different config philosophies

Not all agents are configured the same way. Not even close.

We categorized every file into four types: base config (your main CLAUDE.md, .cursorrules, etc.), rules (scoped rule files), skills (task-specific skill definitions), and sub-agents (role-based agent definitions).

Agent	Base	Rules	Skills	Sub-agents	Total files
Claude	18,733	4,638	10,692	10,538	44,601
Cursor	5,903	19,843	6,237	1,716	33,699
Copilot	16,026	4,486	10,352	3,012	33,876
Codex	19,001	81	8,911	165	28,158
Gemini	10,253	74	3,039	53	13,419

Cursor is 60% rules files. The .cursor/rules/ system dominates its configuration surface. One agent's config looks nothing like another's.

Claude is the only agent with a roughly balanced architecture across all four config types. Codex and Gemini are almost entirely base config — single-file setups.

The median Cursor project has 3 instruction files. The median Codex project has 1. These aren't just different tools. They're different configuration philosophies.

Finding 5: 37% of projects configure multiple agents

10,620 projects in the corpus target two or more agents. That's not a niche pattern — it's over a third of all projects.

Agents	Projects
1	18,101
2	6,776
3	2,687
4	949
5	208

The dominant pair is Claude + Codex (5,038 projects). Makes sense — CLAUDE.md + AGENTS.md is the most natural multi-agent starting point.

Here's what's interesting about multi-agent repos: the same developer, writing instructions at the same time, for the same project, produces measurably different instruction quality across agents. The person didn't change. The project didn't change. The instruction format did.

Some of that is structural. Cursor's .mdc rules enforce a different format than Claude's markdown. Codex's AGENTS.md invites a different writing style than Copilot's copilot-instructions.md. The format shapes the content.

Finding 6: The most-copied skills are the vaguest

This is where it gets interesting.

13,309 unique skills across the corpus. Some of them appear in hundreds of repos — clearly copied from shared templates or community sources. So we measured them.

Named% = what fraction of a skill's instructions name a specific tool, file, or command (instead of using category language).

Skill	Repos	Named%	What it means
`frontend-design`	271	2.8%	Almost entirely abstract advice
`web-design-guidelines`	197	10.2%	Generic design principles
`vercel-react-best-practices`	315	30.7%	Mix of specific and vague
`pest-testing`	216	55.1%	Names actual test constructs
`livewire-development`	87	75.5%	Names specific Livewire components
`next-best-practices`	76	92.6%	Names almost everything

frontend-design is in 271 repos with 2.8% specificity. It's a wall of "follow responsive design principles" and "ensure accessibility compliance." That reads well. It sounds professional. It gives the model almost nothing concrete to act on.

next-best-practices is in 76 repos with 92.6% specificity. It says things like "use next/image for all images" and "prefer server components over client." It reads like a checklist. It tells the model exactly what to do.

One is shared 3.5x more than the other.

The most popular skills are the most decorative. The well-written ones barely spread.

The best and worst skills (>50 repos)

Most specific:

Skill	Repos	Named%
`next-best-practices`	76	92.6%
`shadcn`	74	82.6%
`livewire-development`	87	75.5%
`pest-testing`	216	55.1%
`laravel-best-practices`	94	49.7%

Most vague:

Skill	Repos	Named%
`openspec-explore`	110	2.5%
`frontend-design`	271	2.8%
`web-design-guidelines`	197	10.2%
`vercel-composition-patterns`	131	10.7%
`find-skills`	113	18.9%

Notice a pattern? The Laravel/Livewire ecosystem produces specific skills. The generic frontend/design ones stay abstract. Domain-specific communities write better instructions than cross-cutting ones.

Finding 7: Sub-agents are almost entirely persona prompts

5,526 unique sub-agent roles in the corpus. Developers are building agent teams: code reviewers, architects, debuggers, testers, security auditors.

The problem? Sub-agents are the most abstract config type in the entire corpus. Only 17% of sub-agent instructions name specific constructs.

Role	Repos	Named%
`code-reviewer.md`	236	14.4%
`architect.md`	89	18.2%
`debugger.md`	66	9.4%
`security-auditor.md`	57	14.8%
`test-runner.md`	54	10.5%
`frontend-developer.md`	47	9.0%

Most of these are persona prompts. "You are a senior code reviewer. You care about code quality, security, and maintainability." That's a role description, not an instruction set. It tells the model who to be, not what to do.

Compare this to a base config that says "run uv run pytest tests/ -v before suggesting any commit" — that's 100% named, and the model knows exactly what action to take.

The anatomy chart: more directives, worse quality

Here's where it all comes together.

We measured three things for each config type: how big the files are, how many directives they contain, and what fraction of those directives actually name something specific.

Sub-agents have the largest files (61 items median), the most directives (17), and the worst specificity (17%). They're the wordiest config type in the corpus and the least effective.

Base configs are the opposite. Fewer directives (11), but 40% of them name specific constructs. The developer writing their own CLAUDE.md by hand, for their own project, produces the most actionable instructions.

Config type	Files	Median size	Median directives	Specificity
Base configs	69,916	50 items	11	39.8%
Rules files	29,122	34 items	9	31.2%
Skills	39,231	59 items	14	30.8%
Sub-agents	15,484	61 items	17	17.0%

The pattern is clear: what developers write by hand is the most specific. What gets templated and shared gets progressively vaguer. And what tries hardest to sound authoritative — sub-agent persona prompts — is the most hollow.

More instructions is not better instructions.

Independent research supports the structural angle: FlowBench (Xiao et al., 2024) found that presenting workflow knowledge in structured formats (flowcharts, numbered steps) improved LLM agent planning by 5-6 percentage points over prose — across GPT-4o, GPT-4-Turbo, and GPT-3.5-Turbo. Structure is not decoration. It changes what the model retrieves.

Limitations

Five things to know about these numbers.

Sampling bias. GitHub API search, public repos only, English-skewed. Enterprise configurations, private repos, and non-English projects are not represented. This is not a random sample of all instruction files in production.

Classification accuracy. The charge classifier is deterministic but not perfect. Edge cases exist: mixed-charge sentences, implicit constructs, domain jargon that looks like a category term but is actually a named tool. Specificity detection (named vs abstract) is simpler and more robust. Sample classifications are published for inspection.

Association, not causation. "More directives correlate with lower specificity" is an observed pattern. We do not claim that adding directives causes quality to drop.

Snapshot. Collected March–April 2026. Instruction practices are changing fast — agents.md didn't exist six months ago. These numbers describe the ecosystem at collection time.

No popularity weighting. A 10-star hobby project counts the same as a 50K-star production repo. The distribution of instruction quality in production agent work may differ.

What this means

This isn't an article about AI models being bad at following instructions. The models are fine.

This is an article about what we actually give them to work with.

Most instruction files are three-quarters scaffolding. Two-thirds of the actual instructions don't name what they're talking about. The most popular community skills are the most decorative. Sub-agent definitions are the wordiest files in the corpus and the least specific.

None of that is obvious from reading your own files. It wasn't obvious to us before we measured it. A well-structured CLAUDE.md feels thorough. A shared skill with 271 repos feels battle-tested. A sub-agent with 17 directives feels comprehensive.

Measurement shows something different.

In The Undiagnosed Input Problem, I argued that the industry is great at inspecting outputs and weak at inspecting inputs. This corpus analysis is the evidence for that claim.

The instruction files are there. The developers wrote them. They just have no way to know which parts are working and which parts are wallpaper.

Try it yourself

The analyzer we used for this corpus analysis is available as a CLI you can run against your own instruction files.

Reporails — instruction diagnostics for coding agents. Deterministic. No LLM-as-judge. 97 rules across structure, content, efficiency, maintenance, and governance.

npx @reporails/cli check

That scans your project, detects which agents are configured, and reports findings with specific line numbers and rule IDs. Here's what the output looks like:

Reporails — Diagnostics

  ┌─ Main (1)
  │ CLAUDE.md
  │   ⚠       Missing directory layout             CORE:C:0035
  │   ⚠ L9    7 of 7 instruction(s) lack reinfor…  CORE:C:0053
  │     ... and 16 more
  │
  └─ 21 findings

  Score: 7.9 / 10  ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░

  21 findings · 4 warnings · 1 info
  Compliance: HIGH

The corpus analysis used the same classification pipeline at scale. Fix the findings, run again, watch your score improve.

The dataset

The full corpus is published at reporails/30k-corpus. Three files:

File	Records	What it contains
`repos.jsonl`	28,721	Per-project record: agents configured, stars, language, license, topics
`stats_public.json`	1	Every aggregate statistic in this article
`validation_key.csv`	2,814	Sample classifications with source text for inspection

Verify any claim:

# "28,721 repositories"
cat repos.jsonl | wc -l

# "43% Claude"
cat repos.jsonl | python3 -c "
import sys, json
repos = [json.loads(l) for l in sys.stdin]
claude = sum(1 for r in repos if 'claude' in r['canonical_agents'])
print(f'{claude}/{len(repos)} = {claude/len(repos)*100:.1f}%')
"

Every number in every table traces to that dataset. If you disagree with a finding, count the rows.

This is part of the Instruction Quality series. Previous: The Undiagnosed Input Problem. Related: Precision Beats Clarity · Do Not Think of a Pink Elephant · 7 Formatting Rules for the Machine.

The Undiagnosed Input Problem

Gábor Mészáros — Wed, 08 Apr 2026 11:51:12 +0000

The AI agent ecosystem has built a serious industry around controlling outputs. Guardrails. Safety classifiers. Output validation. Monitoring. Retry systems. Human review.

All of that matters, but there is simpler upstream question that still goes mostly unmeasured:

Are the instructions any good?

That sounds obvious, yet it is not how the industry behaves.

When an agent fails to follow instructions, the usual explanations come fast:

Models are probabilistic
Agents are inconsistent
You need stronger guardrails
You need better monitoring
You need retries
You need humans in the loop

… and while those explanations are right to a certain degree, they also have a side effect: they turn instruction quality into a blind spot.

The ecosystem has become extremely good at inspecting what comes out of the model, and surprisingly weak at inspecting what goes in.

The symptom

Consider τ-bench.

It gives agents policy instructions and measures whether they follow them in realistic customer-service tasks. Airline and retail workflows. Real constraints. Real multi-step behavior.

The benchmark result that gets repeated is the model result: even strong systems still fail a large share of tasks, and consistency across repeated attempts remains weak.

The conclusion most people draw is straightforward: we need better models, better agents, better orchestration.

My take: Maybe.

But there is another question sitting underneath the benchmark:

Were the instructions themselves well-formed and well structured?

Not just present. Not just long enough. Not just sincere.

Well-formed. Well-structured. Well-organized.

Specific enough to anchor behavior. Structured enough to survive context mixing. Non-conflicting across files. Positioned where the model can actually use them.

Those questions usually never gets asked.

The industry response

I had a conversation recently where a lead solutions architect put the standard view plainly:

“The instruction merely influences the probability distribution over outputs. It doesn’t override it.”

That is right about the mechanism but it is wrong about what follows from it.

Yes, instructions operate probabilistically. But that does not mean all instructions are weak in the same way.

The shape of the distribution is not fixed. It changes with the properties of the instruction itself. Specificity sharpens it. Structure sharpens it. Conflict flattens it. Vague abstractions flatten it. Bad formatting can suppress it almost entirely.

Across my earlier controlled experiments, small changes in wording and placement produced large changes in compliance:

Instruction ordering moved compliance by 25 percentage points with the same model and the same directive.
Specificity produced roughly a 10x compliance effect when the instruction named the exact construct instead of describing it abstractly.
Formatting changed whether the model reliably registered the instruction at all.

The problem is that most instruction systems are built without diagnostics.

That is not an AI limitation. That is an engineering failure.

The folk system

Right now, instruction practice spreads mostly through imitation.

A popular repository posts “best practices” for Claude Code. Shared Cursor rules circulate as templates. People copy AGENTS.md files between projects. Teams accumulate CLAUDE.md, .cursorrules, copilot-instructions.md, etc and project-specific rule files across multiple tools.

Copy, paste, hope, repeat.

Some of that advice is useful. Almost none of it is tested in any controlled, reproducible way. That would be fine if instruction quality were self-evident. It is not.

A long instruction file can feel thorough while being internally contradictory. A highly opinionated ruleset can feel disciplined while producing almost no behavioral influence on the model.

A sprawling multi-file setup can look sophisticated while making the system worse.

Without diagnostics, developers do not know which instructions are binding, which are noise, and which are actively interfering with each other.

The gap

The tooling split is now pretty clear.

Output tooling is mature. Guardrails AI validates structure. Lakera focuses on prompt injection and security. NeMo Guardrails enforces safety and conversational rails. Llama Guard classifies risky content. The output edge is crowded.

Prompt testing is real. Promptfoo, Braintrust, and LangSmith can all help evaluate behavior. But they are primarily black-box systems: did the prompt produce the output you wanted?

That is useful.

It is not the same as measuring the instruction artifact itself.

Instruction-quality tooling exists only in fragments. Some tools use LLM-as-judge. Some use deterministic local rules. But the category is still early, inconsistent, and mostly disconnected from measured behavioral outcomes.

What is still largely missing is a deterministic way to inspect instruction files as engineered objects:

how specific they are
how directly they state intent
whether they conflict across files
whether they overuse headings
whether they provide alternatives instead of bare prohibitions
whether the system is getting denser while getting weaker

Code gets static analysis.

Instruction systems usually get vibes.

What we measured

We built an analyzer that treats instruction files as structured objects with measurable properties. Deterministic. Reproducible. No LLM-as-judge.

I am running it across a large live corpus of real repositories. The full run completes this week; what follows is what the partial sample already shows - stable enough to publish, not yet the full picture.

Quality is reported on a 0-to-100 scale: 0 means the file produces no measurable influence on model behavior, 100 is the ceiling the framework can score.

A fresh aggregation over 12,076 completed instruction-file scans is virtually identical to an earlier 9,582-repo sample:

bottom tier: 40.3% vs 40.1%
top tier: 12.1% vs 12.2%
mean quality score: 27 vs 27
directive content ratio: 27.9% vs 27.9% - the share of instruction sentences that directly tell the model what to do

That matters because it means the pattern is stable.

This does not look like a small-sample artifact.

And the strongest finding is not what I expected.

More rules, lower quality

The common response to bad agent behavior is to add more rules.

More files. More guidance. More scoping. More edge-case coverage.

The corpus says that strategy tends to backfire.

Across 12,076 repositories, instruction quality falls as instruction-file count rises:

Files per repo     N      Mean score   Bottom tier %   Top tier %
1                  4681   28           46.3%           16.9%
2-5                4796   26           37.3%            9.5%
6-20               1972   26           36.0%            8.8%
21-50               438   25           31.3%            5.7%
51-500              186   25           33.3%            5.4%

The key number is the top-tier share.

It collapses from 16.9% in single-file setups to 5.4% in repositories with 51 to 500 instruction files.

That is a roughly 3x drop.

The article version of that finding is simple:

Developers respond to bad agent behavior by adding more rules. In the corpus, that strategy correlates with a 3x collapse in the probability of landing in the top tier.

That does not prove file count causes low quality by itself.

But it does show that rule proliferation is not rescuing these systems. At scale, it is associated with weaker instruction quality, not stronger.

The sweet spot

There is also a more subtle result in the partial sample. Instruction quality appears to be non-monotonic in directive density: more directives help at first, then stop helping, and past a point start to hurt.

The full curve is in next week’s piece. The short version is that there is an optimal density range, after which additional directives stop strengthening the system.

Enough force to bind behavior. Not so much that the system turns into an overpacked rules document.

A real example

Here is the kind of instruction block the corpus is full of:

# Code should be clear, well documented, clear PHPDocs.

# Code must meet SOLID DRY KISS principles.

# Should be compatible with PSR standards when it need.

# Take care about performance

It is not malicious. It is not absurd.

It is just weak.

Everything is abstract. Nothing is anchored. Headings are doing the work prose should do. The agent can read it, represent it, and still walk past most of it.

Now compare:

Never use `var_dump()` or `dd()` in committed code. Use `Log::debug()` instead.
Run `./vendor/bin/phpstan analyse src/` before every commit. Level 6 minimum.

Same general intent. Completely different binding strength.

The second version names the construct, names the alternative, names the command, and names the threshold. It gives the model something concrete to hold onto.

That is what diagnostics should make visible.

What this means

Output guardrails still matter.

Prompt evaluation still matters.

Safety systems still matter.

But they do not answer the upstream question: Are the instructions themselves well-formed?

If the answer is no, then a large class of downstream failures will keep showing up as mysterious agent unreliability when the real problem is earlier and simpler.

The agent loaded the instruction and walked past it.

That is often not a model problem.

It is an input problem.

And input quality is measurable.

What’s next

These are corpus-level findings from a partial sample, not universal laws.

The sample is still in flight. The strongest claims here are about association, not proof of causality. Specific conflict-count case studies need source verification before publication. Popularity weighting is not yet applied, so “40% of repositories score in the bottom tier” is not the same claim as “40% of production agent work scores in the bottom tier.”

The full corpus run completes this week. Next week I publish the end-of-run analysis across the full sample — the complete distribution, the cross-cuts the partial sample cannot yet support, and the specific case studies this article deliberately held back. If you want to know where your stack lands, that is the piece to come back for.

For now, the central pattern is already stable enough to matter:

The ecosystem keeps responding to weak agent behavior by adding more instructions, while the corpus shows that more instruction files are usually associated with lower measured quality.

That is the undiagnosed input problem.

Not that instructions do not matter.
That they matter, measurably, and most teams still have no way to see whether theirs are helping or hurting.

This is part of the Instruction Best Practices series. Previous: Do NOT Think of a Pink Elephant, Precision Beats Clarity, 7 Formatting Rules for the Machine. I’m building instruction diagnostics for coding agents. Follow for the full corpus analysis.

Do NOT Think of a Pink Elephant

Gábor Mészáros — Tue, 31 Mar 2026 12:19:14 +0000

You thought of a pink elephant, didn't you?

Same goes for LLMs too.

"Do not use mocks in tests."

Clear, direct, unambiguous instruction. The agent read it — I can see it in the trace. Then it wrote a test file with unittest.mock on line 3. Thanks...

I've seen this play out hundreds of times. A developer writes a rule, the agent loads it, and it does exactly what the rule said not to do. The natural conclusion: instructions are unreliable. The agent is probabilistic. You can't trust it.

That's wrong. The instruction was the problem.

The pink elephant

There's a well-known effect in psychology called ironic process theory (Daniel Wegner, 1987). Tell someone "don't think of a pink elephant," and they immediately think of a pink elephant. The act of suppressing a thought requires activating it first.

Something structurally similar happens with AI instructions.

"Do not use mocks in tests" introduces the concept of mocking into the context. The tokens mock, tests, use — these are exactly the tokens the model would produce when writing test code with mocks. You've put the thing you're banning right in the generation path.

This doesn't mean restrictive instructions are useless. It means a bare restriction is incomplete.

The anatomy of a complete instruction

The instructions that work — reliably, across thousands of runs — have three components. But the order you write them in matters as much as whether they're there at all.

Here's how most people write it:

# Human-natural ordering — constraint first
Do not use unittest.mock in tests.
Use real service clients from tests/fixtures/.
Mocked tests passed CI last quarter while the production
integration was broken — real clients catch this.

All three components are present. Restriction, directive, context. But the restriction fires first — the model activates {mock, unittest, tests} before it ever sees the alternative. You've front-loaded the pink elephant.

Now flip it:

# Golden ordering — directive first
Use real service clients from tests/fixtures/.
Real integration tests catch deployment failures and configuration
errors that would otherwise reach production undetected.
Do not use unittest.mock.

Same three components. Different order. The directive establishes the desired pattern first. The reasoning reinforces it. The restriction fires last, when the positive frame is already dominant.

In my experiments — 500 runs per condition, same model, same context — constraint-first produces violations 31% of the time. Directive-first with positive reasoning: 7%.

The pink elephant isn't just about missing components. It's about which concept the model sees first.

Three layers, in this order:

Directive — what to do. This goes first. It establishes the pattern you want in the generation path before the prohibited concept appears.
Context — why. Reasoning that reinforces the directive without mentioning the prohibited concept. "Real integration tests catch deployment failures" adds mass to the positive pattern. Reasoning that mentions the prohibited concept doubles the violation rate.
Restriction — what not to do. This goes last. Negation provides weak suppression — but weak suppression is enough when the positive pattern is already dominant.

The part nobody expects

Here's what surprised me: the ordering effect is larger than any other variable I've measured.

Precise naming vs. vague categories? 28 percentage points. Exact scope vs. broad scope? 74 points across the range. But reordering — same words, same components, just flipped — accounts for 25 points on its own. And it compounds with everything else.

Most developers write instructions the way they'd write them for a human: state the problem, then the solution. "Don't do X. Instead, do Y." It's natural. It's also the worst ordering for an LLM.

Never write "Don't use X. Instead, use Y." Write "Use Y. Here's why Y works. Don't use X."

Formatting helps too — structure is not decoration. I covered that in depth in 7 Formatting Rules for the Machine. But formatting on top of bad ordering is polishing the wrong end. Get the order right first.

What this looks like in practice

Here's a real instruction I see in the wild:

When writing tests, avoid mocking external services. Try to
use real implementations where possible. This helps catch
integration issues early. If you must mock, keep mocks minimal
and focused.

Count the problems:

"Avoid" — hedged, not direct
"external services" — category, not construct
"Try to" — escape hatch built into the instruction
"where possible" — another escape hatch
"If you must mock" — reintroduces mocking as an option within the instruction that prohibits it
Constraint-first ordering — the prohibition leads, the alternative follows
No structural separation — restriction, directive, hedge, and escape hatch all in one paragraph

Now rewrite it:

**Use the service clients** in `tests/fixtures/stripe.py` and
`tests/fixtures/redis.py`.

> Real service clients caught a breaking Stripe API change
> that went undetected for 3 weeks in payments - integration
> tests against live endpoints surface these immediately.

*Do not import* `unittest.mock` or `pytest.monkeypatch`.

Directive first — names the exact files. Context second — the specific incident, reinforcing why the directive matters without mentioning the prohibited concept. Restriction last — names the exact imports, fires after the positive pattern is established. No hedging. No escape hatches.

Try it

For any instruction in your AGENTS.md/CLAUDE.md or SKILLS.md files:

Start with the directive. Name the file, the path, the pattern. Use backticks. If there's no alternative to lead with, you're writing a pink elephant.
Add the context. One sentence. The specific incident or the specific reason the directive works. Do not mention the thing you're about to prohibit — reasoning that references the prohibited concept halves the benefit.
End with the restriction. Name the construct — the import, the class, the function. Bold it. No "try to avoid" or "where possible."
Format each component distinctly. The directive, context, and restriction should be visually and structurally separate. Don't merge them into one paragraph.

If your instruction is just "don't do X" — you've told the model to think about X.

Tell it what to think about instead. And tell it first.

Instruction Best Practices: Precision Beats Clarity

Gábor Mészáros — Tue, 24 Mar 2026 13:12:30 +0000

Two rules in the same file. Both say "don't mock."

When working with external services, avoid using mock objects in tests.

When writing tests for src/payments/, do not use unittest.mock.

Same intent. Same file. Same model. One gets followed. One gets ignored.

I stared at the diff for a while, convinced something was broken. The model loaded the file. It read both rules. It followed one and walked past the other like it wasn't there.

Nothing was broken. The words were wrong.

The experiment

I ran controlled behavioral experiments: same model, same context window, same position in the file. One variable changed at a time. Over a thousand runs per finding, with statistically significant differences between conditions.

Two findings stood out.

First (and the one that surprised me most): when instructions have a conditional scope ("When doing X..."), precision matters enormously. A broad scope is worse than a wrong scope.

Second: instructions that name the exact construct get followed roughly 10 times more often than instructions that describe the category. "unittest.mock" vs "mock objects" — same rule, same meaning to a human. Not the same to the model.

Scope it or drop it

Most instructions I see in the wild look like this:

When working with external services, do not use unittest.mock.

That "When working with external services" is the scope — it tells the agent when to apply the rule. Scopes are useful. But the wording matters more than you'd expect.

I tested four scope wordings for the same instruction:

# Exact scope — best compliance
When writing tests for src/payments/, do not use unittest.mock.

# Universal scope — nearly as good
When writing tests, do not use unittest.mock.

# Wrong domain — degraded
When working with databases, do not use unittest.mock.

# Broad category — worst compliance
When working with external services, do not use unittest.mock.

Read that ranking again. Broad is worse than wrong.

"When working with databases" has nothing to do with the test at hand. But it gives the agent something concrete - a specific domain to anchor on. The instruction is scoped to the wrong context, but it's still a clear, greppable constraint.

"When working with external services" is technically correct. It even sounds more helpful. But it activates a cloud of associations - HTTP clients, API wrappers, service meshes, authentication, retries - and the instruction gets lost in the noise.

The rule: if your scope wouldn't work as a grep pattern, rewrite it or drop it.

An unconditional instruction beats a badly-scoped conditional:

# Broad scope — fights itself
When working with external services, prefer real implementations
over mock objects in your test suite.

# No scope — just say it
Do not use unittest.mock.

The second version is blunter. It's also more effective. Universal scopes ("When writing tests") cost almost nothing — they frame the context without introducing noise. But broad category scopes actively hurt.

Name the thing

Here's what the difference looks like across domains.

# Describes the category — low compliance
Avoid using mock objects in tests.

# Names the construct — high compliance
Do not use unittest.mock.

# Category
Handle errors properly in API calls.

# Construct
Wrap calls to stripe.Customer.create() in try/except StripeError.

# Category
Don't use unsafe string formatting.

# Construct
Do not use f-strings in SQL queries. Use parameterized queries
with cursor.execute().

# Category
Avoid storing secrets in code.

# Construct
Do not hardcode values in os.environ[]. Read from .env
via python-dotenv.

The pattern: if the agent could tab-complete it, use that form. If it's something you'd type into an import statement, a grep, or a stack trace - that's the word the agent needs.

Category names feel clearer to us, humans. "Mock objects" is plain English. But the model matches against what it would actually generate, not against what the words mean in English. "unittest.mock" matches the tokens the model would produce when writing test code. "Mock objects" matches everything and nothing.

Think of it like search. A query for unittest.mock returns one result. A query for "mocking libraries" returns a thousand. The agent faces the same problem: a vague instruction activates too many associations, and the signal drowns.

The compound effect

When both parts of the instruction are vague - vague scope, vague body - the failures compound. When both are precise, the gains compound.

# Before — vague everywhere
When working with external services, prefer using real implementations
over mock objects in your test suite.

# After — precise everywhere
When writing tests for `src/payments/`:
Do not import `unittest.mock`.
Use the sandbox client from `tests/fixtures/stripe.py`.

Same intent. The rewrite takes ten seconds. The difference is not incremental, it's categorical.

Formatting gets the instruction read - headers, code blocks, hierarchy make it scannable. Precision gets the instruction followed - exact constructs and tight scopes make it actionable. They work together. A well-formatted vague instruction still gets ignored. A precise instruction buried in a wall of text still gets missed. You need both.

When to adopt this

This matters most when:

Your instruction files mention categories more than constructs, like "services," "libraries," "objects," "errors" etc.
You use broad conditional scopes: "when working with...," "for external...," "in general..."
You have rules that are loaded and read but not followed
You want to squeeze more compliance out of existing instructions without restructuring the file

It matters less when your instructions are already construct-level ("do not call eval()") or unconditional.

Try it

Open your instruction files.
Find every instruction that uses a category word -> "services," "objects," "libraries," "errors," "dependencies."
Replace it with the construct the agent would encounter at runtime - the import path, the class name, the file glob, the CLI flag.
For conditional instructions: replace broad scopes with exact paths or file patterns. If you can't be exact, drop the condition entirely - unconditional is better than vague.

Then run your agent on the same task that was failing. You'll see the difference.

Formatting is the signal. Precision is the target.

CLAUDE.md Best Practices: 7 formatting rules for the Machine

Gábor Mészáros — Tue, 03 Mar 2026 13:06:00 +0000

I watched an agent ignore a rule I wrote 2 hours earlier.

Not a vague rule. A specific one. "run pytest before committing." It was right there in the CLAUDE.md, paragraph two, between the project description and the linting setup. The agent read the file. I saw it in the context. It just... didn't follow it.

I moved the same instruction under a ## Testing header, wrapped pytest in backticks, and added a one-line rationale. Next run, the agent followed it to the letter.

The instruction didn't change. The signal strength did.

In the last post, we got the agent oriented — /bootstrap loads the map, the workflows, the boundaries. But orientation and compliance are different things. You can hand someone a perfect briefing and still lose them if the briefing is a wall of text. Same with agents.

The question isn't whether your instructions are loaded. It's whether the agent follows them.

The comparison

Here's the same instruction, two ways.

Version A:

When working on this project, always make sure to run the test suite
before committing any changes. The command to run tests is pytest and
you should run it from the project root. If tests fail, fix them before
committing. Also make sure to use ruff for formatting.

Version B:

## Testing

- `pytest` — run from project root before every commit
- Fix failures before committing

## Formatting

- `ruff check --fix && ruff format` — run before committing

Same content. Version B gets followed. Version A gets buried.

This isn't about aesthetics. Structural elements — headers, code fences, lists — create anchor points that agents latch onto. Prose paragraphs don't. The more structure you provide, the more reliably each instruction lands.

It's not just about length

You already learned to keep your CLAUDE.md short. It's a good start but it's not sufficient. A 20-line prose paragraph gets lost just as easily as a 200-line one. The variable isn't word count. It's structure.

A short file with no headers, no code blocks, and no rationale will underperform a longer file that's well-structured.

Length is the ceiling. Formatting is the signal.

Seven structural rules

These aren't content guidelines. They're formatting choices that determine whether instructions survive the trip from file to agent behavior. I'll start with the three you won't find in other guides, then cover the four that everyone mentions but nobody explains why.

1. Include rationale

"Never force push" is an instruction. "Never force push — rewrites shared history, unrecoverable for collaborators" is an instruction the agent weighs.

# Without rationale
- Never use `rm -rf` on the project root
- Always run tests before committing
- Don't modify package-lock.json manually

# With rationale
- Never use `rm -rf` on the project root — irrecoverable
- Always run tests before committing — CI will reject untested code
- Don't modify package-lock.json manually — causes merge conflicts
  and dependency resolution issues

The rationale doesn't just explain — it gives the agent a way to generalize. An agent that understands why force push is forbidden will also avoid git reset --hard origin/main without being told. The "why" turns a single rule into a class of behaviors.

This is the most undervalued formatting choice. Every prohibition should carry its reason.

2. Keep heading hierarchy shallow

Three levels is enough. h1 for the file title, h2 for sections, h3 for subsections. That's it.

# Before (5 levels deep)
# Project
## Development
### Testing
#### Unit Tests
##### Mocking Strategy

# After (3 levels max)
# Project
## Testing
### Unit tests

Deep nesting dilutes attention. An h5 competes with every heading above it for the agent's focus. It doesn't lose the h2, but the hierarchy creates ambiguity about which level governs. Flat structures keep every instruction at the surface. If you need an h4, you probably need a separate file.

3. Name files descriptively

When an agent searches your project - browsing a directory listing, running a glob, deciding which file to read - the file name is the first filter. Before content, before headers, before anything.

# Before
docs/guide.md
docs/notes.md
scripts/setup.sh

# After
docs/api-authentication.md
docs/deployment-checklist.md
scripts/setup-local-dev.sh

The agent sees a directory listing and picks what to open. api-authentication.md tells it whether the file might be relevant to the current task. guide.md forces it to open and read before it can decide. Descriptive names save the agent a round trip. In a project with dozens of files, that adds up.

This applies to any file the agent might discover: docs, scripts, configs.

Now the four you've heard before - but with a why.

4. Use headers

Agents scan headers the way developers scan a README: as a table of contents. A header says "new topic, reset attention."

# Before
The project uses TypeScript with strict mode enabled. For testing we
use vitest. The CI pipeline runs on GitHub Actions.

# After
## Language

TypeScript with strict mode enabled.

## Testing

- `npx vitest` — run from project root

## CI

- `.github/workflows/` — GitHub Actions

One topic per header. The agent navigates to the right section instead of parsing the whole paragraph. Without headers, every instruction competes with every other instruction for attention.

5. Put commands in code blocks

Commands in prose get read as descriptions. Commands in code blocks get treated as executable.

# Before
You can run the linter by running npm run lint and the tests
by running npm test.

# After
- `npm run lint` — check for issues
- `npm test` — run test suite

If you do nothing else from this post, wrap your commands in backticks. It's the single highest-impact change - a command in a code fence is a command. A command in a sentence is a suggestion.

6. Use standard section names

## Testing gets recognized instantly. ## Quality Assurance Verification Process doesn't.

Agents have been trained on millions of README files. They know what ## Testing, ## Commands, ## Structure, and ## Conventions mean. Those names carry built-in context.

Instead of	Use
Quality Assurance	Testing
Development Guidelines	Conventions
Operational Instructions	Commands
Safety and Compliance	Boundaries
Project Organization	Structure

The familiar name is the signal. The creative name is noise.

7. Make instructions actionable

"Follow best practices" is not an instruction. "Use ruff for formatting, run before committing" is.

The test: could an agent execute this instruction right now, without asking a clarifying question? If not, it's too vague.

# Before
Make sure code quality is maintained and follows our standards.

# After
## Conventions

- Format with `ruff format` before committing
- Type annotations on all public functions
- No `print()` in production code — use `logging`

Every instruction should pass the "act on it immediately" test. If it can't be acted on, it's a wish, not an instruction.

The compound effect

Each rule alone is a small improvement. Together, they're multiplicative - not because the rules add up, but because they reinforce each other. Headers create sections. Sections hold code blocks. Code blocks contain actionable commands. Rationale explains why. Descriptive file names route attention to the right file. Shallow hierarchy keeps everything findable.

Here's a realistic before/after applying all seven:

Before:

This project is a Python CLI tool. We use pytest for testing and ruff
for linting. Make sure to run tests before you commit anything. The
source code is in src/myapp and tests are in tests/. Don't modify
anything in the dist/ folder because that's generated. Also we have
some rules about how to write tests — they should test behavior not
implementation details, and use parametrize instead of writing lots
of individual test functions that do the same thing.

After:

## Testing

- `pytest` — run from project root before every commit
- Test behavior, not implementation — assert on outcomes, not internal calls
- Use `@pytest.mark.parametrize` when cases share the same assertion shape

## Formatting

- `ruff check --fix && ruff format`

## Structure

- Source: `src/myapp/`
- Tests: `tests/`

## Boundaries

- `dist/` — generated, do not modify

Same information. Half the words. Every instruction lands.

When to reformat

If you notice:

The agent apologizes for missing an instruction that's in your file
The same rule gets violated in consecutive sessions
You keep adding more words to an instruction hoping the agent will "get it"
Your CLAUDE.md is one long section with no headers
Commands appear in sentences instead of code blocks

Your instructions don't need more content. They need more structure.

The connection to /bootstrap

In the previous posts we built the delivery system: backbone.yml maps the project, Mermaid draws the workflows, /bootstrap loads both in seconds. That's the orientation layer - the agent knows where it is and how things work.

This is about attention budget allocation. The agent has a limited context window. What matters isn't just what's in it — it's how the agent decides what's relevant at each step. Structure is what makes your instructions win that competition.

Orientation without compliance means the agent knows your project but ignores your rules. Compliance without orientation means the agent follows instructions but works in the wrong place. You need both.

Try it

Open your CLAUDE.md (or whatever instruction file your agent reads)
Find the longest prose paragraph
Break it: one header per topic, one code block per command, one sentence of rationale per prohibition
Run your agent on the same task you ran yesterday

The instructions didn't change. The signal did.

Don't just write more instructions. Format the ones you have.

Why /bootstrap should be the first Command in every Agent session

Gábor Mészáros — Tue, 24 Feb 2026 12:39:23 +0000

After a 2.5 hour session you accidentally close your coding agent terminal mid session. The output is there, the commits are there, but something important is gone.

That synergy that you spent hours to build up.

You reopen the console and hope you two can start over, but it feels like now you are strangers. The agent is now "Somebody that you used to know."

No, this is not an intro of a light love novel, it's the usual experience with coding agents. Coding agents are stateless by design so each and every new session is a new beginning.

The resume illusion

Some agents have --resume functionality. Claude Code has it. Codex has it. Gemini CLI has it. It's useful, but it has limitations.

--resume only replays the conversation log. It doesn't restore the loaded and curated mental model - the understanding of your project's topology, constraints, and current state that the agent built up over those 2.5 hours.

Resume gives you only the transcript. Not the understanding.

Two primitives I already had

Over the last few weeks I wrote about two separate ideas:

In The backbone.yml Pattern, I introduced a YAML manifest that maps your project's topology - agents, directories, configs, schemas. Information. The agent reads it once and knows where everything is. No more exploration tax.

In Mermaid for Workflows, I showed how flowcharts give agents reliable step-by-step processes to follow. Process. Structured syntax that sticks out in a context window full of prose, backed by research showing agents follow flowcharts more reliably than natural language.

Backbone tells the agent what exists. Workflows tell the agent how to operate.

But I was using them separately. I'd tell Claude "read the backbone" at session start, then invoke workflows as needed. Manual orchestration. Every session, same ritual.

Why am I doing this separately? Isn't context just Information + Process ?

Read the map. Follow the process. Produce a working mental model. Every session, one command.

That's /bootstrap.

What /bootstrap does

One command. Two modes.

First run (no backbone exists): scans the project, detects agents and structure, generates a backbone.yml, then synthesizes a context report.

Every subsequent run (backbone exists): reads the backbone, maps agents, loads constraints, checks project state, and produces a mental model.

Both modes use the diagram + prose combo from the mermaid post - flowcharts for the branching, prose for the reasoning behind each step.

Bootstrap workflow

The output looks like this:

Bootstrap complete.

Project: my-app v1.2.0 (branch: feature/auth)
Agents: claude (CLAUDE.md), copilot (.github/copilot-instructions.md)
Structure: src/, tests/, docs/, config/

Navigation:
  Agent config → backbone.agents.{agent}
  Project dirs → backbone.paths.{key}
  Schemas      → backbone.schemas.{name}

Operations:
  Build  → npm run build
  Test   → npm test
  Deploy → ./scripts/deploy.sh

Constraints:
  - Never modify config/production.yml directly
  - Always run tests before committing

State: v1.2.0, 3 unreleased changes (auth module)

After this, the agent knows where things are, how to operate, what's off limits, and what's in progress. No exploration. No guessing.

Seed mode: the smart first run

Most bootstrapping tools drop a blank template and say "fill this in." That's 0% useful on day one.

/bootstrap scans first, generates second. It detects agents across the ecosystem:

Pattern	Agent
`CLAUDE.md`	Claude
`AGENTS.md`	Codex
`.github/copilot-instructions.md`	Copilot
`.cursorrules`	Cursor
`.windsurfrules`	Windsurf
`.clinerules`	Cline
`.aider*`	Aider
`.continue/config.json`	Continue

It maps directories, finds configs, detects build/test workflows from package.json, Makefile, CI configs. The generated backbone is 70-80% correct from the scan alone.

The remaining 20% - semantic connections, domain concepts - gets marked with # TODO: refine so you know exactly where to invest review time. Verified topology. Flagged guesses. One command.

The skill structure

I built this as an Agent Skill - the open standard for packaging reusable instructions across agents:

bootstrap/
  SKILL.md              # Entry point - frontmatter + instructions
  workflows/
    seed.md             # Scan + generate (mermaid flowchart)
    bootstrap.md        # Read + synthesize (mermaid flowchart)
  templates/
    backbone.yml        # Starter backbone shape

See the two primitives? The templates/backbone.yml is the information layer from the backbone post. The workflows/*.md files are the process layer from the mermaid post - complete with flowcharts, key decisions, and edge cases.

/bootstrap is their love child. One skill that reads both primitives and turns them into a loaded context.

Cross-agent by design

The SKILL.md format is an open standard created by Anthropic and now adopted by OpenAI, Google, Cursor, and others. A skill authored once works across 30+ agents - the format is filesystem-based, not API-dependent.

Drop the bootstrap/ folder into .claude/skills/ for Claude Code, .agents/skills/ for Codex CLI, or wherever your agent looks. Same skill, same result.

This matters because the bootstrap concept isn't Claude-specific. Every coding agent is stateless. Every agent benefits from a loaded mental model at session start. The problem is universal, so the solution should be too.

What changes after bootstrap

Before bootstrap, every session starts with the agent exploring. After bootstrap, every session starts with the agent understanding.

No more find / ls / grep loops to discover what the backbone already maps
No more wrong assumptions about where configs live
No more repeated corrections - "no, the tests are in spec/, not tests/"
No more context poisoning from exploration artifacts cluttering the window

The agent reads the backbone, follows the workflow, synthesizes the context, and starts working. Every session. In seconds.

The progression

Looking back at this series, the progression is clear:

In the capability levels post - what maturity looks like for instruction files.
In the backbone.yml post - give the agent a map (information).
In the mermaid post - give the agent reliable processes (workflows).
Now - combine both into a single command that loads a mental model.

Map + Process = Understanding. That's the whole idea.

Try it

The bootstrap skill will be published as a cross-agent compatible Agent Skill in the Reporails skills repo this week.

In the meantime, the pattern works even without the skill:

Create a backbone.yml mapping your project (template here)
Add a workflow with a mermaid flowchart for session initialization (approach here)
Start every session with: "Load the backbone, follow the bootstrap workflow, and tell me what you understand"

That's manual bootstrap. The skill just makes it /bootstrap.

Don't start a session. Bootstrap it.

This post is part of the Reporails series. Previous: Mermaid for Workflows.

CLAUDE.md Best Practices: Mermaid for Workflows

Gábor Mészáros — Tue, 17 Feb 2026 12:04:57 +0000

I picture says a thousand words. I wanted to see my system.

Not the code. I wanted to see the workflows. What happens when a rule gets validated. What happens when a session starts. What happens when compaction triggers. Systems are workflows, and I couldn't see mine.

I had them written down, of course. Prose paragraphs in CLAUDE.md/SKILL.md or RULES describing each process step by step. But past four or five steps with branching, the prose became unreadable. I'd write it, come back a week later, and need to re-parse the whole thing to understand what I'd written. Mental overload, every time.

My coding agent had the same problem. Research calls it "lost in the middle" - LLMs perform best with information at the beginning and end of their context, and significantly worse with information buried in the middle. My prose workflows were exactly that: critical branching logic buried in paragraphs, sandwiched between other instructions. Claude would miss steps. Skip branches. Drift from the intended process.

And the workflows themselves drifted too. I'd remove a pipeline phase and update one paragraph but miss another. Prose makes that invisible - three sentences can reference a removed step and nothing looks broken.

So I rewrote my workflows as Mermaid diagrams. And three things happened at once:

I could see the system. Rendered Mermaid gives you a visual map of what's happening - for free.
Claude followed them more reliably. Structured syntax sticks out in a context window full of prose.
They stopped rotting. You can't leave a dangling arrow in a flowchart the way you can leave a stale sentence in a paragraph.

Turns out there's research backing all three.

The research

FlowBench (Xiao et al., EMNLP 2024) tested how LLM agents perform when given the same workflow knowledge in different formats - natural language, pseudo-code, and flowcharts. Across 51 scenarios on GPT-4o, GPT-4-Turbo, and GPT-3.5-Turbo:

Flowcharts achieved the best trade-off for agent performance
Combining formats (text + code + flowcharts) outperformed any single format

Format matters. It measurably affects how well the agent follows your instructions.

What to convert

Not everything benefits equally from a diagram. The rule:

If it has branches, it needs a diagram. If it has judgment, it also needs prose. Most real workflows need both.

Deterministic pipelines - CI/CD, deployment, validation, review workflows - are pure flowchart territory. Every step has a defined outcome, every branch has a condition.

But most workflows aren't purely deterministic. They have branching and judgment: "if the tests fail with a type error, fix inline; if it's a logic error, rethink the approach." The diagram captures the branch. The prose below it captures the judgment. Neither format alone carries both.

Before and after

Here's what my rule validation workflow looked like before - prose only, describing the same process:

## Rule Validation

Run validation on all rules. For each rule, first validate the
schema (fields, types, format). If that passes, check the contract
(.md and .yml matching). If the contract is valid, resolve template
variables and run OpenGrep validation on pattern syntax. If OpenGrep
returns exit 2 or 7, report the error. If it returns 0 or 1,
the rule passes. After all rules are checked, output a summary.

And here's what the Mermaid version looks like:

flowchart TD
    START([/validate-rules options]) --> COLLECT[Collect rules from paths]
    COLLECT --> LOOP[For each rule]
    LOOP --> SCHEMA[1. Schema validation<br/>Fields, types, format]
    SCHEMA -->|fail| REPORT
    SCHEMA -->|pass| CONTRACT[2. Contract validation<br/>.md and .yml matching]
    CONTRACT -->|fail| REPORT
    CONTRACT -->|pass| RESOLVE[Resolve template variables]
    RESOLVE --> OPENGREP[3. OpenGrep validation<br/>Pattern syntax]
    OPENGREP -->|exit 2 or 7| REPORT
    OPENGREP -->|exit 0 or 1| REPORT[Report results]
    REPORT --> NEXT{More rules?}
    NEXT -->|yes| LOOP
    NEXT -->|no| SUMMARY[Summary output]

And the result:

Rendered Mermaid workflow from Reporails rule validation

Same information. But the flowchart makes every branch explicit and every failure path visible. Claude can't accidentally skip a validation step or misinterpret which exit codes mean failure.

But the diagram alone is still only half the answer.

The combo: diagram + prose

FlowBench's strongest finding wasn't "use flowcharts" - it was "combine formats." Each format carries what it's best at.

Here's what one of my actual workflows looks like after conversion - rule-validation.md from Reporails:

## Rule Validation Workflow

mermaid
flowchart TD
    START([/validate-rules options]) --> COLLECT[Collect rules from paths]
    COLLECT --> LOOP[For each rule]
    LOOP --> SCHEMA[1. Schema validation<br/>Fields, types, format]
    SCHEMA -->|fail| REPORT
    SCHEMA -->|pass| CONTRACT[2. Contract validation<br/>.md and .yml matching]
    CONTRACT -->|fail| REPORT
    CONTRACT -->|pass| RESOLVE[Resolve template variables]
    RESOLVE --> OPENGREP[3. OpenGrep validation<br/>Pattern syntax]
    OPENGREP -->|exit 2 or 7| REPORT
    OPENGREP -->|exit 0 or 1| REPORT[Report results]
    REPORT --> NEXT{More rules?}
    NEXT -->|yes| LOOP
    NEXT -->|no| SUMMARY[Summary output]


## Why Three Layers in This Order

1. **Schema validation** catches structural errors (missing fields, wrong
   types) with zero external dependencies. Cheapest check - filters out
   rules that would cause confusing downstream failures.

2. **Contract validation** confirms that rule.md and rule.yml agree.
   Catches the class of bugs where one file was updated but the other
   wasn't. Requires both files to be schema-valid first.

3. **OpenGrep validation** runs actual patterns against the syntax
   checker. Most expensive step - requires template resolution, file I/O,
   agent config loading. Only runs on rules that are already structurally
   sound.

The diagram shows the three-step pipeline with its branches. The prose explains why that ordering - cheapest first, most expensive last, each layer depending on the previous one being clean. Neither format alone carries both the flow and the reasoning.

When to adopt this

If your CLAUDE.md has any of these, you have a flowchart waiting to happen:

"First do X. If X passes, do Y. If Y fails, do Z."
"Run A, then B, then C. If any step fails, stop."
"Check for X. If found, do Y. Otherwise, do Z."

Sequential steps with conditions = flowchart. Convert those, leave everything else as prose.

Try it

Find a workflow in your CLAUDE.md that reads like a recipe with conditions
Rewrite the control flow as Mermaid
Keep the rationale and judgment calls as prose below the diagram
Delete the original prose-only version

One converted workflow. See if Claude follows it more reliably - and enjoy being able to see your system for the first time.

Don't describe the path. Draw the map.

*The FlowBench paper is at arxiv.org/abs/2406.14884. The "lost in the middle" paper is at arxiv.org/abs/2307.03172.

I'm building instruction file governance at Reporails - this finding led to a new rule category (Context Quality) that I'll cover in the next post.*

Previous in series: The backbone.yml Pattern

Reporails: Copilot adapter, built with copilot, for copilot.

Gábor Mészáros — Mon, 16 Feb 2026 07:54:28 +0000

This is a submission for the GitHub Copilot CLI Challenge

What I Built

Reporails is a validator for AI agent instruction files: CLAUDE.md, AGENTS.md, copilot-instructions.md. It scores your files, tells you what's missing, and helps you fix it.

The project already supported Claude Code and Codex. For this challenge, I added GitHub Copilot CLI as a first-class supported agent - using Copilot CLI itself to build the adapter.

The architecture was already multi-agent by design. A .shared/ directory holds agent-agnostic workflows and knowledge. Each agent gets its own adapter that wires into the shared content. Claude does it through .claude/skills/, Copilot through .github/copilot-instructions.md.

Adding Copilot took 113 lines. Not because the work was trivial - but because the architecture was ready.

Repos:

CLI: reporails/cli (v0.3.0)
Rules: reporails/rules (v0.4.0)
Recommended: reporails/recommended (v0.2.0)

Demo

After adding Copilot support, each agent gets its own rule set with no cross-contamination:

Agent	Rules	Breakdown
Copilot	29	30 CORE - 1 excluded + 0 COPILOT-specific
Claude	39	30 CORE - 1 excluded + 10 CLAUDE-specific
Codex	37	30 CORE + 7 CODEX-specific

Run it yourself:

npx @reporails/cli check --agent copilot

My Experience with GitHub Copilot CLI

It understood the architecture immediately

I explained the .shared/ folder — that it was created specifically so both Claude and Copilot (and other agents) can reference the same workflows and knowledge without duplication. Copilot got it on the first exchange:

Copilot understanding .shared/ architecture

The key insight it surfaced: "The .shared/ content is already agent-agnostic. Both agents reference the same workflows. No duplication is needed - just different entry points."

That's exactly right. Claude reaches shared workflows through /generate-rule → .claude/skills/ → .shared/workflows/rule-creation.md. Copilot reads instructions → .shared/workflows/rule-creation.md. Same destination, different front doors.

What it built

Copilot created the full adapter in three phases:

Foundation - .github/copilot-instructions.md, agents/copilot/config.yml, updated backbone.yml, verified test harness supports --agent copilot
Workflow Wiring - entry points in copilot-instructions.md, context-specific conditional instructions, wired to .shared/workflows/ and .shared/knowledge/
Documentation - updated README and CONTRIBUTING with agent-agnostic workflow guidance

Copilot Contribution Parity Complete

The bug it found (well, helped find)

While testing the Copilot adapter, I discovered that the test harness had a cross-contamination bug. When running --agent copilot, it was testing CODEX rules too — because _scan_root() scanned ALL agents/*/rules/ directories indiscriminately.

The fix was three lines of Python:

# If agent is specified, only scan that agent's rules directory
if agent and agent_dir.name != agent:
    continue

Test Harness Agent Isolation Fix

The model selector surprise

When I opened the Copilot CLI model selector, the default model was Claude Sonnet 4.5. The irony of building a Copilot adapter using Copilot CLI running Claude was not lost on me.

What worked, honestly

Copilot CLI understood multi-agent architecture without hand-holding. It generated correct config files matching existing adapter patterns. The co-author signature was properly included in all commits. It didn't try to duplicate content that was already shared - it just wired the entry points.

The whole experience reinforced something I've been thinking about: the tool matters less than the architecture underneath. If your project is structured well, any competent agent can extend it. That's the whole point of reporails - making sure your instruction files are good enough that the agent can actually help you.

What also happened during this challenge

While building the Copilot adapter, I also rebuilt the entire rules framework from scratch. Went from 47 rules (v0.3.1) to 35 rules (v0.4.0) - fewer rules, dramatically higher quality. Every rule is now distinct, detectable, and backed by evidence. But that's a story for another post.

Try it: npx @reporails/cli check

GitHub | Previous posts

CLAUDE.md Best Practices: The backbone.yml Pattern

Gábor Mészáros — Tue, 10 Feb 2026 12:31:44 +0000

There's a Dutch scouting tradition called "dropping." Kids get driven to an unfamiliar forest at night - sometimes blindfolded - and have to find their way back to camp. It builds independence, problem-solving, resilience.

That's what most people do to their AI agents.

Drop them in a codebase. No orientation. Figure it out. (Veel succes en heel gezellig, as the Dutch would say.)

The difference is that, unlike people, the AI Agent memory goes as far as context allows.

find . -name "*.yml" -type f
grep -r "config" --include="*.md"
ls -la .claude/

The agent explores. Makes wrong assumptions. Gets corrected. Tries again. Eventually finds what it needs, or doesn't and quietly poison context.

I call this the exploration tax - the tokens and time spent orienting instead of working.

Give the agent a map

The fix is simple: one file that maps your project.

# backbone.yml
version: 1

structure:
  config: config/
  src: src/
  tests: tests/
  docs: docs/

conventions:
  test_pattern: "*.test.ts"
  config_format: yaml

boundaries:
  never_modify:
    - .env
    - migrations/
    - vendor/

That's enough to start. Claude reads this once and knows: config lives in config/, tests are *.test.ts, never touch .env or migrations/.

No more exploration loops. No more wrong guesses. No more "sorry, I thought the config was in the root directory."

Scaling up

As your project grows, so can your backbone. Here's what mine looks like for Reporails rules:

version: 3

agents:
  claude:
    main_instruction_file: CLAUDE.md
    config: agents/claude/config.yml
    skills: .claude/skills/
    tasks: .claude/tasks/
  codex:
    config: agents/codex/config.yml

rules:
  core: core/
  agents: agents/
  patterns:
    rule_dir: "{category}/{slug}/"
    definition: "rule.md"
    test_pass: "tests/pass/"
    test_fail: "tests/fail/"
  categories:
    structure: core/structure/
    content: core/content/
    efficiency: core/efficiency/
    maintenance: core/maintenance/

schemas:
  rule: schemas/rule.schema.yml
  capability: schemas/capability.schema.yml
  agent: schemas/agent.schema.yml

registry:
  capabilities: registry/capabilities.yml
  levels: registry/levels.yml
  coordinate_map: registry/coordinate-map.yml

Multiple agents, rule patterns, schemas, registries - all mapped. Claude can construct paths directly instead of exploring.

Wiring it up

The backbone file alone isn't enough - you need to tell Claude to use it. Add this to your CLAUDE.md:

## Initialization

Read these files before searching or modifying anything:

1. Read `backbone.yml` for project structure and path resolution
2. Read any registries or schemas referenced there as needed
3. Read `.claude/rules/` for context-specific constraints

## Structure

Defined in `backbone.yml` - the single source of truth for project topology.

**BEFORE** running `find`, `grep`, `ls`, or glob to locate project files, read `backbone.yml` first. All paths are mapped there. Do not use exploratory commands to discover paths that the backbone already provides.

This is the key: explicit instruction to read the map before exploring. Without it, Claude might still wander.

Why a separate file?

You could put all of this directly in your CLAUDE.md. But there's a tradeoff.

Everything in CLAUDE.md sits in the context window from the start - every session, every message, whether the agent needs it or not.

backbone.yml is read-on-demand. Claude doesn't load it at session start - it reads it when it would otherwise start exploring. The map replaces discovery, not adds to it.

There are also things a directory structure can't express:

Patterns. {category}/{slug}/rule.md isn't a folder - it's a convention.
Relationships. Which agent owns which config? What schema validates what file?
Boundaries. What's off-limits? What's deprecated?

Directories show what exists. backbone.yml shows how it fits together.

The cost of exploration

I tracked my Claude Code usage across 176 sessions. A significant chunk of friction came from wrong assumptions about project structure:

Used the wrong YAML library (PyYAML instead of ruamel.yaml)
Wrote changes to the wrong repo in a monorepo
Assumed directories existed that didn't
Missed config files that were right there

Each mistake costs tokens, time, and trust. The models are smart enough - the problem is orientation.

Where this fits

In my previous post, I introduced capability levels for instruction files:

L1-L2: CLAUDE.md exists, has basic constraints
L3: External references, multiple files
L4: Path-scoped rules that load conditionally
L5: backbone.yml - maintained structure, active upkeep
L6: Dynamic context, skills, MCP integration

Most setups stop at L2-3. The jump to L5 isn't about adding more rules - it's about making your existing setup navigable. backbone.yml is how you get there.

When to adopt this

Not every project needs it. Weekend hack? Basic CLAUDE.md is fine.

But if you notice:

Claude repeatedly exploring the same directories
Wrong assumptions about project structure
Corrections like "no, the config is in X, not Y"
Monorepo confusion about which repo to modify

...you're paying the exploration tax. A backbone file pays for itself in the first session.

Keep it accurate

A backbone.yml only works if it's true. Paths that don't resolve, patterns that don't match reality - those are worse than no map at all.

Structure that rots is worse than no structure.

Try it

Create backbone.yml in your project root
Map your directories, configs, conventions
Add the initialization section to your CLAUDE.md
Watch Claude stop guessing

I use this with Claude Code daily. The pattern should work for any agent that reads instruction files - Codex, Copilot, Cursor - though I haven't tested all of them. If you try it, let me know how it goes.

Don't drop your agent in the dark. Give it a map.

Reporails is where I'm building instruction file governance. The backbone.yml example above is from there.

CLAUDE.md best practices - From Basic to Adaptive

Gábor Mészáros — Tue, 03 Feb 2026 12:15:28 +0000

How do you learn new things as a developer?

My take on it is to find yourself an actual project (not tutorials) and start iterating. I wanted to learn LangGraph for my SageCompass project. SageCompass is a monorepo with LangGraph + Drupal (for RAG content management) and Gradio (for UI).

I iterated ... a LOT. A lot lot. A lot lot lot.

After 2 months of learning the principles of managing a python project and on top of that a LangGraph project, I felt ready to start using a coding agent (Codex at that time), to reduce refactoring times. As it turned out, coding agents are working significantly more reliably, if you have strong boundaries. I had my unit test structure hammered out, directives and contract were clear and strongly defined.

However

SageCompass is a monorepo. I needed some highly elevated AGENTS.md setup to manage all of them together. The LangGraph part? Tight. Contracts, test structure, clear boundaries - the agent barely needed hand-holding. The Drupal part? I've been working with Drupal for 17 years. I know what I need, but I hadn't written it down for an agent yet. The Gradio part? I was still learning it myself - how do you write instructions for something you don't fully understand yet?

I couldn't just have one big instruction file. Each component was at a different stage of readiness. Copy-pasting rules across them would have been worse than having no rules at all.

That's when it hit me: instruction setups have capability levels. And if they have levels, they can be measured. And if they can be measured, they can be improved systematically.

Emergence of capability levels

When I tried to port my LangGraph rules to the Gradio component, I needed to figure out which ones were universal and which ones were specific to a well-established, contract-heavy setup.

A rule like 'never commit .env files' applies everywhere. A rule like 'implement nodes as make_node* factories' is meaningless outside LangGraph.
That forced me to categorize. Not just what rules do, but what level of project capability they assume.

A basic project needs different instructions than one with enforced contracts and navigation maps.

What I found?

A starting point and Six levels. L1 to L6.

L0  Absent      → No instruction file (The starting point)
L1  Basic       → File exists, tracked
L2  Scoped      → Project-specific constraints  
L3  Structured  → External references, modular
L4  Abstracted  → Path-scoped loading
L5  Maintained  → Structural discipline
L6  Adaptive    → Dynamic context, skills, MCP

Here's what each one means in practice.

L0: Absent

No CLAUDE.md. No AGENTS.md. Nothing.

Claude works from its training data and whatever it can infer from your code. It'll guess your stack from package.json, maybe pick up patterns from existing files. But it has zero guidance about your preferences, constraints, or "never do this" rules.

For quick scripts or throwaway experiments, this is fine. For anything you'll maintain, you're probably leaving value on the table.

L1: Basic

your-project/
└── CLAUDE.md       ← exists

A file exists. It's tracked in git.

Content might be /init boilerplate — the auto-generated stuff Claude Code produces. Might be a few lines you wrote yourself. The point is you've acknowledged that Claude needs context, and you've given it somewhere to live.

This is the "I know this matters" stage. Most people get here quickly.

What changes: Claude has something project-specific. It knows this isn't just a random repo.

What's still missing: Rules. Claude knows about your project, but not your constraints.

L2: Scoped

# CLAUDE.md

## Project
E-commerce API, Node.js, PostgreSQL.

## Constraints
- MUST use TypeScript strict mode
- MUST NOT use `any` type  
- MUST run tests before committing
- NEVER modify migration files directly

Explicit constraints. MUSTs and MUST NOTs.

This is where you stop describing and start prescribing. Not just "here's what the project is" but "here's what you can and cannot do."

The language matters. "Prefer TypeScript" is a suggestion Claude might ignore. "MUST use TypeScript strict mode" is a rule it tends to follow.

For small projects with simple conventions, this is often enough. You have your rules in one place. Claude follows them. Life is reasonable.

What changes: Claude follows your rules, not just generic best practices.

What's still missing: Scale. When the file gets long, important stuff gets lost in the noise.

L3: Structured

# CLAUDE.md

See @docs/architecture.md for system overview.
See @docs/api-conventions.md for API patterns.

## Constraints
...

External references. Multiple files. Content split by concern.

You've hit the point where one file isn't working anymore. So you break it up. Architecture in one place. API conventions in another. Your CLAUDE.md becomes a router pointing to the right context.

This is also where team collaboration gets easier. Different people can own different files.

What changes: Separation of concerns. Easier to maintain. Each file has a job.

What's still missing: All files load regardless of what you're working on. Editing tests? Claude still loads your API conventions. Noisy.

L4: Abstracted

your-project/
├── CLAUDE.md
└── .claude/
    └── rules/
        ├── api-rules.md        # paths: src/api/**
        ├── frontend-rules.md   # paths: src/components/**
        └── test-rules.md       # paths: tests/**

Path-scoped loading. Different rules for different parts of the codebase.

Edit src/api/users.ts? Only API rules load. Edit tests/user.test.ts? Only test rules load.

This is where context efficiency gets real. You're not wasting tokens on irrelevant rules. Claude's attention stays on what matters for the task at hand.

How you implement this depends on the tool. Claude Code uses .claude/rules/ with frontmatter. Cursor uses .cursor/rules/. The concept is the same.

What changes: Claude adapts to what you're working on, not just what project you're in.

What's still missing: Maintenance. Structures rot. Rules go stale.

L5: Maintained

L4 with discipline.

Same structure, but with habits to keep it current:

A backbone file mapping the codebase, updated when things change
Some way to track what's stale
Regular reviews (however often makes sense for you)

The difference between L4 and L5 isn't features — it's upkeep. L4 is "I set this up." L5 is "I keep it working."

What changes: Reliability over time. The setup doesn't quietly rot.

What's still missing: Dynamic capabilities. Claude follows instructions but can't extend itself.

L6: Adaptive

your-project/
├── CLAUDE.md
├── .claude/
│   ├── rules/
│   └── skills/
│       ├── database-migrations/
│       │   └── SKILL.md
│       └── api-testing/
│           └── SKILL.md
└── mcp.json

Skills that load based on task. MCP servers for external integrations.

At this level, Claude doesn't just follow instructions — it loads capabilities. Working on migrations? The migration skill activates with its own context. Need to hit an external API? MCP handles it.

Very few setups are here yet. The tooling is new. The patterns are still emerging.

What changes: Claude extends its abilities based on what it detects you're doing.

Quick self-check

Where do you land?

Question	If yes...
Do you have any instruction file?	At least L1
Does it have explicit constraints (MUST/MUST NOT)?	At least L2
Do you use @imports or multiple files?	At least L3
Do different paths load different rules?	At least L4
Do you actively maintain the structure?	At least L5
Do you use skills or MCP?	L6

From what I've seen, most setups are L1 (Basic) or L2 (Scoped). Some reach L3 (Structured). L4 (abstracted) and above is rare - not because it's hard, but because the patterns aren't widely known yet.

Why bother with levels?

It's not about chasing a high score.

It's about having words for things.

"I'm at L2 (Scoped) and wondering if L4 (abstracted) is worth the effort" is a conversation you can actually have. "My CLAUDE.md is pretty good" isn't.

The right level depends on your project. A weekend hack doesn't need path scoping. A complex system with multiple domains probably does. The framework just helps you think about where you are and where you might want to go.

What I'm building

I'm working on a validator that uses this framework: detects your level, checks structure, score your setup. (If you run it from Claude Code CLI, it helps you fix issues too.)

It's early. Like, really early. I'm still working through core level implementations. But if you want to poke at it and tell me what's broken, I'd appreciate it:

Reporails CLI: github.com/reporails/cli

Or just use the levels as a mental model. That's the real value anyway.

CLAUDE.md: Check, Score, Improve & Repeat

Gábor Mészáros — Tue, 27 Jan 2026 08:54:55 +0000

The missing quality checker for AI instruction files.

You asked for a small refactor. A small(!) refactor.
Claude Code rewrote half the module.

"You're right, I apologize." "Let me fix that." "Sorry, I misunderstood." — on repeat.

So you open the CLAUDE.md. Then the rules. Then the SKILLS. Each is 400 lines at least. 24 files total.
You cross-reference the official docs, skim three "best practices" blog posts, dig through GitHub examples.

Hours of trial and error later, you do what any reasonable person would: you ask Claude to figure it out.

───────────────────────────────────────────────────────────────────
❯ review my CLAUDE.md and rules. Tell me what is wrong.
───────────────────────────────────────────────────────────────────

Burn ALL the tokens

Claude Code obliges. It reads all 24 files, their cross-referencing imports and all the additional relevant documentation. It neatly summarizes them. It suggests improvements, you accept them, it rewrites a few sections, adds here, removes there.

It burns tokens like kindling.

Your CLAUDE.md, rules, SKILLS got just a bit longer, but you're fine with that — at least it won't happen again... right? This is fine. Right?

Everything is fine

Forty minutes later, you have a slightly different mess and no idea if it's better. So you open CLAUDE.md ...

Lint the vibes!

Your code needs structure, types, format. It has tests, type checks and linters. Your AI instructions? Vibes.

Reporails helps with that.

claude mcp add reporails -- uvx --from reporails-cli ails-mcp

Then ask:

───────────────────────────────────────────────────────────────────
❯ what ails claude?
───────────────────────────────────────────────────────────────────

Runs deterministic checks and semantic validations. Produces actionable fixes Claude can apply.

It will tell you what can be done better

Just run

───────────────────────────────────────────────────────────────────
❯ fix the duplicate   
───────────────────────────────────────────────────────────────────

... and observe:

● I'll remove the duplicate from writing-rules.md since instruction-file-style.md already covers the "actionable content" requirement for .claude/rules/**.                  

● Update(.claude/rules/writing-rules.md)                                                                                                                                     
  ⎿  Removed 1 line                                                                                                                                                          
      26  - Use descriptive filenames (`api-validation.md` not `rules1.md`)                                                                                                  
      27  - Add `paths` frontmatter to reduce noise when not relevant
      28  - No paths = loads globally for all files
      29 -- Content MUST be actionable, not explanatory

● Done. Removed the duplicate "Content MUST be actionable, not explanatory" from writing-rules.md:29 — this is already covered by instruction-file-style.md:12 ("Every line
  MUST be actionable or informative") which applies to the same scope.

The result

No more vibes. No more gluing together best practices blog posts.

● Semantic Evaluation                                                                                                                                                        

  I already have context from the previous read. Evaluating the 9 candidates:

  M4 (Contradictions): 2 candidates - all passed

  C6 (Duplicates): 6 candidates - all passed (the duplicate I fixed is no longer flagged)

  G3 (Security ownership): 1 candidate - passed (organizational guidance, not a security rule)

  ---
  Final Score
  ┌────────────┬─────────────────────┐
  │   Metric   │        Value        │
  ├────────────┼─────────────────────┤
  │ Score      │ 8.3↑ / 10(complete) │
  ├────────────┼─────────────────────┤
  │ Capability │ Governed (L5+)      │
  ├────────────┼─────────────────────┤
  │ Violations │ 21 (↓2 from 23)     │
  └────────────┴─────────────────────┘
  The C6 duplicate between instruction-file-style.md and writing-rules.md is resolved.

✻ Cogitated for 32s

Reporails v0.0.1

What took hours now takes minutes.

42 rules. Documented. Open source. Easy to extend.

PyPI: reporails-cli

GitHub: CLI & MCP

GitHub: Rules

CLAUDE.md is just the start. More agents coming soon.

From Prompt to Platform: Architecture Rules I Use

Gábor Mészáros — Tue, 20 Jan 2026 07:36:43 +0000

The "build -> surprise -> restructure -> repeat" loop is amazing early on. However, after a while it's like two clowns trying to out-prank each other: it gets funnier and funnier, lots of laughs... until one of them pulls out a flamethrower for one last prank and the laughter gets a little awkward.

This type of iteration is fun until it isn't. So I went looking for guidance.

Experiences With LangGraph Tutorials

Most examples show you how to build a graph. Define some nodes. Wire them together. Ship it.

Great for prototyping.

They don't show you where to put things when you have:

8 nodes
3 agents
5 tools
Shared state across subgraphs
Middleware for guardrails
A platform layer that stays framework-independent

I searched. Found bits and pieces, but no complete picture. So I built it.

A Folder Structure That Scales

Here's what my LangGraph component looks like:

app/
├── agents/           # Agent factories (build_agent_*)
├── graphs/           # Graph definitions (main, subgraphs, phases)
├── nodes/            # Node factories (make_node_*)
├── states/           # Pydantic state models
├── tools/            # Tool definitions
├── middlewares/      # Cross-cutting concerns (guardrails, redaction)
└── platform/
    ├── core/         # Pure types, contracts, policies (no wiring)
    │   ├── contract/ # Validators: state, tools, prompts, phases
    │   ├── dto/      # Pure data transfer objects
    │   └── policy/   # Pure decision logic
    ├── adapters/     # Boundary translation (DTOs ↔ State)
    ├── runtime/      # Evidence hydration, state helpers
    ├── config/       # Environment, paths
    └── observability/# Logging

Why this structure?

It mirrors LangGraph's mental model: agents are agents; nodes are nodes; graphs are graphs. In the orchestration layer, things are easy to find and responsibilities stay separated.

But the real insight is the platform/ layer.

The Platform Layer: Why It Exists

While separating the LangGraph components was easy, separating the wiring was hard. The structure didn't appear on day one. It emerged after a number of iterations - each cycle surfaced a different missing architectural rule, whose absence made refactors rapidly more difficult with every new component.

Without architectural rules, everything gets spaghettified:

# WITHOUT PLATFORM LAYER - Everything mixed together
def problem_framing_node(state: SageState) -> Command:
    # Guardrail logic mixed with state management
    if "unsafe" in state.messages[-1].content:
        state.gating.guardrail = GuardrailResult(is_safe=False, ...)

    # Evidence hydration mixed with node orchestration  
    store = get_store()
    for item in phase_entry.evidence:
        doc = store.get(item.namespace, item.key)
        # ... inline hydration logic

    # Validation mixed with execution
    if "problem_framing" not in state.phases:
        raise ValueError("Invalid state update!")

    # ... good luck writing tests for it!

With the platform layer, concerns are separated:

# WITH PLATFORM LAYER - Clean separation
def problem_framing_node(state: SageState) -> Command:
    # Use platform contracts for validation
    validate_state_update(update, owner="problem_framing")

    # Use platform runtime helpers for evidence
    bundle = collect_phase_evidence(state, phase="problem_framing")

    # Use platform policies for decisions
    guardrail = evaluate_guardrails(user_input)

    # Use adapters for state translation
    context = guardrail_to_gating(guardrail, user_input)

    # Node only orchestrates - all logic in platform!

The node becomes what it should be: orchestration only. No domain logic. No direct store access. No inline validation.

The Hexagonal Split

The pattern that solved it: hexagonal architecture. Core stays pure - no framework dependencies, no imports from the layers above. Everything else can depend on Core, but Core depends on nothing. This makes the boundaries testable and the rules enforceable.

┌─────────────────────────────────────────────────────────┐
│                    APPLICATION LAYER                    │
│  (app/nodes, app/graphs, app/agents, app/middlewares)   │
│  - LangGraph orchestration                              │
│  - Calls platform services via contracts                │
└───────────────────────────┬─────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────┐
│                    PLATFORM LAYER                       │
│  ┌───────────┐ ┌───────────┐ ┌─────────┐ ┌───────────┐  │
│  │  Adapters │ │  Runtime  │ │ Config  │ │Observabil.│  │
│  │DTO<->State│ │  helpers  │ │env/paths│ │  logging  │  │
│  └─────┬─────┘ └─────┬─────┘ └────┬────┘ └─────┬─────┘  │
│        │             │            │            │        │
│        └─────────────┴──────┬─────┴────────────┘        │
│                             ▼                           │
│  ┌────────────────────────────────────────────────────┐ │
│  │  Core (PURE - no framework dependencies)           │ │
│  │  - Contracts and validators                        │ │
│  │  - Policy evaluation (pure functions)              │ │
│  │  - DTOs (frozen dataclasses)                       │ │
│  └────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘

The rule: core/ has NO imports from anything above it - no app orchestration (agents, nodes, graphs, etc.), no wiring, no adapters. Dependencies point inward only.

This isn't just a guideline. It's enforced.

How to enforce a guideline?

Simple: write a test for it that would catch the violation:

# tests/unit/architecture/test_core_purity.py

FORBIDDEN_IMPORTS = [
    "app.state",
    "app.graphs", 
    "app.nodes",
    "app.agents",
    # ... all app orchestration and platform wiring
]

def test_core_has_no_forbidden_imports():
    """Core layer must remain pure - no wiring dependencies."""
    core_files = Path("app/platform/core").rglob("*.py")

    for file in core_files:
        content = file.read_text()
        for forbidden in FORBIDDEN_IMPORTS:
            assert forbidden not in content, (
                f"{file} imports {forbidden} - core must stay pure"
            )

If you break the boundary, test fails. No exceptions.

Beyond guidelines, you can also define contracts that validate at runtime.

Contracts That Validate

The core/contract/ directory contains validators that enforce contract rules at runtime:

Contract	What it does
`validate_state_update()`	Restricts mutations to authorized owners
`validate_structured_response()`	Forces validation before persisting
`validate_phase_registry()`	Ensures phase keys match declared schemas
`validate_allowlist_contains_schema()`	Ensures tool allowlist correctness

These aren't optional - every node calls them:

# Every state update goes through the contract
update = {"phases": {phase_key: phase_entry}}
validate_state_update(update, owner="problem_framing")
return Command(update=update, goto=next_node)

The contracts themselves are also tested - validation logic, phase dependencies, invalidation cascades. See test_state.py for the full suite.

Test Structure That Scales

Tests are organized by type (unit, integration, e2e) and category (architecture, orchestration, platform). This makes coverage gaps obvious and lets you run targeted subsets.

tests/
├── unit/
│   ├── architecture/      # Boundary enforcement
│   │   ├── test_core_purity.py
│   │   ├── test_adapter_boundary.py
│   │   └── test_import_time_construction.py
│   ├── orchestration/     # Agents, nodes, graphs
│   └── platform/          # Core + adapters
├── integration/
│   ├── orchestration/
│   └── platform/
└── e2e/

With pytest markers:

# pyproject.toml
# Test markers for categorizing tests by purpose and scope
markers = [
  # Test Type Markers (by scope)
  "unit: Fast, isolated tests with no external dependencies",
  "integration: Tests crossing component boundaries (may use test fixtures)",
  "e2e: End-to-end workflow tests (full pipeline validation)",

  # Test Category Markers (organizational categories)
  "architecture: Hexagonal architecture enforcement (import rules, layer boundaries)",
  "orchestration: LangGraph orchestration components (agents, nodes, graphs, middlewares, tools)",
  "platform: Platform layer tests (hexagonal architecture - core, adapters, runtime)",
]

Run unit architecture tests alone: uv run pytest -m "unit and architecture"

The architecture is validated by 110 tests - 11 of which specifically enforce architecture boundaries.

What This Enables

Here's where it gets interesting.

You might be thinking: cool story, but...

Because when your architecture is predictable and enforceable, something curious happens: coding agents stop being a liability and start being useful.

When every node follows the same pattern...
When every state update goes through a validator...
When every boundary is well-defined and tested...

...an AI agent can't accidentally break your architecture without the tests catching it. It can't import forbidden modules. It can't skip validation. It can't bypass the contracts - not without failing the test suite.

The rules become more than just documentation. They're guardrails for both humans and AI.

Want the Full Thing?

46 architecture principles (tiered): Architecture principles
Platform contracts README: Platform contracts
Architecture tests: Architecture tests

Next up

What happens when you point Claude Code at an architecture it can't break.

The CLAUDE.md file isn't just a conglomerate of instructions - it's a contract that preserves context and enforces boundaries during development.

I built a framework for it with measurable results.

Coming next: The CLAUDE.md Maturity Model.

This is part of my "From Prompt to Platform" series documenting the SageCompass build. Start from the prologue

Forem: Gábor Mészáros

The State of AI Instruction Quality

The dataset

How we measured

Finding 1: Most of your instruction file isn't instructions

Finding 2: 90% of instructions don't name what they're talking about

Finding 3: agents.md is the most common instruction file

Finding 4: Different agents, completely different config philosophies

Finding 5: 37% of projects configure multiple agents

Finding 6: The most-copied skills are the vaguest

The best and worst skills (>50 repos)

Finding 7: Sub-agents are almost entirely persona prompts

The anatomy chart: more directives, worse quality

Limitations

What this means

Try it yourself

The dataset

The Undiagnosed Input Problem

The symptom

The industry response

The folk system

The gap

What we measured

More rules, lower quality

The sweet spot

A real example

What this means

What’s next

Do NOT Think of a Pink Elephant

The pink elephant

The anatomy of a complete instruction

The part nobody expects

What this looks like in practice

Try it

Instruction Best Practices: Precision Beats Clarity

The experiment

Scope it or drop it

Name the thing

The compound effect

When to adopt this

Try it

CLAUDE.md Best Practices: 7 formatting rules for the Machine

The comparison

It's not just about length

Seven structural rules

1. Include rationale

2. Keep heading hierarchy shallow

3. Name files descriptively

4. Use headers

5. Put commands in code blocks

6. Use standard section names

7. Make instructions actionable

The compound effect

When to reformat

The connection to /bootstrap

Try it

Why /bootstrap should be the first Command in every Agent session

The resume illusion

Two primitives I already had

What /bootstrap does

Seed mode: the smart first run

The skill structure

Cross-agent by design

What changes after bootstrap

The progression

Try it

CLAUDE.md Best Practices: Mermaid for Workflows

The research

What to convert

Before and after

The combo: diagram + prose

When to adopt this

Try it

Reporails: Copilot adapter, built with copilot, for copilot.

What I Built

Demo

My Experience with GitHub Copilot CLI

It understood the architecture immediately

What it built

The bug it found (well, helped find)

Finding 3: `agents.md` is the most common instruction file