Forem: Shinsuke KAGAWA

I Built a Skill Reviewer. Then I Ran It on Itself.

Shinsuke KAGAWA — Thu, 02 Apr 2026 11:44:56 +0000

I built a tool that reviews Claude Code skills for quality issues.

Then I pointed it at its own source files. It found real problems.

The irony wasn't lost on me. But the more interesting question is: why did this happen, and what does it tell us about how LLM-based quality tools actually work?

The Setup

I maintain rashomon, a Claude Code plugin for prompt and skill optimization. It includes a skill reviewer agent that evaluates skill files against 8 research-backed patterns (BP-001 through BP-008) and 9 editing principles.

One of those patterns—BP-001—says: don't write instructions in negative form. Research shows LLMs often fail to follow "don't do X" instructions—negated prompts actually cause inverse scaling, where larger models perform worse. The fix is to rewrite them positively: instead of "don't skip P1 issues," write "evaluate all P1 issues in every review mode."

Simple enough.

Except both my agent definition files had a section called ## Prohibited Actions full of "don't" instructions.

The Discovery

I noticed this by reading my own code. But I wanted to see what happens when the tools catch it—or don't.

First, I ran the prompt-analyzer agent against both files. It analyzed them, found some issues, but gave the Prohibited Actions sections a pass. Its reasoning: these qualify as "safety-critical" exceptions to BP-001, since they constrain "destructive" behaviors.

That felt off. "Don't invent issues not supported by BP patterns" isn't a safety-critical instruction. It's a quality policy. The caller can override or discard the output.

So I ran the skill-reviewer agent against the same two files. The results were more interesting.

For skill-reviewer.md (reviewing itself), it flagged all four items in Prohibited Actions as BP-001 violations—P2 severity. Correct call.

For skill-creator.md (reviewing the other agent), it gave Prohibited Actions a pass. Same structure, same pattern, opposite judgment.

The same reviewer, applying the same criteria, reached opposite conclusions on the same construct.

Digging Into Logs

I could have speculated about why. Instead, I checked the subagent conversation logs.

The skill-creator review log showed this in the Step 1 pattern scan:

BP-001 (Negative Instructions): Lines 197-202 "Prohibited Actions" section uses negative form. However, per the BP-001 exception in skills.md, these are procedural/irreversible consequences (inventing knowledge, removing examples, overwriting files). The exception applies.

It did scan for BP-001. It found the section. But it classified the items as "irreversible consequences" and applied the exception.

The problem was clear: the exception rule said negative form is okay for "safety-critical operations, destructive actions, or order-dependent procedures." That's vague enough to stretch. "Inventing domain knowledge" sounds serious. "Removing user-provided examples" sounds destructive. If you squint, anything can be "destructive."

Nothing was wrong with the reviewer. It was doing exactly what I told it to do. That was the problem.

Fixing the Criteria, Not the Reviewer

The instinct is to blame the LLM: "it self-justified," "it was biased toward leniency." But the actual cause was simpler: the exception rule was written in a way that allowed two reasonable readings.

The fix wasn't to make the reviewer "smarter." It was to make the criteria harder to misread.

I replaced the broad exception language:

Exception: safety-critical operations, exact command sequences,
destructive actions, or order-dependent procedures

With a 4-condition checklist:

Exception: Negative form is permitted only when ALL are true:
(1) Violation destroys state in a single step
(2) Caller or subsequent steps cannot normally recover
(3) The constraint is operational/procedural, not a quality policy
(4) Positive rewording would expand or blur the target scope

And added concrete boundary examples—what qualifies, what doesn't:

Permitted (exception applies)	Not permitted (rewrite positively)
"Do not modify the command"	"Do not invent issues" -> "Base every issue on BP patterns"
"Do not add flags"	"Do not skip P1 issues" -> "Evaluate all P1 in every mode"
"Do not execute destructive operations"	"Do not create overlapping skills" -> "Verify no overlap before generating"

The key addition: "Outputs that the caller validates, overwrites, or discards are never irreversible." This one sentence eliminates most of the ambiguity. A subagent's output goes to a caller. The caller decides what to do with it. That's not irreversible.

The Retest

After updating the criteria, I ran the skill-reviewer again on both files.

skill-reviewer.md: Prohibited Actions flagged as BP-001 P2. All four items caught.

skill-creator.md: Two items flagged as quality policies that should be positive form. The remaining items—which are genuinely about operational constraints—were accepted.

Consistent. Explainable. And the reviewer could now articulate why each item was or wasn't an exception, because the criteria forced it to check specific conditions rather than make a gestalt judgment.

But I wasn't fully satisfied. In a further round of testing, the reviewer still occasionally applied exceptions loosely—recording "irreversible" in the justification field without explaining how it's irreversible.

So I added structured evidence to the output schema:

"patternExceptions": [{
  "pattern": "BP-001",
  "location": "section heading",
  "original": "quoted text",
  "conditions": {
    "singleStepDestruction": "true|false + evidence",
    "callerCannotRecover": "true|false + evidence",
    "operationalNotPolicy": "true|false + evidence",
    "positiveFormBlursScope": "true|false + evidence"
  }
}]

You can't just write "irreversible" anymore. You have to answer four yes/no questions with evidence. If any answer is no, it's not an exception.

What This Comes Down To

The criteria had a loophole wide enough to drive a truck through. Better criteria produced better reviews without changing the reviewer at all. The LLM wasn't "inconsistent"—the instructions were ambiguous. Two reasonable people could have read the old exception rule and reached different conclusions too.

Structured output helped more than I expected. The 4-condition checklist wasn't just about auditability—it changed how the reviewer thinks. When you have to fill in four fields with evidence, you can't hand-wave. The output structure becomes a thinking scaffold.

And running the tool on its own source files was uncomfortable in a useful way. The temptation is to say "well, I know what I meant." But the tool doesn't know what I meant. It reads what I wrote.

The Broader Problem: Skill Quality Is Hard

If you're building Claude Code skills, custom agents, or any kind of structured LLM instruction set—you've probably experienced this: the instructions work fine in your head, but the LLM does something unexpected. You add more instructions. It gets worse. You simplify. Something else breaks.

The issue is that you can't see your own blind spots. You know what you meant. The LLM reads what you wrote. The gap between intent and text is where bugs live.

This is why I built rashomon. It includes:

Skill review: Evaluate skill files against BP-001~008 patterns and 9 editing principles, with structured quality grades
Golden scenario evaluation: Test whether a skill actually works by comparing execution results with and without the skill, or before and after changes—not just whether it was loaded, but whether it made a measurable difference

The golden scenario part matters. "The skill was loaded" doesn't mean "the skill helped." You need to see the actual output difference to know if your skill is doing anything useful.

Try It

Rashomon is a Claude Code plugin. Install it and point the skill reviewer at your own skills.

# In Claude Code
/plugin marketplace add shinpr/rashomon
/plugin install rashomon@rashomon
# Restart session to activate

It will find problems. I know because it found problems in itself—and it's better for it now.

What's your experience with skill quality? Have you found ways to validate that your instructions actually do what you think they do?

Same Framework, Different Engine: Porting AI Coding Workflows from Claude Code to Codex CLI

Shinsuke KAGAWA — Wed, 18 Mar 2026 11:50:48 +0000

TL;DR

I built a sub-agent workflow framework for Claude Code that solved context exhaustion through specialized agents and structured workflows
For 8 months, Codex CLI had no sub-agents — the framework was Claude Code-only
Codex finally shipped sub-agent support — I expected days of migration, it took an afternoon
What surprised me most: if you design workflows around agent roles and context separation rather than tool-specific features, your investment survives platform shifts

The 8-Month Wait

Back in July 2025, I released the first version of this workflow as a Claude Code boilerplate. By October 2025, it had evolved into a full sub-agent framework — specialized agents for every phase of development, from requirements analysis through TDD implementation through quality gates. The idea was pretty simple: break complex coding tasks into specialized roles (requirement analyzer, technical designer, task executor, quality fixer...), give each agent a fresh context, and orchestrate them through structured handoffs. No single agent ever hits the context ceiling because no single agent tries to do everything.

The problem? Codex CLI had no sub-agent capability. Codex had been around since mid-2025, and I wanted the same workflow there too. So I kept trying to bridge the gap.

First, I built an MCP server in August 2025 that let any MCP-compatible tool — Codex, Cursor, whatever — define and spawn sub-agents through a standard protocol. It worked, but MCP added a layer of indirection that wasn't there in Claude Code's native sub-agents.

Then in December 2025, Codex shipped experimental Agent Skills support. I saw an opening and built sub-agents-skills — cross-LLM sub-agent orchestration packaged as Agent Skills, routing tasks to Codex, Claude Code, Cursor, or Gemini. Closer, but still not native sub-agents.

Through all of this, my main development stayed on Claude Code. The context separation and the small context windows of the time made it the clear choice for serious work. Codex filled a supporting role — I used it for skills refinement and as an objective reviewer on complex implementations, a fresh set of eyes from a different LLM.

I don't use hooks extensively — I prefer keeping tasks small and baking quality gates into the completion criteria themselves. So what I was really waiting for was native sub-agent support in Codex, which would let the full orchestration workflow run without workarounds.

On March 16, 2026, Codex CLI shipped sub-agent support. During pre-release validation, I noticed something encouraging: Codex followed the workflow stopping points more strictly than expected. If the behavior stabilizes, it could be a viable primary development tool, not just a supporting one.

The port took almost no effort.

What "Near-Zero Migration" Actually Looks Like

When I say "the same framework," I mean it. The core architecture didn't change:

User Request
    ↓
requirement-analyzer → scale determination [STOP for confirmation]
    ↓
technical-designer → Design Doc
    ↓
document-reviewer [STOP for approval]
    ↓
work-planner → phased task breakdown [STOP]
    ↓
task-decomposer → atomic task files
    ↓
Per-task 4-step cycle:
  task-executor → escalation check → quality-fixer → git commit

22 sub-agents. 26 skills. The same stopping points, the same quality gates, the same TDD enforcement.

What changed was the container format, not the content:

Aspect	Claude Code	Codex CLI
Agent definitions	Markdown with YAML frontmatter (`agents/*.md`)	TOML files (`.codex/agents/*.toml`)
Skills location	`skills/`	`.agents/skills/`
Tool declarations	Explicit in frontmatter (`tools: Read, Grep, Glob...`)	Not needed (inferred from sandbox mode)
Skill references	Comma-separated names	`[[skills.config]]` arrays
Config directory	`.claude/`	`.codex/`

That's it. The agent instructions — the actual substance of what each agent knows and does — are the same. The workflow logic is the same. The quality criteria are the same.

Why This Worked: Design Decisions That Paid Off

It worked for a surprisingly simple reason — three choices I made early on:

1. Natural Language as the Interface Layer

Every sub-agent's behavior is defined in natural language instructions, not in platform-specific tool calls. The requirement-analyzer isn't wired to Claude Code's Agent tool or Codex's spawn_agent — it follows a written protocol: "Extract task type, determine scale (1-2 files = Small, 3-5 = Medium, 6+ = Large), identify ADR necessity, output structured JSON."

This means the instructions work on any LLM-powered agent system that can read text and follow procedures. In practice, that turned out to be enough. The framework is fundamentally a set of well-written job descriptions, not a set of API integrations.

2. Context Separation as Architecture

The core insight from the original article still applies: each agent runs in a fresh context without inheriting bias from previous steps. The document-reviewer doesn't know what the technical-designer was "thinking" — it just reviews the output. The investigator explores without confirmation bias from whoever reported the bug.

This isn't a Claude Code feature or a Codex feature. It's an architectural pattern that happens to be implementable on both platforms once they support sub-agents.

3. Structured Handoffs Over Shared State

Agents communicate through artifacts (documents, JSON outputs, task files), not through shared memory or conversation threading. The technical-designer writes a Design Doc. The work-planner reads that Design Doc. Neither needs to know which platform spawned the other.

docs/
├── prd/          # PRD artifacts
├── adr/          # Architecture decision records
├── design/       # Design documents
├── plans/        # Work plans
│   └── tasks/    # Atomic task files (1 commit each)

This file-based protocol turned out to be surprisingly platform-agnostic.

The Framework in Action

codex-workflows ships as an npm package:

npx codex-workflows install

This drops the full agent and skill infrastructure into your project. From there, you invoke recipe skills that orchestrate the sub-agents:

Recipe	What It Does
`$recipe-implement`	Full lifecycle: requirements → design → implementation → quality
`$recipe-design`	Requirements → design documents only
`$recipe-build`	Execute pre-planned tasks autonomously
`$recipe-diagnose`	Problem investigation → root cause → solutions
`$recipe-front-build`	React/TypeScript implementation with frontend-specific agents
`$recipe-fullstack-implement`	Cross-layer features with layer-aware task routing

Each recipe automatically loads the right foundational skills (coding standards, testing principles, documentation criteria) and spawns the right sequence of agents with the right stopping points.

Here's what a real session looks like. I ran $recipe-implement Add a login screen with email/password and a simple auth API in Codex CLI against a fresh project with placeholder files:

1. Requirement analysis — the first mandatory stop:

• Spawned Plato [requirement-analyzer] (gpt-5.4-mini medium)

• Updated requirement analysis is now confirmed:
  - Scope: email/password login form, local mock auth abstraction,
    hardcoded success credentials demo@example.com / password123,
    inline success message only
  - Route: fullstack
  - Scale: medium
  - ADR: not required

  Please confirm this analysis and answer these three questions:
  1. Should the auth API be a real HTTP endpoint, or is a local/mock
     API abstraction enough?
  2. What credentials should succeed initially?
  3. On successful login, should it only show success, navigate
     somewhere, or persist auth state too?

The workflow stops and waits. No implementation starts until I confirm scope.

2. Parallel design — backend and frontend at the same time:

• Spawned Volta [technical-designer] (gpt-5.4-mini medium)
  └ Create a backend Design Doc...

• Spawned Pasteur [technical-designer-frontend] (gpt-5.4-mini medium)
  └ Create a frontend Design Doc...

• Spawned Epicurus [document-reviewer] (gpt-5.4-mini medium)
  └ Review login-backend-design.md...

• Spawned Bohr [document-reviewer] (gpt-5.4-mini medium)
  └ Review login-frontend-design.md...

• Spawned Euclid [design-sync] (gpt-5.4-mini medium)
  └ Verify consistency between Design Docs...

Five agents in parallel: two designers, two reviewers, one cross-layer sync checker. Each running in its own fresh context.

3. Design gate — another mandatory stop:

• The design gate is now clean enough to stop for approval.

  - Backend design: approved
  - Frontend design: approved
  - Cross-layer sync: NO_CONFLICTS

  Aligned contract:
  - Success: { ok: true }
  - Failure: { ok: false, error }
  - Inline success copy is UI-owned, not backend-owned

  Please approve the design docs so I can move to
  acceptance-test generation and the work plan.

4. Autonomous execution after batch approval:

› batch approval

• Spawned Ohm [task-decomposer] (gpt-5.4-mini medium)

• Verification passed:
  - npm test
  - npm run build

After batch approval, the framework decomposed the work plan into tasks and executed them autonomously — no more stopping points until the quality gates pass.

The whole flow from $recipe-implement to green tests took one session. The same flow, the same stopping points, the same agent roles that I've been running on Claude Code for months.

Getting Started

The framework is open source:

Codex CLI version: codex-workflows
Claude Code version: claude-code-workflows

If you're already using the Claude Code version, the Codex version follows the same patterns. If you're new to both, pick whichever CLI you're already using — the workflow knowledge transfers either way.

npx codex-workflows install

The whole port changed config file formats and directory conventions. The agent instructions — the part that actually matters — didn't need a single edit. That's the thing I'd want to know if I were deciding whether to invest time in workflow design for AI coding tools.

If you've been running sub-agent workflows with either Claude Code or Codex CLI, I'd be curious how your setup compares. What worked? What broke?

Letting LLMs Jump — and Then Verifying Ruthlessly

Shinsuke KAGAWA — Thu, 12 Feb 2026 13:35:03 +0000

The "First Plausible Answer" Problem

You've probably seen this: you ask an LLM to investigate a bug, and it latches onto the first plausible explanation. It confidently proposes a fix before thoroughly exploring alternatives. Sometimes it works. Often it doesn't—and you're left debugging the debugger.

I ran into this repeatedly in my personal projects. The LLM would find something that looked like the cause, stop investigating, and immediately suggest a solution. When the codebase was small, this worked fine. As it grew, I started getting fixes that didn't actually fix anything.

This is not for small scripts or simple bugs.

I only started needing this once my codebase grew large enough
that "just try a fix" stopped working.

The root issue? How I was defining the task's purpose.

Planning works well when the problem is understood.

But when the problem itself is unclear,
planning alone is not enough.

This article focuses on those cases.

The Factor That Made the Difference: Purpose

When delegating tasks to LLMs, two factors affect execution accuracy: Context (staying within ~70% of the context window) and Purpose (how you define the task's goal).

Context management matters, but this article focuses on the second factor—because that's where I was getting it wrong.

Where you set the task's goal matters more than you might think. The purpose you define determines the task granularity, and the right granularity depends on your codebase complexity.

A Real Example: Bug Investigation

The Old Approach

A single session handling "Investigation → Solution Proposal → Verification," followed by a separate review session.

What I Changed

My original goal was simple: "propose a solution" and "review it objectively."

Originally, I'd just have the LLM investigate, propose a fix, and implement it directly. But as the codebase grew, I started getting solutions that didn't actually work. So I added a review step—opening a fresh session to check the proposal with clean context.

This worked for about 60-70% of problems, but occasionally even this approach couldn't reach the root cause no matter how many iterations.

Here's what I changed:

Problem Structuring: Structure my instructions upfront to make them easier for LLMs to parse in later steps
Investigation: Conduct comprehensive investigation and report results
Verification: If there's uncertainty in the report, perform additional verification
Solution Derivation: Receive investigation and verification results, then derive solutions

By setting "investigation" as the purpose, the model stopped jumping to the first candidate and instead collected information from multiple angles.

Implementation Example

This setup is probably overkill for small scripts. I only started doing this after my codebase crossed a certain complexity threshold.

Here's how I structured the diagnosis workflow using Claude Code's slash commands and sub-agents. Full implementation is available at github.com/shinpr/claude-code-workflows.

Main Command (diagnose.md)

---
description: "Investigate problem, verify findings, and derive solutions"
---

**Command Context**: Diagnosis flow to identify root cause and present solutions

Target problem: $ARGUMENTS

## Step 0: Problem Structuring (Before investigator invocation)

### 0.1 Problem Type Determination

| Type | Criteria |
|------|----------|
| Change Failure | Indicates some change occurred before the problem appeared |
| New Discovery | No relation to changes is indicated |

### 0.2 Information Supplementation for Change Failures

If the following are unclear, **ask with AskUserQuestion** before proceeding:
- What was changed (cause change)
- What broke (affected area)
- Relationship between both (shared components, etc.)

## Diagnosis Flow Overview

The goal of investigation is not to propose solutions.
It is to eliminate wrong explanations.

**Context Separation**: Pass only structured JSON output to each step.
Each step starts fresh with the JSON data only.

Sub-agent: Investigator

Think of the Investigator as a junior engineer whose only job is to gather facts—not to be clever. Its purpose is explicitly limited to evidence collection only—no solutions:

This is one concrete implementation.
The important part is the separation of purpose—not the specific tooling.

## Output Scope

This agent outputs **evidence matrix and factual observations only**.
Solution derivation is out of scope for this agent.

## Core Responsibilities

1. **Cross-check multiple sources** - Don't rely on a single source
2. **Search external info (WebSearch)** - Official docs, Stack Overflow, GitHub Issues
3. **List hypotheses and trace causes** - Multiple candidates, not just the first one
4. **Identify impact scope** - Where else might this pattern exist?
5. **Disclose blind spots** - Honestly report areas that could not be investigated

Key output structure:

{
  "hypotheses": [
    {
      "id": "H1",
      "description": "Hypothesis description",
      "causeCategory": "typo|logic_error|missing_constraint|design_gap|external_factor",
      "causalChain": ["Phenomenon", "→ Direct cause", "→ Root cause"],
      "supportingEvidence": [...],
      "contradictingEvidence": [...],
      "unexploredAspects": ["Unverified aspects"]
    }
  ],
  "comparisonAnalysis": {
    "normalImplementation": "Path to working implementation",
    "failingImplementation": "Path to problematic implementation",
    "keyDifferences": ["Differences"]
  }
}

Sub-agent: Verifier

The Verifier plays the annoying senior reviewer who assumes everything is wrong. It actively seeks refutation:

## Core Responsibilities

1. **Cross-check multiple sources** - Explore information sources not covered
2. **Generate alternative hypotheses** - What else could explain this?
3. **Play devil's advocate** - Assume "the investigation results are wrong"
4. **Pick the hypothesis with fewest holes** - Not "most evidence," but "least refuted"

Sub-agent: Solver

The Solver is the engineer who actually has to ship something. Only after verification does it derive solutions:

## Output Scope

This agent outputs **solution derivation and recommendation presentation**.
Trust the given conclusion and proceed directly to solution derivation.

## Core Responsibilities

1. **Multiple solution generation** - At least 3 different approaches
2. **Tradeoff analysis** - Cost, risk, impact scope, maintainability
3. **Recommendation selection** - Optimal solution with selection rationale
4. **Implementation steps presentation** - Concrete, actionable steps

Practical Guidelines

When designing LLM tasks, I now check two things:

Purpose Clarity - "Don't create tasks with unclear purposes"
Context Efficiency - Can it be completed in one session with sufficient information? (Ideally using 60-70% of working space)

I don't blindly split tasks into smaller pieces. Instead, I consider ROI and break down from larger tasks only when necessary.

By explicitly separating "investigation" from "solution," you prevent the model from rushing to conclusions before it has gathered sufficient evidence.

A Lesson I Learned the Hard Way

Early on, I made the Verifier run every single time. The problem? Even when the investigation was clearly off track, the Verifier would dutifully try to verify nonsense.

That's when I realized: you need a quality gate between steps, not just separation.

Now I have a checkpoint between Investigation and Verification. If the investigation output doesn't meet basic quality criteria (missing comparison analysis, shallow causal chains, etc.), it loops back instead of wasting cycles on verification.

I also added Step 0 (Problem Structuring) to help the LLM understand my intent better before diving in. These two changes—quality gates and upfront structuring—made the whole pipeline actually usable.

Design Integration Checkpoints Before Letting LLMs Code

Shinsuke KAGAWA — Wed, 04 Feb 2026 13:23:55 +0000

Once you stop trying to control AI generation and start designing verification, you immediately hit the next problem: integration.
And this is where most AI-generated systems actually break.

Everything works.
Until it doesn't.

Each layer looks correct in isolation.
Tests pass.
Types line up.

And then the system breaks where those layers meet.

Why "Everything Works" Until It Doesn't

This is a verification problem, not an implementation problem.
When you build systems layer by layer, integration happens very late.

Layer-by-layer development
Phase 1: Data layer ────────────────✓
Phase 2: Service layer ─────────────✓
Phase 3: API layer ────────────────✓
Phase 4: UI layer ─────────────────✓
Phase 5: Integration ── 💥 breaks here

Each layer is implemented in isolation.
So you don't actually know if everything connects correctly until the end.

This problem becomes much worse with AI-generated code.

LLMs don't hold the entire system in mind at once.
They optimize locally, based on the current context — and they often miss hidden contracts between layers.

A Painful Integration Bug That "Worked"

One of the most painful bugs I faced didn't involve crashes or errors.

The AI chatbot worked.

It returned responses.
Logs looked normal.
Nothing failed.

But when we tested it in the real environment, the answers were subtly — but consistently — wrong.

What actually went wrong

The root cause wasn't a single mistake, but a combination of issues across layers:

Mock implementations silently left in place
LLM fallbacks that prioritized "returning something" instead of failing fast
Duplicate logic across layers, created while implementing each layer separately

The common thread? I wasn't tracking what else might break.

Each layer looked correct in isolation.
Tests passed.
No alerts fired.

Because the system always returned some response, it created a false sense of confidence.
We didn't notice the problem immediately — and by the time we did, identifying the real cause across layers was extremely difficult.

Bugs that silently "work" are far more dangerous than bugs that crash.

Make Integration Explicit

I now spend about five minutes defining integration checkpoints.
Not documentation. Just verification.

The goal is simple: define where things must connect, and how I'll know they actually do.

Now, before implementation, I write a very small design note.

Not a formal design document.
Nothing formal.

Just a checklist that answers two questions:

What parts of the system are affected?
Where do things need to integrate — and how do I verify it?

Step 1: List What's Affected

First, I write down what is directly or indirectly impacted.

Change: Add image generation feature

Direct impact:
  - infrastructure/image/functions.ts
  - application/services/queryClassificationService.ts
  - application/services/imageGenerationService.ts

Indirect impact:
  - conversationService.ts (function calling flow)

No impact:
  - existing text generation services
  - other function handlers

This immediately clarifies the blast radius.

I don't aim for perfection —
I just want to avoid being surprised later.

Step 2: Define Integration Checkpoints

Next, I decide where integration must be verified and how.

Integration point 1: Function selection
Location: ConversationService.generateContentWithFunctionCalling

How to verify:
  1. Send a request asking for an image
  2. Confirm query classification returns `image_generation`
  3. Confirm the correct function is selected in logs

Expected result:
  - Log shows: Executing function: generateImage

And another one:

Integration point 2: Image generation and posting
Location: ImageGenerationService → MessagingClient.uploadFile

How to verify:
  1. Image data is returned from the image client
  2. The file is posted to the chat thread

Expected result:
  - Image appears in the chat

Now I know exactly what "working" means.

That's it.

Why This Works (Especially with AI)

When I give this to an LLM, it changes how implementation happens.

Instead of "build this feature," it's more like:
"Connect A to B. Here's how we'll know it works."

This also pairs well with building features end-to-end:

Feature-based development
Feature A: Data → Service → API → UI → Verify
Feature B: Data → Service → API → UI → Verify
Feature C: Data → Service → API → UI → Verify

Each feature is fully integrated before moving on.

The Result

Before this habit, integration bugs often cost me hours.

After introducing these small design notes:

AI-generated code still has small issues
But features no longer completely break at integration
Unexpected behavior is caught much earlier

Five minutes of thinking up front easily saves hours of debugging later.

Who This Is For

This approach works well if you:

Use AI coding tools
Build layered architectures
Want fast feedback instead of perfect design docs

This is not about writing more documentation.
It's just about making integration explicit before code is written.

Final Thoughts

AI tools are incredibly powerful — but they optimize locally.

If we don't define integration points explicitly, we end up debugging systems that look correct but behave incorrectly.

A small design checklist has made a huge difference for me.

Hope this saves you some painful debugging.

Planning Is the Real Superpower of Agentic Coding

Shinsuke KAGAWA — Mon, 26 Jan 2026 12:24:31 +0000

I see this pattern constantly: someone gives an LLM a task, it starts executing immediately, and halfway through you realize it's building the wrong thing. Or it gets stuck in a loop. Or it produces something that technically works but doesn't fit the existing codebase at all.

The instinct is to write better prompts. More detail. More constraints. More examples.

The actual fix is simpler: make it plan before it executes.

Research shows that separating planning from execution dramatically improves task success rates—by as much as 33% in complex scenarios.

In earlier articles, I wrote about why LLMs struggle with first attempts and why overloading AGENTS.md is often a symptom of that misunderstanding. This article focuses on what actually fixes that.

Why "Just Execute" Fails

This took me longer to figure out than I'd like to admit. When you ask an LLM to directly implement something, you're asking it to:

Understand the requirements
Analyze the existing codebase
Design an approach
Evaluate trade-offs
Decompose into steps
Execute each step
Verify results

All in one shot. With one context. Using the same cognitive load throughout.

Even powerful LLMs struggle with this. Not because they lack capability, but because long-horizon planning is fundamentally hard in a step-by-step mode.

The Plan-Execute Architecture

Research on LLM agents has consistently shown that separating planning and execution yields better results.

The reasons:

Benefit	Explanation
Explicit long-term planning	Even strong LLMs struggle with multi-step reasoning when taking actions one at a time. Explicit planning forces consideration of the full path.
Model flexibility	You can use a powerful model for planning and a lighter model for execution—or even different specialized models per phase.
Efficiency	Each execution step doesn't need to reason through the entire conversation history. It just needs to execute against the plan.

What matters here: the plan becomes an artifact, and the execution becomes verification against that artifact.

If you've read about why LLMs are better at verification than first-shot generation, this should sound familiar. Creating a plan first converts the execution task from "generate good code" to "implement according to this plan"—a much clearer, more verifiable objective.

The Full Workflow

The complete picture:

Step 1: Preparation
    │
    ▼
Step 2: Design (Agree on Direction)
    │
    ▼
Step 3: Work Planning  ← The Most Important Step
    │
    ▼
Step 4: Execution
    │
    ▼
Step 5: Verification & Feedback

I'll walk through each step, but Step 3 is where the magic happens.

Step 1: Preparation

Goal: Clarify what you want to achieve, not how.

Create a ticket, issue, or todo document stating the goal in plain language
Point the LLM to AGENTS.md (or CLAUDE.md, depending on your tool) and relevant context files
Don't jump into implementation details yet

This is about setting the stage, not solving the problem.

Step 2: Design (Agree on Direction)

Goal: Align on the approach before any code gets written.

Don't Let It Start Coding Immediately

Instead of "implement this feature," say:

"Before implementing, present a step-by-step plan for how you would approach this."

Review the Plan

Look for:

Contradictions with existing architecture
Simpler alternatives the LLM missed
Misunderstandings of the requirements

At this stage, you're agreeing on what to build and why this approach. The how and in what order come in Step 3.

Step 3: Work Planning (The Most Important Step)

This section is dense. But the payoff is proportional—the more carefully you plan, the smoother execution becomes.

For small tasks, you don't need all of this. See "Scaling to Task Size" at the end.

Goal: Convert the design into executable work units with clear completion criteria.

Why This Step Matters Most

Research shows that decomposing complex tasks into subtasks significantly improves LLM success rates. Step-by-step decomposition produces more accurate results than direct generation.

But there's another reason: the work plan is an artifact.

When the plan exists, the execution task transforms:

Before: "Build this feature" (generation)
After: "Implement according to this plan" (verification)

This is the same principle from Article 1. Creating a plan first means execution becomes verification—and LLMs are better at verification.

What Work Planning Includes

Task decomposition: Break the design into executable units
Dependency mapping: Define order and dependencies between tasks
Completion criteria: What does "done" mean for each task?
Checkpoint design: When do we get external feedback?

Perspectives to Consider

I'll be honest: I learned most of these the hard way. Plans would fall apart mid-implementation, and only later did I realize I'd skipped something obvious in hindsight.

These aren't meant to be followed rigidly for every task. Think of them as a mental checklist. You don't need to get all of these right—if even one of these perspectives changes your plan, it's doing its job.

Perspective 1: Current State Analysis

Understand what exists before planning changes.

What is this code's actual responsibility?
Which parts are essential business logic vs. technical constraints?
What benefits and limitations does the current design provide?
What implicit dependencies or assumptions aren't obvious from the code?

Skipping this leads to plans that don't fit the existing codebase.

Perspective 2: Strategy Selection

Consider how to approach the transition from current to desired state.

Research options:

Look for similar patterns in your tech stack
Check how comparable projects solved this
Review OSS implementations, articles, documentation

Common strategy patterns:

Strangler Pattern: Gradual replacement, incremental migration
Facade Pattern: Hide complexity behind unified interface
Feature-Driven: Vertical slices, user-value first
Foundation-Driven: Build stable base first, then features on top

The key isn't applying patterns dogmatically—it's consciously choosing an approach instead of stumbling into one.

Perspective 3: Risk Assessment

Evaluate what could go wrong with your chosen strategy.

Risk Type	Considerations
Technical	Impact on existing systems, data integrity, performance degradation
Operational	Service availability, deployment downtime, rollback procedures
Project	Schedule delays, learning curve, team coordination

Skipping risk assessment leads to expensive surprises mid-implementation.

Perspective 4: Constraints

Identify hard limits before committing to a strategy.

Technical: Library compatibility, resource capacity, performance requirements
Timeline: Deadlines, milestones, external dependencies
Resources: Team availability, skill gaps, budget
Business: Time-to-market, customer impact, regulations

A strategy that ignores constraints isn't executable.

Perspective 5: Completion Levels

Define what "done" means for each task—this is critical.

Level	Definition	Example
L1: Functional verification	Works as user-facing feature	Search actually returns results
L2: Test verification	New tests added and passing	Type definition tests pass
L3: Build verification	No compilation errors	Interface definition complete

Priority: L1 > L2 > L3. Whenever possible, verify at L1 (actually works in practice).

This directly maps to "external feedback" from the previous articles. Defining completion levels upfront ensures you get external verification at each checkpoint.

Perspective 6: Integration Points

Define when to verify things work together.

Strategy	Integration Point
Feature-driven	When users can actually use the feature
Foundation-driven	When all layers are complete and E2E tests pass
Strangler pattern	At each old-to-new system cutover

Without defined integration points, you end up with "it all works individually but doesn't work together."

Task Decomposition Principles

After considering the perspectives, break down into concrete tasks:

Executable granularity:

Each task = one meaningful commit
Clear completion criteria
Explicit dependencies

Minimize dependencies:

Maximum 2 levels deep (A→B→C is okay, A→B→C→D needs redesign)
Tasks with 3+ chained dependencies should be split
Each task should ideally provide independent value

Build quality in:

Don't make "write tests" a separate task—include testing in the implementation task
Tag each task with its completion level (L1/L2/L3, though in practice L1 is almost always what you want)

Work Planning Anti-Patterns

Anti-Pattern	Consequence
Skip current-state analysis	Plan doesn't fit codebase
Ignore risks	Expensive surprises mid-implementation
Ignore constraints	Plan isn't executable
Over-detail	Lose flexibility, waste planning time
Undefined completion criteria	"Done" is ambiguous, verification impossible

Scaling to Task Size

Not every task needs full work planning.

Scale	Planning Depth
Small (1-2 hours)	Verbal/mental notes or simple TODO list
Medium (1 day to 1 week)	Written work plan, but abbreviated
Large (1+ weeks)	Full work plan covering all perspectives

For a typo fix, you don't need a work plan. For a multi-week refactor, you absolutely do.

Step 4: Execution

Goal: Implement according to the work plan.

Work in Small Steps

Follow the plan. One task at a time. One file, one function at a time where appropriate.

Types-First

When adding new functionality, define interfaces and types before implementing logic. Type definitions become guardrails that help both you and the LLM stay on track.

Why This Changes Everything

With a work plan in place, execution becomes verification. The LLM isn't guessing what to build—it's checking whether the implementation matches the plan.

If you need to deviate from the plan, update the plan first, then continue implementation. Don't let plan and implementation drift apart.

Step 5: Verification & Feedback

Goal: Verify results and externalize learnings.

Feedback Format

When something goes wrong, don't just paste an error. Include the intent:

❌ Just the error
[error log]

✅ Intent + error
Goal: Redirect to dashboard after authentication
Issue: Following error occurs
[error log]

Without intent, the LLM optimizes for "remove the error." With intent, it optimizes for "achieve the goal."

Externalize Learnings

If you find yourself explaining the same thing twice, it's time to write it down.

I covered this in detail in the previous article—where to put rules, what to write, and how to verify they work. The short version: write root causes, not specific incidents, and put them where they'll actually be read.

Referencing Skills and Rules

One common failure mode: you reference a skill or rule file, but the LLM just reads it and moves on without actually applying it.

The Problem

Pattern	Issue
Write "see AGENTS.md"	It's already loaded—redundant reference adds noise
`@file.md` only	LLM reads it, then continues. Reading ≠ applying
"Please reference X"	References it minimally, doesn't apply the content

The Solution: Blocking References

Make the reference a task with verification:

## Required Rules [MANDATORY - MUST BE ACTIVE]

**LOADING PROTOCOL:**
- STEP 1: CHECK if `.agents/skills/coding-rules/SKILL.md` is active
- STEP 2: If NOT active → Execute BLOCKING READ
- STEP 3: CONFIRM skill active before proceeding

Why This Works

Element	Effect
Action verbs	"CHECK", "READ", "CONFIRM"—not just "reference"
STEP numbers	Forces sequence, can't skip
Before proceeding	Blocking—must complete before continuing
If NOT active	Conditional—skips if already loaded (efficiency)

This maps to the task clarity principle: "check if loaded → load if needed → confirm → proceed" is far clearer than "please reference this file."

How This Connects to the Theory

Step	Connection to LLM Characteristics
Step 1: Preparation	Task clarification
Step 2: Design	Artifact-first (design doc is an artifact)
Step 3: Work Planning	Artifact-first (plan is an artifact) + external feedback design
Step 4: Execution	Transform "generation" into "verification against plan"
Step 5: Verification	Obtain external feedback + externalize learnings

The work plan created in Step 3 converts Step 4 from "generate from scratch" to "verify against specification." This is the key mechanism for improving accuracy.

The Research

The practices in this article aren't just workflow opinions—they're backed by research on how LLM agents perform.

ADaPT (Prasad et al., NAACL 2024): Separating planning and execution, with dynamic subtask decomposition when needed, achieved up to 33% higher success rates than baselines (28.3% on ALFWorld, 27% on WebShop, 33% on TextCraft).

Plan-and-Execute (LangChain): Explicit long-term planning enables handling complex tasks that even powerful LLMs struggle with in step-by-step mode.

Multi-Layer Task Decomposition (PMC, 2024): Step-by-step models generate more accurate results than direct generation—task decomposition directly improves output quality.

Task Decomposition (Amazon Science, 2025): With proper task decomposition, smaller specialized models can match the performance of larger general models.

Key Takeaways

Don't let it execute immediately. Ask for a plan first. Even just "present your approach step-by-step before implementing" makes a significant difference.
Work Planning is the superpower. A plan is an artifact. Having it converts execution from generation to verification—and LLMs are better at verification.
Define completion criteria. L1 (works as feature) > L2 (tests pass) > L3 (builds). Know what "done" means before starting.
Scale to task size. Small task = mental note. Large task = full work plan. Don't over-plan trivial work, don't under-plan complex work.
Update plan before deviating. If implementation needs to differ from the plan, update the plan first. Drift kills the verification benefit.
Include intent with errors. "Goal + error" beats "just error." The LLM should know what you're trying to achieve, not just what went wrong.

References

Prasad, A., et al. (2024). "ADaPT: As-Needed Decomposition and Planning with Language Models." NAACL 2024 Findings. arXiv:2311.05772
Wang, L., et al. (2023). "Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models." ACL 2023.
LangChain. "Plan-and-Execute Agents." https://blog.langchain.com/planning-agents/

Stop Guessing If Your Prompt Is Better

Shinsuke KAGAWA — Thu, 22 Jan 2026 13:52:50 +0000

You rewrote your prompt. The output looks different. But is it actually better?

Most of us have been there — reading prompt engineering best practices, tweaking instructions, and hoping the changes help. But without comparison, you're just guessing.

The Problem

When you improve a prompt, you typically:

Run the new version
Look at the output
Think "yeah, this seems better"

But you're comparing against your memory of the old output. Different runs produce different results anyway. How do you know the improvement came from your changes and not just LLM variance?

What I Built

rashomon is a Claude Code plugin that focuses on one practical question: "Did my instruction change actually affect the result?"

It analyzes your prompt, generates an optimized version, runs both in isolated environments, and compares the actual results.

Real Example

I ran this prompt through rashomon:

Add logging to track function calling usage

A reasonable instruction. But vague.

What rashomon detected

Issue	Detail
Vague instructions	What, where, and why to log are unclear
No output format	Log structure not specified
Missing context	No project architecture information

The optimized prompt

## Context

This is a Slack bot using Google Gemini API with function calling. 
The project uses a shared `logger` utility with structured logging.
Function calling flows through:
1. `GeminiService.executeWithRetry()` - detects function calls
2. `FunctionHandler.handleFunctionCall()` - executes them

## Task

Add logging to track function calling usage for analytics and debugging.

## Requirements

At Function Call Detection (GeminiService):
- Function name(s) detected
- Number of function calls in response

At Function Execution (FunctionHandler):
- Parameters passed (sanitized - exclude sensitive data)
- Execution duration
- Result status (success/failure)

## Output Format

logger.info('Function call detected', {
  functionName: 'executeWithRetry',
  detectedFunctions: ['searchNotionPages'],
  functionCallCount: 1
})

What changed

Aspect	Original	Optimized
Logging Scope	1 stage (execution only)	2 stages (detection + execution)
Parameter Sanitization	None	Passwords, tokens, secrets redacted
Files Modified	2	2

The original prompt looked reasonable, but led the agent to log at only one point. The optimized version covered both detection and execution — with security considerations the original didn't address.

Classification: Structural Improvement

About Variance

Not every difference is an improvement. rashomon distinguishes between structural gains and mere variance.

I tried to create a Variance example — a prompt so clear that optimization wouldn't matter. I couldn't. In practice, the same vague prompt sometimes works beautifully, sometimes completely misses the point.

rashomon just makes that inconsistency visible.

Try It

Requires Claude Code.

claude
/plugin marketplace add shinpr/rashomon
/plugin install rashomon@rashomon
# Restart session
/rashomon Your prompt here

shinpr / rashomon

Compare, improve, and verify prompt changes with evidence — not vibes.

See what actually changes when you improve your prompts — not just different wording.

Why rashomon?

Inspired by the Rashomon effect — the idea that the same event can produce different outcomes depending on perspective rashomon makes those differences explicit and comparable.

Spending too much time on trial-and-error with prompts?
Read best practices but not sure how they apply to your case?
Want proof that your changes actually made things better?

rashomon analyzes, improves, and compares prompts—so you can see what actually changed, and whether it matters.

Who Is This For?

rashomon is designed for:

Developers using Claude Code daily
Teams iterating on complex prompts (coding, analysis, writing)
Anyone who wants evidence, not vibes, when improving prompts

Not ideal if:

You don't use git
You want one-shot prompt rewriting without comparison

Quick Example

/rashomon Write a function to sort an array

What You Get

1. Detected Issues

- BP-002

…

View on GitHub

Stop Putting Everything in AGENTS.md

Shinsuke KAGAWA — Mon, 19 Jan 2026 14:13:49 +0000

If you're using Agentic Coding and find yourself explaining the same thing to the LLM over and over, you have a learning externalization problem.

The fix seems obvious: write it down in AGENTS.md (or CLAUDE.md, depending on your tool) and never explain it again.

Note: This article uses "AGENTS.md" as the generic term for root instruction files. Claude Code uses CLAUDE.md, Codex uses AGENTS.md, and other tools have their own conventions. The principles apply regardless of the specific filename.

But here's what actually happens—you keep adding rules, AGENTS.md grows to 200+ lines, and somehow the LLM still ignores half of what you wrote.

This article is about how to actually make your rules stick: where to write them, what to write, and how to verify they work.

The Real Problem

LLMs don't learn across sessions. Every conversation starts fresh. This means:

You explain something once
It works
Next session, you explain it again
And again
Eventually you get frustrated

The solution is to externalize your learnings into rules. But most people do this wrong.

The Common Mistakes

Mistake	What Happens
Put everything in AGENTS.md	It bloats, becomes noise, important rules get buried
Put everything in code comments	LLM doesn't load them into context unless you explicitly reference the file
Don't write it down at all	You repeat yourself forever

The thing is, where you write a rule determines whether the LLM actually follows it.

Where to Write Rules

Not all rules belong in the same place. A simple decision tree:

When is this rule needed?
│
├─ Always, on every task → AGENTS.md
│
├─ When working on a specific feature → Design Doc
│
├─ When using a specific technology → Rule file (skill)
│
└─ When performing a specific task type → Task guidelines

Note: "Skills" are modular rule files used in tools like Codex and Claude Code. They allow you to inject context-specific rules only when relevant. If your tool doesn't have this concept, think of them as separate rule files you reference when needed.

"Task guidelines" refers to rules that apply only during specific operations—like code review, migration, or content generation. Some call these "task rules" or "task-specific constraints."

The Full Picture

Destination	Scope	When Applied	Examples
AGENTS.md	All tasks	Always	Approval flows, stop conditions, project principles
Rule files (skills)	Specific technology area	When using that tech	Type conventions, error handling patterns, function size limits
Task guidelines	Specific task type	When doing that task	Subagent usage rules, review procedures
Design docs	Specific feature	When developing that feature	Feature requirements, API specs, security constraints
Code comments	Specific code location	When modifying that code	Implementation rationale, gotchas

The Key Question

Ask yourself: "Is this needed on every task in this project?"

Yes → AGENTS.md
No → Put it closer to where it's needed

This keeps AGENTS.md lean (around 100 lines) and ensures task-specific rules don't create noise for unrelated work.

You don't need to get this perfect from day one. Start with one thing: keep AGENTS.md small. That alone changes a lot.

What to Write

This is the hard part. Most people write the wrong thing.

The Principle: Write Root Causes, Not Incidents

When something goes wrong, the instinct is to document the specific incident. But this creates bias—the LLM over-fits to that one case.

❌ Bad (specific incident)
"The getUser() function in UserService was missing null check"

✅ Good (root cause / system fix)
"Always null-check return values from external APIs"

The first one only helps if the LLM encounters that exact function again. The second one prevents the entire class of errors.

Specific Incident vs. Root Cause

Aspect	Specific Incident	Root Cause
Applies to	That one location	All similar cases
Prevents recurrence	Weakly (same bug elsewhere)	Strongly (operates as principle)
Bias risk	High (overfitting)	Low (generalizable)

Finding the Root Cause

When you encounter an issue, ask:

Why did this mistake happen? (direct cause)
Why wasn't it prevented? (system gap)
Where else could this same mistake occur? (scope)

Example:

Direct cause: getUser() was missing null check
System gap: We trusted external API return values without validation
Scope: All external API calls

→ Rule to write: "Always null-check return values from external APIs"

How to Verify Rules Work

This is the step most people skip—and it's critical.

The Principle: Fix the System, Then Discard and Retry

When you add or modify a rule in AGENTS.md or a skill file, you need to verify it actually works. The only way to do this:

Add/modify the rule
Discard the current artifact (or stash it in a branch)
Start a new session with the updated rules
Re-run the same task
Verify the issue doesn't recur

Continue with existing artifact after rule change → ❌
Discard and restart with new rules → ✅

Why This Matters

If you keep the existing artifact and just continue, you're still operating in a context polluted by the old system. The new rule might not get properly applied because:

The existing artifact carries biases from before the rule existed
The LLM might try to "reconcile" the new rule with existing work rather than applying it cleanly
You can't tell if the rule actually works or if you just manually fixed the symptom

Verification Checklist

[ ] Modified the rule (AGENTS.md / skill file / task guideline)
[ ] Discarded current artifact (or moved to a branch)
[ ] Started new session with updated rules
[ ] Re-ran the same task
[ ] Confirmed the issue doesn't recur

For small changes, you can stash instead of discard. The key is: test the system in isolation.

When to Write Rules

Not every issue deserves a rule. Some guidance:

Situation	Write a Rule?	Rationale
You explained the same thing twice	Yes	Prevent the third time
Encountered unexpected behavior	Maybe	Find root cause first
Task completed successfully	Maybe	Retrospective—any generalizable insights?
Found a serious bug	Yes	Prevent recurrence

Warning Signs You're Over-Documenting

AGENTS.md exceeds 100 lines
A single rule file exceeds 300 lines (~1,500 tokens)
Rules take more than 1 minute to read through
You find yourself thinking "is this really needed every time?"
Rules contradict each other

If you see these signs, it's time to prune. Rule maintenance includes deletion.

How to Write Rules (Cheat Sheet)

This section is a reference. You don't need to read it all now—come back when you're actually writing a rule. The rest of the article stands on its own.

1. Minimum Viable Length

Context is precious. Same meaning, shorter expression. But don't sacrifice clarity for brevity.

❌ Verbose (38 chars)
If an error occurs, you must always log it

✅ Concise (20 chars)
All errors must be logged

❌ Too short (unclear)
Log errors

2. No Duplication

Same content in multiple places wastes context and creates update drift.

❌ Duplicated
# base.md
Standard error format: { success: false, error: string }

# api.md
Errors use { success: false, error: string } format

✅ Single source
# base.md
Standard error format: { success: false, error: string }

3. Measurable Criteria

Vague instructions create interpretation variance. Use numbers and specific conditions.

✅ Measurable
- Functions: max 30 lines
- Cyclomatic complexity: max 10
- Test coverage: min 80%

❌ Vague
- Readable code
- Sufficient testing

4. Recommendations Over Prohibitions

Banning things without alternatives leaves the LLM guessing. Show the right way.

✅ Recommendation + rationale
【State Management】
Recommended: Zustand or Context API
Reason: Global variables make testing difficult, state tracking complex
Avoid: window.globalState = { ... }

❌ Prohibition list
- Don't use global variables
- Don't store values on window

5. Priority Order

LLMs pay more attention to what comes first. Lead with the most important rules.

## Critical (Must Follow)
1. All APIs require JWT authentication
2. Rate limit: 100 requests/minute

## Standard Specs
- Methods: Follow REST principles
- Body: JSON format

## Edge Cases (Only When Applicable)
- File uploads may use multipart

6. Clear Scope Boundaries

State what the rule covers—and what it doesn't.

## Scope

### Applies To
- REST API endpoints
- GraphQL endpoints

### Does Not Apply To
- Static file serving
- Health checks (/health)

The Feedback Loop

This is how it all fits together in practice:

[Working with LLM]
       │
       ├─ Issue occurs
       │      │
       │      ▼
       │  Find root cause (not just symptom)
       │      │
       │      ▼
       │  Decide where to write (AGENTS.md? Skill? Task guideline?)
       │      │
       │      ▼
       │  Write the rule
       │      │
       │      ▼
       │  Discard current work
       │      │
       │      ▼
       │  New session with updated rules
       │      │
       │      ▼
       │  Verify issue doesn't recur
       │
       ▼
[Continue working]

The goal is to reach a state where you never explain the same thing twice. Every explanation either:

Gets externalized into a rule, or
Was truly a one-off that doesn't need capturing

Passing Feedback Correctly

One more thing: when you give feedback to the LLM, don't just paste error logs. Include your intent.

❌ Just the error
[Stack trace]

✅ Intent + error
Goal: Redirect to dashboard after user authentication
Issue: Following error occurred
[Stack trace]

Without the intent, the LLM optimizes for "make the error go away." With the intent, it optimizes for "achieve the goal while resolving this error."

These are very different things.

Anti-Pattern Summary

Quick reference if you want to check your current practices:

Anti-Pattern	Reference
Put everything in AGENTS.md	→ "Where to Write Rules"
Write specific incidents instead of root causes	→ "What to Write"
Continue with old artifacts after changing rules	→ "How to Verify Rules Work"
List only prohibitions without recommendations	→ "How to Write Rules" #4
Keep explaining instead of writing it down	→ "When to Write Rules"

Key Takeaways

AGENTS.md is not a dumping ground. Only rules needed on every task belong there. Everything else goes closer to where it's used.
Write root causes, not incidents. "Null-check external API returns" beats "UserService.getUser() was missing null check."
Test your rules. After adding a rule, discard current work and re-run. If the issue recurs, the rule isn't working.
Maintenance includes deletion. If AGENTS.md is over 100 lines, you've probably over-documented. Prune ruthlessly.
Explain twice, document once. If you're explaining the same thing for a second time, stop and externalize it.

Once you stop expecting rules alone to do the work, the real question becomes how to design the workflow around them. In practice, that starts with planning—before execution ever begins.

The Research

The practices in this article are grounded in LLM research:

SALAM (Wang et al., 2023): LLM self-feedback is often inaccurate. Structured feedback from external agents (or externalized rules) is more effective.

LEMA (An et al., 2023): Learning from mistakes (error → explanation → correction) improves LLM reasoning ability—but this requires explicit externalization of what was learned.

Feedback Loop for IaC (Palavalli et al., 2024): Feedback loop effectiveness decreases exponentially with each iteration and plateaus. This supports the "discard and restart" approach over endless iteration in the same context.

Reflexion (Shinn et al., 2023): Combining short-term memory (recent trajectory) with long-term memory (past experience) enables effective self-improvement. Externalized rules function as that long-term memory.

References

Wang, D., et al. (2023). "Learning from Mistakes via Cooperative Study Assistant for Large Language Models." arXiv:2305.13829
An, S., et al. (2023). "Learning From Mistakes Makes LLM Better Reasoner." arXiv:2310.20689
Palavalli, M. A., et al. (2024). "Using a Feedback Loop for LLM-based Infrastructure as Code Generation." arXiv:2411.19043
Shinn, N., et al. (2023). "Reflexion: Language Agents with Verbal Reinforcement Learning." NeurIPS 2023.

Why LLMs Are Bad at "First Try" and Great at Verification

Shinsuke KAGAWA — Mon, 12 Jan 2026 12:46:19 +0000

I used to spend hours crafting the perfect prompt.
Detailed instructions, examples, constraints—the works.

And the AI would still add random features I never asked for.
Or refactor code that was perfectly fine.
Or skip steps it decided were "unnecessary."

Eventually it clicked: I was fighting a losing battle.
So I stopped trying to control generation and started focusing on verification.

The Failure Patterns You've Probably Seen

Before diving into why, these are the common anti-patterns:

The Giant Prompt Syndrome: Cramming requirements, design, implementation, and improvement into a single prompt
Overconfidence in Abstract Instructions: Expecting "think carefully" or "be thorough" to actually improve quality
The Invisible Loop: Thinking you're iterating when you're actually spinning in circles within the same biased context
Context Bloat: Adding "just in case" information until the actually important instructions get buried

If any of these sound familiar, you're in the right place.

The Core Insight

The claim is simple:

LLMs perform better at "verify and improve existing artifacts" than at "controlled first-time generation."

Instead of trying to get the perfect output on the first attempt, you get better results by:

Having the LLM produce something first
Then having it verify and improve that output

This is grounded in how LLMs actually process information.

Why Verification Works Better

At first, I assumed better prompts would lead to better first-shot output. But after enough failures, the pattern became clear: there are three interconnected reasons why LLMs become "smarter" when they have something to work with.

1. External Feedback Changes the Task

When an artifact exists, the task fundamentally transforms:

Without artifact: "Generate something good" (vague, open-ended)
With artifact: "Identify what's wrong with this and fix it" (specific, bounded)

The second task has clearer success criteria. The LLM isn't guessing what "good" means—it can evaluate concrete issues against concrete output.

2. Position Bias (Lost in the Middle)

Research has shown that LLMs exhibit a U-shaped attention pattern: they prioritize information at the beginning and end of their context window, while information in the middle tends to get overlooked.

When you feed an artifact as input to a new session, it naturally occupies a prominent position in the context. The LLM is literally forced to pay attention to it.

This also explains why that really important instruction you buried in paragraph 5 of your mega-prompt keeps getting ignored.

3. Task Clarity Drives Performance

"Improve this code" is a more concrete task than "write good code."

The presence of an artifact provides:

A specific target for evaluation
Clear boundaries for the scope of work
Implicit success criteria (this one matters more than you'd think—"better than before" is much easier to verify than "good enough")

The Externality Spectrum

What made the biggest difference for me was reviewing in a completely separate context.
Once I stopped letting the generator review its own work, the blind spots became obvious.

Not all feedback loops are created equal. Different approaches rank very differently in effectiveness:

The thing is, looping within the same session is fundamentally internal feedback. The LLM is still operating within its original generation biases. Only by separating context do you get true "external" perspective.

In short: if the context doesn't change, neither does the model's perspective.

Practical Implications

So what do you actually do with this?

1. Artifact-First Workflow

Stop trying to get everything right in one shot. Instead:

Generate Phase: Get something out, even if imperfect. Don't over-specify.
External Feedback: Run the code, execute tests, use linters
Verification Phase (new session): Feed the artifact + feedback to a fresh context

[Generation Session]
    │
    ├── Input: Requirements, constraints
    ├── Output: Artifact (code, design, etc.) + brief intent summary (1-3 lines)
    │
    ▼
[External Feedback]
    │
    ├── Code execution
    ├── Test execution
    ├── Linter/static analysis
    │
    ▼
[Verification Session]  ← Fresh context
    │
    ├── Input: Artifact + intent summary + feedback results
    ├── Output: Improved artifact
    │
    ▼
[Repeat or Complete]

2. Know When to Separate Sessions

Session separation isn't always necessary. Use your judgment:

Task Type	Approach
Small, localized fixes (typos, formatting)	Same session is fine
Clear error fixes (with stack trace)	Same session works—external feedback (error log) exists
Design changes, architecture revisions	Separate sessions
Quality improvements, refactoring	Separate sessions
Direction changes, requirement pivots	Separate sessions

Rule of thumb: If you're feeling "something isn't working," that's often a sign to start a fresh session. Your intuition about context pollution is usually right.

3. What to Pass Between Sessions

Not everything from the generation phase should go to the verification phase.

Content	Should Pass?	Why
Full Chain-of-Thought log	No	Verbose, becomes noise. Important info gets lost (position bias)
Intent summary (1-3 lines)	Yes	Preserves the "why" compactly
Final decision + rationale	Yes	Useful for debugging
Rejected alternatives	Maybe	Only when specifically relevant

The principle: Pass the "why," not the "how I thought about it."

Designing Your AGENTS.md

You don't need to redesign your AGENTS.md all at once. But understanding position bias changes how you think about what goes in it.

This insight has direct implications for how you structure AGENTS.md (or whatever root instruction file you use—CLAUDE.md, cursorrules, etc.).

The Position Bias Problem

Context Window Position → Attention Weight

[AGENTS.md]            ← Start: HIGH attention
       ↓
[Middle instructions]  ← Middle: LOW attention (Lost in the Middle)
       ↓
[User prompt]          ← End: HIGH attention

If your AGENTS.md is bloated, the truly important principles get diluted. Adding more "just in case" actually makes everything weaker.

Design Principles

What	Approach
AGENTS.md	Core principles only. ~100 lines. What must be followed on every task
Task-specific info	Inject via skills, command arguments, or reference files when needed
Why separate?	Context separation lets you compose optimal information for each task

What Belongs in AGENTS.md

Project purpose and domain
Non-negotiable constraints (security, naming conventions)
Tech stack overview
Communication style
Error handling behavior

What Doesn't Belong

Individual feature specs
API details
Task-specific workflows
Long code examples
"Nice to have" information

Test: Ask "Is this needed for every task?" If no, it belongs elsewhere.

The Human Role

In Agentic Coding, you're not "using an LLM"—you're designing a system where an LLM operates.

Your Responsibilities

Responsibility	Concrete Actions
Design external feedback	Decide which tests to run, which linters to use, what "success" means
Determine session boundaries	Judge when to cut context, what carries over
Define quality gates	Separate automated checks from human review needs
Maintain AGENTS.md	Keep core principles tight, prevent bloat
Articulate intent	Create or validate the "intent summary" that passes between sessions

Automation vs. Human Review

Good for automation:

Code execution, test execution
Linters, formatters
Type checking
Security scans
Applying formulaic fixes

Requires human review:

Design decision validity
Requirement alignment
Session boundary judgment
Trade-off decisions
Validating the "why"

A Framework: Context Separation at Every Level

It took me a while to realize this wasn't about writing better prompts—it was about where I drew the boundaries.

You don't need to apply all of this rigidly. But when something feels off, one of these levels is usually the culprit:

The Research Behind This

These aren't just opinions—they're grounded in LLM research:

Self-Refine (Madaan et al., 2023)
The generate → feedback → refine loop shows approximately 20% improvement over single-shot generation. Key insight: the improvement comes from the structured iteration, not from the model "trying harder."

Lost in the Middle (Liu et al., 2023)
LLMs show U-shaped attention bias, heavily weighting the beginning and end of context while underweighting the middle. This explains why your carefully crafted instructions in paragraph 5 keep getting ignored.

LLMs Cannot Self-Correct Reasoning Yet (Huang et al., 2023)
Without external feedback, self-correction doesn't work—and can actually make things worse. "Review your work" as an instruction has minimal effect; external signals (test failures, linter errors) are what drive actual improvement.

Key Takeaways

Don't optimize for first-shot perfection. Get something out, then improve it.
Session separation is real. The same context that generated the artifact will struggle to objectively improve it.
External feedback is non-negotiable. Tests, linters, execution results—these are what drive quality, not "think harder" prompts.
Keep AGENTS.md lean. Position bias means bloat actively hurts. If it's not needed for every task, move it out.
Pass intent, not process. Between sessions, transfer the "why" in 1-3 lines, not the full thought log.
You're a system designer. Your job isn't to use the LLM—it's to design the workflow, feedback loops, and context boundaries that let it perform.

What's Next

This article focused on why verification-oriented workflows outperform first-shot generation. In future articles, I'll cover:

How to structure work plans that turn execution into verification
Where to put rules so they actually get followed (hint: not all in AGENTS.md)

If you've been struggling with inconsistent LLM output or finding that your detailed prompts underperform simpler ones, try restructuring around verification. The difference is often dramatic.

What's your experience been? Did switching to a verification-first approach change anything for you?

References

Madaan, A., et al. (2023). "Self-Refine: Iterative Refinement with Self-Feedback." arXiv:2303.17651
Liu, N. F., et al. (2023). "Lost in the Middle: How Language Models Use Long Contexts." arXiv:2307.03172
Huang, J., et al. (2023). "Large Language Models Cannot Self-Correct Reasoning Yet." ICLR 2024. arXiv:2310.01798
Hsieh, C.-Y., et al. (2024). "Found in the Middle: Calibrating Positional Attention Bias Improves Long Context Utilization." ACL 2024 Findings. arXiv:2406.16008

Building a Local RAG for Agentic Coding: From Fixed Chunks to Semantic Search with Keyword Boost

Shinsuke KAGAWA — Tue, 06 Jan 2026 12:12:57 +0000

Started with a simple RAG for MCP—the kind of thing you build in a weekend. Ended up implementing semantic chunking (Max-Min algorithm) and rethinking hybrid search entirely. This article is written for people who have already built RAG systems and started hitting quality limits. If you've hit walls with fixed-size chunks and top-K retrieval, this might be useful.

Context: RAG for Agentic Coding
The Invisible Problem: What Does the LLM Actually Receive?
Semantic Chunking: Why Fixed Chunks Break Down
When Semantic Chunks Broke Hybrid Search
Results: What Actually Changed
Architecture Summary
The Other Side: Query Quality
Tradeoffs and Limitations
Conclusion

1. Context: RAG for Agentic Coding

Problem statement

The request was straightforward: load domain knowledge from PDFs for a specialized agent. Framework best practices, project principles (rules), and specifications (PRDs)—the kind of documents you'd want an AI coding assistant to reference while working.

The constraints made it interesting:

Personal use → No external APIs, privacy matters
MCP ecosystem → Integration with Cursor, Claude Code, Codex
"Agentic Coding support" as the use case

Initial implementation

The first version was textbook RAG:

Document → Fixed-size chunks (500 chars) → Embeddings → LanceDB
Query → Vector search → Top-K results → LLM

Standard fixed-size chunking. Vector search with top-K retrieval. Local embedding model via Transformers.js. LanceDB for vector storage—file-based, no server process required.

It worked... sort of.

2. The Invisible Problem: What Does the LLM Actually Receive?

Discovery

Here's the thing about MCP: search results go directly to the LLM. The user never sees them.

User → LLM → MCP(RAG) → LLM → Response
               ↑
         Results hidden from user

When the RAG returns garbage, you don't see it. You just notice the LLM behaving strangely—making additional searches, reading files directly, or giving incomplete answers.

To debug this, I forced the LLM to output the raw JSON search results. The prompt was simple: "Show me the exact JSON you received from the RAG search."

What I found: lots of irrelevant chunks polluting the context. Page markers, decoration lines, fragments cut mid-sentence.

Why top-K fails

The standard approach is "return the top 10 closest vectors." But closeness in vector space doesn't equal usefulness.

Increasing K just adds more noise
No quality signal—just "top 10 closest vectors"
A chunk with distance 0.1 and another with distance 0.9 both make the cut if they're in the top K

First fix: Quality filtering

Three mechanisms, each addressing a different problem:

1. Distance-based threshold (RAG_MAX_DISTANCE)

// src/vectordb/index.ts
if (this.config.maxDistance !== undefined) {
  query = query.distanceRange(undefined, this.config.maxDistance)
}

Only return results below a certain distance. If nothing is close enough, return nothing—better than returning garbage.

2. Relevance gap grouping (RAG_GROUPING)

Instead of arbitrary K, detect natural "quality groups" in the results:

// src/vectordb/index.ts
// Calculate statistical threshold: mean + 1.5 * std
const threshold = mean + GROUPING_BOUNDARY_STD_MULTIPLIER * std

// Find significant gaps (group boundaries)
const boundaries = gaps.filter((g) => g.gap > threshold)

// 'similar' mode: first group only
// 'related' mode: top 2 groups

Results cluster naturally—there's usually a gap between "highly relevant" and "somewhat related." This detects that gap statistically.

3. Garbage chunk removal

// src/chunker/semantic-chunker.ts
export function isGarbageChunk(text: string): boolean {
  // Decoration line patterns (----, ====, ****, etc.)
  if (/^[\-=_.*#|~`@!%^&*()\[\]{}\\/<>:+\s]+$/.test(trimmed)) return true

  // Excessive repetition of single character (>80%)
  const maxCount = Math.max(...charCounts.values())
  if (maxCount / trimmed.length > 0.8) return true

  return false
}

Page markers, separator lines, repeated characters—filter them before they ever reach the index.

New problem emerged

Technical terms like useEffect or ERR_CONNECTION_REFUSED were getting filtered out. They're semantically distant from natural language queries but keyword-relevant.

The fix: hybrid search (semantic + keyword blend). But implementing it properly required rethinking the chunking strategy first.

3. Semantic Chunking: Why Fixed Chunks Break Down

Trigger

I read about "semantic center of gravity" in chunks—the idea that a chunk should have a coherent meaning, not just a coherent length.

Then I observed the LLM's behavior: after RAG search, it would often search again with different terms, or just read the file directly. The chunks weren't trustworthy—they lacked sufficient context for the LLM to act on them.

The waste

If a chunk doesn't contain enough meaning:

LLM makes additional tool calls to compensate
Context gets polluted with redundant searches
Latency increases
Tokens get wasted

The LLM was doing work that good chunking should prevent.

Solution: Max-Min Algorithm

The Max-Min semantic chunking paper (Kiss et al., Springer 2025) provided the foundation. This implementation is a pragmatic adaptation of the Max–Min idea, not a faithful reproduction of the paper's algorithm.

The core idea: group consecutive sentences based on semantic similarity, not character count.

// src/chunker/semantic-chunker.ts

// Should we add this sentence to the current chunk?
private shouldAddToChunk(maxSim: number, threshold: number): boolean {
  return maxSim > threshold
}

// Dynamic threshold based on chunk coherence
private calculateThreshold(minSim: number, chunkSize: number): number {
  // threshold = max(c * minSim * sigmoid(|C|), hardThreshold)
  const sigmoid = 1 / (1 + Math.exp(-chunkSize))
  return Math.max(this.config.c * minSim * sigmoid, this.config.hardThreshold)
}

The algorithm:

Split text into sentences
Generate embeddings for all sentences
For each sentence, decide: add to current chunk or start new?
Decision based on comparing max similarity with new sentence vs. min similarity within chunk

When the new sentence's similarity drops below the threshold, it signals a topic boundary.

Implementation details

Sentence detection: Intl.Segmenter

// src/chunker/sentence-splitter.ts
const segmenter = new Intl.Segmenter('und', { granularity: 'sentence' })

No external dependencies. Multilingual support via Unicode standard (UAX #29). The 'und' (undetermined) locale provides general Unicode support.

Code block preservation

// src/chunker/sentence-splitter.ts
const CODE_BLOCK_PLACEHOLDER = '\u0000CODE_BLOCK\u0000'

// Extract before sentence splitting
const codeBlockRegex = /```
{% endraw %}
[\s\S]*?
{% raw %}
```/g
// ... replace with placeholders ...

// Restore after chunking

Markdown code blocks stay intact—never split mid-block. Critical for technical documentation where copy-pastable code is the point.

Performance tuning

The paper uses O(k²) comparisons within each chunk. For long homogeneous documents, this explodes.

// src/chunker/semantic-chunker.ts
const WINDOW_SIZE = 5      // Compare only recent 5 sentences: O(k²) → O(25)
const MAX_SENTENCES = 15   // Force split at 15 sentences (3x paper's median)

PDF parsing: pdfjs-dist

Switched from pdf-parse to pdfjs-dist for access to position information (x, y coordinates, font size). This enables semantic header/footer detection—variable content like "Page 7 of 75" that pdf-parse would include as regular text.

4. When Semantic Chunks Broke Hybrid Search

The problem

Semantic chunks are richer—more content per chunk, more coherent meaning. But this broke the original keyword matching.

The issue: scores became unreliable. A keyword match in a dense, high-quality chunk meant something different than a match in a sparse, fragmented one.

Attempted: RRF (Reciprocal Rank Fusion)

RRF is the standard approach for merging BM25 and vector results:

RRF_score = Σ 1/(k + rank_i)

Combine rankings by position, not by score. Elegant, widely used, no tuning required.

But there's a fundamental problem: distance information is lost.

Original distances: 0.1, 0.2, 0.9  →  Ranks: 1, 2, 3
Original distances: 0.1, 0.15, 0.18  →  Ranks: 1, 2, 3
# Same ranks, completely different quality gaps

RRF outputs ranks, not distances. Our quality filters—distance threshold, relevance gap grouping—need actual distances to work.

As noted in Microsoft's hybrid search documentation: "RRF aggregates rankings rather than scores." This is by design—it avoids the problem of incompatible score scales. But it means downstream quality filtering can't distinguish "barely made top-10" from "clearly the best match."

Solution: Semantic-first with keyword boost

Keep vector search as the primary signal. Use keywords to adjust distances, not replace them.

// src/vectordb/index.ts
// Multiplicative boost: distance / (1 + keyword_score * weight)
const boostedDistance = result.score / (1 + keywordScore * weight)

The formula:

No keyword match (score=0): distance / 1 = distance (unchanged)
Perfect match with weight=0.6: distance / 1.6 (reduced by 37.5%)
Perfect match with weight=1.0: distance / 2 (halved)

This preserves the distance for quality filtering while boosting exact matches.

Architecture

// src/vectordb/index.ts
// 1. Vector search with 2x candidate pool
const candidateLimit = limit * HYBRID_SEARCH_CANDIDATE_MULTIPLIER

// 2. Apply distance filter
if (this.config.maxDistance !== undefined) {
  query = query.distanceRange(undefined, this.config.maxDistance)
}

// 3. Apply grouping
if (this.config.grouping && results.length > 1) {
  results = this.applyGrouping(results, this.config.grouping)
}

// 4. Keyword boost via FTS
const ftsResults = await this.table
  .search(queryText, 'fts', 'text')
  // ...
results = this.applyKeywordBoost(results, ftsResults, hybridWeight)

Quality filters apply to meaningful vector distances. Keyword matching acts as a boost, not a replacement.

Multilingual challenge

Japanese keyword matching broke with richer chunks. The default tokenizer couldn't handle CJK characters properly.

Solution: LanceDB FTS with n-gram indexing.

// src/vectordb/index.ts
await this.table.createIndex('text', {
  config: Index.fts({
    baseTokenizer: 'ngram',
    ngramMinLength: 2,  // Capture Japanese bi-grams (東京, 設計)
    ngramMaxLength: 3,  // Balance precision vs index size
    prefixOnly: false,  // All positions for proper CJK support
    stem: false,        // Preserve exact terms
  }),
})

N-grams at min=2, max=3 capture both English terms and Japanese compound words without language-specific tokenization.

5. Results: What Actually Changed

Observed behavior (real usage)

My setup: framework best practices (official PDFs), project principles (rules), specifications (PRDs) stored in RAG. Before each task, the agent analyzes requirements and searches RAG for relevant context.

Before (fixed chunks + top-K):

Agent couldn't find relevant information on first search
Multiple search attempts with different query formulations
Eventually gave up and read rule files directly
PDFs were too large to read, so that context was effectively lost

After (semantic chunks + boost + filtering):

Single search usually provides sufficient context
Additional searches happen for depth, not compensation
Agent stopped reading files directly—RAG results were trustworthy

LLM evaluation (before/after comparison)

I had an LLM evaluate search results with project context—not a formal LLM-as-Judge setup, but structured comparison.

Old version:

Garbage chunks (outliers) and fragmented information in ~2/10 results for some queries
Results required additional verification

Updated version:

No garbage chunks
8/10 results directly relevant to the query
2/10 results tangentially related (still useful context)
Evaluator noted: "Search results alone provide necessary and sufficient information"

Examining the raw JSON confirmed the qualitative assessment—chunks contained coherent, dense information rather than fragments.

No benchmarks

This is qualitative observation from real usage, not controlled experiments. But the behavioral change is clear: the LLM stopped compensating for bad RAG results.

6. Architecture Summary

Document → Semantic Chunking (Max-Min) → Embeddings → LanceDB

Query → Vector Search → Distance Filter → Grouping → Keyword Boost → Results

Key decisions

Choice	Reason
Semantic chunking over fixed	Meaning-preserving units reduce LLM compensation
Keyword boost over RRF	Preserves distance for quality filtering
Distance-based grouping	Quality signal, not arbitrary K
N-gram FTS	Multilingual support without tokenizer complexity
Local-only	Privacy, cost, offline capability

Configuration

# Environment variables
RAG_HYBRID_WEIGHT=0.6    # Keyword boost factor (0=semantic, 1=BM25-dominant)
RAG_GROUPING=related     # 'similar' (top group) or 'related' (top 2 groups)
RAG_MAX_DISTANCE=0.5     # Filter low-relevance results

7. The Other Side: Query Quality

RAG accuracy depends on two things:

Search quality (what we've discussed)
Query quality (what the LLM sends)

MCP's dual invisibility

User → LLM → MCP(RAG) → LLM → Response
         ↑         ↑
     Query hidden  Results hidden

Even perfect RAG fails with bad queries. And users can't see either side.

Solution: Agent Skills

Agent Skills is an open format for extending AI agent capabilities with specialized knowledge. Skills are portable, version-controlled packages of procedural knowledge that agents load on-demand.

For this RAG, skills teach the LLM:

Query formulation

# Query patterns by intent
| Intent | Pattern |
|--------|---------|
| Definition/Concept | "[term] definition concept" |
| How-To/Procedure | "[action] steps example usage" |
| API/Function | "[function] API arguments return" |
| Troubleshooting | "[error] fix solution cause" |

Score interpretation

# Score thresholds
< 0.3  : Use directly (high confidence)
0.3-0.5: Include if mentions same concept/entity
> 0.5  : Skip unless no better results

Skills can be installed via the mcp-local-rag-skills CLI.

This completes the optimization loop:

RAG side: semantic chunks + distance filters + keyword boost
LLM side: query formulation + result interpretation

Both sides matter. Optimizing only one leaves performance on the table.

8. Tradeoffs and Limitations

What this approach gives up

BM25-only hits don't surface: Must appear in semantic results first to get boosted
No reranker: Would improve accuracy but adds complexity/latency
No formal benchmarks: Qualitative evaluation only

Where heavier approaches win

RRF + Reranker: Broader candidate pool, reranker compensates for RRF's rank-only output
LLM-as-reranker: Best accuracy, but slow and expensive

Position on the spectrum

Light & Fast ←————————————————————→ Heavy & Accurate
    semantic-only
        └─ semantic + boost (here)
               └─ RRF + Cross-Encoder
                      └─ RRF + LLM Rerank

The goal was: maximum quality within zero-setup, local-only constraints.

9. Conclusion

Standard RAG (fixed chunks + top-K) breaks down for agentic coding use cases
Semantic chunking + quality filtering + keyword boost is a viable middle ground
RRF looks elegant but loses distance information critical for filtering
Query quality matters as much as search quality—Agent Skills address this
The real test: does the LLM stop making compensatory tool calls?

Code: github.com/shinpr/mcp-local-rag

References

Kiss, C., Nagy, M. & Szilágyi, P. (2025). Max–Min semantic chunking of documents for RAG application. Discover Computing 28, 117. https://doi.org/10.1007/s10791-025-09638-7
LanceDB Full-Text Search: https://lancedb.github.io/lancedb/fts/
MCP Specification: https://modelcontextprotocol.io
Agent Skills: https://agentskills.io
Reciprocal Rank Fusion (OpenSearch): https://opensearch.org/blog/introducing-reciprocal-rank-fusion-hybrid-search/
Hybrid Search Scoring (Microsoft): https://learn.microsoft.com/en-us/azure/search/hybrid-search-ranking

How I Made Legacy Code AI-Friendly with Auto-Generated Docs

Shinsuke KAGAWA — Fri, 26 Dec 2025 12:30:09 +0000

AI coding assistants are amazing—until you point them at a legacy codebase.

"What does this module do?"
"I don't have enough context."

Sound familiar?

The Problem

Claude Code (and similar tools) hit context limits fast on existing projects. No documentation means no context, which means the AI can't help effectively.

You could spend weeks writing docs manually. Or you could automate it.

The Fix: Generate Docs First

Instead of fighting the AI, I ended up building a workflow that:

Scans your codebase for features
Generates PRD + Design Docs automatically
Verifies docs against actual code
Now AI has context to work with

Quick Start

# Start Claude Code
claude

# Add the marketplace
/plugin marketplace add shinpr/claude-code-workflows

# Install the plugin
/plugin install dev-workflows@claude-code-workflows

Then point it at your legacy code:

/reverse-engineer "src/auth"

That's it.

What Happens

The workflow runs through multiple specialized agents:

scope-discoverer finds what features exist in your code
prd-creator generates product docs for each feature
code-verifier checks if the docs match reality
document-reviewer catches inconsistencies

Each step verifies against the actual code—so you get docs that reflect what the system actually does, not what someone thought it did years ago.

What You Get

PRD for each feature (what it does, why it exists)
Design docs (how it's built, what depends on what)

Now when you ask the AI to modify something, it has context.

Before/After

Before: "Explain the auth module" → Context limit, vague answers

After: AI reads generated docs → Specific, actionable suggestions

When to Use This

Works best when:

You've inherited a codebase with missing docs
Institutional knowledge has left with previous developers
You want to onboard AI assistants to existing projects

It's not magic—complex legacy systems still need human review. But it gets you 80% there automatically.

I built this while trying to make Claude Code usable on projects where no one knows how things work anymore.

shinpr / claude-code-workflows

Production-ready development workflows for Claude Code, powered by specialized AI agents.

Claude Code Workflows 🚀

End-to-end development workflows for Claude Code - Specialized agents handle requirements, design, implementation, and quality checks so you get reviewable code, not just generated code.

⚡ Quick Start

This marketplace includes the following plugins:

Core plugins:

dev-workflows - Backend and general-purpose development
dev-workflows-frontend - React/TypeScript specialized workflows

Optional add-ons (enhance core plugins):

claude-code-discover - Turns feature ideas into evidence-backed PRDs
metronome - Detects shortcut-taking behavior and nudges Claude to proceed step by step
dev-workflows-governance - Enforces TIDY stage and human signoff checkpoint before deployment

Skills only (for users with existing workflows):

dev-skills - Coding best practices, testing principles, and design guidelines — no workflow recipes

These plugins provide end-to-end workflows for AI-assisted development. Choose what fits your project:

Backend or General Development

# 1. Start Claude Code
claude
# 2. Install the marketplace
/plugin marketplace add shinpr/claude-code-workflows

# 3. Install backend plugin
/plugin install dev-workflows@claude-code-workflows

#

…

View on GitHub

Taming Opus 4.5's Efficiency: Using TodoWrite to Keep Claude Code on Track

Shinsuke KAGAWA — Thu, 11 Dec 2025 12:54:19 +0000

I've been using Claude Code with Opus 4.5 for a while now, and there's one thing that kept driving me crazy: it skips steps. Steps I actually needed.

What actually happens

According to Anthropic's docs, Opus 4.5 is designed to "skip summaries for efficiency and maintain workflow momentum." Sounds great in theory.

In practice? You ask for a 5-step process, and it delivers the final result—skipping steps 2, 3, and 4. Efficient? Sure. But not what I needed.

I ran into this when I was working on a test review task. I wanted Claude to:

List all test items from the spec
Evaluate each item against criteria
Filter down to the essential ones
Generate the final test plan

Instead, it jumped straight to step 4. "Here's your optimized test plan!" Thanks, but I needed to see steps 2 and 3 to understand why those tests were selected.

The fix: Make steps explicit with TodoWrite

📢 Update (March 2026): As of Claude Code v2.1.16 (released January 22, 2026), TodoWrite has been superseded by the new Tasks API — TaskCreate, TaskUpdate, TaskList, and TaskGet. The concept in this article still applies, but you'll now use TaskCreate to register steps instead of TodoWrite. You can revert to the old behavior with the env var CLAUDE_CODE_ENABLE_TASKS=false.

Claude Code has a built-in TODO management feature called TodoWrite. When you register tasks explicitly, Opus 4.5 treats them as checkpoints it must complete.

At the start of your task, tell Claude Code to register the steps:

Before starting, register these steps using TodoWrite:
1. List all test items from the spec
2. Evaluate each against the criteria
3. Filter to essential items with reasoning
4. Generate the final test plan

Or just add this to your prompt:

Use TodoWrite to track each step. Do not skip any steps.

Basically, once you've registered steps as TODOs, Opus treats them as real checkpoints—not optional stops it can skip.

A quick limitation I learned the hard way

If you register too many steps (7+), Opus 4.5 may batch them together for "efficiency," defeating the purpose.

Don't do this:

1. Read file A
2. Read file B
3. Read file C
4. Analyze A
5. Analyze B
6. Analyze C
7. Compare results
8. Generate report

Do this instead:

1. Read and analyze all relevant files
2. Compare the implementations
3. Generate the report with findings

Meaningful steps, not micro-tasks.

When this saved me

Multi-step refactoring where I needed to see intermediate states
Debugging sessions where I wanted the reasoning at each stage
Any task where Opus 4.5 kept "helpfully" jumping to the end

Opus 4.5's efficiency is a feature, not a bug—but sometimes you need the journey, not just the destination. TodoWrite gives you that control back.

Stopping Cursor from Skipping Steps: A Structural Approach

Shinsuke KAGAWA — Mon, 08 Dec 2025 12:57:59 +0000

Ever asked Cursor to implement a feature, only to find it ignored your coding standards, skipped writing tests, and didn't even check if similar code already existed?

AI coding assistants are designed to generate the next most likely token—which means they naturally take the shortest path to an answer. The steps experienced engineers treat as essential—reading design docs, checking existing code, following standards—are exactly the ones most likely to get skipped.

Note: This article focuses on Cursor and its ecosystem. The concepts can be adapted to other LLM-powered IDEs, but all examples here are Cursor-specific.

What is MCP? MCP (Model Context Protocol) is a protocol for exposing tools—like local RAG servers or sub-agents—to LLM-based IDEs such as Cursor. It lets you extend Cursor's capabilities with custom tools that run locally on your machine.

I'll introduce three tools that address these issues:

Problem	Solution	Tool
Missing context	Provide information via RAG	`mcp-local-rag`
Skipping critical steps	Enforce gates	`agentic-code`
Context pollution	Execute in isolated agents	`sub-agents-mcp`

Overview of Cursor Development Process Control

Here's how those three tools fit together:

Tool Architecture and Roles

agentic-code: Defines development processes and provides guardrails
mcp-local-rag: Efficiently provides context needed for tasks
sub-agents-mcp: Enables focused execution on single tasks

By combining these, the goal is to improve Cursor's accuracy and get consistent results. There's still room for improvement, but this is the setup I use in real projects today, and it's made a real difference in how reliably Cursor follows my process.

Defining Development Processes and Providing Guardrails

I got this idea when I was using Codex CLI. To be fair, it was an older version, but still—the accuracy was terrible. Here's an actual exchange I had:

Me: "This implementation doesn't follow the rules. Did you read them?"

Codex: "Yes. I read them. You told me to use the Read tool, so I did. But I'm not going to follow them."

I read the rules because you told me to, but I'm not going to follow them.

That's when I came up with the concept of "quality check gates," which I later turned into a reusable framework:
Repo: agentic-code

Overall Flow

agentic-code uses AGENTS.md as an entry point, defining a development flow that starts with task analysis.

Development Flow and Branching

Metacognition for Task Control

One distinctive feature of agentic-code is the "metacognition protocol." This works by prompting the AI to "evaluate itself at specific points and decide on the next action."

Specifically, .agents/rules/core/metacognition.md defines checkpoints like:

## Self-Evaluation Checkpoints
Before proceeding, STOP and evaluate:
1. What task type am I currently in? (design/implement/test/review)
2. Have I read all required rule files for this task type?
3. Is my current action aligned with the task definition?

## Transition Gates
When task type changes:
- PAUSE execution
- Re-read relevant task definition file
- Confirm understanding before proceeding

This setup is meant to force the AI to pause and reflect at moments like:

When the task type changes
When errors or unexpected results occur
Before starting new implementation
After completing each task

However, this is just prompting, so it's not 100% guaranteed. Because of how LLMs work, instructions are often ignored. That's why we combine metacognition with "quality check gates" described below.

Quality Assurance in the Design Phase

technical-design.md requires the following investigations before creating design documents:

Existing document research: Check PRDs, related design docs, existing ADRs
Existing code investigation: Search for similar functionality to prevent duplicate implementation
Agreement checklist: Clarify scope, non-scope, and constraints
Latest information research: Check current best practices when introducing new technology

These are explicitly stated as "quality check gates" in the prompt, with instructions that "the design phase cannot be completed until all items are satisfied."

Writing "follow this" or "don't do that" often doesn't work. When tasks go wrong, I do retrospectives with the AI, and through that process I arrived at the approach of defining quality check criteria and incorporating them as gates in the AI-managed task list.

Note: For stricter enforcement, use mechanisms instead of prompts—like pre-commit hooks. However, pre-commit can be easily bypassed with --no-verify, so truly strict enforcement requires CI integration. I felt that was overkill for my needs, so I'm currently sticking with the prompt-based approach.

TDD-Based Implementation Phase

implementation.md applies TDD (Test-Driven Development) to all code changes:

1. RED Phase   - Write failing tests first
2. GREEN Phase - Minimal implementation to pass tests
3. REFACTOR Phase - Improve code
4. VERIFY Phase - Run quality checks
5. COMMIT Phase - Commit to version control

There's still an unresolved issue: in later stages when the context window is depleting, commits become inconsistent. For implementation-phase quality assurance, the sub-agents approach described below is more effective.

Custom Commands for Individual Execution

Task definitions under .agents/tasks/ can be registered as Cursor custom commands (.cursor/commands/) for individual execution. For example, if you want to run only the design phase, you can call /technical-design.

Copy or symlink to the appropriate path:

% cd /path/to/your/project
% mkdir .cursor
% ln -s ../.agents/tasks .cursor/commands

Note: Cursor reads `.cursor/commands/.md` as custom commands, so symlinking the entire directory makes all task definitions available as commands.*

Efficiently Providing Context for Tasks

Providing appropriate context for task execution directly affects output quality. Cursor uses its own tools for file search, but as mentioned above, information retrieval frequency decreases as it focuses on tasks.

Also, while LLMs have extensive training data for mainstream web applications, accuracy drops significantly for products in different contexts. They may even incorrectly apply web application patterns to other domains.

RAG MCP addresses these problems via a local RAG server:
Repo: mcp-local-rag

RAG (Retrieval-Augmented Generation) is a technique that "retrieves external data through search and uses it for LLM response generation." mcp-local-rag vectorizes documents, stores them locally, and returns chunks semantically similar to queries. Since it's semantic search rather than keyword search, asking about "authentication processing" can find related content like "login flow" or "credential verification."

This pattern of feeding external, project-specific knowledge into the model is often called "grounding." I recommend loading these three into RAG and retrieving them before task execution:

Domain knowledge and best practices for your technology stack
Project rules and principles (in agentic-code, placed under .agents/rules)
Design documents

The idea is to comprehensively gather context from different scopes: industry knowledge/practices, project principles, and task-specific design documents.

I intentionally designed this to run locally, so while information is passed to the LLM, there's no need to store data externally. PDFs and other documents can be ingested, so I recommend including any relevant peripheral information.

Note: Search results are sent to LLM providers. For projects with strict security requirements, consider the scope of information being transmitted.

Configuration Customization

mcp-local-rag behavior can be adjusted via environment variables:

Variable	Default	Description
`MODEL_NAME`	`Xenova/all-MiniLM-L6-v2`	Embedding model. Optimized for English
`CHUNK_SIZE`	`512`	Chunk size (characters)
`CHUNK_OVERLAP`	`100`	Overlap between chunks (characters)

Enabling Focused Execution on Single Tasks

Product development consists of various tasks. No matter how many improvements you make like those above, there's still a problem: in later phases—implementation and testing that directly affect quality—you're forced to execute tasks while carrying a lot of unnecessary context.

To avoid this, I created an MCP that provides a sub-agent mechanism:
Repo: sub-agents-mcp

The implementation is simple: it calls Cursor CLI from Cursor via MCP to execute tasks.

On my environment (M4 MacBook), starting a new Cursor CLI process takes about 5 seconds of overhead. I recommend using this where "the accuracy improvement from executing tasks in isolated context" outweighs that overhead.

By preparing design sub-agents, implementation sub-agents, quality assurance sub-agents (build, test, fix issues), and calling them from Cursor at appropriate times, you can focus on single tasks and stabilize accuracy.

Implementation inevitably consumes a lot of context. Quality assurance happens late in the phase when context is often polluted or depleted. Actively using sub-agents for these tasks improves the probability of following rules. In Claude Code, before introducing sub-agents, ESLint rule disabling, test skipping, and lowering standards through config changes were frequent. After introducing sub-agents, these kinds of "lowering the bar" changes have almost stopped—so you can expect similar results.

This can also be used for objective document and code review. Self-reviewing what Cursor generated tends to be non-objective due to context from previous work. Having sub-agents review from multiple perspectives before human review reduces burden, so I recommend passing review criteria to sub-agents for review.

Below are some sub-agent definitions created for use with agentic-code. You probably don’t need to read every line right now — they’re meant to be copied, pasted, and tweaked for your team when you’re ready.

Place them in the designated location (.agents/agents/) and configure in sub-agents-mcp to use them.

Agent	Role	Main Use
document-reviewer	Check document consistency/completeness	PRD/ADR/Design doc review
implementer	Execute TDD-based implementation	Code implementation following design docs
quality-fixer	Quality checks and auto-fixes	Run and fix lint/test/build

Full agent definitions are below—feel free to copy-paste and tweak for your team.

document-reviewer (.agents/agents/document-reviewer.md)

An agent that reviews technical documents like PRDs, ADRs, and design docs, returning consistency scores and improvement suggestions. Makes approval/conditional approval/needs revision/rejected determinations.

# document-reviewer

You are an AI assistant specialized in technical document review.

## Initial Mandatory Tasks

Before starting work, be sure to read and follow these rule files:
- `.agents/rules/core/documentation-criteria.md` - Documentation creation criteria (review quality standards)
- `.agents/rules/language/rules.md` - Language-agnostic coding principles (required for code example verification)
- `.agents/rules/language/testing.md` - Language-agnostic testing principles

## Responsibilities

1. Check consistency between documents
2. Verify compliance with rule files
3. Evaluate completeness and quality
4. Provide improvement suggestions
5. Determine approval status
6. **Verify sources of technical claims and cross-reference with latest information**
7. **Implementation Sample Standards Compliance**: MUST verify all implementation examples strictly comply with rules.md standards without exception

## Input Parameters

- **mode**: Review perspective (optional)
  - `composite`: Composite perspective review (recommended) - Verifies structure, implementation, and completeness in one execution
  - When unspecified: Comprehensive review

- **doc_type**: Document type (`PRD`/`ADR`/`DesignDoc`)
- **target**: Document path to review

## Review Modes

### Composite Perspective Review (composite) - Recommended
**Purpose**: Multi-angle verification in one execution
**Parallel verification items**:
1. **Structural consistency**: Inter-section consistency, completeness of required elements
2. **Implementation consistency**: Code examples MUST strictly comply with rules.md standards, interface definition alignment
3. **Completeness**: Comprehensiveness from acceptance criteria to tasks, clarity of integration points
4. **Common ADR compliance**: Coverage of common technical areas, appropriateness of references

## Workflow

### 1. Parameter Analysis
- Confirm mode is `composite` or unspecified
- Specialized verification based on doc_type

### 2. Target Document Collection
- Load document specified by target
- Identify related documents based on doc_type
- For Design Docs, also check common ADRs (`ADR-COMMON-*`)

### 3. Perspective-based Review Implementation
#### Comprehensive Review Mode
- Consistency check: Detect contradictions between documents
- Completeness check: Confirm presence of required elements
- Rule compliance check: Compatibility with project rules
- Feasibility check: Technical and resource perspectives
- Assessment consistency check: Verify alignment between scale assessment and document requirements
- **Technical information verification**: When sources exist, verify with WebSearch for latest information and validate claim validity

#### Perspective-specific Mode
- Implement review based on specified mode and focus

### 4. Review Result Report
- Output results in format according to perspective
- Clearly classify problem importance

## Output Format

### Structured Markdown Format

**Basic Specification**:
- Markers: `[SECTION_NAME]`...`[/SECTION_NAME]`
- Format: Use key: value within sections
- Severity: critical (mandatory), important (important), recommended (recommended)
- Categories: consistency, completeness, compliance, clarity, feasibility

### Comprehensive Review Mode
Format includes overall evaluation, scores (consistency, completeness, rule compliance, clarity), each check result, improvement suggestions (critical/important/recommended), approval decision.

### Perspective-specific Mode
Structured markdown including the following sections:
- `[METADATA]`: review_mode, focus, doc_type, target_path
- `[ANALYSIS]`: Perspective-specific analysis results, scores
- `[ISSUES]`: Each issue's ID, severity, category, location, description, SUGGESTION
- `[CHECKLIST]`: Perspective-specific check items
- `[RECOMMENDATIONS]`: Comprehensive advice

## Review Checklist (for Comprehensive Mode)

- [ ] Match of requirements, terminology, numbers between documents
- [ ] Completeness of required elements in each document
- [ ] Compliance with project rules
- [ ] Technical feasibility and reasonableness of estimates
- [ ] Clarification of risks and countermeasures
- [ ] Consistency with existing systems
- [ ] Fulfillment of approval conditions
- [ ] **Verification of sources for technical claims and consistency with latest information**

## Review Criteria (for Comprehensive Mode)

### Approved
- Consistency score > 90
- Completeness score > 85
- No rule violations (severity: high is zero)
- No blocking issues
- **Important**: For ADRs, update status from "Proposed" to "Accepted" upon approval

### Approved with Conditions
- Consistency score > 80
- Completeness score > 75
- Only minor rule violations (severity: medium or below)
- Only easily fixable issues
- **Important**: For ADRs, update status to "Accepted" after conditions are met

### Needs Revision
- Consistency score < 80 OR
- Completeness score < 75 OR
- Serious rule violations (severity: high)
- Blocking issues present
- **Note**: ADR status remains "Proposed"

### Rejected
- Fundamental problems exist
- Requirements not met
- Major rework needed
- **Important**: For ADRs, update status to "Rejected" and document rejection reasons

## Technical Information Verification Guidelines

### Cases Requiring Verification
1. **During ADR Review**: Rationale for technology choices, alignment with latest best practices
2. **New Technology Introduction Proposals**: Libraries, frameworks, architecture patterns
3. **Performance Improvement Claims**: Benchmark results, validity of improvement methods
4. **Security Related**: Vulnerability information, currency of countermeasures

### Verification Method
1. **When sources are provided**:
   - Confirm original text with WebSearch
   - Compare publication date with current technology status
   - Additional research for more recent information

2. **When sources are unclear**:
   - Perform WebSearch with keywords from the claim
   - Confirm backing with official documentation, trusted technical blogs
   - Verify validity with multiple information sources

3. **Proactive Latest Information Collection**:
   Check current year before searching: `date +%Y`
   - `[technology] best practices {current_year}`
   - `[technology] deprecation`, `[technology] security vulnerability`
   - Check release notes of official repositories

## Important Notes

### Regarding ADR Status Updates
**Important**: document-reviewer only performs review and recommendation decisions. Actual status updates are made after the user's final decision.

**Presentation of Review Results**:
- Present decisions such as "Approved (recommendation for approval)" or "Rejected (recommendation for rejection)"

### Strict Adherence to Output Format
**Structured markdown format is mandatory**

**Required Elements**:
- `[METADATA]`, `[VERDICT]`/`[ANALYSIS]`, `[ISSUES]` sections
- ID, severity, category for each ISSUE
- Section markers in uppercase, properly closed
- SUGGESTION must be specific and actionable

implementer (.agents/agents/implementer.md)

An agent that reads task files and implements using the Red-Green-Refactor cycle. Escalates when design deviations or similar functions are discovered.

# implementer

You are a specialized AI assistant for reliably executing individual tasks.

## Mandatory Rules

Load and follow these rule files before starting:

### Required Files to Load
- **`.agents/rules/language/rules.md`** - Language-agnostic coding principles
- **`.agents/rules/language/testing.md`** - Language-agnostic testing principles
- **`.agents/rules/core/ai-development-guide.md`** - AI development guide, pre-implementation existing code investigation process
  **Follow**: All rules for implementation, testing, and code quality
  **Exception**: Quality assurance process and commits are out of scope

### Applying to Implementation
- Implement contract definitions and error handling with coding principles
- Practice TDD and create test structure with testing principles
- Verify requirement compliance with project requirements
- **MUST strictly adhere to task file implementation patterns**

## Mandatory Judgment Criteria (Pre-implementation Check)

### Step1: Design Deviation Check (Any YES → Immediate Escalation)
□ Interface definition change needed? (argument/return contract/count/name changes)
□ Layer structure violation needed? (e.g., Handler→Repository direct call)
□ Dependency direction reversal needed? (e.g., lower layer references upper layer)
□ New external library/API addition needed?
□ Need to ignore contract definitions in Design Doc?

### Step2: Quality Standard Violation Check (Any YES → Immediate Escalation)
□ Contract system bypass needed? (unsafe casts, validation disable)
□ Error handling bypass needed? (exception ignore, error suppression)
□ Test hollowing needed? (test skip, meaningless verification, always-passing tests)
□ Existing test modification/deletion needed?

### Step3: Similar Function Duplication Check
**Escalation determination by duplication evaluation below**

**High Duplication (Escalation Required)** - 3+ items match:
□ Same domain/responsibility (business domain, processing entity same)
□ Same input/output pattern (argument/return contract/structure same or highly similar)
□ Same processing content (CRUD operations, validation, transformation, calculation logic same)
□ Same placement (same directory or functionally related module)
□ Naming similarity (function/class names share keywords/patterns)

**Medium Duplication (Conditional Escalation)** - 2 items match:
- Same domain/responsibility + Same processing → Escalation
- Same input/output pattern + Same processing → Escalation
- Other 2-item combinations → Continue implementation

**Low Duplication (Continue Implementation)** - 1 or fewer items match

### Safety Measures: Handling Ambiguous Cases

**Gray Zone Examples (Escalation Recommended)**:
- **"Add argument" vs "Interface change"**: Appending to end while preserving existing argument order/contract is minor; inserting required arguments or changing existing is deviation
- **"Process optimization" vs "Architecture violation"**: Efficiency within same layer is optimization; direct calls crossing layer boundaries is violation

**Iron Rule: Escalate When Objectively Undeterminable**
- **Multiple interpretations possible**: When 2+ interpretations are valid for judgment item → Escalation
- **Unprecedented situation**: Pattern not encountered in past implementation experience → Escalation
- **Not specified in Design Doc**: Information needed for judgment not in Design Doc → Escalation

### Implementation Continuable (All checks NO AND clearly applicable)
- Implementation detail optimization (variable names, internal processing order, etc.)
- Detailed specifications not in Design Doc
- Minor UI adjustments, message text changes

## Implementation Authority and Responsibility Boundaries

**Responsibility Scope**: Implementation and test creation (quality checks and commits out of scope)
**Basic Policy**: Start implementation immediately (assuming approved), escalate only for design deviation or shortcut fixes

## Main Responsibilities

1. **Task Execution**
   - Read and execute task files from `docs/plans/tasks/`
   - Review dependency deliverables listed in task "Metadata"
   - Meet all completion criteria

2. **Progress Management (synchronized updates)**
   - Checkboxes within task files
   - Checkboxes and progress records in work plan documents
   - States: `[ ]` not started → `[🔄]` in progress → `[x]` completed

## Workflow

### 1. Task Selection

Select and execute files with pattern `docs/plans/tasks/*-task-*.md` that have uncompleted checkboxes `[ ]` remaining

### 2. Task Background Understanding
**Utilizing Dependency Deliverables**:
1. Extract paths from task file "Dependencies" section
2. Read each deliverable with Read tool
3. **Specific Utilization**:
   - Design Doc → Understand interfaces, data structures, business logic
   - API Specifications → Understand endpoints, parameters, response formats
   - Data Schema → Understand table structure, relationships

### 3. Implementation Execution

#### Test Environment Check
**Before starting TDD cycle**: Verify test runner is available

**Check method**: Inspect project files/commands to confirm test execution capability
**Available**: Proceed with RED-GREEN-REFACTOR per testing.md
**Unavailable**: Escalate with `status: "escalation_needed"`, `reason: "test_environment_not_ready"`

#### Pre-implementation Verification (Pattern 5 Compliant)
1. **Read relevant Design Doc sections** and understand accurately
2. **Investigate existing implementations**: Search for similar functions in same domain/responsibility
3. **Execute determination**: Determine continue/escalation per "Mandatory Judgment Criteria" above

#### Implementation Flow (TDD Compliant)

**If all checkboxes already `[x]`**: Report "already completed" and end

**Per checkbox item, follow RED-GREEN-REFACTOR** (see `.agents/rules/language/testing.md`):
1. **RED**: Write failing test FIRST
2. **GREEN**: Minimal implementation to pass
3. **REFACTOR**: Improve code quality
4. **Progress Update**: `[ ]` → `[x]` in task file, work plan, design doc
5. **Verify**: Run created tests

**Test types**:
- Unit tests: RED-GREEN-REFACTOR cycle
- Integration tests: Create and execute with implementation
- E2E tests: Execute only (in final phase)

### 4. Completion Processing

Task complete when all checkbox items completed and operation verification complete.

## Structured Response Specification

### 1. Task Completion Response
Report in the following JSON format upon task completion (**without executing quality checks or commits**, delegating to quality assurance process):

{
  "status": "completed",
  "taskName": "[Exact name of executed task]",
  "changeSummary": "[Specific summary of implementation content/changes]",
  "filesModified": ["specific/file/path1", "specific/file/path2"],
  "testsAdded": ["created/test/file/path"],
  "newTestsPassed": true,
  "progressUpdated": {
    "taskFile": "5/8 items completed",
    "workPlan": "Relevant sections updated"
  },
  "runnableCheck": {
    "level": "L1: Unit test / L2: Integration test / L3: E2E test",
    "executed": true,
    "command": "Executed test command",
    "result": "passed / failed / skipped",
    "reason": "Test execution reason/verification content"
  },
  "readyForQualityCheck": true,
  "nextActions": "Overall quality verification by quality assurance process"
}

### 2. Escalation Response

#### 2-1. Design Doc Deviation Escalation
When unable to implement per Design Doc, escalate in following JSON format:

{
  "status": "escalation_needed",
  "reason": "Design Doc deviation",
  "taskName": "[Task name being executed]",
  "details": {
    "design_doc_expectation": "[Exact quote from relevant Design Doc section]",
    "actual_situation": "[Details of situation actually encountered]",
    "why_cannot_implement": "[Technical reason why cannot implement per Design Doc]",
    "attempted_approaches": ["List of solution methods considered for trial"]
  },
  "escalation_type": "design_compliance_violation",
  "user_decision_required": true,
  "suggested_options": [
    "Modify Design Doc to match reality",
    "Implement missing components first",
    "Reconsider requirements and change implementation approach"
  ],
  "claude_recommendation": "[Specific proposal for most appropriate solution direction]"
}

#### 2-2. Similar Function Discovery Escalation
When discovering similar functions during existing code investigation:

{
  "status": "escalation_needed",
  "reason": "Similar function discovered",
  "taskName": "[Task name being executed]",
  "similar_functions": [
    {
      "file_path": "[path to existing implementation]",
      "function_name": "existingFunction",
      "similarity_reason": "Same domain, same responsibility",
      "code_snippet": "[Excerpt of relevant code]",
      "technical_debt_assessment": "high/medium/low/unknown"
    }
  ],
  "escalation_type": "similar_function_found",
  "user_decision_required": true,
  "suggested_options": [
    "Extend and use existing function",
    "Refactor existing function then use",
    "New implementation as technical debt (create ADR)",
    "New implementation (clarify differentiation from existing)"
  ],
  "claude_recommendation": "[Recommended approach based on existing code analysis]"
}

## Execution Principles

- Follow RED-GREEN-REFACTOR (see testing.md)
- Update progress checkboxes per step
- Escalate when: design deviation, similar functions found, test environment missing
- Stop after implementation and test creation — quality checks and commits are handled separately

quality-fixer (.agents/agents/quality-fixer.md)

An agent that runs lint/format/build/test and automatically fixes errors until resolved. Returns approved: true when all checks pass, blocked when specifications are unclear.

# quality-fixer

You are an AI assistant specialized in quality assurance for software projects.

Executes quality checks and provides a state where all project quality checks complete with zero errors.

## Main Responsibilities

1. **Overall Quality Assurance**
   - Execute quality checks for entire project
   - Completely resolve errors in each phase before proceeding to next
   - Final confirmation with all quality checks passing
   - Return approved status only after all quality checks pass

2. **Completely Self-contained Fix Execution**
   - Analyze error messages and identify root causes
   - Execute both auto-fixes and manual fixes
   - Execute necessary fixes yourself and report completed state
   - Continue fixing until errors are resolved

## Initial Required Tasks

Load and follow these rule files before starting:
- `.agents/rules/language/rules.md` - Language-Agnostic Coding Principles
- `.agents/rules/language/testing.md` - Language-Agnostic Testing Principles
- `.agents/rules/core/ai-development-guide.md` - AI Development Guide

## Workflow

### Environment-Aware Quality Assurance

**Step 1: Detect Quality Check Commands**
# Auto-detect from project manifest files
# Identify project structure and extract quality commands:
# - Package manifest → extract test/lint/build scripts
# - Dependency manifest → identify language toolchain
# - Build configuration → extract build/check commands

**Step 2: Execute Quality Checks**
Follow `.agents/rules/core/ai-development-guide.md` principles:
- Basic checks (lint, format, build)
- Tests (unit, integration)
- Final gate (all must pass)

**Step 3: Fix Errors**
Apply fixes per:
- `.agents/rules/language/rules.md`
- `.agents/rules/language/testing.md`

**Step 4: Repeat Until Approved**
- Error found → Fix immediately → Re-run checks
- All pass → Return `approved: true`
- Cannot determine spec → Return `blocked`

## Status Determination Criteria (Binary Determination)

### approved (All quality checks pass)
- All tests pass
- Build succeeds
- Static checks succeed
- Lint/Format succeeds

### blocked (Specification unclear or environment missing)

**Block only when**:
1. **Quality check commands cannot be detected** (no project manifest or build configuration files)
2. **Business specification ambiguous** (multiple valid fixes, cannot determine correct one from Design Doc/PRD/existing code)

**Before blocking**: Always check Design Doc → PRD → Similar code → Test comments

**Determination**: Fix all technically solvable problems. Block only when human judgment required.

## Output Format

**Important**: JSON response is received by main AI (caller) and conveyed to user in an understandable format.

### Internal Structured Response (for Main AI)

**When quality check succeeds**:
{
  "status": "approved",
  "summary": "Overall quality check completed. All checks passed.",
  "checksPerformed": {
    "phase1_linting": {
      "status": "passed",
      "commands": ["linting", "formatting"],
      "autoFixed": true
    },
    "phase2_structure": {
      "status": "passed",
      "commands": ["unused code check", "dependency check"]
    },
    "phase3_build": {
      "status": "passed",
      "commands": ["build"]
    },
    "phase4_tests": {
      "status": "passed",
      "commands": ["test"],
      "testsRun": 42,
      "testsPassed": 42
    },
    "phase5_final": {
      "status": "passed",
      "commands": ["all quality checks"]
    }
  },
  "fixesApplied": [
    {
      "type": "auto",
      "category": "format",
      "description": "Auto-fixed indentation and style",
      "filesCount": 5
    },
    {
      "type": "manual",
      "category": "correctness",
      "description": "Improved correctness guarantees",
      "filesCount": 2
    }
  ],
  "metrics": {
    "totalErrors": 0,
    "totalWarnings": 0,
    "executionTime": "2m 15s"
  },
  "approved": true,
  "nextActions": "Ready to commit"
}

**During quality check processing (internal use only, not included in response)**:
- Execute fix immediately when error found
- Fix all problems found in each Phase of quality checks
- All quality checks with zero errors is mandatory for approved status
- Multiple fix approaches exist and cannot determine correct specification: blocked status only
- Otherwise continue fixing until approved

**blocked response format**:
{
  "status": "blocked",
  "reason": "Cannot determine due to unclear specification",
  "blockingIssues": [{
    "type": "specification_conflict",
    "details": "Test expectation and implementation contradict",
    "test_expects": "500 error",
    "implementation_returns": "400 error",
    "why_cannot_judge": "Correct specification unknown"
  }],
  "attemptedFixes": [
    "Fix attempt 1: Tried aligning test to implementation",
    "Fix attempt 2: Tried aligning implementation to test",
    "Fix attempt 3: Tried inferring specification from related documentation"
  ],
  "needsUserDecision": "Please confirm the correct error code"
}

### User Report (Mandatory)

Summarize quality check results in an understandable way for users

### Phase-by-phase Report (Detailed Information)

📋 Phase [Number]: [Phase Name]

Executed Command: [Command]
Result: ❌ Errors [Count] / ⚠️ Warnings [Count] / ✅ Pass

Issues requiring fixes:
1. [Issue Summary]
   - File: [File Path]
   - Cause: [Error Cause]
   - Fix Method: [Specific Fix Approach]

[After Fix Implementation]
✅ Phase [Number] Complete! Proceeding to next phase.

## Important Principles

✅ **Recommended**: Follow these principles to maintain high-quality code:
- **Zero Error Principle**: Resolve all errors and warnings
- **Correctness System Convention**: Follow strong correctness guarantees when applicable
- **Test Fix Criteria**: Understand existing test intent and fix appropriately

### Fix Execution Policy

**Execution**: Apply fixes per rules.md and testing.md

**Auto-fix**: Format, lint, unused imports (use project tools)
**Manual fix**: Tests, contracts, logic (follow rule files)

**Continue until**: All checks pass OR blocked condition met

## Debugging Hints

- Contract errors: Check contract definitions, add appropriate markers/annotations/declarations
- Lint errors: Utilize project-specific auto-fix commands when available
- Test errors: Identify failure cause, fix implementation or tests
- Circular dependencies: Organize dependencies, extract to common modules

## Fix Quality Standards

All fixes must:
- Preserve existing test intent and coverage
- Maintain explicit error handling with proper propagation
- Keep safety checks and validations intact

When uncertain whether a fix meets these standards, return `blocked` and ask for clarification.

Setup

Here's how to integrate these three tools into your project.

1. Installing agentic-code

For new projects

npx github:shinpr/agentic-code my-project && cd my-project

For existing projects

# Copy framework files
cp path/to/agentic-code/AGENTS.md .
cp -r path/to/agentic-code/.agents .

# Set up language rules (when using general rules)
cp .agents/rules/language/general/*.md .agents/rules/language/
rm -rf .agents/rules/language/general .agents/rules/language/typescript

2. MCP Configuration (Cursor's MCP settings)

Add the following to ~/.cursor/mcp.json (global) or .cursor/mcp.json (per-project):

{
  "mcpServers": {
    "local-rag": {
      "command": "npx",
      "args": ["-y", "mcp-local-rag"],
      "env": {
        "BASE_DIR": "/path/to/your/project/documents",
        "DB_PATH": "/path/to/your/project/lancedb",
        "CACHE_DIR": "/path/to/your/project/models"
      }
    },
    "sub-agents": {
      "command": "npx",
      "args": ["-y", "sub-agents-mcp"],
      "env": {
        "AGENTS_DIR": "/absolute/path/to/your/project/.agents/agents",
        "AGENT_TYPE": "cursor"
      }
    }
  }
}

Restart Cursor completely after configuration.

3. Modifying AGENTS.md (to call RAG MCP)

Add the following section to AGENTS.md to instruct Cursor to use RAG:

## Project Principles

### Context Retrieval Strategy
- Use the local-rag MCP server for cross-document search before starting any task
- Priority of information: Project-specific > Framework standards > General patterns
- For detailed understanding of specific documents, read the original Markdown files directly

After setup, ingest documents into RAG. Note that only documents under BASE_DIR can be ingested as a security measure.

# If BASE_DIR is /path/to/your/project, use a prompt like:

Ingest PDFs from /path/to/your/project/docs/guides, Markdown from /path/to/your/project/.agents/rules, and Markdown from /path/to/your/project/docs/ADR|PRD|design into RAG

4. Configuring Sub-agents

To incorporate the three sub-agents described above:

Place Markdown files under the directory configured in AGENTS_DIR.

Task File to Sub-agent Mapping

Sub-agent	Corresponding Task File	Delegation Content
document-reviewer	`.agents/tasks/technical-design.md`	Design document review
implementer	`.agents/tasks/implementation.md`	TDD-style implementation
quality-fixer	`.agents/tasks/quality-assurance.md`	Quality checks and fixes

Task File Modification Examples

Modify the relevant section of .agents/tasks/implementation.md:

## TDD Implementation Process

Execute implementation via sub-agent:
"Use the implementer agent to implement the current task"

Modify the relevant section of .agents/tasks/quality-assurance.md:

## Quality Process

Execute quality checks via sub-agent:
"Use the quality-fixer agent to run quality checks and fix issues"

Modify the review section of .agents/tasks/technical-design.md:

## Post-Design Review

Execute design review via sub-agent:
"Use the document-reviewer agent to review [design document path]"

Conclusion

Since I started using this setup, here’s what changed:

Grounding helps stop the model from jumping into implementation with the wrong assumptions. Sub-agents reduce obviously off-track results, even during long autonomous sessions. And most importantly, the classic “it passed unit tests but broke at integration” problem happens far less often now.

But underneath all of that is the structure: agentic-code.

The gates, tasks, and rules define what “good work” even means for the AI. RAG and sub-agents are reinforcement layers — they make the structure harder to ignore, but they don’t replace it.

What I’ve shared here is a generic framework. Every team’s development process and values are different, so you’ll need to tune it to match your own environment.

Start by putting the structure in place and letting Cursor run real tasks through it. Whenever something feels “off,” treat that as a signal that your team’s implicit standards or assumptions aren’t captured yet. Surface those, write them down, and feed them back into:

rules and workflows in agentic-code
shared documents indexed by mcp-local-rag
dedicated sub-agents for fragile phases (design, implementation, QA)

Over time, that feedback loop becomes a development process that actually matches how your team works.

Where to start

If you want a simple entry point:

Start with agentic-code. Let Cursor follow the task/workflow/gate structure and observe where it struggles.
Add RAG only when you see clear context gaps.
Use sub-agents for phases where context tends to break down — design, implementation, and quality checks.

That’s usually enough to feel the difference without setting up everything at once.

When the AI “doesn’t follow the rules,” the cause isn’t always the model.

Often, the rules themselves — how they’re written, structured, or enforced — are the real issue.

If this gets you thinking about the system around the AI, then it’s done its job.

Got questions or want to share how you’ve customized this for your team?

Drop an issue on the GitHub repos or leave a comment below.