Forem: Igor Kramar

I built a Claude Code plugin that argues with me about architecture. Then it caught me lying to it.

Igor Kramar — Fri, 08 May 2026 20:29:24 +0000

TL;DR. I built a Claude Code plugin (MIT, github.com/IgorKramar/archforge-marketplace) that turns Claude into a senior architect. After running one deep cycle on a real architectural decision, you get back: a multi-page ADR with explicit architectural rules, two or three honest alternatives with trade-offs, and a five-role adversarial roast. On one of my own ADRs the roast found a confused-deputy attack vector at the LLM-tool boundary that I'd missed — flagged independently by both the security role and the compliance role. The plugin then failed to follow its own language rules in a specific, instructive way. This article is what I learned, including how the failure led to the most useful feature I added.

I've been using Claude as a daily collaborator for about a year. For most things — code, writing, debugging — it's excellent. But I noticed a specific failure mode in architectural conversations. I'd say "I'm thinking of using Postgres for this." Claude would say "Great choice, here's why Postgres works." I'd say "Actually, let me reconsider, maybe SQLite." Claude would say "Excellent, SQLite is perfect for this case, here's why." Both answers couldn't be right. They couldn't even both be useful.

Architecture is one of the few engineering activities where disagreement is the work. You need someone who refuses to let you skip ahead. Someone who insists on alternatives. Someone who points at the downside you don't want to hear. Default LLM tuning is the opposite of that.

So I built a plugin. It's MIT-licensed, the repo is at the end. This article is what I learned building it and putting it through a real architectural cycle on a real project — including the part where the plugin caught itself failing, and how that led to the most interesting feature in the whole thing.

Why a cycle, not a template

Most AI-architecture tooling I'd seen was templates. "Here's an ADR template. Fill it in." But ADRs are the artifact, not the work. A good ADR is the residue of a process that included three things you can't fake by filling in a template:

Discovery — making constraints, forces, and prior art explicit before you allow yourself to think about solutions.
Real alternatives — not "the option I want" plus a strawman, but two or three genuine options each with honest trade-offs.
Push-back — someone willing to argue, especially when your first instinct is to skip ahead.

So I structured the plugin around a six-phase cycle:

1. DISCOVER  → constraints, forces, prior art, requirements
2. RESEARCH  → current information from the web (versions, prices, regulation)
3. DESIGN    → 2–3 alternatives, each with trade-offs
4. DECIDE    → pick one, state why, state when it breaks
5. DOCUMENT  → ADR + update ARCHITECTURE.md + diagrams
6. REVIEW    → architectural review when code lands

Each phase is a slash command (/archforge:discover, :research, etc.). There's /archforge:cycle "<problem>" that walks the whole thing with user gates between phases. The phases enforce discipline. You can't propose solutions in discover. You can't introduce new alternatives in decide. You can't write an ADR until you've actually decided. The structure forces the conversation to slow down where it matters.

The router skill — the one that activates whenever the user mentions architecture or design — has a section that turned out to matter more than anything else:

### Hold position. Argue. Don't soft-cave.

If the user proposes something you consider weak, say so directly and argue.
Do not collapse at the first pushback. Soft pushback that folds is a form of
disrespect — the user came for honest critique, not validation.
Maintain the position until presented with a real counter-argument.

That paragraph carried more weight in practice than I expected.

Want to try it before reading the story?

Skip ahead if you want to know what it does first. But if you'd rather see for yourself:

# In Claude Code:
/plugin marketplace add https://github.com/IgorKramar/archforge-marketplace
/plugin install archforge@archforge-marketplace

# In any project (preferably one with an open architectural question):
/archforge:init
/archforge:cycle "should I extract this module into its own service?" --scale=light

A light-scale cycle takes about 10 minutes, walks you through Discover → Design → Decide → Document with pauses for your input, and produces a short ADR with two alternatives and explicit reasoning. If you find the result useful, run a --scale=deep cycle on a real decision next.

The rest of this article is the story of what the cycle produced when I ran it on a hard problem in my own project, and what failed when I did.

What the plugin actually contains

Three layers of components, but the structure isn't the interesting part — what each layer does is.

Skills — knowledge and disposition. Ten of them. A router skill that sets the architect persona. Specialists for ADR writing, system design, frontend architecture, backend architecture, AI agents architecture, code review, web research, an integration skill for Compound Engineering, an architectural-diagrams skill (C4, sequence, state, ER, deployment — all Mermaid). Skills are markdown files Claude pulls in when their description matches the current task.

Slash commands — 17 of them in v0.4. The cycle phases. Plus shortcuts (/archforge:adr), plus tooling (/archforge:map for a decision dependency graph, /archforge:observe for finding architectural decisions made implicitly in code, /archforge:upgrade for migrating projects between plugin versions, /archforge:diagram for any of the five diagram types).

Sub-agents — nine. Three for long-running tasks (architect, reviewer, researcher). Five for adversarial review — but I'll get to those, because they're where the story turns. And one for catching the plugin failing itself. That last one is the entire point of this article.

Plus soft-warning hooks that nudge you when you've changed many files without an ADR, when a new top-level directory appears, when many modules have been touched without architectural observation. Reminders, not gates.

The case study: an AI agent for a regulated market

I'm building a SaaS product — call it "the project" for this article — an AI-driven landing page builder targeting small businesses in a regulated jurisdiction where strict data residency requirements rule out US-hosted LLM providers. Florists, barbers, coffee shops. The differentiation is that an AI agent walks the user from "I want a website" to "the website is published and I can see analytics" without the user having to know design, code, or AI prompting.

The first architectural decision was the AI agent layer. This is hard because everything else depends on it: the backend's role, the data model, the load profile, the legal posture. And there are hard constraints — local data-protection law regulates personal data processing, and Anthropic's API isn't reachable from the target jurisdiction's infrastructure, which makes Claude itself problematic for the project technically and legally.

I ran /archforge:cycle "AI agent architecture for the editor". Below is what happened, condensed.

Discover, round 1 — and the constraint I'd missed

The discovery phase produced a structured document: restated problem, functional requirements, quality attributes, constraints (team: solo developer, budget: startup, compliance: local data-protection law), prior art (Lovable, v0, Cursor, regional AI website builders), forces, and six open questions with polar options.

I answered them. Then I noticed I'd written "Anthropic API only, server-side" as a constraint, and the plugin had built the rest on top of that. But Anthropic isn't reachable from infrastructure in the target jurisdiction, and sending end-user personal data to US servers violates the local data-protection law.

I pointed this out. The plugin produced a second round of discover — clearly labeled "Section 7: Second round of discover (after user input)". It:

Acknowledged the gap honestly: "this should have come up in round one; I accepted the user's premise without verifying it."
Expanded alternatives to four LLM-provider categories: regional commercial APIs (regulator-certified), self-hosted open-source, Anthropic via tunnel (and explicitly flagged it as not solving the data-residency requirement — data still physically leaves the jurisdiction), and a hybrid.
Pushed back on two of my answers to the original questions.

On Q3 (depth of business context) I'd answered "profile + dynamic memory (RAG)." The plugin pushed back: "Profile is cheap — 5–10 fields in the DB. Dynamic memory is a separate module with embedding provider, vector store, retrieval, ranking, periodic re-indexing. Is dynamic memory needed from day one of MVP, or is this clearly a V2 milestone? If V2 — fix it: profile now, dynamic memory in a later ADR."

I said V2.

On Q5 (evaluation), I'd asked for a full observability stack with weekly session labeling. The plugin pushed back: "Full stack with weekly labeling of 30-50 sessions is 3-5 hours per week of your time plus extra infrastructure. For a single developer, that either eats 5-10% of your time or becomes ceremony. Realistic path: log all LLM calls and tool calls structurally from day one (this is non-negotiable), defer the labeling and observability layer until you have ≥100 real sessions to label."

I agreed. Both push-backs ended up in the final ADR as explicit V2 commitments.

This was a senior engineer telling me that what I was asking for was disproportionate to my situation, and being right about it. The "Hold position. Argue. Don't soft-cave" instruction was earning its place.

Research, design, decide

Research surfaced 50+ dated sources on regional LLM providers, their certification posture, pricing, and capability ceilings. Design produced three alternatives with honest trade-offs — Alt 1 minimalist, Alt 2 gateway with redundancy, Alt 3 full agent platform — and a 15-row comparison matrix. The plugin recommended Alt 2, not Alt 3 (which scored higher on coverage). The reasoning: 5-6 weeks to MVP for a single developer is too long; many of Alt 3's features are premature.

I picked Alt 2. The plugin produced ADR-0001 with seven explicit architectural rules — PromptProvider trait, tool registry as enum + struct, dedicated Orchestrator module, current_state column from day 1, structured logging with zero-cost null fields for evaluation, vector storage installed but embedding() returning unimplemented!() until a future ADR.

Real ADR. The kind I'd write if I'd spent an afternoon on it.

Here's the C4 Component-level (L3) diagram the plugin produced for the AI module inside the backend, showing the path from client through gateway to LLM providers — the seven architectural rules made structural:

graph TB
    Client["SPA (editor)<br/>via REST/SSE"]

    subgraph Backend["Backend — ai/ module"]
        direction TB
        Orchestrator["<b>Orchestrator</b><br/>run_turn(session, message)<br/>dispatch_to_role() — V2"]
        ToolRegistry["<b>Tool Registry</b><br/>enum + struct<br/>required_mode (V2)<br/>execute_tool()"]
        PromptProvider["<b>PromptProvider</b> trait<br/>YamlPromptProvider — V0<br/>DbPromptProvider — V2"]
        Gateway["<b>LLM Gateway</b><br/>chat_completion()<br/>embedding()<br/>routing-policy"]
        Sanitizer["<b>Sanitizer</b><br/>strips PII<br/>before sending to<br/>secondary provider"]
        StateMachine["<b>SessionState</b><br/>current_state<br/>('active' in V0)"]
        Logger["<b>StructuredLogger</b><br/>provider, latency,<br/>tokens, cost,<br/>eval_label (V2)"]
    end

    subgraph DB["Database"]
        AISession[("ai_session<br/>+ current_state")]
        AIMessage[("ai_message<br/>+ provider, latency,<br/>tokens, cost, eval_*")]
        AIToolCall[("ai_tool_call")]
        BusinessProfile[("business_profile<br/>+ embedding vector NULL<br/>(V2)")]
        Prompts[("prompt_template — V2")]
    end

    subgraph Providers["LLM providers"]
        ProviderA["<b>Provider A</b><br/>regulator-certified<br/>primary, PII-safe"]
        ProviderAClassif["<b>Provider A (Lite)</b><br/>classification"]
        ProviderB["<b>Provider B</b><br/>secondary,<br/>sanitized payloads only"]
        ProviderEmbed["<b>Provider C</b><br/>embeddings"]
    end

    subgraph Future["V2 (same gateway)"]
        SelfHosted["<b>Self-hosted LLM</b><br/>large open model<br/>via local inference server"]
        SelfHostedEmbed["<b>Self-hosted embeddings</b>"]
    end

    Client -->|API call| Orchestrator
    Orchestrator -->|reads<br/>system prompt| PromptProvider
    Orchestrator -->|selects tools| ToolRegistry
    Orchestrator -->|chat_completion| Gateway
    Orchestrator -->|state<br/>transitions| StateMachine
    Orchestrator -->|every turn| Logger

    Gateway -->|PII-sensitive| ProviderA
    Gateway -->|classification / intent| ProviderAClassif
    Gateway -->|sanitized creative<br/>+ failover| Sanitizer
    Sanitizer -->|cleaned text| ProviderB
    Gateway -->|embedding| ProviderEmbed

    Gateway -.->|V2: added<br/>via config| SelfHosted
    Gateway -.->|V2: via config| SelfHostedEmbed

    StateMachine --> AISession
    Logger --> AIMessage
    Logger --> AIToolCall
    Orchestrator -->|reads context| BusinessProfile
    PromptProvider -.->|V2| Prompts

    ToolRegistry -->|read_state<br/>update_*<br/>query_*| Orchestrator

Solid arrows are V0 connections (the initial implementation per ADR-0001). Dashed arrows are V2 — components the gateway and schema are ready to accept without rewriting calling code. The whole "V2" subgraph isn't built yet; the architecture is shaped to absorb it incrementally through later ADRs. The seven rules from the ADR are visible structurally: the PromptProvider trait abstraction, the typed tool registry with reserved fields, the dedicated Orchestrator module, the current_state column from day 1, the gateway's split between chat_completion() and embedding(), the sanitizer pipeline as a guarded path before the secondary provider. None of this was generated as boilerplate — it followed from the discovery and design phases that came before.

Review found three blockers in its own ADR

This was the first thing that genuinely surprised me. The review phase ran the code-review-architectural skill on the same ADR the plugin had just produced. It found three blocking issues:

B-1. The routing policy "PII data → primary provider, non-PII → secondary" is in the ADR. But who decides at runtime which is which? Three possible interpretations: static tool-to-provider mapping, attribute on the prompt, runtime classifier on message content. This is an architectural seam that will diverge in implementation.

B-2. The sanitizer for the secondary channel is described as nontrivial. But the ADR doesn't say what happens when the sanitizer fails. Falls through? Blocks? Falls back to the primary provider? Gap between "the architecture protects the user" and "the developer protects the user."

B-3. Failover between providers in the middle of an unfinished tool-use loop is not addressed. Tool-use loops have multiple round-trips. If the primary provider fails on the third of five round-trips, what happens? The secondary provider's history format is incompatible. The most common failure point in hybrid gateways, not addressed.

I applied all three. The review document was updated with Status: Applied 2026-05-07 and a closeout block listing each fix. The cycle compounded — the next ADR builds on a corrected base.

I wrote a draft of this article at this point. It was decent. But the story wasn't done.

v0.4: adversarial roast with five roles

In v0.4 I added something I'd been thinking about for a while: multi-perspective adversarial review. Not one reviewer — five, each with strict non-overlapping scope.

devil-advocate — pressure-test for failure modes, hidden assumptions, edge cases, concurrency bugs.
pragmatist — operational reality, on-call burden, real cost over time, skills/bus factor, deployment risk.
junior-engineer — clarity for a fresh reader six months from now. Undefined terms, unfollowable steps, broken cross-references.
compliance-officer — regulatory exposure, PII flows, jurisdiction, audit, third-party risk.
futurist — 1-3 year horizon, structural drift, technology lifecycle, hiring, regulatory drift.

Each role has an explicit "what I cover / what I do NOT cover" table referencing the other roles. The point: when independent perspectives converge on the same finding, the finding is real. When one role is silent on something, it's because another role owns it.

Command: /archforge:roast <ADR-NNNN>. Output: a directory of six documents — one summary plus one per role.

Auto-roast at --scale=deep between Decide and Document, so important decisions never become accepted ADRs without passing the multi-role gauntlet.

The roast on ADR-0002

ADR-0002 was a modular monolith on a Cargo workspace — the second decision in the project, after the AI gateway was settled. I ran roast on it.

The output was strong. 14 high-severity findings, 16 medium, 16 low, plus 8 structural risks from the futurist. Cross-cutting concerns — issues that multiple roles independently surfaced — were the most valuable part. Six of them. The most consequential:

CC-1: a class of confused-deputy attack at the boundary between LLM tools and the database access layer. When AI tools execute under elevated database privileges (a common pattern when tool calls need to read across tenant boundaries efficiently) without authorization checks in the tool itself, prompt injection can cause horizontal data leakage between tenants. The devil-advocate flagged it from a security lens (an attacker with prompt-injection access can use the trusted tool to do things the attacker shouldn't be able to). The compliance-officer independently flagged the same code path from a regulatory lens (horizontal leakage of personal information across tenant boundaries violates the local data-protection law). Two roles, two angles, same architectural attack vector.

This is exactly the "compound" mode I built the plugin for. Two independent reviewers from different lenses converged on a single architectural attack vector. That's a real finding, not noise.

The recommended fix was straightforward — add a rule to the ADR requiring tools to accept an authorization context and perform ownership checks, or alternatively use a constrained database role that enforces row-level security at the connection level. Plus open two new ADRs (operational baseline and compliance contour) before any paying users.

This was the single best architectural moment I'd had with an AI tool. The plugin found a class of security bug in my own ADR through structured adversarial review, and proposed a concrete fix. I was ready to publish the article.

And then the report was garbled

The roast output came back in mixed Russian and English. Section headers were in English (## Headline findings, ## Cross-cutting concerns) but the prose was in Russian, and the prose was full of transliterated English — "обзервабилити" instead of "наблюдаемость", "operational baseline" instead of "операционный минимум", "compile-time гарантии" instead of "гарантии на этапе компиляции".

I'd written an explicit terminology rule into the architect skill in v0.3 specifically to prevent this. Russian artifacts must use proper Russian engineering terminology, not transliterated English calques. The rule was there. It clearly hadn't been applied.

I told Claude: "your output is full of anglicisms, please apply the terminology rule."

What happened next is, in some ways, more interesting than the original bug.

Overcorrection: the failure mode nobody talks about

Claude's response started: "Виноват — это уже второй раз с тем же правилом, явно нарушил." (Guilty — second time with the same rule, clearly violated.)

And then it rewrote the summary. With problems.

It translated Devil-advocate to "Обвинитель" (Prosecutor). And Pragmatist to "Прагматик". And Junior-engineer to "Новый разработчик". And Compliance-officer to "Специалист по соответствию". And — and this is the worst — Futurist to "Стратег" (Strategist).

It translated the section headers. ## Headline findings became ## Главное. ## Cross-cutting concerns became ## Сквозные проблемы. ## Severity counts became ## Распределение по тяжести.

It translated the finding IDs. CC-1 became СП-1. CC-2 became СП-2.

I read this and immediately knew there was something wrong, beyond just being a different translation choice. These weren't translations. They were structural breaks.

Devil-advocate is the name of an agent file. agents/devil-advocate.md. There's a name: field in the frontmatter that says devil-advocate. That string is invoked by the orchestrating command. Translating it to "Обвинитель" means:

Future references to that agent in other artifacts won't resolve.
The roast directory's per-role files (01-devil-advocate.md, 02-pragmatist.md...) suddenly don't match the roles named in the summary.
Comparing this roast to a future one is impossible — two roasts on the same ADR will name the roles differently.
The role concept itself was changed: "Стратег" (strategist) is not a translation of Futurist (long-horizon role); it's a different role. The role was substituted.

## Headline findings is a section header prescribed by the roast command file. The command writes that header literally. Other tooling (and a future user reading the directory) expects that header. Translating it to ## Главное means the document diverged from what the plugin's own templates promised.

CC-1 is a finding ID that gets cross-referenced. The summary says "CC-1 — see devil-advocate F1.1 + compliance C1.1". If you rename CC-1 to СП-1 in the summary but the agent docs still call it CC-1 (because they were written first, in English), the cross-references break. References pointing at IDs that don't exist anymore.

This is the plugin's own templates, names, and identifiers being silently translated under translation pressure. After being told "apply the terminology rule", the assistant pattern-matched too aggressively and translated things that aren't terminology. They're identifiers.

And here's the failure mode in a clean form: AI tools fail in two directions, and most reviews of AI tools only test one direction.

Undercorrection — the LLM ignores a rule. ("обзервабилити" instead of "наблюдаемость".)
Overcorrection — the LLM applies a rule too widely after being corrected. (Translating identifiers along with prose.)

The blog posts about LLM problems usually focus on the first. Hallucination, ignored constraints, dropped instructions. Overcorrection — over-eager application of corrections to inappropriate scope — is also a serious failure mode, and it's exactly the failure mode that trying harder produces.

Root cause: rules in one place, applied in many

Why did this happen? The terminology rule lived in skills/architect/SKILL.md — the router skill. The router skill is loaded into Claude's context when the architectural intent is detected. But the roast command spawns five sub-agents (devil-advocate, pragmatist, etc.). Each sub-agent is a separate context. They read their own SKILL.md files. They don't automatically inherit rules from the router.

So the structural bug was: the rule was authored once, but its enforcement depended on it being read in each of the contexts where output gets generated. In the roast, that was six contexts (five role agents plus the summarizing main thread). Five of them never saw the rule.

After my correction, the main thread did know the rule. But it applied it to everything that vaguely resembled a Russian-with-anglicisms problem — including identifiers that should never have been translated.

The fix could not be "remember to apply the rule everywhere". That's the same fix that already existed and didn't work. The fix had to be structural: the rule needs to be embedded in every place where output is generated, with explicit guards against both directions of failure.

The fix, and what we built around it

In v0.4-rc2, I did three things:

1. Embedded the language rule in every sub-agent. All five roast agents and all three core agents (architect, reviewer, researcher) now have an explicit ## Language and terminology section in their agents/*.md files referencing the architect skill's taxonomy. Each agent's section names the specific identifiers that role uses (its own name, its own finding-ID scheme, the section headers it produces) that must never be translated.

2. Rewrote the language rule in the router skill with explicit categories. A 10-category taxonomy distinguishing what gets translated from what doesn't:

A. Plugin component identifiers — never translate.
B. Software, library, protocol names — never.
C. Standard abbreviations — never.
D. Laws, regulations, standards — never.
E. Artifact identifiers (finding IDs, rule numbers, ADR IDs) — never.
F. Plugin template section names — never (verbatim English even when body is in another language).
G. Project-internal proper nouns — never when capitalized.
H. Term-of-art with no clean Russian equivalent — keep with first-occurrence gloss.
I. Calques — translate per the calque table.
J. Prose verbs and connectors — translate.

Plus a section literally titled "Overcorrection is also a failure" with the exact examples of what had just gone wrong: Devil-advocate should not become "Обвинитель", ## Headline findings should not become ## Главное, CC-3 should not become СП-3. The negative examples are now part of the spec.

3. Added a new agent: meta-reviewer. This is the most important part.

What the meta-reviewer does

The five roast roles attack the substance of an ADR. The meta-reviewer doesn't. It checks the form.

It reads artifacts produced by the plugin and checks them against the plugin's own templates and rules. Five categories:

Template conformance — does this ADR have all the required sections? Does this roast directory have a summary plus one file per role? Does this review have a Status section?
Identifier preservation — are agent names, command names, template section headers, finding IDs, software names, regulation names, all in their original form?
Language-pass evidence — for Russian artifacts, did the calque pass actually run? Is there a one-line note at the end stating what was changed?
Cross-reference integrity — does "ADR-NNNN" point at an ADR that exists? Does "see B-1" resolve to a finding that's in the document?
Lifecycle integrity — has anyone substantively edited an Accepted ADR (which should be superseded, not edited)?

Critically, the meta-reviewer does not evaluate architectural quality. That's roast's job. The meta-reviewer is the role that asks: "does this artifact match what the plugin's own files said it should look like?"

It uses the plugin's own source files as the specification. It reads commands/roast.md to know what sections a roast summary must have. It reads skills/architect/SKILL.md to know which strings count as identifiers. The plugin is grading itself against its own promises.

In --scale=deep cycles, the meta-reviewer runs automatically: after the auto-roast (on the roast directory) and after Document (on the freshly-written ADR). High-severity divergences pause the cycle.

So: the plugin found an architectural bug in its own ADR (review of ADR-0001). Then it found a security bug in another of its ADRs (roast of ADR-0002). Then it failed to follow its own language rule. Then it overcorrected and translated identifiers it shouldn't have. Now it has a role specifically designed to catch the kind of bug the previous version had.

This is what compounding looks like, in the actual Compound Engineering sense: each cycle leaves the next cycle better-equipped, because the system itself is the artifact being improved.

What I'd build differently if I started over

One thing, mainly. I'd put the language rule in every relevant context from the start, not in the router skill alone. The "DRY" instinct says rules should live in one place. For LLM tooling that turns out to be wrong: rules need to be in every context where they're enforced, even at the cost of duplication. Sub-agent contexts don't automatically share state with the parent. Architecting LLM-tool systems is partly the discipline of accepting that LLMs do not naturally inherit context the way functions inherit lexical scope. The meta-reviewer pattern — a role specifically dedicated to checking that artifacts conform to the system's own rules — should probably exist in any AI plugin that produces structured artifacts, not just this one.

What's coming

The plugin's roadmap covers three minor versions. v0.5 "Sharper feedback" sharpens existing feedback — a /archforge:diff <ADR> command that checks whether an accepted ADR still lives in the actual code, an anti-patterns skill (concrete named ones — distributed monolith, database-as-integration-layer, sync chains, cache as source of truth), an architectural-metrics skill, an /archforge:export command for shipping artifacts to articles or portfolios. v0.6 "Memory and history" adds a historian agent reading the project archive, retrospectives, decision-map evolution, optional pre-commit hooks. v0.7 "Teams and budgets" covers cost as a first-class architectural variable and multi-architect coordination.

Full plan, including the explicit anti-roadmap (no code generation from ADRs, no project-management integrations, no doc-site generators, no voting workflows on ADRs, no per-language plugin variants) is in ROADMAP.md.

The compound pattern, more directly

Most AI-tooling discourse focuses on automation — getting AI to do more work for you. That's real and matters. But for some kinds of work, automation is the wrong frame.

Architecture is one of those. The work isn't to produce an artifact faster. The work is to make a defensible decision under uncertainty. The artifact is downstream of the decision. Defensible decisions require structured disagreement, real alternatives, honest trade-off analysis, and someone willing to push back when you skip ahead.

If you cast AI as "thing that produces artifacts", you'll get bad architecture faster. If you cast AI as "thing that argues with me about my reasoning until I either change my mind or strengthen my case, while also checking that the artifacts of our conversation match the rules we agreed on", you'll get better architecture, slower, on purpose.

archforge is one attempt at the second framing. Pairs naturally with Compound Engineering — CE handles feature-level workflow (Brainstorm → Plan → Work → Review → Compound), archforge handles architectural decisions (Discover → Research → Design → Decide → Document → Review). Architectural decisions feed into CE plans by ADR number; CE compound learnings can produce new ADRs. The integration is materialized by running /archforge:remember-compound-integration once per project.

If you try the plugin, the most useful thing you can give back is a specific failure. Not "it's great" — that's nice but not actionable. "I ran roast on this ADR and the futurist role completely missed X" — that's actionable, that's how v0.5 gets shaped.

Repository: github.com/IgorKramar/archforge-marketplace. Issues, PRs, and especially specific bug reports of the plugin failing on your project are welcome. The plugin gets better when it fails in instructive ways. That's the whole point.

Superpowers vs Compound Engineering: is the 'vs' even real?

Igor Kramar — Mon, 04 May 2026 13:02:08 +0000

TL;DR — Superpowers and Compound Engineering aren't competitors. They're optimised for different worlds. Superpowers is gold for mature codebases with established methodology (TDD shops, large legacy systems, teams enforcing standards). Compound Engineering is gold for early-stage products where one person owns a feature end-to-end. Pick by what your codebase looks like, not by which README sounds shinier.

If you've spent any time in the Claude Code plugin ecosystem in the last few months, you've almost certainly heard about both:

Superpowers by Jesse Vincent — a "complete software development methodology" plugin, ~42k stars, in Anthropic's official marketplace.
Compound Engineering by Every — a 36-skill, 50-agent framework around the idea that "each unit of engineering work should make the next one easier", ~16k stars. Both ship as Claude Code plugins. Both wrap roughly the same surface — brainstorm, plan, work, review. Both have evangelists writing "I 100x'd my output" posts. So the natural question gets asked a lot: which one wins?

I installed both, ran them on the same projects for a few weeks (work — a Nuxt/Vue 3 platform product; personal — a small landing-page tool and my wife's florist site), and the answer turned out to be more interesting than I expected.

The "vs" is the wrong question. Let me show you why.

What both plugins are actually solving

Bare Claude Code is excellent at small tasks and dangerous at large ones. Without scaffolding it tends to:

Skip planning and start typing code on turn two.
Forget the lessons from yesterday's bug-hunt by tomorrow morning.
Drift further from your project's conventions the longer the session runs.
Treat each new request as if the codebase were a stranger. Both plugins exist to fix this — but they fix it from different ends. To see the difference, you have to look past the README marketing and ask what each one forces you to do.

The 90% overlap

Honestly: most of the surface is the same. Here's the side-by-side:

Phase	Superpowers	Compound Engineering
Refine a vague idea	`brainstorming`	`/ce-brainstorm`
Turn it into a plan	`writing-plans`	`/ce-plan`
Isolated execution	`using-git-worktrees`	`ce-worktree`
Code review before merge	`requesting-code-review`	`/ce-code-review`
Systematic debugging	`systematic-debugging`	`/ce-debug`

If you only look at this table, the conclusion is "they're the same plugin with different prefixes". That conclusion is wrong, and the difference is hidden in two places: what each plugin enforces, and what each one adds beyond the shared surface.

The real difference

Superpowers is a discipline-enforcement engine

Read the Superpowers source and you find it again and again: this plugin is opinionated, and the opinions have teeth. The clearest example is TDD:

If Claude tries to write code before tests, this skill literally makes it delete the code and start over. No exceptions.

That's not a tip. That's a guard rail. Other examples:

The brainstorming skill activates automatically when you describe a feature — you can't accidentally skip it.
systematic-debugging runs a 4-phase root-cause process and triggers a mandatory architectural review after three failed fix attempts.
YAGNI and "evidence over claims" are baked in as non-negotiables. Superpowers is, in Jesse Vincent's own framing, a methodology. It's there to make Claude behave the way a senior engineer at a disciplined shop would behave, whether you remember to ask or not.

The cost of this is real: Superpowers will fight you when you don't want to do TDD, when you want to vibe-code a quick spike, when you'd rather see something running before you write the test. That's not a bug. That's the entire point.

Compound Engineering is a knowledge-accumulation framework

Compound Engineering's central claim is in its name: each cycle should make the next cycle cheaper. The unique skills, the ones Superpowers has no real equivalent for, all serve that claim:

/ce-compound — after a feature ships, you write down what was learned. Bug patterns, gotchas, surprising decisions. These get indexed and pulled into the context of future plans.
/ce-compound-refresh — periodically reviews stored learnings and decides whether to keep, update, replace or archive them. Without this, your knowledge base drifts.
/ce-strategy — maintains a STRATEGY.md at the repo root: target problem, persona, key metrics, tracks. /ce-ideate, /ce-brainstorm and /ce-plan all read it as grounding.
/ce-product-pulse — time-windowed reports of what users actually experienced, saved to docs/pulse-reports/ so a timeline of outcomes builds up over time. Notice what's happening here: CE is reaching above the engineering loop (strategy) and below it (user outcomes), and tying both back into the planning context. It's trying to be a product loop, not just an engineering one.

The cost is also real: 36 skills and 50+ agents is a lot of surface. Without discipline you end up running ceremonial workflows on tasks that didn't need them. And /ce-compound only works if you actually use it after every cycle — skip it for two sprints and CE collapses into "Claude Code with extra slash commands".

Where each one breaks

This is the part most comparison posts skip, so let me be specific.

Superpowers breaks when your domain isn't TDD-shaped.

I tried Superpowers on a Nuxt/Vue 3 SSR feature where most of the work was prop-drilling, layout tweaks and Pinia state plumbing. The TDD-first enforcement turned a 90-minute change into a 3-hour session of writing tests for code that's mostly visual. For SSR-specific bugs (hydration mismatches, server-only state) the discipline pays for itself. For "make this card layout responsive", it's pure friction.

Superpowers also struggles with rapid iteration. Brainstorm-plan-test-implement is brilliant for a feature you'll ship to a million users. It's overkill for a 30-minute spike where the goal is to know whether an approach is even viable.

Compound Engineering breaks when you skip the compound step.

This is the trap. CE has so many slash commands that it feels productive even when you're just running /ce-plan and /ce-work on autopilot. But CE without /ce-compound is not compound engineering — it's just a more verbose Claude Code session. The plugin's value compounds only if you compound. I've watched myself skip it under deadline pressure on three consecutive cycles before I noticed the framework had quietly become decorative.

CE also breaks at team scale. The named-persona reviewers (ce-dhh-rails-reviewer, ce-kieran-typescript-reviewer) encode somebody else's taste. On a personal project that's fine — useful, even. On a team project, "the reviewer Claude is roleplaying as DHH" is not a conversation I want to have with a senior colleague at standup.

The shape that actually predicts which one fits

Here's the rule of thumb I landed on after two months of switching back and forth:

Your situation	Better fit
Mature codebase, established conventions, real test suite	Superpowers
Greenfield product, you own it end-to-end, conventions still forming	Compound Engineering
Team enforcing TDD or similar discipline	Superpowers
Solo dev or small team, knowledge dies if not written down	Compound Engineering
Legacy system where consistency matters more than novelty	Superpowers
Early-stage product where strategy shifts week to week	Compound Engineering
You want the plugin to constrain you	Superpowers
You want the plugin to remember for you	Compound Engineering

Or stated more bluntly: Superpowers is for codebases with a methodology to enforce. Compound Engineering is for products where the knowledge to compound is still being created.

A startup with a 6-month-old codebase and a single engineer per feature has very little to enforce — there are no conventions yet, the test suite is patchy, the architecture is in flux. What it desperately needs is a memory: why did we pick Pinia over Vuex, what broke last sprint, who is the persona we're optimising for. CE's /ce-strategy, /ce-product-pulse and /ce-compound are exactly that memory.

A 10-year-old enterprise codebase with 30 engineers has the opposite shape. The knowledge already exists — in the test suite, in code review norms, in the architectural decision records. What it needs is enforcement, because the failure mode is drift away from established standards under deadline pressure. Superpowers' "delete the code, write the test first, no exceptions" is precisely calibrated to that failure mode.

What I actually run, and where

For my work codebase (Nuxt/Vue 3, established team, real conventions), I lean on Superpowers — but selectively. brainstorming and writing-plans for any non-trivial feature; systematic-debugging for tricky SSR bugs; I let TDD enforcement run on backend Pinia store logic and turn it down on pure UI work.

For personal projects (the landing-page tool, the florist site), I run Compound Engineering. /ce-strategy once at the start gives every subsequent /ce-plan real context about what the product is for. /ce-compound after each meaningful feature actually compounds — I've already had /ce-plan surface a learning from a previous cycle that saved me an evening.

I don't run both in the same repo. They both touch CLAUDE.md, they both want to be the source of truth for how the agent behaves, and the conflict isn't worth it.

The honest version of "vs"

If someone asks "which plugin should I install?" — that's the wrong question. The right one is: what shape is the work?

Mature codebase, established discipline, the failure mode is drift → Superpowers.

Early product, single owner per feature, the failure mode is forgotten learnings → Compound Engineering.

The "vs" framing is what marketing produces when two tools are competing for the same star count. Engineering produces a different question: what fits this codebase, this team, this stage?

Try both. Run each on the project where its philosophy actually matches. You'll know within a week which one earned its place.

If you've run them both and your shape is different from mine, I'd genuinely like to hear it in the comments — especially the cases where I'm probably wrong.