Forem: Alexander Velikiy

Everyone is squeezing context. We stopped putting everything in one context.

Alexander Velikiy — Sat, 23 May 2026 06:06:06 +0000

The standard advice for reducing LLM costs: truncate your prompts, use a cheaper model, compress your system prompt, enable caching, add Be concise. to every instruction and hope for the best.

All valid. All treating the symptom.

We did something different.

The real problem isn't prompt size. It's context architecture.

When great_cto runs a feature pipeline — architect, PM, senior-dev, QA, security officer — each agent starts by reading the same stack of documents:

ARCH-*.md — full architecture decisions, 3–8k tokens each
PLAN-*.md — implementation plans, 4–10k tokens
decisions.md — every architectural decision made since the project started
lessons.md — every lesson learned, including that one time someone forgot to add an index

Six agents. Each reads all of it. Most of it irrelevant to the task at hand.

A senior-dev implementing a Stripe webhook doesn't need the 200-line deep-dive into the auth system. They need two sentences: "We use Stripe. Card data never touches our infra."

The information was right. The delivery unit was wrong. We were running a library where everyone gets every book, every time.

Phase 1: Stop sending full documents. Send summaries.

Every artifact in great_cto now has a paired .summary.md — auto-generated, ≤250 tokens, structured for the consuming agent:

# ARCH — Multi-tenant auth system · summary
- **Decision:** SAML over OIDC for enterprise; JWT internally
- **Stack:** Node 20, Passport.js, PostgreSQL row-level security
- **Risks:** SAML metadata rotation, session fixation on tenant switch
- **Full doc:** docs/architecture/ARCH-auth.md

Agents read the summary first. If they need depth — the path to the full doc is right there. In practice, 80% of reads stop at the summary. The other 20% at least know exactly what they're looking for.

The numbers:

	Before	v2.19.0
13 artifacts, per agent read	21,459 tokens	2,216 tokens
Reduction		–89.7%

The summary generates automatically via a PostToolUse hook the moment any agent writes an artifact. Anthropic Haiku if you have an API key (~$0.0005/call). OpenRouter Kimi K2 as fallback. Deterministic keyword heuristic if neither — zero cost, works offline, mildly embarrassed about the quality but gets the job done.

No config. No manual steps. Write artifact, get summary.

Phase 2: Stop injecting the entire memory. Filter it to the task.

decisions.md is an append-only log. It grows. A typical project after three months: 40–80 entries — database choices, API decisions, security tradeoffs, that one auth approach you tried and abandoned at 2am.

Before v2.19.0, the architect agent received the full file every time. 3–5k tokens, of which maybe 200 were actually relevant to the task. The model read all of it, politely, and quietly ignored most of it.

Now: one call to scripts/memory-filter.mjs "add Stripe webhook integration" decisions.md --k=5

The filter scores each entry against the task title. For "add Stripe webhook integration" — you get the PCI decision, the webhook signature lesson, the relevant security pattern. Not the database choice from six months ago that has nothing to do with anything.

The numbers:

	Before	v2.19.0
decisions.md inject per agent pair	946 tokens	544 tokens
Reduction		–42.5%

Latency: ~50ms heuristic, ~200ms Haiku. Cost: <$0.0001 per call. Opt-out: GREAT_CTO_DISABLE_MEMORY_FILTER=1 (for when you miss the old noise).

The combined pipeline: before vs. after

Six agents per feature. Each reads artifacts and memory.

	Before	v2.19.0
Total tokens per feature	134,430	16,560
Reduction		–87.7%
Cost saved (Sonnet $3/1M)		$0.35 per feature

This is with a small project — 13 artifacts, 7 decisions. The savings compound with scale: at 50 artifacts and 50 decisions (a project six months in), the legacy number climbs past 600k tokens per feature run. The filtered number stays roughly flat.

That's the interesting property of this architecture: the noise grows with the project, the signal doesn't.

What this isn't

This is not prompt compression. We're not removing information — we're delivering it at the right granularity, to the right agent, at the right moment.

The full docs are still there. The full decisions.md is still there. Any agent that needs depth can read it — the summary tells them exactly where to look. The filter acknowledges it might miss something ("if you suspect a relevant lesson is missing, read the full file directly"). It's a hint, not a wall.

We're not betting on the model being smart enough to ignore irrelevant noise. We're not hoping a Be concise. instruction somewhere will solve a structural problem. We're betting on information architecture — the same principle that makes an indexed database faster than a full table scan.

The index doesn't know less than the table. It knows where to look.

Getting it

Everything shipped in v2.19.0:

scripts/generate-summary.mjs — --all, --check, --force
scripts/memory-filter.mjs — --k=N, --heuristic, --stats
agents/_shared/artifact-summary-contract.md — the producer/consumer contract
31 tests, all green

npx great-cto upgrade

Summaries generate on first --all run, then stay fresh automatically. Memory filter activates in architect and senior-dev agents — no config needed.

What's next

Phase 3: session-scoped read cache. When five agents in one pipeline all read PROJECT.md, only the first actually reads the file. The rest get a cache stub with a hash. Target: additional –15% on multi-agent runs.

Phase 4: system prompt audit across all 30+ agent files. Removing filler. Enforcing token budgets. Finding the seven places we wrote "carefully" when the model was going to be careful anyway.

The full plan is public: docs/plans/PLAN-token-economy-2026-q2.md

great_cto v2.17 - no more tambourine dance

Alexander Velikiy — Fri, 22 May 2026 15:03:35 +0000

If you've ever spent 20 minutes setting up Claude Code plugins before you could even start working - this update is for you.

One install, everything works.

Previously: install great_cto, then figure out that Superpowers and Beads are also needed, find the repos, clone them, enable them in settings, restart. Classic.

Now - one command:

npx great-cto install

Done. Superpowers and Beads install automatically as companion plugins. They land in ~/.claude/plugins/cache/local/, get enabled in settings.json, and are ready to work. If git is missing - great_cto gives a friendly hint instead of silently failing.

Jurisdiction-aware agents.

The new jurisdictions module detects the context of your project - EU, US, Canada, UK, Australia, and others - and automatically activates the right regulatory reviewer agents.

Working on a fintech product for European users? The EU reviewer turns on automatically. Building for the Canadian market? PIPEDA gets covered. No manual configuration, no trying to remember what applies where.

Eight jurisdictions are currently supported, and the list keeps growing.

Critics before the plan.

The most expensive bugs aren't in the code - they're in the decisions made before coding starts. Three new critic agents now run at the earliest stages of the pipeline, before a single line is written.

Architecture critic catches structural problems that make future work impossible. Coupling that rules out multi-tenancy. An "obvious" O(n²) loop that works fine in dev and falls apart at scale. These aren't bugs - they're constraints that quietly close off entire solution spaces.

Spec critic catches "we solved the wrong problem" - the worst class of bug, because there's no way to unit-test for it. By the time the code works correctly, it may be doing entirely the wrong thing.

Schema critic catches the migration that will deadlock a 50M-row table 10 minutes after deploy. A NOT NULL column without a default. An index added without CONCURRENTLY. The kind of change that looks clean in a code review and becomes an incident.

Previously, critics only appeared starting from Plan stage. Now they cover the three positions where a mistake is most expensive.

llm-leash UI: 16 new features.

llm-leash is the great_cto admin board - a local web UI that shows what your AI agents are doing, what they've spent, what passed review, and what needs your attention. Think of it as a control panel for the agent pipeline.

This release adds 16 new features to the board. The most useful ones:

Cmd-K - global command palette for navigation.
Issues subtab - all security and compliance findings in one place.
Session timeline - visual history of what happened and when.
Topology graph - shows agent dependencies. Useful when you have 5+ parallel agents running.
HITL diff - human-in-the-loop review of agent changes before they're applied.
OPA config - Open Policy Agent integration for compliance rules.
SOC2 export - one-click audit trail for compliance officers.
Rule comparison - compare policy versions side by side.

Companion plugins out of the box.

A bit more detail on how the Superpowers + Beads bundle works, since the architecture is non-obvious.

Superpowers - a methodology plugin. It gives Claude Code skills: /brainstorm, /write-plan, /execute-plan, code review workflow, TDD cycle, parallel agent execution. Without it, Claude acts on vibes. With it - on a structured plan.

Beads - a git-native task tracker. Tasks live as commits, survive session restarts, have dependencies and blockers. Claude creates and closes them autonomously as it works.

great_cto - the orchestration layer. It routes requests to the right agents, enforces reviewers based on archetype and jurisdiction, manages the agent pipeline.

Together: you describe what needs to be done, great_cto breaks it into a plan, Beads tracks it, Superpowers enforces methodology, the right reviewer agents plug in automatically.

TL;DR

npx great-cto install

npm: https://www.npmjs.com/package/great-cto
GitHub: https://github.com/avelikiy/great_cto

Feedback and PRs welcome.

AI Agents Work While You Sleep — Now They Can Wake You Up

Alexander Velikiy — Mon, 18 May 2026 08:22:47 +0000

Let me describe a Tuesday evening.

I fire off /start "refactor billing module", the pipeline kicks in, six AI agents start doing their thing, and I think: great, I've got an hour. I'll cook pasta.

I cook pasta. I eat pasta. I do the dishes. I put on an episode of something. I come back.

The pipeline has been waiting for my approval for 54 minutes. The senior-dev agent is sitting there, doing absolutely nothing, blocked on a gate:plan that needed one click from me. Fifty-four minutes of human absence. Zero pasta to show for it.

This is the core tension of running an AI pipeline: the whole point is that it works while you're not watching. But the moment it needs you, it needs you immediately — and it has no way to tell you that.

Until now.

What we added

Two things, both live in the board's Notifications settings:

Email alerts — you enter your email, click a verification link, done. From that point on, five specific events send you an email:

A P0 incident opens (the pipeline found a production fire)
A gate has been waiting for your approval for more than 30 minutes
A gate is actively blocking the pipeline right now
Your monthly AI spend crosses the limit you set
Monday morning weekly digest — what got done, what it cost, how many gates passed

Browser push notifications — desktop notifications, the same kind you get from Slack or email. You enable them once in the board settings, the browser asks for permission, and that's it. No app to install. No Firebase. No account anywhere.

Why exactly these five triggers

The first version of this feature had fifteen triggers. It was immediately annoying. Every time an agent sneezed, my phone buzzed.

The honest answer to "what do I actually need to know about right now" is surprisingly short:

Something broke. Not "an agent is running" or "a review started" — the actual situation where production is on fire and the pipeline found it before I did.

I'm blocking progress. The pipeline stopped and is waiting for me. The longer I wait, the more time I've wasted running all those agents. If it's been 30 minutes, I definitely didn't see the gate notification in the terminal.

I'm about to overspend. LLM costs are real. A runaway pipeline on a big refactor can quietly rack up $20–30 if nobody's watching. A cost alert at $15 is much better than discovering $40 on the invoice.

Weekly summary. Not urgent, but useful. Monday morning coffee + what your AI team accomplished this week = a surprisingly good way to start the day.

That's the whole list. No alerts for "agent started", "agent completed", "reviewer disagreed with another reviewer", or any of the other events that feel important but mostly produce noise.

The email setup

Open the board → Settings → Notifications. Enter your email. You get one verification email — click the link, and you're verified forever. No repeated sign-ins, no token rotation, no dashboard at some third-party service.

Under the hood the board sends emails through a relay we run at greatcto.systems. We're using Resend on the free tier (100 emails/day), which is more than enough for a solo developer who isn't actively burning the place down.

Why a relay and not direct SMTP? Because storing an SMTP password inside a local tool that lives on your laptop is a disaster waiting to happen. The relay holds the credentials; your board just sends an HTTPS request. If the relay is down, you miss a notification. That's fine. You're not building a hospital.

The push notifications

This one was more fun to build.

Browser push notifications sound simple — they're everywhere, every website pesters you with them — but implementing them correctly from scratch is genuinely involved. There's a spec called VAPID that requires signing cryptographic tokens with elliptic curve keys, and basically every tutorial says "just use the web-push npm package."

We couldn't. The board server is intentionally zero-dependency — no npm packages, no node_modules, nothing. It's a single file that you can read start to finish in an afternoon. Adding a library for notifications would mean adding a library for notifications and everything that library depends on, which is how you end up with 47 packages installed to send one HTTP request.

So we implemented it ourselves using only what Node.js ships with.

The fun gotcha: somewhere deep in the VAPID spec, the signature format that Node produces natively is not the format that browsers expect. One is DER-encoded (an old ASN.1 format from the 90s), the other is just raw bytes. Our first test push hit the browser's push service and got a 401 back. Fifteen minutes of reading specs later, we found the conversion, fixed it, and every subsequent push worked perfectly.

The end result: you toggle the switch in the board, your browser asks "Allow notifications?", you click Allow, and from that point on your desktop shows a native notification whenever any of the five triggers fire. Same notification you'd get from a Slack message. No third-party service involved.

The notification drawer

Push and email are for when you're away from the keyboard. The in-app drawer is for when you're at the keyboard but not watching the terminal.

Click the bell icon in the top nav. A panel slides out with the last 20 notifications — what fired, when, and whether you've seen it. Unread count shows as a badge on the bell. "Mark all read" lives right there.

The history persists across restarts. Close the board, reopen it tomorrow morning, your Monday digest is still there.

How to try it

npx great-cto init
npx great-cto board

Go to Settings → Notifications. Add an email or enable push (or both). Then just work normally. If a gate waits too long, you'll hear about it.

The whole thing is open source: github.com/avelikiy/great_cto (MIT, free, you pay your own LLM provider).

The pasta incident has not repeated since.

Real cost breakdown: 10 packs, $0.60 LLM bill, $42K saved per regulated feature

Alexander Velikiy — Sun, 17 May 2026 14:29:04 +0000

This is the numbers post. If you read the ten-packs deep-dive and walked away wanting the spreadsheet, here it is.

All numbers below are from real client engagements (anonymized aggregates) plus telemetry from the GreatCTO install base. Not projections. Not vendor-pitch math.

Per-feature: the $42K → $0.60 + 50 hours of human review

A single regulated feature in a single industry. Pre-pipeline:

Identify which regs apply          ~8h    × $200      = $1,600
Read primary regulation text      ~14h    × $200      = $2,800
Map regulation → stack            ~20h    × $250      = $5,000
Draft threat model                ~32h    × $250      = $8,000
Consent flow + UX                 ~20h    × $180      = $3,600
Implementation                    ~40h    × $180      = $7,200
Internal legal review              ~8h    × $400      = $3,200
External auditor pre-meeting      ~10h    × $350      = $3,500
Revisions                         ~16h    × mixed     = $3,500
Final signoff                      ~4h    × $400      = $1,600
                                  ─────                ─────
                                  ~172h               ~$40K
                                                      (rounded $42K with overhead)

With pipeline:

LLM compute (architect+reviewers)  ~$0.60-$1.40 per feature
Human review of LLM output         ~14-18h × mixed     ~$3,800
External auditor pre-meeting       ~6-8h   (lower because tighter document)
Internal legal                     ~8h     (unchanged)
                                   ─────                ─────
                                   ~28-34h              ~$11-14K

Net saved per feature: ~$28-30K and ~140 hours of human time. LLM bill is rounding error.

The $0.60 number is per feature, not per MVP. Some readers conflated these. A small fintech feature on Claude Sonnet costs ~$0.60-$1.40 in LLM calls. A full MVP run with all 10 packs activated and ~30 features ships ~$500-$1,500 in LLM compute. Both numbers are honest, they describe different scopes.

Per-MVP: $287K → $128K (~55% reduction)

A voice-AI MVP, three months of work, traditional team composition:

1 Product Manager × 3 months × $180/h × 120h/mo = $64,800
4 Engineers × 3 months × $180/h × 140h/mo = $302,400
Architecture work (internal or fractional CTO) = ~$20,000
Security review (external) = ~$15,000
Compliance setup (consultant + internal time) = ~$28,000
Misc (PM tools, hosting trial, design) = ~$8,000 ───────── ~$438K nominal ~$287K after overlap & efficient teaming


With pipeline + agentic SDLC, same MVP, 6-8 weeks:

- 1 Product Manager × 2 months × $180/h × 120h/mo  = $43,200
- 2 Engineers × 2 months × $180/h × 140h/mo        = $100,800
- LLM compute across the whole run                 = ~$1,200
- Architecture review (1 sr human, 3 sessions)     = ~$3,000
- Security review (external, same)                 = ~$15,000 (unchanged — see "what doesn't compress")
- Compliance setup (pipeline output + ~12h review) = ~$5,500
- Misc                                             = ~$8,000
                                                     ─────────
                                                     ~$176K nominal
                                                     ~$128K after similar overlap savings

Net: ~$159K saved per MVP, ~45% time saved. Most of the saving is not the LLM bill — it is fewer engineer-months because senior-dev parallelism + auto-review compresses the build phase.

Per-quarter / per-runway: the bet that changes

For a founder shipping into one regulated industry (most realistic scenario):

	Traditional	Pipeline	Saved
MVP time	3 months	6-8 weeks	~1.5 months
MVP cost	$287K	$128K	$159K
Compliance setup (4 features, year 1)	$168K	$48K	$120K
Year 1 total	$455K	$176K	$279K
Equivalent runway months @$50K burn	9.1 mo	3.5 mo	5.6 months recovered

For a founder shipping into 10 industries (hypothetical "compliance-heavy AI products" portfolio):

	Traditional	Pipeline	Saved
Year 1 (10 MVPs × overlap)	$1.45M	$580K	$870K
Wall-clock (sequential)	30 months	10 months	20 months
Wall-clock (with parallelism)	21 months	7 months	14 months

The 10-industry case is hypothetical — no real founder ships into all 10 simultaneously. But it shows the structural ratio: roughly 60% cost reduction, roughly 67% wall-clock reduction.

LLM compute: where the money goes

Per-MVP LLM compute, ~$500-$1,500 total, breaks down roughly:

senior-dev × 4-8 features            ~70%     (code-writing is expensive)
architect (per-feature ARCH.md)      ~12%
specialist reviewers (5 per feature) ~10%     (verdicts are cheap)
pm (decomposition)                   ~3%
qa-engineer (test scaffolds)         ~3%
detection + memory + misc            ~2%

The reviewers are roughly 10% of cost despite being 5 of the 8 agents that run. They output verdicts, not code. If your LLM cost is exploding, look at how much code is being generated, not how many agents are running.

Hardware / model-choice ratios

We tested Sonnet 4.6 vs Haiku 4.5 vs Opus 4.5 on the same 23-feature batch:

Model	LLM cost ratio	Wall-clock ratio	architect output quality (human eval, blind)
Haiku 4.5	0.31×	0.74×	"noticeably worse" — 4 of 23 ARCH docs unusable
Sonnet 4.6	1.0× (baseline)	1.0×	acceptable, default
Opus 4.5	5.1×	1.27×	"marginally better" — 1 ARCH doc clearly superior

Conclusion: Sonnet for everything except deep-reasoning architecture decisions. Use Opus only for architect on greenfield features in unfamiliar territory. Haiku for high-volume worker agents (pair programming, code generation) where the ARCH note is not on the critical path.

What does NOT compress

I have called this out before, but in numbers terms:

Item	Compressible?
External audit cycle (NYC bias auditor, 2-4 weeks)	No
FDA pre-submission meeting (60-90 days)	No
IRB approval (clinical trials, 8-12 weeks)	No
Wet-lab validation (drug discovery)	No
HARA signature (functional safety, 1 calendar moment)	No
Lawyer reading the threat model	Compresses (LLM-written threat model is faster to read than human-written long-form)
Regulator phone calls	No

Anything that requires another organization's calendar runs at human speed. Internal work compresses 5-25×. External-dependency work does not move.

For an early-stage AI startup on 18-24 month runway, the bet that changes is the internal portion. You can now run 3 external compliance cycles per year instead of 1.5, because the internal prep for each one compressed from six weeks to ten days.

The thing I underbet

When I started building the packs, I assumed the ROI claim would be "30-40% on compliance cost." The number ended up larger and the shape surprised me — most of the saving is not the LLM compute (it is rounding error) but the fewer engineering-months the parallelism enables, plus the fewer consulting hours the LLM-drafted threat model enables.

If you take one number from this post: the LLM compute is not the moat. The pipeline that runs the agents in parallel, gates the right humans at the right scope, and persists memory across incidents is the moat. The LLM is the substrate.

About: I build GreatCTO — a multi-agent SDLC plugin for Claude Code with 10 compliance packs. MIT, runs locally. Pay your own LLM API. Per-pack numbers (which 10 industries, what each pack does, real consulting-rate comparisons) are in the W21 deep-dive.

The MTTR -94% claim, with receipts

Alexander Velikiy — Sun, 17 May 2026 14:28:21 +0000

Earlier posts cite a "median MTTR drop of 94.1% across 47 paired P0 incidents." This post is the receipts. The full methodology is also in docs/benchmarks/MTTR.md — this post explains why the number is what it is, and the four cases it does not capture.

What got measured

Setup:

12 production repositories (mix of fintech, voice-AI, clinical, dev-tools).
P0 incident defined as: user-facing, paged a human, took ≥15 minutes to resolve.
Window: rolling 6 months. Pre-treatment + post-treatment.
Treatment: the project installed GreatCTO and started persisting (pattern_hash, detection_order_that_worked, rationale) after each P0 resolved.
Outcome: time from page to root-cause identified.

I measured detection time, not full resolution time. Resolution depends on rollout speed, blast radius, customer comms — too many confounds. Detection time is the part where memory could conceivably help, and it is the part where humans burn the most calendar hours on recurring bugs.

The number

47 paired incidents. "Paired" means: same shape (same pattern_hash) seen at least twice across the 6-month window, once before persistence, once after.

Stat	Pre	Post	Delta
Median detection time	178 min	11 min	−94.1%
Mean detection time	224 min	17 min	−92.6%
90th percentile	412 min	41 min	−90.0%
Worst case (post)	n/a	89 min	n/a
Best case (post)	n/a	4 min	n/a

Skewed by a couple of near-100% cases (postgres pool exhaustion and a connection-string typo that the agent matched to a prior incident's commit diff and flagged in under 5 minutes). I report median because it is less misleading than mean for skewed distributions. The 90th percentile is probably the number you should care about — it is the "still 6× faster on the bad cases" claim.

How the mechanism works

The agent stores, for each resolved incident:

pattern_hash:   sha256(normalized_log_signature + topology_hint)
detection_order: ["check_pg_pool_size", "check_connections", "check_query_count"]
rationale:      "connection_refused logs + pool > 80% utilization → pool exhaustion, not network"

On a new incident, the agent's Step 0 is: hash the current incident's signature, look up in ~/.great_cto/incident_memory.jsonl, if pattern hits, try the prior detection_order first. If it identifies the root cause: log "memory hit." If it does not: fall back to systematic exploration.

There is no inference. The agent is not "smarter" — it is just skipping hypothesis exploration time because someone (you, last time) already paid for that exploration.

⚠ The 4 honest misses

Memory-based detection is not magic. Four cases in the 47 had pattern matches that pointed in the wrong direction and burned 10-30 minutes before the agent gave up and fell back to systematic.

Miss #1. Pattern matched on log signature "OOMKilled in worker pool." Prior detection order was "check worker memory limits." Reality: this time, the OOM was a memory leak in a different worker that pushed the wrong worker over its limit. Agent spent 18 minutes confirming the wrong worker's limits before noticing the leak. Total detection time: 34 minutes vs ~80 minutes baseline. Net positive but ugly.

Miss #2. Pattern matched "5xx spike from API gateway." Prior cause was upstream DB lag. Reality: this time it was a misconfigured rate-limiter that started rejecting requests after a deploy. Agent ran "check DB lag" for 12 minutes before pivoting. 28 minutes total vs ~140 baseline. Still a win, but called a "miss" because the prior path was wrong.

Miss #3. Pattern matched "auth failures after deploy." Prior cause was OAuth client secret rotation. Reality: a clock skew on one node caused JWT signature validation to fail. Agent's prior detection order led it through token store inspection first. 41 minutes total vs ~200 baseline.

Miss #4. Worst case. Pattern matched "DNS resolution failures." Prior detection order was "check Route 53 health checks." Reality: a third-party CDN had an outage. The agent's path was completely wrong, did not give up early enough, and a human had to manually override at minute 22. 89 minutes total vs ~150 baseline. Win on absolute time, but I would not call this a "memory worked" case.

If I report the 47 cases as "94.1% median drop," I owe the audience the 4 cases where the mechanism worked badly. They are 8.5% of the sample. The remaining 91.5% of cases saw memory either help significantly (74%) or be irrelevant (no pattern hit, fell straight to systematic exploration — 17%).

How to replicate in your own repo

Three steps, no GreatCTO required:

Persist incident memory. After each P0 resolves, write (pattern_hash, detection_order, rationale) to a markdown file in your repo. Plain text. Git-trackable.
At incident start, ask your agent to read that file before doing anything else. Even Claude Code with no plugins will use the file if you point it at one.
Track detection time. Page-to-RC-identified, in minutes. Spreadsheet is fine.

Run for one quarter. If you see a consistent reduction in detection time on recurring patterns, you have your own version of this mechanism. If you do not see reduction, your incidents are too unique or your pattern hash is too coarse.

The hash I use is sha256(top_3_log_lines_normalized + topology_hint) where topology_hint is the service name. This gets ~70% recall on similar incidents and very few false hits. You can tune for your domain.

What I will not do

Some readers ask for the raw data (anonymized incidents). I will not publish it — even anonymized, customers can be re-identified from incident shapes and timing. I will share the synthetic test cases in tests/incident_memory.test.mjs and the aggregate statistics in docs/benchmarks/MTTR.md. That is enough to verify the mechanism without leaking client incident data.

What this is not

Not an RCT. Observational. Twelve repos is small. The selection bias is real — the repos that adopted GreatCTO early were also the ones with the best L3 culture. A worse team might see 30% drop instead of 94%.

The number I would defend to your board: on recurring incident patterns, memory-driven detection compresses detection time by 5-10× median, with a long tail of near-zero-improvement cases. That is more honest than "94%." But "94%" is what shows up in the data.

About: I build GreatCTO — a multi-agent SDLC plugin for Claude Code. MIT, runs locally. Memory layer source is in packages/cli/src/memory.ts. The full benchmark methodology is at docs/benchmarks/MTTR.md.

Three days of code, six weeks of compliance — the math behind why

Alexander Velikiy — Sun, 17 May 2026 14:28:15 +0000

If you have shipped into a regulated industry, you know this ratio. Engineering ships a feature in three days. Compliance setup around the feature takes six weeks. Some founders get used to it. The right reaction is: the ratio is the bug.

This post is for the CEO / CTO who reads "What $1.4M of compliance work looks like in 14 hours" and wants to understand the mechanism — why six weeks specifically, and where in those weeks an LLM can save time without anyone getting sued.

Where the six weeks actually go

I priced this out properly the last three times I lived it as a CTO-for-hire. Numbers below are typical for a voice-AI or fintech feature shipping in 2025-2026.

Phase	Median hours	Hourly rate	Subtotal
Identify which regulations apply	8	$200 (senior legal)	$1,600
Read primary regulation text	12-16	$200	~$2,800
Map regulation → your stack	16-24	$250 (compliance consultant)	~$5,000
Draft threat model	32	$250	$8,000
Draft consent flow + UX changes	16-24	$180 (senior PM + senior frontend)	$3,600
Implement consent + audit log	40	$180	$7,200
Internal legal review of threat model	8	$400 (general counsel)	$3,200
External auditor pre-meeting + Q&A	10	$350 (specialist)	$3,500
Revisions, second pass	16	mixed	~$3,500
Final sign-off	4	$400	$1,600
Total	~190 hours	mixed	~$42,000

This is a single regulated feature. Multi-jurisdictional (US + EU + India + state-level US) doubles or triples it. Multi-feature (a startup shipping into a regulated industry has 8-15 such features in the first six months) makes the aggregate $300K-$500K of consulting before the product exists in production.

Where an LLM helps

Not all of those 190 hours are equal. Some are mechanical, some require judgment, some require relationships.

Mechanical (can be 80-90% automated):

Reading primary regulation text. The CFR is plain text. The EU AI Act Annex III is plain text. LLMs read 200 pages faster than any human can think. Replaces ~12-16 hours.
Mapping regulation to stack. "Does our PCI-DSS scope include the webhook signature verifier?" is a deterministic question with a regex-and-citation answer. Replaces ~12-18 hours of the 16-24.
Drafting threat model. Each pack has a 200-word template (down from my first 800-word version — auditors politely asked for shorter). LLM fills it in using regulation text + your ARCH.md. Replaces ~24-28 hours of the 32.
Generating evidence artifacts (decision logs, gate signoffs, audit trail). The pipeline emits these as side effects, not as a separate phase. Replaces ~6-8 hours.

Judgment (human time stays roughly constant):

Identify which regulations apply. Mostly mechanical, but the "is this an edge case" call is human. Reduces from 8h to ~2-3h of review.
Drafting consent flow UX. Pure product judgment. The LLM writes a first pass you can react to in 15 minutes instead of authoring from scratch in 4 hours. Reduces from 16-24h to ~4-6h.
Implementation. Coding is faster with LLM assistance, but the gates are real. Reduces from 40h to ~10-15h.

Relationship (cannot be automated, and pretending otherwise is malpractice):

Internal legal review. Your GC has to sign. Their time is your time. Unchanged at 8h.
External auditor pre-meeting. The auditor wants a human on the other end of the phone who can defend the threat model under questioning. The LLM-generated threat model is the document the auditor reads. The conversation about it is yours. Unchanged at 10h, but the auditor reads a tighter document faster, so call it 6-8h net.

New math:

Phase	Old	New	Saved
Identify regs	8h	2-3h	~6h
Read regs	12-16h	1-2h	~13h
Map to stack	16-24h	3-4h	~17h
Threat model	32h	4-6h	~27h
Consent UX	16-24h	4-6h	~15h
Implementation	40h	10-15h	~28h
Internal legal	8h	8h	0
External auditor	10h	6-8h	~3h
Revisions	16h	6-8h	~9h
Final signoff	4h	4h	0
Total	~190h	~50-65h	~125-140h

Wall-clock compresses from six weeks to about ten working days, partly because removed work and partly because the work that remains can run in parallel (the LLM drafts while the auditor pre-meeting is scheduled).

Cost compresses from ~$42K to ~$15-18K (LLM bill ~$50-150, human time the rest). Median compression I have measured: ~60% on cost, ~67% on wall-clock.

Why this is not "AI replaces compliance consultants"

The compliance specialist of 2027 is someone who knows which regulation applies in which jurisdiction and can operate a pipeline to do the reading and templating for them. Same depth of judgment. Five times the productivity.

That person is going to win market share against the consultant still billing by the hour to read 200 pages of regulation. Not because their judgment is better — it is the same. Because their cost-per-judgment is one-fifth.

The judgment is the moat. The reading and templating around the judgment has been commoditized. This is the same transition that happened to junior associates in law firms when document-review tools landed in 2010-2015. Senior partners did not disappear; they got faster.

What does not compress

External calendar time. The auditor still books two weeks out. The FDA pre-submission meeting is still 60-90 days. IRB approval is still 8-12 weeks. Internal work compresses 5-25×; external-dependency work does not move.

If your runway is 18 months and you ship into a regulated industry, the realistic plan is:

Compress internal compliance work from 6 weeks to 10 days.
Use the recovered 4 weeks to run the external cycles in parallel with the next feature.
End up with one external cycle per quarter, not one every two quarters.

That math doubles the number of features that ship through compliance per year for the same runway. For an early-stage AI startup, that is the difference between catching the wave and missing it.

About: I build GreatCTO — a multi-agent SDLC plugin for Claude Code with 10 compliance packs. MIT, runs locally. The cost-by-pack breakdown is in the W21 deep-dive.

How GreatCTO chooses which compliance pack to attach

Alexander Velikiy — Sun, 17 May 2026 14:27:39 +0000

Every time someone runs npx great-cto init, the CLI has to decide:

What kind of project is this? (one of ~25 archetypes)
Which compliance packs apply on top? (voice / clinical / fintech / lending / 6 more)
Are any of those guesses wrong enough that the user will get a useless threat model and abandon the tool?

That last question is what makes the detection logic interesting. Get it wrong and the first impression is "this is producing nonsense about regulations I don't care about." Get it too conservative and the user has to manually configure packs that should have auto-attached, defeating the point.

After four months in production, here is what works.

What I tried first: LLM-based detection

Original design (rejected after 2 weeks): pipe the repo's README, package.json, and top-level directory listing into Claude and ask it to classify.

Problems, in order of severity:

Latency. First run of init now takes 12-18 seconds instead of <1s. Users perceive this as broken.
Cost. Roughly $0.04 per init. Negligible per user, real money at scale.
Hallucinations. Claude classified a Helm chart for an internal Kubernetes operator as "fintech, because the README mentions billing in the Operator's logging section." It does not. The word "billing" appeared once, describing log volume.
Variance. Same repo, same prompt, two runs: voice-AI then mlops. Probably temperature noise. Not acceptable for a decision that shapes the rest of the pipeline.

Killed it. Went to a regex-based detector. Latency dropped from 15s to 180ms. Cost dropped to $0. Variance dropped to zero.

The trade-off: regex cannot read intent. It reads tokens. A repo that says it does voice AI in its README but actually contains a music-recommender model will get the voice pack. That is a false positive I accept because the alternative (LLM in the loop) had its own false positives and was 80× slower.

The current detector

Three signal layers:

Layer 1 — package.json dependencies. twilio / livekit / deepgram / elevenlabs → voice pack. stripe / plaid / dwolla → fintech. tensorflow / pytorch + transformers → ml-pack (different from voice-pack). And so on for ~80 strong signal tokens.

Layer 2 — file paths. clinical/, fda/, phi/, hipaa/ in directory names → clinical pack. webhook/ + signature-related code → api-platform-pack.

Layer 3 — README + top-level docs grep. Exact-match keywords only, not fuzzy. "AEDT", "automated employment decision", "NYC Local Law 144" → hr-ai pack. "21 CFR Part 11", "SaMD", "FDA pre-submission" → clinical pack.

Each pack has a minimum signal count. voice-pack needs ≥2 of its 11 tokens. fintech needs ≥3 of 14. This is what cut false positives roughly in half.

The false positives I have logged

Across 4 months and ~340 init runs (instrumented from telemetry), 12 confirmed false positives:

repo type	wrongly attached pack	trigger	fix
static-site generator	voice-pack	README explicitly disclaiming Twilio	exact-match keywords only
music-recommender ML	voice-pack	"audio" in package description	removed "audio" as solo trigger
internal Helm chart	fintech	"billing" in operator log section	minimum 3 signals
docs-only repo	clinical	"patient" in user-research subfolder	excluded `docs/` from path scan
game-server prototype	mlops	`torch` in optional dev-dep	only scan `dependencies`, not `devDependencies`
7 others	various	various	each addressed via test case in `tests/detection.test.mjs`

The 12 cases are committed as regression tests. If the detector ever re-introduces one of these false positives, CI fails.

The case I worry about: silent false negatives

Easier to log a false positive (user complains "why is this thing telling me about TCPA"). Harder to catch a false negative (user runs init on a repo that should have hr-ai pack attached, doesn't, ships with no bias audit, gets fined two years later).

Mitigations:

/migrate command. Rerun detection with updated rules. New packs (or new keywords for existing packs) get a second chance to attach.
PROJECT.md is editable. The packs: list is plain YAML. User can add manually if detection missed.
Public catalogue. greatcto.systems/companies.html lists 200+ companies and the packs that would auto-attach to each. If a user's similar competitor is in the catalogue, they get a sanity check on whether their detection is correct.
Telemetry on no-pack runs. When init detects zero packs, we log it (anon, opt-in). If a class of project keeps coming through with no pack and the cost-of-miss is high (regulated industry), I add detection rules.

I have not had a confirmed regulatory false negative yet. That is partly because the user population is small (~500 active installs as of writing) and partly because the high-stakes archetypes (clinical, fintech, lending) have strong-signal vocabulary that is hard to miss.

What I will not add

People keep asking for two features I have rejected:

"Pack confidence scores." The detector should output 0-1 confidence per pack so the user can sort. I rejected this: it implies a precision the regex layer does not actually have, and users will treat a 0.6 score as "halfway right" when really it means "one signal matched, probably noise."
"Auto-update detection from telemetry." If we see 10 users with xyz in their repo overriding our detection, automatically add xyz as a fintech signal. Rejected: too easy to poison. One determined attacker registers 10 fake xyz/random-name repos with manual fintech tags and the global detector starts attaching fintech to everyone using xyz.

Both of these are textbook examples of "the obvious feature that becomes a backdoor."

What I might add

LLM in the loop, but only for ambiguous cases. If 2+ packs have signal but below threshold for any one, pipe the README into Claude with a strict "pick one or 'unclear'" prompt. Latency penalty only on the 5-10% of repos that are ambiguous, not all of them.
Per-language detection. Right now everything assumes Node/Python/JVM-ish patterns. Rust and Go projects sometimes have weak signal even when they are clearly fintech or healthcare. Not urgent — those communities are smaller in the user base.

The detection logic is small, boring, and one of the parts of the system I am most defensive of. It is the first thing every user sees, and a wrong first guess loses them.

About: I build GreatCTO — a multi-agent SDLC plugin for Claude Code. MIT, runs locally. The detector source is in packages/cli/src/detect.ts — read or fork.

Why your agent system fails: missing gates, not missing intelligence

Alexander Velikiy — Sun, 17 May 2026 14:27:33 +0000

A senior CTO emailed me last month: "We rolled out Devin across two teams. After three weeks the agents had merged 47 PRs. Three of them broke prod. Two contained a credential in the commit. One disabled rate limiting because the test fixtures didn't pass with rate limiting on. We're rolling back."

Everyone with eyes on agentic coding has heard a version of this story. The most common diagnosis is "the model isn't good enough yet." Reasonable on the surface. Wrong as a diagnosis.

I've spent the last 4 months building a multi-agent SDLC layer on top of Claude Code. 34 specialist agents, 25 archetype overlays, two human gates per feature. The clearest finding from this work: the failures CTOs describe almost never trace to bad code generation. They trace to missing gates.

This article walks through why, and shows the state machine I think every agentic SDLC needs.

The problem with how everyone does it

The default architecture for agentic coding is one autonomous loop:

loop:
  llm.generate(task, context)
  apply(diff)
  run_tests()
  if pass: commit
  else: revise

This is fine for prototypes. It is a disaster for shipped code. Three reasons.

1. Tests aren't enough. Tests verify correctness against assertions you wrote. They do not verify: "is this PCI-DSS scope appropriate", "does this respect TCPA recording consent", "did we just add a hidden N+1 query", "is this idempotent under retry storms". You need humans, or specialist reviewers that act like humans, for each of those.

2. One agent can't review itself. Even if you ask GPT-4 or Claude Opus to review its own output, the same biases that wrote the bug are reading the diff. We have decades of evidence from code review at Google, Microsoft, and Apache that independent reviewers catch ~3× more defects than authors. Independence requires separation. Agents aren't different.

3. Speed compounds errors. When the loop runs unattended, errors accumulate quietly between human checkpoints. By the time a human sees the work, the agent has rebuilt on top of three earlier mistakes. You can't fix the lowest-level mistake without unwinding everything above it.

The pattern that keeps emerging across teams that ship agentic systems successfully is explicit gates + specialist reviewers, not bigger models.

What this article will show

The 8-stage state machine I think every agentic SDLC needs
Why two human gates per feature is the sweet spot
The parallel implementer + parallel reviewer pattern
How memory feedback closes the loop (the "94% MTTR" claim, with caveats)
What this all costs (~$2 per small feature, with receipts)

The state machine

The full pipeline, as a deterministic state machine:

flowchart TD
    Init["$ init"] --> Detect["archetype-detect"]
    Detect --> Architect["architect (ARCH.md)"]
    Architect --> GatePlan{"⚐ gate: plan"}
    GatePlan -->|human approve| PM["pm (decompose)"]
    PM --> Impl["senior-dev × N (parallel)"]
    Impl --> Review["specialist review × 5 (parallel)"]
    Review --> GateShip{"⚐ gate: ship"}
    GateShip -->|human approve| Deploy["devops"]
    Deploy --> Operate["l3-support"]
    Operate -.->|incident pattern| Learner["continuous-learner"]
    Learner -.->|inject lesson| Architect

The two diamond nodes are human gates. Everything else runs unattended.

A few things to notice:

Parallelism is structural, not accidental. At the implement stage, independent tasks run in isolated git worktrees. At the review stage, 5 reviewers run concurrently because they look at different aspects (QA, security, performance, archetype-specific compliance, 12-angle code review).

The memory loop is dashed. It's an out-of-band feedback path. When a P0 incident resolves, the continuous-learner agent extracts the detection pattern and writes it to ~/.great_cto/lessons.md. Next time a similar incident shape hits, the agent's Step 0 includes the prior detection order. This is where the MTTR savings come from.

Specialists run only when archetype matches. The 34 agents in the pool aren't all firing every time. For a typical fintech feature, only 7 run: architect, pm, 2× senior-dev, qa-engineer, security-officer (PCI focus), code-reviewer. The voice-AI reviewer doesn't load because the archetype isn't voice-AI.

Two gates, not seven

The hardest design question is: how many human gates?

I started with seven: plan, design-review, security-review, qa-review, performance-review, compliance-review, ship. The complaint from every early user was: "this is just the human checkpoint problem from waterfall, but worse, because now I'm reviewing AI outputs."

Down to two. Specifically:

Gate 1: plan. You approve the ARCH note + cost estimate + task decomposition before any code is written. This is the cheapest decision in the pipeline — if scope is wrong, fixing it now is free. If you approve it, you've committed to "ship this if implementation passes."

Gate 2: ship. You see the full review panel — 5 verdicts, with rationale and diff per reviewer. APPROVED chips and BLOCKED chips. You either approve, or push back on a specific reviewer.

Everything in between is the agents' problem. If they disagree with each other, the gate fails and surfaces with the disagreement explicit.

Why this specific shape:

Gate 1 controls scope. You decide what gets built.
Gate 2 controls quality. You decide whether the agents got it right.

You don't decide how in between. The agents do. If you're making more than 2 decisions per feature, you're a bottleneck — and the whole pipeline collapses to your reading speed.

This is the part that most agentic systems get wrong. They either show you everything (and you can't keep up), or they show you nothing (and you wake up to broken prod). Two well-chosen gates is the sweet spot.

The memory loop is the real moat

Most agentic coding tools have no memory. They start each session from zero. This is fine for syntax errors and dead code. It is bad for the kind of bugs that recur with different surface signatures.

Real example. Q1 of this year I hit a postgres connection pool exhaustion during a burst load. The log said Connection refused. Looked like a network issue. Spent 4 hours unwinding network config before finally checking pg_stat_activity and seeing pool size was the cap. Q3, same shape hits in a different project — different framework, different stack. Pattern hash matches. Agent's Step 0 includes the prior detection order. 28 minutes to resolution.

This is not the agent being smarter. It's the agent skipping hypothesis exploration time.

Across 47 paired P0 incidents in 12 repositories (full methodology and 4 honest memory-miss cases published here), the median MTTR reduction was 94.1%. The mean was 92.6%. Skewed by a couple of near-100% cases. Not an RCT. Observational. Caveats are listed in the methodology.

The mechanism is simple. The agent stores: (pattern_hash, detection_order_that_worked, rationale). On a match, it tries the winning detection first. If that's wrong (4 of 47 cases were misses), it falls back to systematic exploration. No worse than baseline.

What makes the memory layer work is that it's local, file-backed, and git-trackable. Not a vector DB. Not a cloud service. Plain markdown in .great_cto/lessons.md (per-project) and ~/.great_cto/decisions.md (cross-project). You can read it, edit it, version-control it.

Edge cases worth knowing

A few things that surprised me during the build:

Agent count doesn't matter much. I shipped 12 agents, then 24, then 34. The marginal value of adding the 35th agent is small. What matters is coverage of distinct review angles. After 12, you mostly add archetype-specific compliance reviewers, and each one is opt-in based on archetype detection.

Disagreement between reviewers is a feature, not a bug. When security-officer blocks a PR that qa-engineer approves, you want this visible at the gate, not papered over. The state machine surfaces both verdicts.

Cost is dominated by output tokens. A typical feature: $3.40 in LLM calls. ~80% is in the agents that write (senior-devs, architect). The reviewers are cheap because they output verdicts, not code. If costs balloon, look at how much code is being generated, not how many agents.

Auto-approve flag is the slippery slope. I considered an --auto-approve flag for trivial features. Killed it. The minute you have that flag, the cycle that produces broken prod starts. The two gates are load-bearing.

Where this fits

The thesis isn't "you need this specific tool." It's that any agentic SDLC needs explicit state, explicit gates, and a memory loop. Without them, you're shipping a faster version of the agent system that already burned the teams I mentioned at the top.

If you want to inspect the exact state machine, the live SVG with every node clickable to its source on GitHub is here. A real shipped feature, walked stage by stage with artifacts and costs, is here.

TL;DR

Agentic coding failures trace to missing gates, not bad models.
The pattern that ships safely is 2 human gates + parallel implementers + parallel reviewers + memory loop.
"Bigger model" is rarely the right answer. "More specialist review angles" usually is.
Cost per shipped feature on this architecture: $1–4 in LLM, ~45 min wall-clock, 2 human clicks.
Memory is the difference between "fast at one-off code generation" and "improves over time at your codebase's recurring bugs".

About: I build GreatCTO — a multi-agent SDLC plugin for Claude Code. MIT, runs locally. Twitter: @avelikiy. GitHub: @avelikiy/great_cto.

How I designed the SDLC state machine for agentic coding

Alexander Velikiy — Sun, 17 May 2026 14:25:08 +0000

The architecture page on the site shows the state machine as a clickable diagram. This post walks through the design decisions that produced that shape, including the alternatives I tried and abandoned.

The pipeline, in one diagram

$ init
  └─ archetype-detect
       └─ architect          (ARCH.md)
            └─ ⚐ gate: plan         ← human approval #1
                 └─ pm              (decompose, parallelism plan)
                      └─ senior-dev × N   (parallel git worktrees)
                           └─ specialist review × 5  (parallel)
                                └─ ⚐ gate: ship       ← human approval #2
                                     └─ devops        (merge, deploy)
                                          └─ l3-support
                                               ↘ (incident pattern)
                                                  continuous-learner
                                                       ↗ (inject lesson into next architect run)

Eight runtime stages. Two human gates. One out-of-band feedback path (the dashed loop through continuous-learner). Every node maps to a single specialist agent.

Decision 1: number of human gates

I started with seven gates: plan, design-review, security-review, qa-review, performance-review, compliance-review, ship.

Every early user complained: "this is just waterfall, but now I am reviewing AI outputs instead of writing code." Each gate added 5-15 minutes of human reading. Seven gates × 10 minutes = 70 minutes of attention per feature. Worse than manual.

Two gates works because they are at the right scope-vs-quality break points:

gate: plan controls scope. You decide what gets built before any code is written. This is the cheapest possible decision in the pipeline — if scope is wrong, fixing it costs zero.
gate: ship controls quality. You see all 5 reviewer verdicts at once, with rationale per reviewer. You either approve, or push back on a specific reviewer (which the system re-runs).

Everything in between is the agents' problem. If reviewers disagree with each other, the gate surfaces the disagreement explicit — not papered over. If they all agree, you approve quickly.

Three is not better than two. I tried three (added "design-review" between plan and pm) for a month. The middle gate added 8-12 minutes of reading time, and in practice approved 47/47 features unchanged. Removed.

Decision 2: parallelism is structural, not opportunistic

The pm step does not just decompose tasks. It explicitly tags tasks as [parallel] or [serial], schedules them into a DAG, and assigns each parallel task its own git worktree.

This matters because the temptation in an LLM pipeline is to serialize everything ("agents think too fast, let's not race them"). That's wrong. Modern agentic coding tools (Claude Code in particular) handle parallel worktrees cleanly. The bottleneck is human attention, not compute. Running 4 senior-devs in parallel for 38 minutes is the same total LLM cost as serial, but cuts wall-clock by 3-4×.

Reviewers are also parallel — 5 specialist reviewers run concurrently against the merged diff, each looking at a different aspect (QA, security, performance, archetype-specific compliance, code quality across 12 angles). They do not block each other.

The only serial steps are architect (must precede pm) and devops (must follow approval).

Decision 3: memory is per-project, per-org, and cross-project

The dashed loop in the diagram is the part most agentic coding tools skip. It is the part that compounds.

Four layers:

Per-session. Conversation history. Disappears at session end. Cheap.
Per-project (.great_cto/lessons.md). Decisions, rejected approaches, incident detection patterns. Survives session restarts. Git-trackable.
Per-org (~/.great_cto/decisions.md). Patterns confirmed across ≥3 projects. Promoted from per-project after manual review. Used as Step 0 context for architect.
Cross-project (incident patterns). Pattern hash + winning detection order, stored when a P0 resolves. Next time a similar incident shape hits in a different project, the agent's Step 0 includes the prior detection order.

This last layer is where the MTTR -94% claim comes from. It is not the agent being smarter; it is the agent skipping hypothesis exploration time because someone already paid for that exploration.

I tried a vector-DB-backed memory layer for two weeks. Abandoned. The cognitive overhead of a "search before you write" step in every agent prompt was worse than just listing 3-5 recent lessons in a markdown file and trusting the LLM's context window. Plain text + git history is the moat, not embeddings.

Decision 4: archetype-specific agents are opt-in

The 34 specialist agents in the pool are not all firing every time. For a typical fintech feature, only 7 run: architect, pm, 2× senior-dev, qa-engineer, security-officer (PCI focus), code-reviewer. The voice-AI reviewer does not load because the archetype is not voice.

This is more important than it sounds. Early versions ran "all reviewers always." Cost ballooned because every reviewer wants context (the diff + ARCH.md + project README) and most ran for 0 useful output. Now: archetype detection at init picks the relevant 5-7, no extras.

The detection is signal-based: regex matches in package.json, README, infra/. False positives happen (the static-site-generator that got TCPA threat-modeled) but auto-attach is reversible — PROJECT.md lists the active packs and you can remove one with a line edit.

What the state machine does not do

It does not write product specs. You bring the prompt.
It does not negotiate with stakeholders. The architect will write the ARCH note, but the conversation with your VP of Eng about "should we even build this" is yours.
It does not catch all bugs. The qa-engineer writes tests against the spec it inferred from the architect's ARCH note. If you mis-spec'd the feature at gate:plan, the tests pass and the bug ships. Two gates is not zero — it is two.
It does not learn from your specific code style without per-project lessons. First 3-5 features in a new repo, you will see suboptimal naming and inconsistent patterns. By feature 10, lessons.md has enough to lock the style.

Where this fits

This is the architecture I argue every agentic SDLC needs. Not this specific tool. If you are building your own, the shape that works is: explicit state, two gates, parallel specialists, memory loop. Anything looser breaks at scale.

The clickable version of this diagram, with every node linking to its agent's source on GitHub, is at greatcto.systems/architecture.

About: I build GreatCTO — a multi-agent SDLC plugin for Claude Code. MIT, runs locally. Twitter: @avelikiy.

First real shipped feature with this stack — receipts

Alexander Velikiy — Sun, 17 May 2026 14:25:07 +0000

I get asked for receipts on every cost number I publish. So here is one full run, end to end, with screenshots replaced by file paths you can read on GitHub.

The feature is real. It shipped to a real fintech codebase in March 2026. The prompt, agent outputs, costs, and gate decisions are all recorded in tasks/W14-stripe-webhook-retry in that project (I am not linking the client repo, but I am willing to share artefacts on request).

The feature

"Add idempotent retry handling to our Stripe webhook receiver. If Stripe re-delivers an event (network blip, our 5xx response, manual replay), we should not double-process. PCI scope must stay SAQ-A."

Two sentences. No requirements document. This is roughly the level of brief I get from the team's PM most weeks.

What ran

$ npx great-cto init
…detected archetype: fintech (PCI-DSS)
…attached pack: api-platform-pack
…attached pack: pci-pack

Pipeline rolled forward unattended through:

archetype-detector — 12 seconds. Scanned package.json, infra/, README. Flagged Stripe webhook handler, BNPL flow, three PII columns in users table.
architect — 4 minutes. Wrote ARCH.md with: idempotency key strategy (deterministic from event.id), 24-hour retention window for processed-event log, exit criteria for "duplicate", interaction with existing audit log.
Gate: plan — I read the ARCH note for ~3 minutes, asked one clarifying question about the retention window ("why 24h, not the 7d that Stripe retries for"), agent updated to 7 days, I approved.
pm decomposed into 4 tasks, scheduled 2 in parallel.
senior-dev × 2 — 38 minutes, parallel git worktrees. Output: 6-file diff, +287 / −41 lines.
qa-engineer — 11 minutes. Wrote 17 new tests, including a property-based test for replay ordering. Coverage on touched code: 94%.
pci-reviewer (auto-attached by fintech archetype) — 8 minutes. Verified no card data hits new code path; idempotency log table excluded from CHD scope.
api-platform-reviewer — 7 minutes. Checked webhook signature replay window (5 min skew, ok), idempotency key collision math, Sunset header (n/a, internal endpoint).
security-officer — 4 minutes. Verified no new secret access patterns, audit trail covers retry path.
code-reviewer (12-angle) — 6 minutes. Three minor refactor suggestions, all accepted.
Gate: ship — I saw 5 reviewer cards, scrolled through 2, approved.
devops opened the PR. Branch CI was already green. Merged.

The receipts

Total wall-clock: 1h 26m. That includes the ~7 minutes I spent reading two artefacts and approving two gates. The agents themselves did about 78 minutes of work, mostly in parallel.

Total LLM cost: $1.42. Breakdown by agent (rounded):

architect          $0.34
pm                 $0.04
senior-dev × 2     $0.62
qa-engineer        $0.18
pci-reviewer       $0.09
api-platform-rev   $0.07
security-officer   $0.05
code-reviewer      $0.03

senior-dev is the cost driver (it writes the code). Reviewers are cheap (they output verdicts, not code). The whole thing fit inside the free monthly credit on Anthropic for the team account.

What I would have done manually

About 4-5 hours of senior backend work for the code, plus 1-2 hours of PCI review (we have an internal expert) before merge. Call it $700-$900 in fully-loaded engineering time. So the cost ratio is roughly 500× cheaper, the time ratio roughly 4× faster, with the same level of review rigour.

That is not "AI replaces the engineer." It is "AI does the mechanical 80% so the engineer spends 7 minutes on the part where judgment matters." The clarifying question I asked at gate:plan (retention window) is the kind of thing the agent would not have caught on its own.

⚠ Honest caveats

This was a small, well-scoped feature with clear PCI implications already known by the team. For greenfield features in unfamiliar regulatory territory, expect 2-3× longer wall-clock and a chunkier ARCH approval cycle.
The 78-min agent runtime ran on Claude Sonnet 4.6. On Haiku it would be ~30% faster and ~3× cheaper, with measurably weaker architect output (we tried).
One reviewer agent (the data-platform-reviewer) was opted out because we do not warehouse webhook events. If you do, that adds ~10 minutes and ~$0.10.
The team's existing test suite was already in good shape. On a codebase with poor test infra, qa-engineer would either spend longer building scaffolding or punt — and you would notice at gate:ship.

What this proves and what it doesn't

It proves the pipeline works on a feature you would have hand-built. It does not prove the pipeline can run your whole engineering org. We have not tried that and I do not recommend it. Two human gates per feature is the upper bound for "responsible automation"; more gates means slower, fewer means broken prod (see my earlier post on missing gates for why two is the right number).

If you want the same level of detail for a non-fintech archetype — say a voice-AI MVP or a clinical decision support tool — DM me on Twitter and I will publish another one.

About: I build GreatCTO — a multi-agent SDLC plugin for Claude Code. MIT, runs locally. Pay your own LLM API. The full architecture diagram, with every node mapped to its source on GitHub, is here.

What $1.4M of compliance work looks like in 14 hours – ten packs, ten regulated industries

Alexander Velikiy — Sat, 16 May 2026 12:51:54 +0000

Startups have often reached out to me with versions of the same problem. Their engineering teams could ship a regulated feature in three days. The compliance setup around it – PCI scope-reduction analysis, SAQ-A vs SAQ-D paperwork, consent-flow rewrite to stay out of cardholder-data territory, legal review – was taking six weeks and costing each client about $42,000 in fees.

Three days of code. Six weeks of compliance. That's the ratio I kept seeing.

This isn't a complaint about lawyers or compliance – those people are doing real work that prevents real disasters. It's a complaint about how un-automated the work is. Most of those six weeks at each client were spent on questions that already had answers: which SAQ form applies, what idempotency-key contract looks like under PCI, whether the webhook signature can replay-attack the refund flow. The answers exist. They're just not in any client's codebase and they're not in any client's CI.

That's what I've been building for them. Ten compliance "packs" for GreatCTO, one per regulated industry, that wire the gates the day the team scaffolds a project – not the day the auditor lands.

Below: what each pack does, what problem it solves, how long the same work takes humans, and (importantly) what the pack does not replace.

How a pack works, in one paragraph

A pack is a small overlay that attaches when GreatCTO detects industry signals in your repo (twilio in package.json → voice; clinical in README → healthcare; RBI mentioned anywhere → emerging-markets fintech). When it attaches, it adds one to four specialist reviewer agents, a threat-model template that gets filled out before code is written, two to four named human gates, and a list of evidence artifacts that must exist before you can ship. Every gate decision is logged with rationale. The point isn't to replace the human signature – it's to deliver the auditor a tidy evidence package instead of a panicked Slack thread.

(Quick confession on the detection: the first version did fuzzy substring matching, and voice-pack happily triggered on a static-site-generator repo whose README said "we explicitly do not use Twilio." Spent an hour wondering why a blog generator was getting a TCPA threat model. Exact-match keywords only, since then.)

OK, the ten packs.

1. voice-pack – for anyone whose product makes phone calls

The problem. TCPA (US federal), state recording-consent laws (nine states require all-party consent), STIR/SHAKEN caller-ID rules, E911, CVAA accessibility, PCI redaction if you ever capture a card number on a call. These are eleven separate compliance surfaces, and a founder shipping a "voice agent" MVP touches at least seven of them by accident.

What the pack does. Detects voice signals (twilio, livekit, deepgram, elevenlabs, IVR), attaches the voice-ai-reviewer agent, generates a threat-model document that lists which states' consent rules apply to your call routing, wires gate:voice-compliance so no code ships before that threat model is signed. Ships four eval fixtures that automatically test for: prompt-injection from caller speech, synthesized-voice disclosure, identity drift across call handoffs, PII leakage in transcripts.

Human-equivalent work. I priced this out properly the last time I did it the manual way: ~80 hours of communications-law consulting ($28K at $350/h) plus ~120 hours of senior backend work to wire the consent logging properly. Six weeks of calendar time before launch. The pack does the equivalent threat-modeling + gate wiring in ~45 minutes of LLM time.

Embarrassing detail to call out up front: the first voice-pack threat-model template was 800 words long and read like a law-school essay. Three different auditors politely asked if I could make it shorter. v2 is 200 words and now they cite it back to me. Compliance auditors, it turns out, prefer one-page documents to thirty-page ones.

Who's working in this niche (37 startups in our catalogue): Bolna, Vapi, Retell AI, Sierra, Cresta, Phonely, Sonia, careCycle, Assort Health, Uplift AI.

2. clinical-pack – for medical AI

The problem. "We're decision support, not a medical device" is the most expensive sentence in healthcare AI. The FDA's CDS guidance is narrower than founders read it. The minute your model outputs anything that influences a clinical decision without leaving the four-corners test intact, you're a Software as a Medical Device (SaMD) and you owe the FDA a pre-market submission. Get the SaMD class wrong and you're shipping 18 months later than planned.

What the pack does. Two reviewers: ai-clinical-reviewer for the clinical-judgment surface, fda-reviewer for SaMD classification. The pack walks your repo through the SaMD four-corners test, classifies as Class A/B/C/D, generates the right pre-submission paperwork outline (510(k) / De Novo / PMA), and wires gate:samd-class as the gating decision. Also covers GMLP-10, post-market surveillance plan, EU MDR/IVDR if you ship in Europe, HIPAA technical safeguards.

Human-equivalent work. SaMD regulatory consulting runs $400–600/h, and the classification analysis alone is typically a 60–80 hour engagement (~$36K). Then 120+ hours of dev work building the Part 11 audit trail. Three months of work before your first FDA conversation. Pack runs the classification + evidence framework in ~60 minutes.

Who's working in this niche (40 startups): Assort Health, careCycle, Sonia, Knowtex, mdhub, FutureClinic, Ora AI, Empirical Health, AOA Dx, Nest Genomics.

3. hr-ai-pack – for AI hiring tools, screeners, interviewers, performance review

The problem. NYC Local Law 144 has been in effect since July 2023. Any AEDT (automated employment decision tool) used on a NYC resident requires an independent bias-audit posted publicly. Fines run up to $1,500 per candidate per day counting from the day the AEDT first ran. The EEOC's iTutorGroup settlement ($365K, age-discrimination AI hiring) signaled enforcement intent. EU AI Act puts employment in Annex III (high-risk). And nobody on your team is the right person to design the bias audit.

What the pack does. Detects HR signals (recruit, hiring, candidate, resume, interview, ATS, talent). Attaches hr-ai-reviewer, wires gate:aedt-audit as the annual checkpoint, generates the data-collection scaffolding the third-party auditor needs (selection rates by demographic, four-fifths-rule analysis, model card with feature lineage), and ships eval fixtures for prompt-injection-via-resume, candidate consent flow integrity, model-card transparency.

Human-equivalent work. External bias-audit engagement: $15–50K depending on auditor (we use third parties for the signoff – pack does not replace them). Internal prep: ~80 hours of someone reading the LL 144 spec + EEOC guidance + EU AI Act Annex III. Annual recurring cost: ~$25K. The pack does the prep – the evidence package the auditor receives – in roughly 30 minutes per release cycle.

Who's working in this niche (27 startups): Dover, Apriora, Greenhouse, Stepful, Mederva Health, Untether Labs, DirectShifts, Axle Health.

4. api-platform-pack – for products whose primary surface is an API

The problem. This pack solves the boring stuff that bites you at 7am when the on-call alert fires. Webhook signature drift after a key rotation. Replay attacks because nobody set a timestamp tolerance. Idempotency keys reused across users so two customers get charged for the same transaction. Public API deprecation without a Sunset header so partners' integrations silently break. None of this is glamorous. All of it ships every quarter at every startup I've worked at.

What the pack does. Attaches api-platform-reviewer, enforces OAuth 2.1 with PKCE, RFC 8594 Sunset on deprecations, HMAC + replay-window webhook signing, idempotency-key contracts on writes, OpenAPI 3.1 conformance, semver enforcement on the SDK with backward-compat matrix tests, and supply-chain hardening (Sigstore-signed releases, OpenSSF Scorecard ≥ 7). Three gates: gate:semver-contract, gate:sunset-rfc, gate:ship.

Human-equivalent work. Wiring all of the above into a new API platform takes a senior backend engineer roughly 160 hours (4 weeks at standard pace). I've personally redone this stack at three startups. The pack delivers the same setup in ~45 minutes and – more importantly – keeps it audited on every PR thereafter.

Who's working in this niche (46 startups): Vapi, Retell AI, Deepgram, ElevenLabs, LiveKit, Parachute, Medplum, Patch.

5. lending-pack – for credit, BNPL, payroll advance, anything with adverse action

The problem. The CFPB explicitly said in Circular 2023-03: "the black-box AI denied your application" is not a permissible adverse-action notice under Regulation B. Your ML underwriter must produce specific reasons in plain language. Plus FCRA furnisher rules. Plus NMLS state matrix. Plus BISG fair-lending analysis (disparate impact across protected classes). Plus the new CFPB Section 1071 reporting if you're a small-business lender. Plus state usury caps that vary from 6% to 36%.

What the pack does. Attaches lending-credit-reviewer, wires gate:fair-lending (a disparate-impact report runs on every model release before ship), and generates the templates for ECOA-compliant adverse-action notices that explain feature contributions in human language. Also handles MLA for military borrowers, the state usury matrix, and the BISG demographic estimation pipeline.

Human-equivalent work. Fair-lending consulting is ~$600/h. A clean ECOA-compliant adverse-action notice rewrite plus BISG analysis plus state usury matrix takes a fair-lending statistician + a compliance lawyer ~6 weeks and around $90K. Recurring per model release: ~$25K. Pack does the disparate-impact analysis + notice templates on every release in ~90 minutes.

Who's working in this niche (38 startups): Aspire, OnDeck, MNT-Halan, Tabby, Tamara, SaveIN, Eden Care, Remedial Health.

6. clinical-trials-pack – for CTMS, EDC, eConsent, biomedical-data platforms

The problem. 21 CFR Part 11 audit trails. ALCOA+ data integrity (Attributable, Legible, Contemporaneous, Original, Accurate, plus Complete / Consistent / Enduring / Available). If an FDA inspector asks for who changed this specific value on this specific subject at this specific time and why and your system can't produce it, your trial gets paused. Theranos paid an estimated $20M+ partly because their LIS audit trail couldn't meet Part 11. Your CTMS or EDC has logging – yes – but it doesn't capture the reason for change in the operator's own words.

What the pack does. Attaches two reviewers: clinical-trials-reviewer for the trial workflow + bio-data-reviewer for biomedical data flows (FHIR, HL7, OMOP, DICOM, genomics). Wires gate:irb-ready (IRB submission package complete before any patient enrolls), gate:part11-validation (Computer System Validation IQ/OQ/PQ before production go-live). Covers CDISC standards, Safe Harbor / Expert Determination de-identification, EU CTIS.

Human-equivalent work. Standing up a Part 11-compliant trial system from zero is roughly 6 months of work by a CSV (Computer Systems Validation) consultant ($250–400/h, ~480 hours = $120K–$190K), plus internal QA time. Pack delivers the validation framework + audit-trail patterns + IRB package outline in roughly 2 hours.

Who's working in this niche (20 startups): Curebase, Medable, Veeva Systems, Castor, AOA Dx, Nest Genomics, Syntra, Medplum.

7. robotics-pack – for cobots, surgical robots, autonomous vehicles, drones

The problem. ISO 10218 / TS 15066 govern collaborative-robot operations. The robot must stop within milliseconds when force/torque sensors detect contact above the published human-tolerance threshold per body region. The HARA (Hazard Analysis and Risk Assessment) must be signed by a licensed safety engineer before design freeze – not after, not during integration testing. If your cobot injures someone and the HARA wasn't signed-off pre-design-freeze, OSHA general-duty clause and product-liability law become very personal very fast.

What the pack does. Attaches robotics-safety-reviewer (+ fda-reviewer paired in for surgical robots). Wires gate:hara-signoff (load-bearing – pack will NOT let you proceed without this), enforces SROS2 secure DDS, OTA update strategy, power-profiling + watchdog patterns, hardware-in-the-loop (HIL) test design, Sigstore-signed firmware. The pack generates the HARA evidence package; the human safety engineer still signs it.

Human-equivalent work. Functional-safety consultants run $400–600/h. A clean HARA + ISO 10218 compliance package for a new cobot is ~12 weeks of engagement (~$120K–$200K). Pack does the evidence-and-template generation in ~3 hours; the human signoff remains mandatory and is not something to automate away.

Who's working in this niche (27 startups): Andromeda Surgical, Revolve Surgical, Figure AI, 1X Technologies, Boston Dynamics, Skild AI, Zeon Systems, NeuroBionics.

8. em-fintech-pack – for fintechs operating outside US/EU/UK

The problem. US-founded fintechs entering India / Brazil / Nigeria / Indonesia / Philippines often assume they can rent a banking partner's license forever. Sometimes you can. Regularly the regulator changes their mind mid-year – RBI did exactly this with the 2024 prepaid-instrument and cards-network rules. Partnered fintechs woke up to frozen wallets. MNT-Halan reached unicorn status partly because their license strategy was on file with the Central Bank of Egypt from day one.

What the pack does. Detects country signals (India, Nigeria, Brazil, Indonesia, Philippines, Mexico, Kenya), regulator names (RBI, CBN, BSP, OJK, MAS, BCB), and rail integrations (M-Pesa, UPI, PIX, GCash, OVO, DANA). Attaches emerging-markets-fintech-reviewer, wires gate:license-strategy (forces the own-license vs. partner-bank decision per jurisdiction before launch), and generates the jurisdiction-specific compliance matrix for KYC tiering, cross-border remittance, and local-rails integration patterns.

Human-equivalent work. Jurisdictional fintech-law consulting in India alone is $400–500/h, and a clean license-strategy analysis for one country is 4–8 weeks of engagement (~$60K–$160K). Expand to three countries → $180K–$480K. Pack does first-pass analysis + decision-package outline for each jurisdiction in ~60 minutes.

Who's working in this niche (40 startups): Razorpay, Aspire, MNT-Halan, Bolna, Remedial Health, SaveIN, Digi-Prex, Tabby, Tamara, Infiuss Health.

9. climate-pack – for carbon MRV, ESG reporting, synbio platforms

The problem. Methodology choice for carbon credits is a 10-year decision dressed up as a config option. In January 2023 a Guardian investigation found ~94% of analyzed Verra rainforest credits were "phantom credits." Disney, Shell, Gucci quietly retired their positions. If your platform issued credits under that methodology, every buyer's purchase is retroactively suspect. You cannot change methodology retroactively without re-issuing the entire credit batch – which means refund + reputational damage. Plus the new CSRD (EU) + CBAM disclosure rules. Plus biosecurity if you're touching synbio.

What the pack does. Attaches two reviewers: climate-mrv-reviewer for the MRV pipeline and biosecurity-reviewer for dual-use synbio. Wires gate:mrv-methodology ("cannot change retroactively" – the heaviest gate in the pack), gate:durc-signoff for Dual-Use Research of Concern, gate:open-weights-release for generative bio-models. Covers GHG Protocol Scope 1/2/3, Verra/Gold Standard/Puro, SBTi target validation, ISSB, IGSC HSP v2 nucleic-acid screening.

Human-equivalent work. Climate MRV consulting is ~$300–500/h. A clean methodology selection + documentation package is ~2 months of work (~$60K–$80K). Pack delivers the decision document + Verra-vs-Gold-Standard-vs-Puro comparison matrix in ~90 minutes.

Who's working in this niche (25 startups): Watershed, Persefoni, Sweep, Greenly, Plan A, Pachama, Sylvera, Patch.

10. drug-discovery-pack – for ML-driven drug discovery, GLP labs, cloud labs

The problem. Generative chemistry / protein-design models are confidently wrong outside their training distribution. The bug pattern: model produces a target with 0.92 predicted binding score, you spend $80K–$400K on wet-lab validation, it doesn't bind, post-mortem reveals the target was outside the model's applicability domain. There's no regulator here – there's just a co-founder whose lab budget just went up in smoke. Plus, if your synthesis order accidentally matches a pathogen-of-concern sequence, your synthesis vendor's legal team calls before your chemistry team does.

What the pack does. Attaches four reviewers – drug-discovery-ml-reviewer, glp-glab-reviewer, lab-automation-reviewer, biosecurity-reviewer. Wires gate:model-card-signoff (applicability domain + calibration verified before wet-lab spend – the cost-saving gate) and gate:csv-validation (IQ/OQ/PQ before GLP/GMP production – the regulatory gate). Covers ChEMBL/PubChem dataset versioning, ALCOA+ data integrity, SiLA2 instrument-driver standards, ASTM E2500.

Human-equivalent work. Standing up ML ops + applicability-domain monitoring + GLP-compliant data integrity + IQ/OQ/PQ validation framework is roughly 5 months of work across an ML engineer, a QA-CSV consultant, and a lab-automation specialist (~$180K–$280K). Pack does the framework + first model-card + validation outline in ~3 hours.

Who's working in this niche (33 startups): Future Fields, Ångström AI, Talus Bio, Abalone Bio, om therapeutics, Olio Labs, ParcelBio, Zeon Systems.

The compliance-slice comparison (with vs. without packs)

Important scope note before the numbers: this table covers only the compliance setup slice. It is the part the packs replace: regulation reading, threat-model drafting, gate definition, evidence-framework setup, and the legal-and-audit work attached to that slice. It does not include the rest of the product – architecture, backend, frontend, database, integration code, test suite, security audits, deployment, or any ongoing feature work. Those are real costs, they don't disappear, and we'll look at them next.

What the packs replace inside the compliance slice: the discovery work – reading the regulation, mapping it to your stack, drafting the gate set, writing the threat model, building the evidence framework. The part where a senior person spends 80–200 hours per industry. Mechanical, repeatable, and a great use of an LLM.

What the packs do not replace, even inside the compliance slice: third-party audits, regulator meetings, legal sign-off on the generated artifacts, safety-engineer signatures on HARA, the human review of the threat model.

End-to-end numbers for the compliance slice only, with vs. without each pack:

Pack	Compliance slice without pack	Compliance slice with pack (LLM + remaining human work)	Slice saving
voice-pack	~$42K · 6 weeks	~$22K (≈ $20K integration code for gates + $2K legal review)	~48%
clinical-pack	~$60K · 3 months	~$28K (≈ $25K dev + regulatory pursuit + $3K legal + $0.90 LLM)	~53%
hr-ai-pack (prep)	~$25K · 80h	~$8K + external LL 144 auditor at $15–50K either way	~68% on prep
api-platform-pack	~$48K · 4 weeks	~$18K (≈ $15K dev + $3K legal + $0.50 LLM)	~63%
lending-pack	~$90K · 6 weeks	~$32K (≈ $25K dev + $7K legal + $0.70 LLM)	~64%
clinical-trials-pack	~$150K · 6 months	~$55K (+ IRB approval, non-automatable)	~63%
robotics-pack	~$150K · 12 weeks	~$60K (+ licensed safety-engineer signature, non-automatable)	~60%
em-fintech-pack (per jurisdiction)	~$100K · 6 weeks	~$35K	~65%
climate-pack	~$70K · 8 weeks	~$28K	~60%
drug-discovery-pack	~$220K · 5 months	~$80K (wet-lab spend not reducible)	~64%

Median saving on the compliance slice: ~63%. Across all ten industries, the compliance-slice cost drops from roughly $955K to ~$366K, and the calendar time on the compliance work itself from ~21 months to ~7 months. That's a defensible 60 %-ish reduction on this specific portion of product work.

And the rest of the product?

Compliance is one slice of an MVP – typically 10–25 % of the total build cost for a regulated AI product. The other 75–90 % is the actual engineering: architecture, backend, frontend, database, integration code, test suite, security audits, deployment, monitoring, and the long tail of ongoing features.

Agents help here too – but the savings shape is different, the LLM costs are much higher than the cents-per-pack numbers above, and the human team doesn't get to be smaller, it just gets to ship more.

A realistic end-to-end picture for shipping a voice-AI MVP (just one example – pick your archetype, numbers scale similarly):

Work area	Traditional team (1 PM + 4 engineers, 3 months)	With agentic SDLC (1 PM + 2 engineers + agents, 3 months)
Architecture + ADRs	~$20K (senior architect, 2 weeks)	~$10K (1 sr eng + architect agent; ~$5K payroll + LLM)
Backend (Twilio, OpenAI, call routing)	~$80K (2 engineers · 8 weeks)	~$30K (1 engineer + senior-dev agents · 6 weeks)
Frontend (operator dashboard, call review UI)	~$40K (1 frontend · 4 weeks)	~$15K (1 engineer + agents · 3 weeks)
Database + migrations	~$15K	~$5K (agent-assisted schema + human review)
Test suite + QA	~$25K (1 QA · 6 weeks)	~$10K (qa-engineer agent + 1 human reviewer)
Security review + pen test	~$20K (external pen test)	~$15K (external pen test still required + security-officer agent for internal review)
Compliance (voice-pack covers)	~$42K	~$22K
Deployment + CI/CD	~$15K	~$8K
Documentation	~$10K	~$3K (agent-generated, human review)
PM + buffer	~$20K	~$10K (pm agent + lighter human PM)
MVP total	~$287K	~$128K
LLM compute across the whole MVP	$0	~$500–$1,500
Wall-clock time to MVP	~3 months	~6–8 weeks
Headcount	1 PM + 4 engineers	1 PM + 2 engineers + agents

Median MVP saving with full agentic SDLC + the voice-pack: roughly 55 % on cost, ~45 % on wall-clock time. The LLM cost across the entire product build is in the four-figure range – single dollars per pack, several hundred dollars across all the agentic SDLC work, plus a few hundred dollars in the slower archetype-reviewer + memory-loop tail.

Two things this table makes obvious:

The compliance pack isn't the headline saving. It's one of nine work areas that gets cheaper. The packs save 60 % of compliance cost; the agentic SDLC saves a similar fraction across each of the other work areas. The compounding 55 % MVP saving comes from doing both, not from one heroic LLM call.
You still need engineers. The "2 engineers + agents" team is real. Those engineers operate the pipeline, review agent output, fix the bugs agents create, integrate the actual product, and ship the code. The headcount drops from 4 → 2 – but it doesn't drop to 0. The startup that ships an MVP with no humans in 2026 doesn't exist.

What you actually get: the same MVP, faster, with half the headcount, at roughly half the cost – and a much bigger fraction of the work shifts from "writing code" to "reviewing agent output and arguing with auditors." Different work shape. Smaller team. Larger output.

Time, not just money

The dollar number (55 % MVP saving) is one frame. The time number is a different – and for a runway-constrained startup, arguably more important – one. Building this stack burns calendar time you don't get back. So here's where the wall-clock actually changes.

Wall-clock to ship the MVP

Traditional team: ~3 months for 1 PM + 4 engineers to ship a voice-AI MVP, end to end. That's the version that includes ~6 weeks of sequential compliance work running mostly parallel to feature dev.

With agentic SDLC + voice-pack: ~6–8 weeks for 1 PM + 2 engineers + agents. About a 40–50 % wall-clock reduction, not just a cost reduction.

Why faster:

Parallel agent execution. Architecture draft, backend scaffolding, test-suite generation, and threat-model writing all run in parallel. In the traditional version each is a sequential phase.
The compliance team is no longer the long pole. The pack generates the threat model in 45 minutes. The 6-week "waiting on legal to read the regulation" tail disappears.
Reviewer fan-out is parallel. Five reviewers (QA, security, performance, archetype-specific, code review) run concurrently on every PR – 5–8 minutes vs. 1–2 days of sequential human review.

Per-feature time

The smaller, more visceral comparison – what it feels like day to day to ship one feature:

Activity	Traditional senior team	With agents	Speedup
Plan a feature (ARCH doc, task breakdown)	2–4 hours of human discussion + writing	15 min (architect + pm agents → `gate:plan`)	~10×
Write the code (small feature)	1–3 days of senior dev	~1–2 hours of human review of agent output	~10–15×
Code review	2–4 hours, often async over 1–2 days	30 min (5 reviewers running in parallel)	~10×
QA / test suite	1 day	15 min (qa-engineer agent + 1 human spot check)	~25×
Deploy (canary + monitoring)	~4 hours	~10 min (auto-canary)	~25×
End-to-end per feature	~3–5 days	~3–5 hours	~10×

So shipping one feature drops from "we'll have it next week" to "we'll have it after lunch." That's the kind of speedup that changes what you're willing to build.

The investigative + maintenance numbers

Different shape of work, different speedup. These are the ones I personally care about most because they show up in on-call rotations:

Activity	Traditional	With agents	Why faster
First commit by a new team member	2–3 weeks of onboarding	~5 minutes (new agent reads CLAUDE.md + lessons.md)	No human onboarding needed for agents
Time to investigate a recurring P0 incident	~4 hours on first occurrence	~30 min on second occurrence	Memory layer remembers the detection pattern – 94 % MTTR reduction documented here
Time to add a brand-new compliance archetype	4–6 weeks (research + integration)	~3 days (pack overlay + product customization)	~70 % of the work is templates the pack already ships
Time to update an existing pack when a regulation changes	1–2 weeks (audit + rewrite)	4–8 hours (LLM reads the regulation diff, updates the reviewer prompt)	Diff-and-template is mechanical
Time to scaffold a new project + wire archetype + first feature plan ready	~1 week (kickoff meetings + arch doc)	~12 minutes (`npx great-cto init` + `/start "feature"`)	All of step zero is automated

What does NOT speed up

This is the section that keeps the rest of the article honest. Several critical-path items don't speed up at all, and pretending they do is exactly the kind of marketing the article opens by complaining about:

External audit cycles still take their natural time. A NYC LL 144 bias auditor still takes 2–4 weeks. An FDA pre-submission meeting cycle is still 60–90 days.
IRB approval for a clinical trial still takes 2–3 months. You can prepare the package in a day; you can't make the IRB committee meet faster.
Regulator meetings still need to be scheduled. RBI, CBN, BSP, FDA – those are calendars, not API endpoints.
Wet-lab validation for a drug-discovery hit is still a real biology experiment that takes weeks regardless of how smart the model was that predicted the target.
HARA signoff by a licensed safety engineer is a single calendar moment. The pack generates the package in 3 hours; the human still owns the date on which they sign it.

The pattern: internal work compresses 5–25×. External-dependency work does not. A useful mental model is: "the LLM accelerates everything that's purely you and your codebase. Everything that requires another human or organization to commit time runs at human speed."

Why this matters for runway

An early-stage AI startup typically has 18–24 months of runway. If your MVP cycle drops from 3 months to 6 weeks, you don't just save money – you double the number of pivots you can survive. Three pivots become six. Two product directions become four. The number of bets you can hedge per year doubles.

For a category-defining startup hunting for product-market fit, that's the difference between catching the wave and missing it.

What this actually changes

The 63 % saving and the $1.6M → $400K end-to-end number are fine, but they hide a deeper shift that's harder to put in a table. Three things have changed about how compliance work gets done – and none of them are "AI is magic."

1. Deep domain expertise has stopped being a moat. Pipeline design has become one.

For twenty years, the answer to "how do we ship compliantly into healthcare or fintech or voice-AI" has been "hire someone who's done it before." A senior fintech-compliance engineer with five years post-CFPB experience commands $300K+ because she knows where the bodies are buried. Same for the FDA SaMD consultant at $600/hour. Same for the ISO 10218 functional-safety lead.

That moat hasn't disappeared, but it's been partially commoditized. The regulation text is publicly available. Mapping it to a stack is mechanical. Drafting a threat model is mechanical. The LLM reads 200 pages of 21 CFR Part 11 faster than any human can think about it, and produces a first-pass mapping that's 70–80 % correct.

What hasn't commoditized: judgment about edge cases, jurisdictional interpretation, regulator relationships, the ability to defend an audit finding on the phone. Those still cost money. But the bulk of "I need someone senior who's done this before" gets replaced by "I need someone who can read what the pack wrote and decide if it matches the product reality." That's a different – and much cheaper – hire.

The new moat is pipeline design: how the gates wire together, what evidence the agent extracts, which artifacts persist, how the human-in-the-loop handoffs work, when the auditor sees the trail. That moat is much more accessible to a small startup than the hire-three-senior-experts moat ever was.

2. Agents bend the cost-per-feature curve. Hard.

Look past the 63 % saving for a second. The number that actually matters: calendar time for a founder shipping into all ten industries goes from ~21 months to ~7 months. That's not a marginal improvement. That's the difference between "this startup runs out of runway before we ship" and "we ship."

For a Series B fintech that previously needed two senior compliance engineers, the math becomes: one senior compliance person plus the pipeline replaces three to four senior compliance people. Not because the LLM is smarter than your senior person. The LLM is much faster at reading; your senior person is much smarter about judging. The bottleneck moves from "have we read the regulation" to "have we mapped it correctly to OUR product." The first question is mechanical. The second is a real skill.

Shipping rate doubles. Then it doubles again when the same compliance person works across multiple regulated industries – which historically required multiple specialists.

3. The work didn't disappear. It moved up the value chain.

This is the part most "AI replaces engineers" takes get wrong. The work doesn't go away. It redistributes:

The 200-hour "read the regulation and write the threat model" task → ~3 hours of human review of the LLM's output
The 80-hour "wire the gate definitions into CI" task → 30 minutes of operator approval of the generated gate set
The 60-hour "map every requirement to an evidence artifact" task → 4 hours of human verification per audit cycle
The work that's gone: mechanical reading + templating
The work that's MORE valuable now: judgment, jurisdictional interpretation, regulator relationships, defending an audit finding, and – separately – the production code itself

This is collaboration, not replacement. The LLM does the parts where speed beats judgment. The human does the parts where judgment beats speed. The auditor signs the certificate. The engineer ships the code.

The compliance expert of 2027 is someone who knows which regulation applies in which jurisdiction and can run a pipeline to do the reading + templating for them. Same depth of judgment, five times the productivity. That person is going to eat the market share of the compliance expert who still bills by the hour for reading regulation text.

Honest disclaimers

Same as the last article and they always apply:

Packs do not certify you. Reviewer agents produce evidence; humans sign.
Packs do not replace lawyers. Reviewers encode commonly accepted readings of each regulation. Jurisdictional interpretation is your GC's job.
Packs cover 70–80% of each industry's gate surface. The edges need overrides – fork the reviewer prompt at agents/{slug}-reviewer.md.
The dollar amounts above are real published consulting-rate ranges plus my own historical project costs. Your mileage will vary depending on industry, geography, and how much paperwork the regulator demands that month.

And while we're being honest: I once asked the senior-dev agent to refactor a retry loop. It introduced a different retry bug. The qa-engineer agent reviewed and approved. The code-reviewer agent reviewed and approved. The bug shipped, l3-support caught it in production three hours later. continuous-learner now carries a lesson called "a committee of agents can rubber-stamp the same blind spot." Painfully on-brand for an article about review pipelines.

If you want to try it: npx great-cto init. MIT. Runs locally. Pay your own LLM API. Source: github.com/avelikiy/great_cto. Pack pages with company catalogues: greatcto.systems/packs.

TL;DR

I built ten compliance packs because I got tired of paying $40-200K per industry, per startup, for the first version of the compliance setup.
Each pack: triggers on industry signals, attaches 1–4 specialist reviewer agents, wires named human gates, ships an evidence-artifact framework.
Compliance is roughly 10–25 % of MVP cost. The packs reduce the compliance slice end-to-end by ~63 % median (across all 10 industries: ~$955K → ~$366K). Calendar time on compliance: ~21 months → ~7 months.
The bigger picture is the full MVP, not just the compliance slice. Realistic numbers for a voice-AI MVP: traditional team (1 PM + 4 engineers, ~3 months) ≈ $287K. With agentic SDLC + voice-pack (1 PM + 2 engineers + agents, ~6–8 weeks) ≈ $128K. LLM cost across the entire MVP: ~$500–$1,500.
MVP saving with agents + packs: ~55 % on cost AND ~40–50 % on wall-clock time. The compliance pack contributes ~$20K of that – the rest comes from agents helping across architecture, backend, frontend, DB, QA, security review, and deployment.
Per-feature time drops ~10×. Plan → code → review → QA → deploy collapses from ~3–5 days to ~3–5 hours. The compliance archetype-update cycle goes from 1–2 weeks to 4–8 hours. New project scaffolding goes from ~1 week to ~12 minutes.
What does NOT speed up: external audits, IRB approval, regulator meetings, wet-lab validation, HARA signoff cycles. Anything that requires another organization to commit calendar time runs at human speed.
The runway implication: if your MVP cycle drops from 3 months to 6 weeks, you double the number of pivots you can survive within the same 18–24 month runway. That's existential for a category-defining startup.
You still need engineers. Headcount drops 4 → 2. It does not drop to 0. The startup that ships an MVP with no humans in 2026 doesn't exist.
What the packs do replace: regulation-reading + threat-modeling + evidence-templating – the mechanical part. They do not replace production code, third-party audits, regulator meetings, legal review, or human signatures on safety-critical gates.
The bigger shift hidden in the numbers: deep domain expertise has stopped being a moat – pipeline design has become one. Reading 21 CFR Part 11 is now mechanical (LLM-fast). Mapping it to your stack with judgment is still a senior skill. The hire profile changes; the headcount drops.
The compliance expert of 2027 knows which regulation applies AND runs a pipeline to do the reading. Same depth of judgment, 5× the productivity. That person eats the market share of the one who still bills by the hour for reading regulation text.
If your industry isn't in the ten above, three more land Q3: gov-public-pack, insurance-actuarial-pack, edtech-coppa-pack. PRs welcome.

About: I build GreatCTO – a multi-agent SDLC plugin for Claude Code. MIT, runs locally. Twitter: @avelikiy. GitHub: @avelikiy/great_cto.