Forem: Wrought

Fail-Open Pipelines Are Half the Answer: The Findings Tracker Is the Other Half

Wrought — Tue, 19 May 2026 12:37:04 +0000

I. The 65% datum

Real Python's recent audience research surveyed 278 Python developers about agentic coding workflows. 65% reported being stuck on a context-loss / chat-AI workflow pain that named the gap precisely. One respondent put it in 11 words: "AI is good for solving a small task, but only this."

The diagnosis behind that statistic is right. Real Python's framing — that the problem isn't the AI, it's the workflow — captured something the agentic-tooling space had been talking around. Most coverage of Claude Code, Cursor, and the rest of the agentic-coding category had defaulted to tool questions: which model, which IDE integration, which prompt template. The 65%-stuck cohort answered a different question. They had the tool. The tool worked on small tasks. The workflow was where everything fell apart.

There are two ways to extend that diagnosis into a prescription. The first is to chase the runtime: build pipelines that compose small tasks into bigger ones, add verifier stages that catch errors before they cascade, instrument the runtime with circuit breakers around the unreliable bits. That's the right move at the runtime layer, and a peer engineer named Drew published the canonical write-up of how to do it: Fail-Open LLM Architecture: Why Your Reviewer Stage Should Never Block a Decision.

The second extension is the one this article argues. The 65%-stuck pain isn't only a runtime problem. The runtime layer answers "how do we compose small tasks within a single session?" The other layer — the one Real Python's workshop curriculum legitimately doesn't cover, the one Drew's piece doesn't reach because it stops at the runtime boundary by design — is the session boundary. What happens when session N ends with the work half-finished and session N+1 starts cold? What happens when the context window fills up partway through a refactor? What happens when you close the editor and come back two days later and the agent has no memory of what it was doing?

The right diagnosis is workflow not tool. The right prescription needs more than a tool. This article extends the prescription one layer.

II. Drew's runtime answer

Drew, writing as A3E Ecosystem on dev.to, sets up the runtime case in production terms. His Fail-Open LLM Architecture piece argues, correctly, that "every stage degrades to the simplest correct behaviour when its neighbour fails." The argument starts from a real production incident — OpenAI's API outage on December 11, 2024, when a newly-deployed telemetry service caused every node across hundreds of Kubernetes clusters to execute resource-intensive operations simultaneously, taking the API down for roughly four hours.

The piece then walks through what production-grade LLM pipelines should do about it. The pattern is a three-state circuit breaker around the unreliable stages — typically the reviewer-or-checker stages that sit downstream of the primary model decision. Closed state is normal: requests pass through, the reviewer runs, life is good. Open state is failure: the breaker has seen enough errors that it stops calling the reviewer entirely; requests fail-open with a logged warning, the primary's output passes through unaltered. Half-open state is recovery: after a cooldown window, a probe request tests whether the reviewer is back; if it passes, breaker closes and traffic resumes.

The numbers in the piece are not hypothetical. Well-instrumented circuit breakers around LLM reviewer stages produce 83.5% reductions in cascading failures during upstream incidents. The article cites Anthropic's own degraded period — a roughly nine-month window after the OpenAI outage — where peak failure rates reached 16% and median time-to-recovery sat between 0.77 and 1.23 hours across major providers. Three distinct bugs ran concurrently: a routing error, a TPU corruption, and a compiler bug returning wrong tokens. None of them announced themselves. Detection took weeks because the failures were silent and stochastic. The piece makes a precise claim about silent degradation as a class of failure: it is, the argument goes, always the real enemy, because full outages are at least obvious.

The fix is structural, not heroic. You don't write better reviewers; you build pipelines that don't depend on the reviewer being available. When the breaker fires, the reviewer is treated as missing. The primary's decision passes through with a logged warning that the verification step didn't run. The pipeline returns something correct — possibly slightly degraded — instead of returning nothing. That is what fail-open means at the runtime layer: every stage knows what to do when its neighbour fails. The reviewer falls back to pass-through. Authorization falls back to fail-closed-with-escalation. Content generation falls back to queue-and-retry. Domain-specific error handling, the piece concludes, is what separates production systems from demos.

The runtime answer is right. It's not even controversial — Nygard's Release It! documented the pattern in 2007, Fowler's circuit-breaker bliki entry has been canonical for over a decade, and Netflix's Hystrix project shipped a production implementation engineering teams have leaned on for years. What's new is the application to LLM pipelines specifically: identifying which stages are reviewers (and so candidates for fail-open), which are gatekeepers (and so candidates for fail-closed), which are generators (and so candidates for queue-and-retry).

It stops at the runtime boundary by design. The piece is about what happens inside a single request when one of the stages fails. That's the right scope for a runtime-architecture article. But the 65%-stuck cohort Real Python surveyed wasn't reporting failures inside a single request. They were reporting failures across requests — across sessions, across days, across context windows. The pipeline frame is right; the boundary it covers is one of two.

The session boundary needs the same pattern.

III. The session-boundary mirror

Consider what session N looks like at the moment it ends. The agent has done some work. The user has reviewed some outputs. The conversation has accumulated context — open files, prior decisions, partial implementations, half-debugged bugs. All of that lives in the session: in the model's context window, in the editor's state, in the agent's working memory. None of it persists by default.

Now consider session N+1. It might start the next morning. It might start an hour later after the context window filled up and the conversation auto-compacted, leaving only a summary. It might start because the model timed out, or the network dropped, or the laptop rebooted. From session N+1's perspective, session N's context is gone. The agent has no memory of what was being implemented, what was already debugged, what the user already explained, what was already decided and ruled out.

This is the runtime stage whose neighbour just failed silently. Not a reviewer stage failing inside a single request — the prior session failing as an upstream stage. Without a fall-back, session N+1 has two options: block (ask the user "what were we working on, and where did we leave off?") or hallucinate (invent a context that may or may not match what session N actually did). Both are the cross-session equivalent of "the whole pipeline returns None." Both turn the user into the recovery mechanism.

The cross-session equivalent of Drew's circuit breaker is a Findings Tracker: a markdown artefact written by session N to disk, which session N+1 reads as its first action. The session boundary becomes a fail-open boundary. When the upstream session's context is missing — and at the session boundary it is always missing — the simplest correct behaviour is "read what's on disk." The next session resumes with the same warning convention Drew's reviewer logs: I am reading the tracker, not the live state. Verify before mutating. The pattern transposes one layer up: every session degrades to the simplest correct behaviour when its predecessor's context fails.

To name the artefact precisely, it helps to contrast it with what Claude Code projects already ship.

Claude Code projects ship with a CLAUDE.md file — a markdown document at the project root that captures static persona and preferences. "This project uses pytest. Use type hints. Prefer markdown tables over JSON for status reports." It tells session N+1 who the user is and what the project's conventions are. Real Python's workshop covers CLAUDE.md directly in its lesson on understanding the file.

A Findings Tracker is the task-state layer. It does not record who the user is or what conventions the project uses. It records what work is in flight: this finding is at the Designing stage, F1.4 implementation is checked, F1.5 forge-review hasn't run yet, F1.6 verification depends on an external event scheduled Friday. CLAUDE.md is to a project's .env what a Findings Tracker is to a job-queue checkpoint. One holds static configuration; the other holds dynamic task state.

The two artefacts are complementary — CLAUDE.md is the static project-config layer; the Findings Tracker is the dynamic task-state layer. The workshop curriculum legitimately covers the first; the methodology layer above the agentic tool is the second. They do not compete. They occupy different positions in the same agentic-coding stack.

Most engineering blogs would stop here, having defined the primitive. The interesting question is the next one: what does a Findings Tracker actually look like, and how does the cross-session fail-open behaviour show up in practice?

IV. What a Findings Tracker actually is

Here is one. The tracker below comes from a real ops workflow I run for FluxForge AI™ — an Upwork screening process that captures cohort data on which jobs to bid on, which to skip, and why. The work spanned five sessions across two weeks. I'll show what the file looks like and walk through how the cross-session boundary actually behaves.

The tracker file lives at docs/findings/2026-04-24_1025_upwork_activity_log_FINDINGS_TRACKER.md. The header opens with timestamps and origin:

**2026-04-24 10:25 UTC**

# Upwork Activity Log — Findings Tracker

**Created**: 2026-04-24 10:25 UTC
**Last Updated**: 2026-04-28 20:57 UTC
**Origin**: Session 5 start — S5 handoff Priority 1 (third carry-over across S2, S3, S5).
**Session**: ops 5
**Scope**: Missing rolling cohort record for post-F1 Upwork job screenings.

Created in session 5; last touched in session 10, four days later. The origin field tells session N+1 what triggered the work — in this case, a third carry-over from earlier sessions where the same problem had been deferred. The scope field tells session N+1 what's in and out: this tracker is about the capture mechanism, not about the filter logic that produces what gets captured (that's a sibling tracker).

The Overview section comes next, with a single F1 row table:

| # | Finding | Type | Severity | Status | Stage |
|---|---------|------|----------|--------|-------|
| F1 | Upwork activity log artifact does not exist | Gap | Medium | Verified | Verified |

One finding per tracker; one row per finding. The columns name what kind of work it is (Gap), how urgent (Medium), where it sits in two orthogonal dimensions: Status (Open → In Progress → Resolved → Verified) and Stage (Open → Designing → Blueprint Ready → Planned → Implementing → Reviewed → Resolved → Verified). The Stage column is the closed/open/half-open of the cross-session circuit breaker — it tells the next session exactly where the work paused.

The F1 section that follows the Overview contains the substantive work. Summary, root cause, resolution tasks as checkboxes:

- [x] **F1.1**: Design approach — decide schema, update protocol (→ /design tradeoff)
- [x] **F1.2**: Blueprint + implementation prompt (→ /blueprint)
- [x] **F1.3**: Implementation plan (→ /plan)
- [x] **F1.4**: Implement via /wrought-implement
- [x] **F1.5**: Code review via /forge-review --scope=diff
- [x] **F1.6**: Verify

Six tasks, each tied to a specific pipeline stage. Session N might check off F1.1 and F1.2; session N+1 reads the checkboxes and knows it should start at F1.3. The state is on disk, not in memory.

The cross-session magic moment is the Lifecycle table at the bottom of the F1 section:

| Stage           | Timestamp            | Session | Artifact |
|-----------------|----------------------|---------|----------|
| Open            | 2026-04-24 10:25 UTC | ops 5   | Finding Report |
| Designing       | 2026-04-24 10:40 UTC | ops 5   | Design Analysis |
| Blueprint Ready | 2026-04-24 12:45 UTC | ops 5   | Blueprint + Prompt |
| Planned         | 2026-04-24 13:00 UTC | ops 5   | Plan approved via ExitPlanMode |
| Implementing    | 2026-04-25 18:13 UTC | ops 5   | /wrought-implement iteration 1 |
| Resolved        | 2026-04-25 18:15 UTC | ops 5   | All 12 structural checks PASS |
| Reviewed        | 2026-04-25 18:18 UTC | ops 5   | /forge-review SKIPPED (doc-only) |
| Verified        | 2026-04-25 18:18 UTC | ops 5   | F1.6 closed; carry-over closed |

Eight stage-transition rows. Look at the timestamps: the Designing row was written at 10:40 UTC; the Implementing row was written 30 hours later at 18:13 the next day. Different agent invocations, different conversations, same work item. None of those agents spoke to each other. They read the file.

The Changelog at the bottom captures every session's actions in append-only form — one row per update. If session N+5 needs to understand why a particular decision was made, it reads the Changelog and finds the row that records the decision plus the session it happened in. This is the audit log. git log tells you what changed; the Changelog tells you what was happening. The two views are complementary.

Six months from now, an agent picking up this file knows exactly what was decided, when, and why. The reviewer-around-primary problem at the runtime layer has its session-around-session twin: structured fall-back, explicit warning, persistent log, automatic recovery on next session start. Same shape, one layer up.

V. The methodology layer above the agentic tool

So step back. The 65% of Python developers Real Python's audience research surfaced are stuck somewhere between the agentic tool works on this small task and the agentic tool didn't help me ship the actual feature. That gap has a runtime layer and a session-boundary layer.

The agentic tool answers the runtime question. It composes small tasks within a session — read a file, edit a function, run the tests, observe the output, iterate. That's what the workshop curricula teach, and they teach it well. The Claude Code workshops, Cursor's tutorials, the increasingly mature category of agentic-IDE content — all of it is about the within-session loop.

The Findings Tracker answers the session-boundary question. It's the markdown artefact that lets the next session start where the previous one stopped, with the same fall-back behaviour Drew's circuit breaker provides at the runtime layer: read what's on disk, log a warning that this is post-context state not live state, verify before mutating, then resume.

Together they answer the 65%-stuck problem. Not as a tool fix and not as a runtime-pattern fix in isolation — both are necessary, neither is sufficient. The methodology layer is the layer that makes both work in production.

The Findings Tracker pattern is captured in the open-source Wrought™ plugin for Claude Code (MIT, https://github.com/fluxforgeai/wrought-plugin). Markdown is the substrate; the plugin just automates the writes. The pattern works in any editor with or without it — the trackers are plain markdown files that any session, any agent, any editor can read.

That's the open-core honesty of the artefact: there is nothing magic about the plugin. It generates markdown that follows a particular schema, and the schema is the load-bearing thing. If you fork the schema and never touch the plugin, you still get the cross-session fail-open behaviour. The plugin makes the writes consistent and saves you from re-deriving the schema; it does not own the pattern.

What this article argues is that the methodology layer — the schema, the stages, the verifier-gates, the changelog discipline, the zero-carry-over rule between forge-review runs — is the missing primitive in the agentic-coding stack as it stands in mid-2026. The runtime layer is the workshop's territory, and they own it well. Drew's piece set up half of this argument. The other half is what to do when the session is the stage that just failed.

VI. Closing

If you've built this layer differently — different schema, different state machine, different fall-back convention at the session boundary — I want to hear about it. The pattern as it stands in this article is one shape that works; it is almost certainly not the only shape. The goal is not the schema. The goal is the cross-session fail-open behaviour, expressed concretely enough that the next session knows exactly what the previous one was doing.

If you're chasing the 65%-stuck problem from a different angle — building it into Cursor's memory layer, into LangGraph state, into a custom MCP server, into something else — then the runtime layer and the session-boundary layer are the two places to look. Both need the same shape: the simplest correct behaviour when the upstream stage fails.

— Johan Genis (dev.to/usewrought)

88 Sessions. 201 Commits. Zero Users. What Building an AI Tool With AI Actually Taught Me.

Wrought — Fri, 24 Apr 2026 16:21:00 +0000

I have spent 88 sessions building a software engineering tool using AI — and I used the tool itself for every session. Here is what I learned.

The tool is called Wrought. It is a structured engineering process for AI coding assistants. Think of it as an engineering runbook that your AI assistant actually follows: pipelines for bug investigation, design analysis, implementation, and code review, all producing documentation that builds your project's institutional memory.

The unusual part is not the product. It is the method. Every feature, every bug fix, every architectural decision in Wrought was built using Wrought's own process. Dogfooding at its most literal. The tool that enforces design-before-code was designed before it was coded. The skill that generates findings trackers was tracked in a findings tracker. The code review system reviewed itself.

88 sessions. 201 commits. 87 days from first commit to this post. Here is what the numbers say about building with AI — and what they leave out.

The Numbers

Before the lessons, the raw data. These are not estimates; they are counted from the Git history and file system.

Metric	Count
Sessions (Claude Code conversations)	88
Git commits	201
Findings trackers created	47
Design documents	60
Blueprints	55
Research reports	33
Code reviews (up-to-5-agent parallel review)	23
Investigation reports	16
RCA reports	20
Implementation prompts	75
Plans	16
Lines of Python (source)	~2,300
Lines of Python (tests)	~3,100
Skills (structured AI workflows)	15+
Files changed since first commit	1,016
Lines inserted	193,000+
Calendar days (Jan 26 to Apr 23)	87

A few things jump out immediately. There are more lines of test code than production code. There are more design documents than there are commits to implement them. And 193,000 lines of insertions for a 2,300-line Python CLI means the overwhelming majority of the project is documentation, methodology artifacts, and process records — not source code.

That ratio — 84 to 1 — is the story.

Three Things That Worked

1. Cross-Session Memory Via Structured Artifacts

This is the single most valuable pattern I discovered.

AI coding assistants have a fundamental problem: they forget everything between sessions. Claude Code has auto-memory and CLAUDE.md, which help, but they are lossy. They capture vibes and preferences, not the state of a six-step pipeline with four open findings across three trackers.

The pattern that solved this is what I call the Findings Tracker. It is a markdown file — nothing fancy — that tracks every significant piece of work through a structured lifecycle:

Open -> Investigating/Designing -> Blueprint Ready -> Planned -> Implementing -> Resolved -> Verified

Each tracker has a dependency map, resolution tasks with checkboxes, lifecycle timestamps, and links to every artifact produced along the way. When a new session starts, the AI reads the tracker and knows exactly where work was interrupted, what has been tried, and what comes next.

Here is a real example. The "Context Compaction Resilience" tracker (docs/findings/2026-03-01_1600_context_compaction_resilience_FINDINGS_TRACKER.md) tracked a problem where Claude Code's auto-compaction would destroy in-flight state during long sessions. It spawned 5 sub-findings across a 5-layer defense architecture:

F1: No compact instructions in CLAUDE.md (solved: added a section the compactor reads)
F2: Context percentage data was siloed in the display (solved: bridged to a file)
F3: No last-chance backup before compaction (solved: PreCompact hook)
F4: No automated context threshold alerts (solved: Stop hook at 70% warn, 80% block)
F5: Context calculation was inaccurate (solved: fixed the math)

This work spanned 4 sessions. Without the tracker, each session would have started from scratch, rediscovering what had been tried. With it, every session picked up exactly where the last one left off.

I have 47 of these trackers. They are the project's institutional memory. Not AI-generated summaries — structured records with dependency maps, resolution tasks, and lifecycle stages.

2. Design-First Pipeline (Even When Code Is Cheap)

The most counterintuitive thing about building with AI: the faster code generation gets, the more design matters.

When generating code costs effectively nothing, the temptation is to skip analysis and start implementing. In the first few sessions, that is exactly what happened. And it produced mediocre results. The AI would generate code that worked but was architecturally questionable, or that solved the wrong problem, or that solved the right problem in a way that made the next feature harder to build.

The pipeline that emerged — and that Wrought now enforces — is:

/research -> /design -> /blueprint -> /wrought-implement -> /forge-review

Every feature starts with research (what exists, what are the constraints). Then a design analysis that evaluates multiple options with a structured tradeoff matrix. Then a blueprint with exact file specifications and acceptance criteria. Only then does implementation begin.

Here is what this looks like in practice. When I needed to set up the development environment (Session 1), the /design step evaluated 4 options:

Option A: Single .venv + uv dependency groups (scored 97/105)
Option B: Multiple virtual environments (scored 68/105)
Option C: Docker-only development (scored 51/105)
Option D: System Python + pip (scored 41/105)

Each option was scored across 7 weighted criteria. The analysis took maybe 10 minutes. The design document (docs/design/2026-02-11_1848_dev_environment_strategy.md) is still the reference I consult when questions about the dev setup arise.

Compare that to the alternative: asking AI to "set up a dev environment" and getting whatever the model's default recommendation happens to be that day. That might work once. It does not produce decisions you can explain or revisit 6 months later.

60 design documents later, the pattern has proven itself. Design analysis is cheap with AI assistance. Rework from skipping it is expensive.

3. Self-Referential Testing (The Tool Reviews Itself)

The most powerful quality mechanism was not unit tests (though there are 324 of those). It was using the tool on itself.

Wrought's /forge-review skill runs up to 5 parallel AI subagents — 4 run on every review, each specialized in a different dimension of code quality: algorithmic complexity, data structure selection, paradigm consistency, and computational efficiency. A 5th — flow integrator — spawns conditionally when the diff touches navigation-surface files (routes, nav items, wizards, redirects). When this skill was built, it was immediately used to review its own codebase.

The results were humbling. The review found:

cli.py had grown to 936 lines with 6+ separate concerns (module cohesion violation)
update_index was doing O(n) linear scans for upsert operations
A module-level constant (DOCS_DIRS) was a mutable list — a classic Python footgun
A marker template was duplicated between cmd_init and cmd_upgrade

All four findings were tracked, designed, blueprinted, planned, implemented, and verified through the pipeline. The code review system found debt in the codebase, and the pipeline system fixed it. Self-referential quality assurance.

The review system had a blind spot. None of the four subagents checked flow integrity — how a change to a route, a nav item, or a wizard step affects paths through the product. That gap went unexamined until a post-ship navigation bug in a frontend project exposed it. The fix followed the same pipeline — finding, design, blueprint, implementation, review — and in Session 86 a fifth reviewer, flow integrator, was added. The review system found a gap in its own rubric, and the pipeline built the subagent to close it.

The same pattern applied throughout development. The workflow enforcement engine that prevents skipping pipeline steps? It was built after the AI skipped a pipeline step in Session 41. The context compaction defense system? Built after auto-compaction destroyed an in-flight session. Every failure became a finding, every finding became a fix, and every fix was tested by continuing to use the tool.

This is not just dogfooding. It is a continuous quality feedback loop where the product's own methodology catches and corrects its own defects.

Three Things That Did Not Work

1. Building for 72 Sessions Without a Single User

This is the hardest thing to write, because it is the most important lesson.

At Session 72, a competitive landscape analysis revealed that the market had shifted significantly during those first months of heads-down building. An open-source project in the same space had accumulated 50,000+ GitHub stars and 119 community-contributed skills. Anthropic had shipped native features (Agent Teams, Tasks, code review) that overlapped with planned Wrought capabilities. The market window had compressed from an estimated 12-18 months to 6-9 months.

And Wrought had zero users. Zero revenue. Zero external validation.

The go-to-market findings tracker (docs/findings/2026-03-23_1430_wrought_go_to_market_strategy_FINDINGS_TRACKER.md) logged this as F1, severity: Critical. It is the most important finding in the project.

The numbers told the story. By Session 72, I had 31 findings trackers, 47 design documents, 43 blueprints — all focused inward. A sophisticated methodology producing sophisticated artifacts about a tool that no one outside the project had touched.

The fix came late but came clearly: stop perfecting internals, start external validation. Consulting as a bridge to revenue. Content marketing (including this post) as a bridge to users. Plugin distribution as a bridge to developers. All three should have started by Session 20, not Session 72.

That was the nadir. Sixteen sessions later — Session 88 — something finally broke the other way.

A colleague who had been reviewing Wrought's material with a domain practitioner he'd worked with for years sent me the call recap. The practitioner had identified — unprompted, and repeatedly — a specific pain pattern: operators running internal workflow steps by hand, one at a time, with no systematic capture of what worked. The kind of repetitive, structured, correction-loop work that Wrought's pipelines are designed for. A follow-up conversation was offered. The conversation has not yet happened.

This is not a closed user. It is not a purchase, not a contract, not a proof of product-market fit. It is one practitioner, in one vertical, in one conversation, identifying one pain that the tool might address. It is, however, the first external technical signal this project produced in 88 sessions — sixty more sessions than it should have taken to receive.

The lesson still stands. The arithmetic of earlier distribution would have compounded that signal many times over.

2. Over-Engineering the Internals

The pipeline skip enforcement saga is instructive.

In Session 41, the AI agent skipped the /plan step after /blueprint. A legitimate process violation. The response? A 3-layer defense-in-depth architecture: skill language hardening, CLAUDE.md rule tightening, and a code-enforced stage gate.

This was implemented, verified, and shipped. In Session 46, the agent added editorial commentary suggesting a skip might be acceptable. So the rules were further hardened with an explicit "no commentary" clause.

In Session 62, the agent used EnterPlanMode directly instead of the reactive pipeline. So a new rule (Rule 8) was added to CLAUDE.md, creating what was by then an 8-rule governance framework just for pipeline adherence.

Three sessions of engineering, three findings tracked, three design documents — all to prevent an AI from occasionally suggesting a shortcut. The enforcement worked, but the effort was disproportionate. A simpler rule with a simpler enforcement mechanism would have captured 90% of the value at 20% of the cost.

When your tool is good at tracking and fixing things, everything looks like something to track and fix. Not everything deserves the full pipeline.

3. Underestimating Distribution

The product thesis — that AI coding tools need structured discipline, not just more features — is, I believe, correct. The execution thesis — that building a great tool is sufficient for distribution — was wrong.

Having 15+ skills, 324 tests, a workflow enforcement engine, an up-to-5-agent parallel code review system, and a self-documenting methodology means nothing if the tool is not where developers look for tools. It is not on any marketplace. It is not a plugin. It has no content presence. The website exists but has no organic traffic.

The Claude Code ecosystem has a plugin format. Anthropic has a marketplace (or will soon). Dev.to, Reddit, and Hacker News have active communities discussing exactly the problems Wrought addresses. None of these channels had been touched before Session 70.

That shifted late in the game. By Session 75 the plugin had shipped — MIT-licensed, 42 files, installable with one command. By Session 79 a first blog post had landed. By Session 88 a three-week publishing cadence was running, and this post is its first entry. The direction is right. The timing was late by sixty-plus sessions.

Distribution is not a post-launch activity. It is a pre-launch requirement. If I were starting over, the first 10 sessions would include publishing content and engaging with the community — even before the tool was ready. Feedback from real developers would have shaped the product better than 60 design documents written in isolation.

The Category Question (Layer 4)

By April 2026, the AI coding tool landscape has taken shape. Five layers are visible; four are competed.

Layer 1 — IDE and editor integrations (Cursor, Windsurf, Copilot inside VS Code).
Layer 2 — chat and client UIs (Ona's new two-panel interface, Claude Code's terminal, ChatGPT Codex).
Layer 3 — agent runtimes and infrastructure (Ona Cloud with warm pools and Project Veto sandboxing; Anthropic's Managed Agents, shipped April 2026; LangChain's Deep Agents; Anthropic's native Code Review; Nous Research's Hermes Agent).
Layer 4 — methodology, governance, process.
Layer 5 — tools and skills (skill marketplaces, community-contributed agents, plugin packs).

Layer 4 is empty.

There is no major player in the methodology layer. The closest public benchmark is Ona's six-criteria auto-approve-low-risk-PR mechanism — a 74% reduction in time-to-first-approval and a 3× increase in deployment rate. That result is a strong validation of process discipline as a lever, but it is rule-based automation, not methodology enforcement. It proves that a simple rubric can compound into measurable operational gains. It does not build the process layer.

Wrought is a bet on Layer 4. The 47 findings trackers, 60 design documents, 55 blueprints, and 23 forge-reviews are not product artifacts. They are methodology artifacts. The 84-to-1 ratio between lines of documentation and lines of source code is not a sign of over-engineering. It is what the methodology layer looks like when implemented.

The bet is that when everyone has AI coding assistants and the L1–L3 differentiators flatten, the variable that remains is whether teams have a process their assistant follows. That is Layer 4. Wrought claims it.

The Methodology: A Brief Overview

For those interested in the process itself, here is how Wrought's pipeline works. You can adopt this approach with or without the tool.

Every significant task starts with a Finding. A finding is a gap, defect, or drift — something that needs attention. It gets logged in a Findings Tracker with severity, type, and a proposed resolution path.

Two pipelines handle two types of work:

Reactive (something is broken): Incident -> Investigate -> RCA/Bugfix -> Implement Fix -> Code Review
Proactive (something needs building): Research -> Design -> Blueprint -> Implement -> Code Review

Design before code, always. The /design step produces a structured analysis with multiple options, weighted criteria, and a recommendation. It takes 10-30 minutes with AI assistance and saves hours of rework.

Blueprints specify acceptance criteria. Before implementation begins, there is a document listing exact file changes, acceptance criteria, and test expectations. The AI implements against this spec, not against a vague prompt.

Structured artifacts are cross-session memory. Every step produces a dated, typed document in a known location. When a new session starts, the AI can read these artifacts and resume exactly where work stopped.

Code review closes the loop. After implementation, up to 5 specialized AI subagents review the changes: 4 run on every review — algorithmic complexity, data structure selection, paradigm consistency, and computational efficiency; a 5th (flow integrator) spawns when the diff touches navigation-surface files such as routes, nav items, wizards, and redirects.

A typical feature takes three sessions and 3-6 hours: finding and research, then design and blueprint, then implementation and review. Each step produces a dated artifact. The methodology works. The artifacts prove it.

What Happens Next

Three things are now happening in parallel.

Consulting. The methodology that built Wrought is now offered as a service: production bug fixes with root cause analysis, feature architecture and implementation, codebase reviews, and Claude Code workflow setup. The pipeline works on any codebase, not just Wrought.

Plugin distribution. Wrought's 15+ skills, review agents, and hooks are packaged as an open-source Claude Code plugin — 42 files, MIT licensed, built in Session 75. Public marketplace distribution is scheduled to coincide with the V2.0 launch.

Content. This post is the first of a three-week publishing calendar that will land four long-form pieces: this retrospective, a benchmark response to Ona's 74%/3× auto-approve results (around 2026-04-25), a deep-dive on cross-session memory patterns (2026-04-29), and a piece on why design-first matters even more when AI makes code cheap (2026-05-08). Beyond that: structured incident response workflows, solo-founder retrospectives, and how the 88-session methodology maps onto team settings.

Wrought V1.0 is a local CLI tool. V2.0 will be an MCP server — a hosted service that any AI coding assistant can connect to over HTTP. The architecture is designed, the skills are built, and the distribution strategy is now in motion.

If the methodology interests you, there are two paths:

Try the approach. The Findings Tracker pattern and design-first pipeline work with any AI coding tool. Start by creating a markdown file that tracks your significant tasks through structured stages. You do not need Wrought to benefit from the process.
Follow the build. I will be publishing weekly at fluxforge.ai/blog and cross-posting to Dev.to and LinkedIn. The next piece covers cross-session memory in detail — the specific patterns that make AI assistants useful across long-running projects.

The 88 sessions taught me that AI does not replace engineering discipline. It amplifies whatever discipline you bring. Bring structure, get structured results. Bring chaos, get faster chaos.

Wrought is the structure I built. Now it is time to find out if anyone else needs it.

Originally published at fluxforge.ai/blog. Find me on LinkedIn or follow the build at fluxforge.ai.

I Built a Real-Time Artemis II 3D Tracker in One Session — Here's the Engineering Pipeline That Made It Possible

Wrought — Fri, 03 Apr 2026 22:14:47 +0000

On April 1, 2026, four astronauts launched aboard Orion on Artemis II — humanity's first crewed voyage beyond low Earth orbit since Apollo 17 in 1972.

I wanted to track it. Not on a static NASA page. Not on someone else's stream overlay. I wanted an interactive 3D visualization with real telemetry, in my browser, that I built myself.

Six hours - one afternoon - later, I had one. Live at artemis-tracker-murex.vercel.app.

47 files. ~8,000 lines of TypeScript. 15 unit tests. 5 serverless API proxies. Degree-8 Lagrange interpolation at 60fps. An AI mission chatbot. Deep Space Network status. Deployed on Vercel.

Built in a single session using Claude Code with a structured engineering pipeline called Wrought.

This post isn't about "look what AI can do." It's about what happens when you give an AI agent engineering discipline instead of just a prompt.

What the App Does

ARTEMIS is a real-time 3D mission tracker that combines three NASA data sources into one interactive visualization:

OEM Ephemeris Files from NASA's AROW system — actual spacecraft state vectors (position and velocity) at 4-minute intervals, interpolated to 60fps using Lagrange polynomials
Deep Space Network XML — live antenna status from Goldstone, Canberra, and Madrid
JPL Horizons API — Moon position in the same J2000 reference frame as the spacecraft data

The result: you can watch Orion move along its trajectory in real time, see its speed, distance from Earth, distance to the Moon, and which ground stations are currently talking to it.

There's also an AI chatbot powered by Gemini 2.5 Flash. Common questions like "How long is the mission?" resolve instantly via client-side quick-answer buttons — no API call needed. Free-text questions hit a curated knowledge base through a system prompt.

The Problem with "Vibe Coding"

Every week, someone posts "I built X in 20 minutes with AI." And every week, the comments are the same: Does it have tests? How's the error handling? What happens when the API is down? Did you actually read the code?

These are fair questions. The dirty secret of AI-assisted speed runs is that most of them produce code that works for a demo and breaks in production. The AI generates plausible code. You accept it. Ship it. Move on.

The issue isn't speed — it's the absence of process. No design review. No architecture decision records. No code review. No root cause analysis when something goes wrong. Just "prompt → code → deploy → pray."

I wanted to show there's a better way.

The Pipeline

Wrought is a structured engineering pipeline I built for Claude Code. It enforces a specific sequence of skills for every significant piece of work:

Finding → Research → Design → Blueprint → Implementation → Code Review

Each stage produces a documented artifact. Each artifact feeds the next stage. The AI agent can't skip ahead — the pipeline is the process.

Here's how it played out for ARTEMIS.

Stage 1: Finding

Every task starts with a Findings Tracker — a structured document that captures what you're building, why, and tracks it through every stage.

Finding: Interactive Artemis II Live Visualization
Type: Gap
Severity: High
Rationale: Artemis II launched 2026-04-01. No unified interactive tracker exists.
  NASA data is scattered across OEM files, XML feeds, and on-demand APIs.
  Mission window is ~10 days — time-sensitive opportunity.

This isn't bureaucracy. It's cross-session memory. If I stop working and come back tomorrow, the tracker tells me exactly where I left off and what decisions were already made.

Stage 2: Research

Before building a chatbot, I needed to decide how to build it. The research skill evaluated three approaches:

Approach	Pros	Cons	Verdict
FAQ Bot (pattern matching)	Zero cost, instant	Can't handle novel questions	Too rigid
System Prompt + LLM	Simple, full knowledge in context	Per-query API cost	Selected
RAG (vector search)	Scales to large corpora	Massive overengineering for ~3K tokens of facts	Overkill

The entire Artemis II knowledge base — mission timeline, crew bios, spacecraft specs, orbital mechanics — fits in about 3,000 tokens. That's smaller than this blog post. Building a RAG pipeline with embeddings, a vector database, and chunking strategy for 3,000 tokens of content would have been absurd.

The research also surfaced that Gemini 2.5 Flash has a genuinely free tier (15 requests per minute, 1,000 per day, no credit card). Claude would have been higher quality, but the $0 budget constraint made Gemini the pragmatic choice.

Stage 3: Design

The design stage evaluated four architecture options with weighted scoring:

Option	Stack	Score	Why Not
A: Vite + R3F	Vite, React, React Three Fiber, Vercel	8.6/10	Selected
B: Vite + 2D	Vite, React, Canvas 2D	6.2/10	No depth perception for 3D trajectory
C: Next.js + R3F	Next.js, React, R3F	7.8/10	SSR adds hydration complexity for a pure client app
D: Vanilla Three.js	Three.js, no framework	5.4/10	Manual state management, no HMR for scene

The design document also specified the data pipeline for all four NASA sources, the coordinate system (J2000 Earth-centered, 1 unit = 10,000 km), the HUD layout, and camera presets.

Key insight: Next.js was actively wrong for this project. There's no content to server-render. No SEO to optimize. No dynamic routes. It's a WebGL canvas that talks to APIs. Vite gives you sub-second HMR and a smaller bundle without the hydration tax.

Stage 4: Blueprint

The blueprint translated the design into an implementation spec: 48 files across 8 phases, with acceptance criteria for each phase. It specified the exact file structure, the interfaces for the OEM parser and interpolator, the Zustand store shape, and the serverless proxy signatures.

This is where the pipeline pays dividends. By the time implementation starts, the AI agent has:

A clear architecture to follow (not just a vague prompt)
Specific interfaces to implement (not ad-hoc decisions mid-code)
An explicit scope boundary (what to build AND what not to build)

Stage 5: Implementation

One iteration. All 15 tests passing. Build succeeds. Deployed to Vercel.

That's not AI magic — that's the upstream work paying off. When the design document specifies "Zustand store with scalar selectors to avoid re-render storms" and the blueprint defines the exact store interface, the implementation becomes an execution problem, not a design problem.

Stage 6: Code Review (This Is Where It Gets Interesting)

After implementation, the pipeline runs a multi-agent code review called /forge-review. Four specialized agents review the code in parallel:

Complexity Analyst — algorithmic time/space complexity
Paradigm Enforcer — consistency within files
Efficiency Sentinel — performance anti-patterns
Data Structure Reviewer — access patterns vs. data structure selection

The first review found 5 critical issues:

Critical 1: 60fps Re-Render Storm

The HUD reads spacecraft state as an object from Zustand.
Zustand creates a new object reference every update.
React re-renders all HUD components every frame.
At 60fps, that's 60 full React reconciliation passes per second.

The fix: DataDriver writes position to a shared useRef for the 3D scene (zero React overhead), and throttles Zustand store updates to 4Hz for the HUD. Each HUD card uses a scalar selector (state => state.spacecraft.speed) instead of selecting the whole object.

Critical 2: O(n) Linear Scan in Hot Path

The Lagrange interpolator was doing a linear search through all state vectors to find the nearest data point. With a 3,232-line OEM file at 60fps, that's ~194,000 comparisons per second.

The fix: binary search. O(log n). The data is already sorted by epoch.

// Binary search for closest data point
let lo = 0, hi = vectors.length - 1;
while (lo < hi) {
  const mid = (lo + hi) >> 1;
  if (vectors[mid].epochMs < t) lo = mid + 1;
  else hi = mid;
}

Critical 3: Per-Frame Memory Allocations

The interpolator was calling .slice() and .map() inside the hot loop — allocating new arrays every frame. At 60fps, that's 120+ garbage-collected arrays per second.

The fix: module-level reusable buffer, direct array indexing instead of .map(), and accepting epochMs as a number instead of creating Date objects.

Critical 4: Stale Closure in useChat

The chat hook captured messages in a closure at mount time. When a user sent a second message, the API call used the stale initial array — losing the conversation history.

The fix: useRef tracking the latest messages array, with useCallback reading from the ref instead of the closure.

Critical 5: StrictMode Double-Mount Breaking Polls

A fetchedRef guard prevented re-fetching in StrictMode's double-mount cycle, but also broke cleanup — orphaning intervals and timeouts.

The fix: remove the ref guard entirely. Use AbortController for cancellation. Clean up all intervals and timeouts in the effect's return function.

The Re-Review

After fixing all five criticals through an RCA (root cause analysis) cycle, the pipeline ran a second code review:

Severity	First Review	Re-Review
Critical	5	0
Warning	10	2
Suggestion	8	2

The two remaining warnings were both benign edge cases (a module-level buffer that's safe in single-threaded browsers, and a ref timing gap that's correctly compensated for). Zero criticals.

This is the pipeline's value proposition. Without the review stage, those five bugs would have shipped. The linear scan and re-render storm would have made the app noticeably janky on mobile. The stale closure would have broken multi-turn chat. And you'd never know until users complained.

The Numbers

Metric	Value
Source files	47
Lines of code	~8,000
Unit tests	15
Implementation iterations	1
Critical bugs caught	5 (all fixed)
Build time	~2 seconds
Bundle size	149KB app + 1.1MB Three.js/R3F
NASA data sources	4
Serverless proxies	5
Pipeline artifacts	12 documents
Production deploys	8

What I Learned

1. Upstream design eliminates downstream churn. The implementation completed in one iteration because the blueprint answered most design questions in advance. "What shape is the Zustand store?" isn't a question you want the AI deciding mid-implementation.

2. Code review catches real bugs, not style nits. The forge-review found a 60fps re-render storm and an O(n) hot path — genuine performance issues that would have shipped silently without the review stage.

3. RAG is the new microservices. Not everything needs it. A 3,000-token knowledge base doesn't need embeddings, vector search, and a retrieval pipeline. System prompt stuffing is boring and effective.

4. The audit trail is the product. Every decision — why Vite over Next.js, why Gemini over Claude, why Lagrange over Runge-Kutta — is documented in the pipeline artifacts. Six months from now, when someone asks "why did you build it this way?", the answer exists in the design doc, not in someone's memory.

Try It

Live app: artemis-tracker-murex.vercel.app
Source code: github.com/fluxforgeai/ARTEMIS
Wrought pipeline: wrought-web.vercel.app

The full pipeline artifacts — finding, research, design, blueprint, reviews, RCAs — are all in the repo's docs/ directory. The process is as open as the code.

Built with Claude Code + Wrought by FluxForge AI.