Forem: Joseph Yeo

I Built a Local AI Coding Agent on M5 Max 128GB — It Failed 164 Times Before Passing 35 Tests

Joseph Yeo — Sat, 09 May 2026 11:31:07 +0000

Fully local. No cloud APIs during execution. TDD-enforced. 35 tests passing.

To be clear: I used Claude for the initial architecture and rule design. The experiment was strictly focused on whether a local LLM could survive the autonomous execution loop without phoning home. Planning, docs, and correction rule design I handled with Claude (a cloud API). The coding agent loop (Brain → Coder → Tester) ran 100% locally — no API calls during execution.

The Actual Setup

Most AI coding agent posts you see rely on GPT-4o or Claude via API. The model lives in a data center, and you're paying per token. That's fine — but it means your code, your architecture decisions, and your project context are all leaving your machine.

I wanted something different: a multi-agent system that runs entirely on my MacBook Pro M5 Max 128GB. It autonomously writes code, runs tests in a Docker sandbox, and only commits when tests pass. No internet required once it's running.

This is the story of ForgeFlow — what I built, what broke, and what the data showed.

Hardware Context

The M5 Max 128GB is unusual hardware for this kind of work. Most local LLM setups top out at 32GB or 64GB unified memory, which forces you to choose between model quality and running multiple models simultaneously. At 128GB, that constraint disappears.

What I ran simultaneously:

Model	Size (Q4_K_M)	Role
Qwen3-Coder-Next	~45GB	Brain + Coder
gemma4:26b	~17GB	QA
nomic-embed-text	~0.3GB	RAG embeddings

Total: ~62GB loaded, ~66GB headroom for OS + KV cache. Both models stay warm in memory with keep_alive: 24h — no reload latency between cycles.

This isn't a flex. It's context: the architectural decisions I made (same model for Brain and Coder, both models always loaded) are only feasible at this memory tier. At 64GB, you'd need to make different tradeoffs.

The Architecture: What ForgeFlow Actually Does

ForgeFlow is an n8n workflow that runs every 10 minutes, autonomously picks the next coding task, writes tests first, writes code second, and only commits if all tests pass.

The full loop:

Schedule Trigger (10 min)
  → Load Context (working memory + results log + project rules)
  → Brain (Qwen3-Coder-Next): pick next task from PRD
  → Localization: RAG search for relevant existing code
  → Coder RED (same model): write a failing test
  → Verify RED: pytest must FAIL — if it passes, the test is wrong
  → Coder GREEN: write minimum code to pass the test
  → Phase 0 Gate: py_compile + ruff (deterministic, no LLM)
  → QA (gemma4:26b): run full test suite in Docker sandbox
  → Gate Decision: COMMIT / RETRY / DEADLOCK / ESCALATE
  → Commit & Update (on pass)

Three design principles drove every decision:

1. pytest exit code is the only truth. I don't care if the LLM thinks the code is "clean." If the pytest exit code isn't 0, the code is garbage.

2. The LLM proposes, n8n disposes. No model has write access to the filesystem or git. n8n is the only actor that applies files, runs git commands, and updates state.

3. Deterministic gates before LLM gates. py_compile and ruff run in under 0.5 seconds. If they catch the error, there's no reason to spend 30 seconds calling gemma4.

The Memory System

One of the underrated problems in autonomous coding agents is state management across cycles. The agent can't remember what it did last cycle unless you explicitly store it.

ForgeFlow keeps track of state across six memory layers:

Layer	Storage	Scope
Git history	`.git`	Permanent
Code summaries	ChromaDB (RAG)	Project lifetime
results.tsv	TSV file	Session
AGENTS.md	Markdown	Cross-session
Working memory	JSON file	Current loop
Failure patterns	AGENTS.md auto-update	Generalized

Each layer operates at a different time scale. Git is permanent. Working memory resets every cycle. AGENTS.md accumulates lessons across sessions — when the same failure type occurs 3+ times, a rule gets written: "always include from app.database import get_db — the model consistently forgets this."

TDD Enforcement: Red-Green-Refactor as a System Constraint

The TDD loop isn't a suggestion — it's mechanically enforced by the workflow:

RED phase: Coder writes a test. n8n runs pytest. If it passes, the test is rejected — it's testing something that already works, which means it's the wrong test.
GREEN phase: Coder writes minimum code to pass the test. n8n applies the files, runs the full test suite (not just the new test), checks for regressions.
Commit: Only happens if exit code is 0 across the entire test suite.

Enforcing this mechanically means the model can't shortcut. It can't write "good enough" code and hope the reviewer misses it. The test either passes or it doesn't.

Failure Handling: Bounded Repair

Blind retries are a token-burn trap. Instead, ForgeFlow fingerprints every failure:

failure_signature = SHA256(failure_type + file_path + first_50_chars_of_stderr)[:12]

If I see the same SHA256 signature three times, the agent hits a DEADLOCK and walks away. It's better to skip a task than to let a model hallucinate in a loop.

Failure classification:

Type	Description
`patch`	Code logic error, syntax error
`environment`	Import error, missing module
`localization`	Wrong file referenced
`deadlock`	Same signature 3×

The Data: What Actually Happened

I ran ForgeFlow on a Todo REST API (FastAPI + SQLAlchemy + pytest) — 12 tasks, classic CRUD.

Overall:

Metric	Value
Total attempts	164
PASS (committed)	11 (6.7%)
FAIL (discarded)	116 (70.7%)
DEADLOCK (skipped)	37 (22.6%)
Manual interventions	3
Final test count	35 passing

The 6.7% raw PASS rate sounds bad. But that number is misleading — it includes the early cycles before deterministic corrections were added.

The real signal is in the pass rate as the system "learned" (via manual rules):

Corrections active	PASS rate
0–5 corrections	5.6%
6 corrections	0%
7–10 corrections	40.0%
11–13 corrections	62.5%

Each "correction" is a deterministic rule applied before the LLM output reaches the filesystem. Examples:

from app.db.session import → auto-rewrite to from app.database import
@router.post(...) without status_code=201 → auto-insert
File not in target_files → reject with error message

As corrections accumulated, PASS rate went from 5.6% to 62.5%. The corrections are essentially a hand-built knowledge base of the model's systematic errors. It turns out those errors are highly consistent and predictable.

Failure type distribution:

Type	Count	%
patch	99	64.7%
environment	54	35.3%

At 35.3%, my environment failure rate is triple the standard benchmarks (~13%). That's the "quantization tax" you pay for running Q4 models locally. The deterministic corrections target exactly these failure types.

Hardest tasks:

Task	Attempts	Primary failure
TASK-002 (DB model)	41	PytestDeprecationWarning + ImportError
TASK-006 (GET list)	26	ImportError conftest
TASK-012 (integration)	22	Regression (previous code overwritten)

TASK-002 taking 41 attempts is the starkest number. Most failures were the same PytestDeprecationWarning signature — the model couldn't fix a pytest configuration issue that required understanding the test infrastructure, not just the code under test. Eventually, a manual intervention resolved it.

What Broke (Honestly)

3 manual interventions were required:

TASK-008 (PUT endpoint): The Coder kept generating tests with wrong status codes. Added correction #13 (PUT 201→200 auto-fix) after diagnosing the pattern.
TASK-011 (filtering): The Coder overwrote routes/todo.py while working on filtering, destroying previously committed code. The target_files violation detection wasn't blocking writes — only logging them.
TASK-012 (integration test): DEADLOCK 3 times. The model couldn't figure out that test_integration.py needed to use the existing client fixture from conftest.py rather than creating its own TestClient.

All three were fixed in the session after they occurred by adding deterministic corrections. The system learned — just not automatically.

What This Is (and Isn't)

This is:

A proof that a fully local multi-agent TDD loop is viable on consumer hardware
Evidence that deterministic corrections significantly outperform raw LLM retry for systematic errors
A framework for thinking about autonomous coding at the task level

This isn't:

A "set it and forget it" system. It's a force multiplier that still requires a human to untangle the logic when the model hits a wall.
A system that works without oversight (3 interventions in 12 tasks is not zero)
Generalizable beyond the hardware tier that makes it feasible

The 62.5% PASS rate in the final correction set is meaningful. But the 3 required manual interventions mean the system isn't yet fully autonomous.

What's Next

Second project: A more complex backend (20+ tasks, non-trivial dependencies) to validate that the correction set generalizes and the dependency resolution logic holds under pressure. The goal is a two-project dataset for a proper write-up.

Phase 0.5 Gate: I'm looking at implementing AST-based checks — inspired by the Khati et al. (2026) paper — to kill hallucinations before they even hit the Docker sandbox. The goal is to catch app.routes.todo.get_todo_by_id (doesn't exist) before it reaches pytest.

Automatic correction learning: Right now, corrections are written manually after pattern identification. The next step is having n8n automatically identify recurring failure signatures and propose corrections for human approval.

Hardware Note for Other M-Series Users

If you're on M2/M3/M4 Pro (36–48GB), the same architecture works with tradeoffs:

Run one model at a time (swap between Brain/Coder and QA)
Use smaller QA model (gemma4:9b instead of 26b)
Expect higher latency per cycle (~15–20 min instead of 10)

The fundamental approach — deterministic orchestration + LLM proposal + test-as-truth — doesn't require 128GB. It just runs faster and with better models at that tier.

About

I'm Joseph YEO, a solo builder from Seoul, Korea. I run multiple projects in parallel using AI agents — local AI automation (ForgeFlow), supply chain security (DevRadar Guard), and a few things currently under wraps.

What I'm really interested in is how autonomous these agents can actually become before I have to step in as the human. ForgeFlow is one experiment. There will be more.

Follow along:

Built over ~7 sessions, May 2026. All models run locally via Ollama 0.23.0 on macOS. No cloud APIs were used during autonomous execution.

This post was drafted with Claude and edited by me. I use AI tools to write, just like I use them to code. That's kind of the whole point.

Case Study: How I Dogfood DevRadar Guard on a 954-Dependency Project

Joseph Yeo — Mon, 06 Apr 2026 13:25:37 +0000

This is a follow-up to my earlier post: Axios Was Compromised. Here's What It Means for Your Repo.

The Setup

GloriaPPT is a presentation tool I built and maintain. It's a fairly typical modern JavaScript app: a Next.js frontend, a Node.js backend, and deployment on Vercel. What makes it interesting for this case study is its dependency tree: 954 npm packages in the lockfile.

Most of those packages are transitive. I haven't read the source code for most of them, and realistically, neither do most small teams. If one of them were compromised tomorrow, I probably wouldn't know right away.

That's the problem I built DevRadar Guard to solve.

Before DevRadar Guard

Dependency monitoring: Manual npm audit when I remembered to run it
Supply chain alerts: None — I found out about incidents from social feeds and security threads
.npmrc hardening: Default settings
CLAUDE.md security section: Didn't exist
Pre-install hooks: None
CI/CD security checks: Basic Dependabot, no custom policy
Response time to incidents: Hours to days, depending on when I saw the news

After DevRadar Guard

I deployed DevRadar Guard's hosted monitoring on a small VPS that checks every 30 minutes. Here's what changed.

Signal Collection

The Signal Engine collects threat intelligence from GitHub Security Advisories every 30 minutes. In the first 24 hours, it ingested 467 raw events — advisories affecting npm packages — and normalized all of them into structured threat candidates with confidence scores.

Each signal is scored across five dimensions, including source quality, technical specificity, cross-reference validation, discussion velocity, and ecosystem relevance.

Exposure Matching

Out of 467 normalized signals, the Exposure Engine matched 1 against GloriaPPT's actual dependency tree: axios.

Package: axios
Installed version: 1.14.1
Confidence score: 65/100
Exposure score: 50/100
Final risk: 57/100 (alert threshold: 50)

The confidence score reflects signal quality: a high-trust source, a named package and version, and enough technical detail to treat the advisory seriously.

The exposure score reflects how directly the issue touched this repo: axios was a direct dependency, and the affected version was present in the lockfile.

The Alert

At 00:24 KST (UTC+9) on April 6, a Slack alert landed in #devradar-alerts:

🛡️ DevRadar Guard Alert

Package: axios
Version: 1.14.1
Risk Score: 57/100
Confidence: 65
Exposure: 50
Path: direct

Signal: Compromised axios versions 1.14.1 and 0.30.4 were
reported to deliver a remote access trojan...

I didn't find out about the axios compromise from social media. The alert was waiting for me when I checked Slack.

Guardrail Bundle

DevRadar Guard generates a guardrail bundle — a set of files you can drop into a repo to harden installs, guide AI coding agents, and surface risky dependency changes during review:

File	Purpose
`CLAUDE.md` security section	Security policy for AI coding agents
`.npmrc` hardening	`ignore-scripts=true`, `audit=true`, registry pinning
Pre-install hook	Warns before installing packages younger than 7 days
GitHub Actions workflow	PR check that flags risky dependency changes (alert-only)
`devradar-policy.json`	Machine-readable policy for CI/CD integration

GloriaPPT now uses all 8 generated guardrail files. The pre-install hook would likely have flagged the malicious plain-crypto-js dependency used in the attack, since it had been published less than 24 hours earlier.

The Numbers

Metric	Value
Dependencies monitored	954
Raw threat signals collected (first 24h)	467
Normalization success rate (first 24h sample)	100%
Signals matched to GloriaPPT	1 (axios)
False positives in this case study	0
Time from advisory to alert	< 30 minutes (cron cycle)
Guardrail files generated	8
Manual intervention during detection	0

What This Doesn't Prove

I want to be honest about what this case study shows and what it doesn't.

It shows: A real supply chain threat was detected, matched to a real project, and surfaced as an actionable alert — automatically, without manual intervention.

It doesn't show: DevRadar Guard catching a zero-day before anyone else. The axios advisory was already published when my pipeline picked it up. I'm not claiming to be faster than GitHub Advisory. I'm claiming to be faster than manual monitoring — finding out from social feeds, security threads, or a post after the fact.

It doesn't show: Protection against all supply chain attacks. The Signal Engine currently monitors GitHub Advisories only. Reddit, npm registry anomaly detection, and other sources are planned but not yet active in Alpha.

It doesn't show: Automatic blocking. DevRadar Guard Alpha is alert-only. No PR failures, no install blocks, no surprises. You get the information; you decide what to do.

Try It

DevRadar Guard is still in Alpha, and I'm testing it with a small number of pilot teams. Right now that includes hosted monitoring on a 30-minute cycle, matched alerts in Slack or Discord, a generated guardrail bundle for the repo, and a weekly threat briefing. All I ask in return is a few minutes of feedback each week.

If your project has a package-lock.json and you want earlier, repo-specific visibility into supply chain incidents, the starter kit and waitlist are below:

DevRadar Guard Alpha — alert-only, no automatic blocking. You stay in control.

Axios Was Compromised. Here's What It Means for Your Repo.

Joseph Yeo — Mon, 06 Apr 2026 03:58:03 +0000

On March 31, 2026, the axios npm package — with over 100 million weekly downloads — was compromised and used to distribute malware.

A threat actor took over the lead maintainer's npm account, published two backdoored versions (1.14.1 and 0.30.4), and added a hidden dependency that deployed a cross-platform remote access trojan. The payload targeted Windows, macOS, and Linux. The malicious versions were live for only about three hours before they were removed.

In practice, three hours was enough.

Microsoft attributed the attack to Sapphire Sleet, a North Korean state actor. Google's Threat Intelligence Group confirmed UNC1069 involvement. This was a coordinated, pre-staged operation — the malicious dependency was planted 18 hours before activation.

Why This Matters to You

If your package.json uses caret ranges like ^1.x, a routine npm install could have pulled the compromised version automatically. No unusual action required. Just your normal CI/CD pipeline doing what it was designed to do.

Most teams would not have caught this in time.

Not because they're careless, but because the tooling gap is real:

npm audit looks for known CVEs. This wasn't a CVE when it hit.
Dependabot follows published advisories. This version came from the real maintainer account.
Lockfiles help, but only if they're pinned and not being updated automatically.

The teams that stayed safe had one thing in common: they treated dependency management as part of their security practice, not just routine package maintenance.

What Happened in Our Setup

I maintain a project called GloriaPPT — a typical Next.js app with 954 npm dependencies. When the axios advisory dropped, I wasn't refreshing Twitter. I got a Slack alert.

I built DevRadar Guard to answer one practical question fast: does this incident actually touch one of my repos? In this case, the flow looked like this:

Signal Engine picked up the GitHub Advisory within its 30-minute collection cycle.
Exposure Engine matched it against GloriaPPT's package-lock.json, where axios was a direct dependency.
Guardrail Engine sent a Slack alert with the risk score, confidence level, and affected version.

No manual checking. No scrolling through threads or advisories. The alert landed with the information I needed to decide what to do next.

Axios Was One Incident. The Pattern Keeps Repeating.

Axios will be patched. Credentials will be rotated. Postmortems will be published.

But the pattern repeats. Before axios, it was event-stream. Before that, ua-parser-js. The attack surface keeps growing with every install that pulls in packages your team didn't explicitly choose or review.

The question isn't whether the next supply chain attack will happen. It's whether your repo will know about it before your CI/CD pipeline installs it.

What You Can Do Today

Even without new tooling, these steps can reduce your risk right away:

Pin your dependencies. Remove ^ and ~ from critical packages. Use exact versions.
Set ignore-scripts=true in .npmrc. In this incident, that setting would have blocked the malicious install script.
Review your lockfile after every install. If a new transitive dependency appears that you didn't add, investigate.
Audit your CI/CD pipeline permissions. Does your build environment need network access during npm install?

What We're Building

I'm also building DevRadar Guard around this workflow: early signal collection, repo exposure checks, and guardrail generation. Part of it is open source, including starter config for .npmrc, pre-install hooks, CLAUDE.md (security policy for AI coding agents), and GitHub Actions.

DevRadar Guard is still in Alpha and runs in alert-only mode. No automatic blocking, and no surprise PR failures. You stay in control.

If your team depends on npm and this workflow sounds useful, take a look at the starter kit or join the waitlist: