<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Joseph Yeo</title>
    <description>The latest articles on Forem by Joseph Yeo (@josephyeo).</description>
    <link>https://forem.com/josephyeo</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3863060%2F14a6921b-eef9-4611-ba9b-c1a7b9835304.png</url>
      <title>Forem: Joseph Yeo</title>
      <link>https://forem.com/josephyeo</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/josephyeo"/>
    <language>en</language>
    <item>
      <title>I Built a Local AI Coding Agent on M5 Max 128GB — It Failed 164 Times Before Passing 35 Tests</title>
      <dc:creator>Joseph Yeo</dc:creator>
      <pubDate>Sat, 09 May 2026 11:31:07 +0000</pubDate>
      <link>https://forem.com/josephyeo/i-built-a-local-ai-coding-agent-on-m5-max-128gb-it-failed-164-times-before-passing-35-tests-2fgj</link>
      <guid>https://forem.com/josephyeo/i-built-a-local-ai-coding-agent-on-m5-max-128gb-it-failed-164-times-before-passing-35-tests-2fgj</guid>
      <description>&lt;p&gt;&lt;strong&gt;Fully local. No cloud APIs during execution. TDD-enforced. 35 tests passing.&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;To be clear: I used Claude for the initial architecture and rule design. The experiment was strictly focused on whether a local LLM could survive the &lt;strong&gt;autonomous execution loop&lt;/strong&gt; without phoning home. Planning, docs, and correction rule design I handled with Claude (a cloud API). The coding agent loop (Brain → Coder → Tester) ran 100% locally — no API calls during execution.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Actual Setup
&lt;/h2&gt;

&lt;p&gt;Most AI coding agent posts you see rely on GPT-4o or Claude via API. The model lives in a data center, and you're paying per token. That's fine — but it means your code, your architecture decisions, and your project context are all leaving your machine.&lt;/p&gt;

&lt;p&gt;I wanted something different: a multi-agent system that runs &lt;em&gt;entirely on my MacBook Pro M5 Max 128GB&lt;/em&gt;. It autonomously writes code, runs tests in a Docker sandbox, and only commits when tests pass. No internet required once it's running.&lt;/p&gt;

&lt;p&gt;This is the story of ForgeFlow — what I built, what broke, and what the data showed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Hardware Context
&lt;/h2&gt;

&lt;p&gt;The M5 Max 128GB is unusual hardware for this kind of work. Most local LLM setups top out at 32GB or 64GB unified memory, which forces you to choose between model quality and running multiple models simultaneously. At 128GB, that constraint disappears.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I ran simultaneously:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Size (Q4_K_M)&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3-Coder-Next&lt;/td&gt;
&lt;td&gt;~45GB&lt;/td&gt;
&lt;td&gt;Brain + Coder&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gemma4:26b&lt;/td&gt;
&lt;td&gt;~17GB&lt;/td&gt;
&lt;td&gt;QA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;nomic-embed-text&lt;/td&gt;
&lt;td&gt;~0.3GB&lt;/td&gt;
&lt;td&gt;RAG embeddings&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Total: ~62GB loaded, ~66GB headroom for OS + KV cache. Both models stay warm in memory with &lt;code&gt;keep_alive: 24h&lt;/code&gt; — no reload latency between cycles.&lt;/p&gt;

&lt;p&gt;This isn't a flex. It's context: the architectural decisions I made (same model for Brain and Coder, both models always loaded) are only feasible at this memory tier. At 64GB, you'd need to make different tradeoffs.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture: What ForgeFlow Actually Does
&lt;/h2&gt;

&lt;p&gt;ForgeFlow is an n8n workflow that runs every 10 minutes, autonomously picks the next coding task, writes tests first, writes code second, and only commits if all tests pass.&lt;/p&gt;

&lt;p&gt;The full loop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Schedule Trigger (10 min)
  → Load Context (working memory + results log + project rules)
  → Brain (Qwen3-Coder-Next): pick next task from PRD
  → Localization: RAG search for relevant existing code
  → Coder RED (same model): write a failing test
  → Verify RED: pytest must FAIL — if it passes, the test is wrong
  → Coder GREEN: write minimum code to pass the test
  → Phase 0 Gate: py_compile + ruff (deterministic, no LLM)
  → QA (gemma4:26b): run full test suite in Docker sandbox
  → Gate Decision: COMMIT / RETRY / DEADLOCK / ESCALATE
  → Commit &amp;amp; Update (on pass)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three design principles drove every decision:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. pytest exit code is the only truth.&lt;/strong&gt; I don't care if the LLM thinks the code is "clean." If the pytest exit code isn't 0, the code is garbage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The LLM proposes, n8n disposes.&lt;/strong&gt; No model has write access to the filesystem or git. n8n is the only actor that applies files, runs git commands, and updates state.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Deterministic gates before LLM gates.&lt;/strong&gt; &lt;code&gt;py_compile&lt;/code&gt; and &lt;code&gt;ruff&lt;/code&gt; run in under 0.5 seconds. If they catch the error, there's no reason to spend 30 seconds calling gemma4.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Memory System
&lt;/h2&gt;

&lt;p&gt;One of the underrated problems in autonomous coding agents is state management across cycles. The agent can't remember what it did last cycle unless you explicitly store it.&lt;/p&gt;

&lt;p&gt;ForgeFlow keeps track of state across six memory layers:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Storage&lt;/th&gt;
&lt;th&gt;Scope&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Git history&lt;/td&gt;
&lt;td&gt;&lt;code&gt;.git&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Permanent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code summaries&lt;/td&gt;
&lt;td&gt;ChromaDB (RAG)&lt;/td&gt;
&lt;td&gt;Project lifetime&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;results.tsv&lt;/td&gt;
&lt;td&gt;TSV file&lt;/td&gt;
&lt;td&gt;Session&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AGENTS.md&lt;/td&gt;
&lt;td&gt;Markdown&lt;/td&gt;
&lt;td&gt;Cross-session&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Working memory&lt;/td&gt;
&lt;td&gt;JSON file&lt;/td&gt;
&lt;td&gt;Current loop&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Failure patterns&lt;/td&gt;
&lt;td&gt;AGENTS.md auto-update&lt;/td&gt;
&lt;td&gt;Generalized&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each layer operates at a different time scale. Git is permanent. Working memory resets every cycle. AGENTS.md accumulates lessons across sessions — when the same failure type occurs 3+ times, a rule gets written: &lt;em&gt;"always include &lt;code&gt;from app.database import get_db&lt;/code&gt; — the model consistently forgets this."&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  TDD Enforcement: Red-Green-Refactor as a System Constraint
&lt;/h2&gt;

&lt;p&gt;The TDD loop isn't a suggestion — it's mechanically enforced by the workflow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;RED phase&lt;/strong&gt;: Coder writes a test. n8n runs pytest. If it &lt;em&gt;passes&lt;/em&gt;, the test is rejected — it's testing something that already works, which means it's the wrong test.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;GREEN phase&lt;/strong&gt;: Coder writes minimum code to pass the test. n8n applies the files, runs the full test suite (not just the new test), checks for regressions.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Commit&lt;/strong&gt;: Only happens if exit code is 0 across the entire test suite.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Enforcing this mechanically means the model can't shortcut. It can't write "good enough" code and hope the reviewer misses it. The test either passes or it doesn't.&lt;/p&gt;




&lt;h2&gt;
  
  
  Failure Handling: Bounded Repair
&lt;/h2&gt;

&lt;p&gt;Blind retries are a token-burn trap. Instead, ForgeFlow fingerprints every failure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;failure_signature&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SHA256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;failure_type&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;file_path&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;first_50_chars_of_stderr&lt;/span&gt;&lt;span class="p"&gt;)[:&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If I see the same SHA256 signature three times, the agent hits a &lt;strong&gt;DEADLOCK&lt;/strong&gt; and walks away. It's better to skip a task than to let a model hallucinate in a loop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Failure classification:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;patch&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Code logic error, syntax error&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;environment&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Import error, missing module&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;localization&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Wrong file referenced&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;deadlock&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Same signature 3×&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  The Data: What Actually Happened
&lt;/h2&gt;

&lt;p&gt;I ran ForgeFlow on a Todo REST API (FastAPI + SQLAlchemy + pytest) — 12 tasks, classic CRUD.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Overall:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total attempts&lt;/td&gt;
&lt;td&gt;164&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PASS (committed)&lt;/td&gt;
&lt;td&gt;11 (6.7%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FAIL (discarded)&lt;/td&gt;
&lt;td&gt;116 (70.7%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DEADLOCK (skipped)&lt;/td&gt;
&lt;td&gt;37 (22.6%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Manual interventions&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Final test count&lt;/td&gt;
&lt;td&gt;35 passing&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The 6.7% raw PASS rate sounds bad. But that number is misleading — it includes the early cycles before deterministic corrections were added.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The real signal is in the pass rate as the system "learned" (via manual rules):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Corrections active&lt;/th&gt;
&lt;th&gt;PASS rate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0–5 corrections&lt;/td&gt;
&lt;td&gt;5.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6 corrections&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7–10 corrections&lt;/td&gt;
&lt;td&gt;40.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11–13 corrections&lt;/td&gt;
&lt;td&gt;62.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each "correction" is a deterministic rule applied before the LLM output reaches the filesystem. Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;from app.db.session import&lt;/code&gt; → auto-rewrite to &lt;code&gt;from app.database import&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;@router.post(...)&lt;/code&gt; without &lt;code&gt;status_code=201&lt;/code&gt; → auto-insert&lt;/li&gt;
&lt;li&gt;File not in &lt;code&gt;target_files&lt;/code&gt; → reject with error message&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As corrections accumulated, PASS rate went from 5.6% to 62.5%. The corrections are essentially a hand-built knowledge base of the model's systematic errors. It turns out those errors are highly consistent and predictable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Failure type distribution:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;th&gt;%&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;patch&lt;/td&gt;
&lt;td&gt;99&lt;/td&gt;
&lt;td&gt;64.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;environment&lt;/td&gt;
&lt;td&gt;54&lt;/td&gt;
&lt;td&gt;35.3%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;At 35.3%, my environment failure rate is triple the standard benchmarks (~13%). That's the "quantization tax" you pay for running Q4 models locally. The deterministic corrections target exactly these failure types.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hardest tasks:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Attempts&lt;/th&gt;
&lt;th&gt;Primary failure&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;TASK-002 (DB model)&lt;/td&gt;
&lt;td&gt;41&lt;/td&gt;
&lt;td&gt;PytestDeprecationWarning + ImportError&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TASK-006 (GET list)&lt;/td&gt;
&lt;td&gt;26&lt;/td&gt;
&lt;td&gt;ImportError conftest&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TASK-012 (integration)&lt;/td&gt;
&lt;td&gt;22&lt;/td&gt;
&lt;td&gt;Regression (previous code overwritten)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;TASK-002 taking 41 attempts is the starkest number. Most failures were the same &lt;code&gt;PytestDeprecationWarning&lt;/code&gt; signature — the model couldn't fix a pytest configuration issue that required understanding the test infrastructure, not just the code under test. Eventually, a manual intervention resolved it.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Broke (Honestly)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;3 manual interventions were required:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;TASK-008 (PUT endpoint):&lt;/strong&gt; The Coder kept generating tests with wrong status codes. Added correction #13 (PUT 201→200 auto-fix) after diagnosing the pattern.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;TASK-011 (filtering):&lt;/strong&gt; The Coder overwrote &lt;code&gt;routes/todo.py&lt;/code&gt; while working on filtering, destroying previously committed code. The target_files violation detection wasn't blocking writes — only logging them.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;TASK-012 (integration test):&lt;/strong&gt; DEADLOCK 3 times. The model couldn't figure out that &lt;code&gt;test_integration.py&lt;/code&gt; needed to use the existing &lt;code&gt;client&lt;/code&gt; fixture from &lt;code&gt;conftest.py&lt;/code&gt; rather than creating its own TestClient.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;All three were fixed in the session after they occurred by adding deterministic corrections. The system learned — just not automatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Is (and Isn't)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;This is:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A proof that a fully local multi-agent TDD loop is viable on consumer hardware&lt;/li&gt;
&lt;li&gt;Evidence that deterministic corrections significantly outperform raw LLM retry for systematic errors&lt;/li&gt;
&lt;li&gt;A framework for thinking about autonomous coding at the task level&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;This isn't:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A "set it and forget it" system. It's a force multiplier that still requires a human to untangle the logic when the model hits a wall.&lt;/li&gt;
&lt;li&gt;A system that works without oversight (3 interventions in 12 tasks is not zero)&lt;/li&gt;
&lt;li&gt;Generalizable beyond the hardware tier that makes it feasible&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The 62.5% PASS rate in the final correction set is meaningful. But the 3 required manual interventions mean the system isn't yet fully autonomous.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Second project:&lt;/strong&gt; A more complex backend (20+ tasks, non-trivial dependencies) to validate that the correction set generalizes and the dependency resolution logic holds under pressure. The goal is a two-project dataset for a proper write-up.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase 0.5 Gate:&lt;/strong&gt; I'm looking at implementing AST-based checks — inspired by the Khati et al. (2026) paper — to kill hallucinations before they even hit the Docker sandbox. The goal is to catch &lt;code&gt;app.routes.todo.get_todo_by_id&lt;/code&gt; (doesn't exist) before it reaches pytest.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Automatic correction learning:&lt;/strong&gt; Right now, corrections are written manually after pattern identification. The next step is having n8n automatically identify recurring failure signatures and propose corrections for human approval.&lt;/p&gt;




&lt;h2&gt;
  
  
  Hardware Note for Other M-Series Users
&lt;/h2&gt;

&lt;p&gt;If you're on M2/M3/M4 Pro (36–48GB), the same architecture works with tradeoffs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Run one model at a time (swap between Brain/Coder and QA)&lt;/li&gt;
&lt;li&gt;Use smaller QA model (gemma4:9b instead of 26b)&lt;/li&gt;
&lt;li&gt;Expect higher latency per cycle (~15–20 min instead of 10)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The fundamental approach — deterministic orchestration + LLM proposal + test-as-truth — doesn't require 128GB. It just runs faster and with better models at that tier.&lt;/p&gt;




&lt;h2&gt;
  
  
  About
&lt;/h2&gt;

&lt;p&gt;I'm Joseph YEO, a solo builder from Seoul, Korea. I run multiple projects in parallel using AI agents — local AI automation (ForgeFlow), supply chain security (&lt;a href="https://devradarguard.dev" rel="noopener noreferrer"&gt;DevRadar Guard&lt;/a&gt;), and a few things currently under wraps.&lt;/p&gt;

&lt;p&gt;What I'm really interested in is how autonomous these agents can actually become before I have to step in as the human. ForgeFlow is one experiment. There will be more.&lt;/p&gt;

&lt;p&gt;Follow along:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;𝕏: &lt;a href="https://x.com/josephyeo_dev" rel="noopener noreferrer"&gt;@josephyeo_dev&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/joseph-yeo" rel="noopener noreferrer"&gt;joseph-yeo&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Site: &lt;a href="https://projectjoseph.dev" rel="noopener noreferrer"&gt;projectjoseph.dev&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Built over ~7 sessions, May 2026. All models run locally via Ollama 0.23.0 on macOS. No cloud APIs were used during autonomous execution.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This post was drafted with Claude and edited by me. I use AI tools to write, just like I use them to code. That's kind of the whole point.&lt;/em&gt;&lt;/p&gt;




</description>
      <category>llm</category>
      <category>agents</category>
      <category>tdd</category>
      <category>ollama</category>
    </item>
    <item>
      <title>Case Study: How I Dogfood DevRadar Guard on a 954-Dependency Project</title>
      <dc:creator>Joseph Yeo</dc:creator>
      <pubDate>Mon, 06 Apr 2026 13:25:37 +0000</pubDate>
      <link>https://forem.com/josephyeo/case-study-how-i-dogfood-devradar-guard-on-a-954-dependency-project-d7e</link>
      <guid>https://forem.com/josephyeo/case-study-how-i-dogfood-devradar-guard-on-a-954-dependency-project-d7e</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a follow-up to my earlier post: &lt;a href="https://dev.to/devradarguard/axios-was-compromised-heres-what-it-means-for-your-repo-1hh0"&gt;Axios Was Compromised. Here's What It Means for Your Repo.&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;GloriaPPT is a presentation tool I built and maintain. It's a fairly typical modern JavaScript app: a Next.js frontend, a Node.js backend, and deployment on Vercel. What makes it interesting for this case study is its dependency tree: &lt;strong&gt;954 npm packages&lt;/strong&gt; in the lockfile.&lt;/p&gt;

&lt;p&gt;Most of those packages are transitive. I haven't read the source code for most of them, and realistically, neither do most small teams. If one of them were compromised tomorrow, I probably wouldn't know right away.&lt;/p&gt;

&lt;p&gt;That's the problem I built DevRadar Guard to solve.&lt;/p&gt;

&lt;h2&gt;
  
  
  Before DevRadar Guard
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dependency monitoring:&lt;/strong&gt; Manual &lt;code&gt;npm audit&lt;/code&gt; when I remembered to run it&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Supply chain alerts:&lt;/strong&gt; None — I found out about incidents from social feeds and security threads&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;.npmrc&lt;/code&gt; hardening:&lt;/strong&gt; Default settings&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;CLAUDE.md&lt;/code&gt; security section:&lt;/strong&gt; Didn't exist&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pre-install hooks:&lt;/strong&gt; None&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CI/CD security checks:&lt;/strong&gt; Basic Dependabot, no custom policy&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Response time to incidents:&lt;/strong&gt; Hours to days, depending on when I saw the news&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  After DevRadar Guard
&lt;/h2&gt;

&lt;p&gt;I deployed DevRadar Guard's hosted monitoring on a small VPS that checks every 30 minutes. Here's what changed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Signal Collection
&lt;/h3&gt;

&lt;p&gt;The Signal Engine collects threat intelligence from GitHub Security Advisories every 30 minutes. In the first 24 hours, it ingested &lt;strong&gt;467 raw events&lt;/strong&gt; — advisories affecting npm packages — and normalized all of them into structured threat candidates with confidence scores.&lt;/p&gt;

&lt;p&gt;Each signal is scored across five dimensions, including source quality, technical specificity, cross-reference validation, discussion velocity, and ecosystem relevance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Exposure Matching
&lt;/h3&gt;

&lt;p&gt;Out of 467 normalized signals, the Exposure Engine matched &lt;strong&gt;1 against GloriaPPT's actual dependency tree&lt;/strong&gt;: axios.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Package:&lt;/strong&gt; axios&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Installed version:&lt;/strong&gt; 1.14.1&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Confidence score:&lt;/strong&gt; 65/100&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Exposure score:&lt;/strong&gt; 50/100&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Final risk:&lt;/strong&gt; 57/100 (alert threshold: 50)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The confidence score reflects signal quality: a high-trust source, a named package and version, and enough technical detail to treat the advisory seriously.&lt;/p&gt;

&lt;p&gt;The exposure score reflects how directly the issue touched this repo: &lt;code&gt;axios&lt;/code&gt; was a direct dependency, and the affected version was present in the lockfile.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Alert
&lt;/h3&gt;

&lt;p&gt;At 00:24 KST (UTC+9) on April 6, a Slack alert landed in #devradar-alerts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;🛡️ DevRadar Guard Alert

Package: axios
Version: 1.14.1
Risk Score: 57/100
Confidence: 65
Exposure: 50
Path: direct

Signal: Compromised axios versions 1.14.1 and 0.30.4 were
reported to deliver a remote access trojan...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I didn't find out about the axios compromise from social media. The alert was waiting for me when I checked Slack.&lt;/p&gt;

&lt;h3&gt;
  
  
  Guardrail Bundle
&lt;/h3&gt;

&lt;p&gt;DevRadar Guard generates a guardrail bundle — a set of files you can drop into a repo to harden installs, guide AI coding agents, and surface risky dependency changes during review:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;File&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;CLAUDE.md&lt;/code&gt; security section&lt;/td&gt;
&lt;td&gt;Security policy for AI coding agents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;.npmrc&lt;/code&gt; hardening&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;ignore-scripts=true&lt;/code&gt;, &lt;code&gt;audit=true&lt;/code&gt;, registry pinning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pre-install hook&lt;/td&gt;
&lt;td&gt;Warns before installing packages younger than 7 days&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GitHub Actions workflow&lt;/td&gt;
&lt;td&gt;PR check that flags risky dependency changes (alert-only)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;devradar-policy.json&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Machine-readable policy for CI/CD integration&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;GloriaPPT now uses all 8 generated guardrail files. The pre-install hook would likely have flagged the malicious &lt;code&gt;plain-crypto-js&lt;/code&gt; dependency used in the attack, since it had been published less than 24 hours earlier.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Numbers
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Dependencies monitored&lt;/td&gt;
&lt;td&gt;954&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Raw threat signals collected (first 24h)&lt;/td&gt;
&lt;td&gt;467&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Normalization success rate (first 24h sample)&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Signals matched to GloriaPPT&lt;/td&gt;
&lt;td&gt;1 (axios)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;False positives in this case study&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time from advisory to alert&lt;/td&gt;
&lt;td&gt;&amp;lt; 30 minutes (cron cycle)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Guardrail files generated&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Manual intervention during detection&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  What This Doesn't Prove
&lt;/h2&gt;

&lt;p&gt;I want to be honest about what this case study shows and what it doesn't.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It shows:&lt;/strong&gt; A real supply chain threat was detected, matched to a real project, and surfaced as an actionable alert — automatically, without manual intervention.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It doesn't show:&lt;/strong&gt; DevRadar Guard catching a zero-day before anyone else. The axios advisory was already published when my pipeline picked it up. I'm not claiming to be faster than GitHub Advisory. I'm claiming to be faster than manual monitoring — finding out from social feeds, security threads, or a post after the fact.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It doesn't show:&lt;/strong&gt; Protection against all supply chain attacks. The Signal Engine currently monitors GitHub Advisories only. Reddit, npm registry anomaly detection, and other sources are planned but not yet active in Alpha.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It doesn't show:&lt;/strong&gt; Automatic blocking. DevRadar Guard Alpha is alert-only. No PR failures, no install blocks, no surprises. You get the information; you decide what to do.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;DevRadar Guard is still in Alpha, and I'm testing it with a small number of pilot teams. Right now that includes hosted monitoring on a 30-minute cycle, matched alerts in Slack or Discord, a generated guardrail bundle for the repo, and a weekly threat briefing. All I ask in return is a few minutes of feedback each week.&lt;/p&gt;

&lt;p&gt;If your project has a &lt;code&gt;package-lock.json&lt;/code&gt; and you want earlier, repo-specific visibility into supply chain incidents, the starter kit and waitlist are below:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/devradar-guard/devradar-guard/tree/main/examples/starter-kit" rel="noopener noreferrer"&gt;Starter Kit on GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://tally.so/r/GxDbbL" rel="noopener noreferrer"&gt;Join the waitlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;DevRadar Guard Alpha — alert-only, no automatic blocking. You stay in control.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>npm</category>
      <category>security</category>
      <category>supplychain</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Axios Was Compromised. Here's What It Means for Your Repo.</title>
      <dc:creator>Joseph Yeo</dc:creator>
      <pubDate>Mon, 06 Apr 2026 03:58:03 +0000</pubDate>
      <link>https://forem.com/josephyeo/axios-was-compromised-heres-what-it-means-for-your-repo-1hh0</link>
      <guid>https://forem.com/josephyeo/axios-was-compromised-heres-what-it-means-for-your-repo-1hh0</guid>
      <description>&lt;p&gt;On March 31, 2026, the &lt;code&gt;axios&lt;/code&gt; npm package — with over 100 million weekly downloads — was compromised and used to distribute malware.&lt;/p&gt;

&lt;p&gt;A threat actor took over the lead maintainer's npm account, published two backdoored versions (&lt;code&gt;1.14.1&lt;/code&gt; and &lt;code&gt;0.30.4&lt;/code&gt;), and added a hidden dependency that deployed a cross-platform remote access trojan. The payload targeted Windows, macOS, and Linux. The malicious versions were live for only about three hours before they were removed.&lt;/p&gt;

&lt;p&gt;In practice, three hours was enough.&lt;/p&gt;

&lt;p&gt;Microsoft attributed the attack to Sapphire Sleet, a North Korean state actor. Google's Threat Intelligence Group confirmed UNC1069 involvement. This was a coordinated, pre-staged operation — the malicious dependency was planted 18 hours before activation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters to You
&lt;/h2&gt;

&lt;p&gt;If your &lt;code&gt;package.json&lt;/code&gt; uses caret ranges like &lt;code&gt;^1.x&lt;/code&gt;, a routine &lt;code&gt;npm install&lt;/code&gt; could have pulled the compromised version automatically. No unusual action required. Just your normal CI/CD pipeline doing what it was designed to do.&lt;/p&gt;

&lt;p&gt;Most teams would not have caught this in time.&lt;/p&gt;

&lt;p&gt;Not because they're careless, but because the tooling gap is real:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;npm audit&lt;/code&gt; looks for known CVEs. This wasn't a CVE when it hit.&lt;/li&gt;
&lt;li&gt;Dependabot follows published advisories. This version came from the real maintainer account.&lt;/li&gt;
&lt;li&gt;Lockfiles help, but only if they're pinned and not being updated automatically.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The teams that stayed safe had one thing in common: they treated dependency management as part of their security practice, not just routine package maintenance.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Happened in Our Setup
&lt;/h2&gt;

&lt;p&gt;I maintain a project called GloriaPPT — a typical Next.js app with 954 npm dependencies. When the axios advisory dropped, I wasn't refreshing Twitter. I got a Slack alert.&lt;/p&gt;

&lt;p&gt;I built DevRadar Guard to answer one practical question fast: does this incident actually touch one of my repos? In this case, the flow looked like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Signal Engine&lt;/strong&gt; picked up the GitHub Advisory within its 30-minute collection cycle.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Exposure Engine&lt;/strong&gt; matched it against GloriaPPT's &lt;code&gt;package-lock.json&lt;/code&gt;, where &lt;code&gt;axios&lt;/code&gt; was a direct dependency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Guardrail Engine&lt;/strong&gt; sent a Slack alert with the risk score, confidence level, and affected version.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No manual checking. No scrolling through threads or advisories. The alert landed with the information I needed to decide what to do next.&lt;/p&gt;

&lt;h2&gt;
  
  
  Axios Was One Incident. The Pattern Keeps Repeating.
&lt;/h2&gt;

&lt;p&gt;Axios will be patched. Credentials will be rotated. Postmortems will be published.&lt;/p&gt;

&lt;p&gt;But the pattern repeats. Before axios, it was event-stream. Before that, ua-parser-js. The attack surface keeps growing with every install that pulls in packages your team didn't explicitly choose or review.&lt;/p&gt;

&lt;p&gt;The question isn't whether the next supply chain attack will happen. It's whether your repo will know about it before your CI/CD pipeline installs it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What You Can Do Today
&lt;/h2&gt;

&lt;p&gt;Even without new tooling, these steps can reduce your risk right away:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pin your dependencies.&lt;/strong&gt; Remove &lt;code&gt;^&lt;/code&gt; and &lt;code&gt;~&lt;/code&gt; from critical packages. Use exact versions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Set &lt;code&gt;ignore-scripts=true&lt;/code&gt; in &lt;code&gt;.npmrc&lt;/code&gt;.&lt;/strong&gt; In this incident, that setting would have blocked the malicious install script.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Review your lockfile after every install.&lt;/strong&gt; If a new transitive dependency appears that you didn't add, investigate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audit your CI/CD pipeline permissions.&lt;/strong&gt; Does your build environment need network access during &lt;code&gt;npm install&lt;/code&gt;?&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What We're Building
&lt;/h2&gt;

&lt;p&gt;I'm also building DevRadar Guard around this workflow: early signal collection, repo exposure checks, and guardrail generation. Part of it is open source, including starter config for &lt;code&gt;.npmrc&lt;/code&gt;, pre-install hooks, &lt;code&gt;CLAUDE.md&lt;/code&gt; (security policy for AI coding agents), and GitHub Actions.&lt;/p&gt;

&lt;p&gt;DevRadar Guard is still in Alpha and runs in alert-only mode. No automatic blocking, and no surprise PR failures. You stay in control.&lt;/p&gt;

&lt;p&gt;If your team depends on npm and this workflow sounds useful, take a look at the starter kit or join the waitlist:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/devradar-guard/devradar-guard/tree/main/examples/starter-kit" rel="noopener noreferrer"&gt;Starter Kit on GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://tally.so/r/GxDbbL" rel="noopener noreferrer"&gt;Join the waitlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>security</category>
      <category>npm</category>
      <category>supplychain</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
