<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: bchtitihi</title>
    <description>The latest articles on Forem by bchtitihi (@bchtitihi).</description>
    <link>https://forem.com/bchtitihi</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3812055%2F6fab1811-aa05-45ed-be68-8c15922950b8.png</url>
      <title>Forem: bchtitihi</title>
      <link>https://forem.com/bchtitihi</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/bchtitihi"/>
    <language>en</language>
    <item>
      <title>How I Audited 250K Lines of Legacy Code with 11 AI Agents in One Week</title>
      <dc:creator>bchtitihi</dc:creator>
      <pubDate>Sat, 07 Mar 2026 19:41:54 +0000</pubDate>
      <link>https://forem.com/bchtitihi/how-i-audited-250k-lines-of-legacy-code-with-11-ai-agents-in-one-week-4mfe</link>
      <guid>https://forem.com/bchtitihi/how-i-audited-250k-lines-of-legacy-code-with-11-ai-agents-in-one-week-4mfe</guid>
      <description>&lt;p&gt;I inherited a monolith. 250,000 lines of Python. 20+ years old. The framework was end-of-life since 2018. The language was end-of-life since 2020. Zero tests. Passwords stored in plain text. A proprietary library maintained by 2 people, embedded in 133 imports across 47 files. A database with 462 tables using exotic PostgreSQL inheritance instead of standard ORM patterns. And 900+ production websites depending on it.&lt;/p&gt;

&lt;p&gt;My job: audit the entire thing before a rebuild decision. Traditional approach: 2-3 senior consultants, 4-8 weeks, six figures. &lt;/p&gt;

&lt;p&gt;My approach: 11 AI agents, 2 adversarial teams, 7 iterations, 10 days.&lt;/p&gt;

&lt;p&gt;Here's what happened — including the mistakes that made it work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Iteration 1: The Naive Start (1 agent)
&lt;/h2&gt;

&lt;p&gt;I started where everyone starts. One Claude conversation. Upload the codebase. Ask questions.&lt;/p&gt;

&lt;p&gt;The results looked impressive: 1,100 paragraphs, 18 sections covering architecture, security, performance, business rules. My first thought was "this is amazing."&lt;/p&gt;

&lt;p&gt;My second thought, three days later, was "half of this is wrong."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The hallucinations:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The agent claimed a major frontend library was "not present in the codebase." A simple &lt;code&gt;grep&lt;/code&gt; later found it in 11 files.&lt;/li&gt;
&lt;li&gt;It estimated "200+ SQL triggers." The actual count was 401.&lt;/li&gt;
&lt;li&gt;Most findings had no file references. When I tried to verify them, I couldn't find what the agent was talking about.&lt;/li&gt;
&lt;li&gt;4 database classes were referenced that &lt;strong&gt;didn't exist anywhere in the codebase&lt;/strong&gt;. The agent had invented plausible-sounding names with field counts and relationships.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The lesson:&lt;/strong&gt; AI hallucinates when it can't verify. Without &lt;code&gt;file:line&lt;/code&gt; proof, findings are fiction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rule #1 established:&lt;/strong&gt; Every finding must include &lt;code&gt;file:line&lt;/code&gt; proof. No exceptions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Iteration 2: Filesystem Access (4 agents)
&lt;/h2&gt;

&lt;p&gt;The fix seemed obvious: give the agents filesystem access so they can actually &lt;code&gt;grep&lt;/code&gt; and &lt;code&gt;find&lt;/code&gt; before making claims.&lt;/p&gt;

&lt;p&gt;I set up 4 specialized agents running sequentially:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Agent&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Security Hunter&lt;/td&gt;
&lt;td&gt;OWASP Top 10, credentials, injections&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Code Archaeologist&lt;/td&gt;
&lt;td&gt;Dead code, business rules, module scope&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Metrics Counter&lt;/td&gt;
&lt;td&gt;Exact counts, schema, performance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Cross-Checker&lt;/td&gt;
&lt;td&gt;Consolidation, contradictions&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;I used a configuration inspired by the &lt;a href="https://github.com/shanraisshan/claude-code-best-practice" rel="noopener noreferrer"&gt;shanraisshan/claude-code-best-practice&lt;/a&gt; repo: YAML frontmatter for agents, glob-based rules that load only when needed, and progressive disclosure for skills — only descriptions load at startup, full content on demand. This saved ~60% of context window.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What worked:&lt;/strong&gt; Agent 3 ran &lt;code&gt;grep -c "CREATE TRIGGER"&lt;/code&gt; and got 954 (not the "200+" estimate from Iteration 1). Real numbers replaced guesses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What went wrong:&lt;/strong&gt; 15+ findings were left marked "[TO VERIFY]." The agents couldn't verify each other because they ran sequentially — by the time Agent 4 found issues, Agents 1-3 were done.&lt;/p&gt;

&lt;p&gt;Also, Agent 4 estimated security remediation at "€95K + 4 weeks." Agents have zero basis for cost estimation. This was pure hallucination dressed as analysis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rules established:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No "[TO VERIFY]" in final deliverables&lt;/li&gt;
&lt;li&gt;No effort estimates — agents audit, humans estimate&lt;/li&gt;
&lt;li&gt;Cross-review between agents is mandatory&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Iteration 3: Parallel with Cross-Review (5 agents)
&lt;/h2&gt;

&lt;p&gt;The big change: agents running in parallel, each reviewing one other agent's work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The breakthrough:&lt;/strong&gt; Zero "[TO VERIFY]" markers. When Agent 1 claimed 35 imports and Agent 3 counted 38, the consolidator re-ran the grep and settled it (38 was correct).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The problem:&lt;/strong&gt; Good breadth, shallow depth. The security agent found 12 vulnerabilities. A deeper audit later found 19+ with 7 critical.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context management became critical.&lt;/strong&gt; I learned that compacting at 70% of context usage (not the default 95%) prevents agents from losing instructions mid-analysis. And CLAUDE.md files over 200 lines get partially ignored — details need to move into separate rule and skill files.&lt;/p&gt;

&lt;h2&gt;
  
  
  Iteration 4: Specialized Deep-Dive (10 agents)
&lt;/h2&gt;

&lt;p&gt;One agent per domain: workflows, batch processes, forms, templates, middleware, database schema, integrations, module classification, security+GDPR, quality arbiter.&lt;/p&gt;

&lt;p&gt;This produced exhaustive reports. The Module Classifier categorized every view, model, and route as IN_SIMPLE / IN_COMPLEX / OUT / GRAY_ZONE — giving the CTO a clear decision framework: "28 things are easy, 47 are hard, 120 are out of scope, 12 need your decision."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But this is also where the biggest error was introduced.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Agent 6 (Schema Architect) reported &lt;strong&gt;889 foreign keys&lt;/strong&gt;. This number came from counting columns named &lt;code&gt;ref_*&lt;/code&gt; — a naming convention. Previous iterations had reported the same number. Nobody questioned it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;889 traveled from Iteration 2 → 3 → 4 without anyone verifying the actual database constraints.&lt;/strong&gt; I'll tell you the real number soon.&lt;/p&gt;

&lt;h2&gt;
  
  
  Iteration 5: The Validation Tribunal (15 agents)
&lt;/h2&gt;

&lt;p&gt;15 fresh agents, one per domain, each re-verifying every finding from Iterations 1-4. &lt;strong&gt;They could only read source code — not previous reports.&lt;/strong&gt; This prevented bias.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Verdict&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;th&gt;%&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Confirmed&lt;/td&gt;
&lt;td&gt;131&lt;/td&gt;
&lt;td&gt;78.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Partially confirmed&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;11.9%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Invalidated&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;8.3%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Not verifiable&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;0.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;New findings&lt;/td&gt;
&lt;td&gt;17&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Reliability score&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;89.4%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The 4 hallucinations from earlier iterations were caught here. The agents had invented database classes that sounded plausible but didn't exist.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But the 889 FK survived.&lt;/strong&gt; The validators re-ran &lt;code&gt;grep -c "ref_"&lt;/code&gt; and got 889 again. The query was correct. The interpretation was wrong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Per-domain scores revealed the weak spot:&lt;/strong&gt; Architecture: 100%. Business rules: 100%. &lt;strong&gt;Database: 37.5%.&lt;/strong&gt; That score screamed "investigate further."&lt;/p&gt;

&lt;h2&gt;
  
  
  Iteration 6: Active Exploration (7 agents)
&lt;/h2&gt;

&lt;p&gt;Instead of re-reading the same code, I introduced &lt;strong&gt;new data sources&lt;/strong&gt;: git history, CVE scanning (&lt;code&gt;pip-audit&lt;/code&gt;), and the production database schema.&lt;/p&gt;

&lt;h3&gt;
  
  
  The 889 → 15 moment
&lt;/h3&gt;

&lt;p&gt;The Schema Inspector obtained the &lt;strong&gt;production schema&lt;/strong&gt; and ran:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;information_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;table_constraints&lt;/span&gt; 
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;constraint_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'FOREIGN KEY'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Result: &lt;strong&gt;15.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;All 15 were on system tables. &lt;strong&gt;Zero on business tables.&lt;/strong&gt; For 20+ years, the application had operated with zero referential integrity on its core data.&lt;/p&gt;

&lt;p&gt;The migration strategy changed completely. Instead of "migrate 889 FK relationships," it became "design proper constraints for the new system."&lt;/p&gt;

&lt;h3&gt;
  
  
  Other discoveries
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;97% of code was dormant&lt;/strong&gt; (135 active files out of 4,369)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;36 CVEs confirmed&lt;/strong&gt; (4 critical, CVSS ≥ 9.0)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bus factor = 1&lt;/strong&gt; (one developer owned 67-72% of commits on 7/8 critical modules)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The database dump was 8 years old&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Iteration 7: The Adversarial Dual-Team (11 agents)
&lt;/h2&gt;

&lt;p&gt;The final iteration. Two independent teams: Team A (7 agents) audits, Team B (4 agents) tries to prove Team A wrong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Critical rule:&lt;/strong&gt; Team B cannot modify Team A's reports. It produces its own files.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Findings challenged&lt;/td&gt;
&lt;td&gt;22&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Confirmed&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Nuanced (correct but misleading)&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Invalidated&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Reliability score&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;81.8%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Why 81.8% is better than 89.4%
&lt;/h3&gt;

&lt;p&gt;Iteration 5's 89.4% was a &lt;strong&gt;validation&lt;/strong&gt; score: "Are the facts correct?" Iteration 7's 81.8% was an &lt;strong&gt;adversarial&lt;/strong&gt; score: "Can we find reasons these facts are wrong or misleading?"&lt;/p&gt;

&lt;p&gt;The lower score is more trustworthy. It means the process works.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 8 Rules (Learned the Hard Way)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rule&lt;/th&gt;
&lt;th&gt;Learned in&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Every finding needs &lt;code&gt;file:line&lt;/code&gt; proof&lt;/td&gt;
&lt;td&gt;Iteration 1&lt;/td&gt;
&lt;td&gt;Without proof, agents hallucinate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Search before claiming&lt;/td&gt;
&lt;td&gt;Iteration 1&lt;/td&gt;
&lt;td&gt;"Library absent" — found in 11 files&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No effort estimates&lt;/td&gt;
&lt;td&gt;Iteration 2&lt;/td&gt;
&lt;td&gt;Agents are terrible at estimation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cross-review between agents&lt;/td&gt;
&lt;td&gt;Iteration 3&lt;/td&gt;
&lt;td&gt;Agents contradict each other silently&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Classify everything&lt;/td&gt;
&lt;td&gt;Iteration 4&lt;/td&gt;
&lt;td&gt;Decision-makers need decisions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Re-verify previous iterations&lt;/td&gt;
&lt;td&gt;Iteration 5&lt;/td&gt;
&lt;td&gt;889 FK traveled 4 iterations unchecked&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Use production data&lt;/td&gt;
&lt;td&gt;Iteration 6&lt;/td&gt;
&lt;td&gt;889 → 15 changed everything&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Adversarial beats validation&lt;/td&gt;
&lt;td&gt;Iteration 7&lt;/td&gt;
&lt;td&gt;89.4% looked good, 81.8% was honest&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The Open-Source Methodology
&lt;/h2&gt;

&lt;p&gt;I've open-sourced everything: &lt;strong&gt;&lt;a href="https://github.com/bchtitihi/legacy-audit-agents" rel="noopener noreferrer"&gt;github.com/bchtitihi/legacy-audit-agents&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The repo includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;7 detailed iteration documents (every mistake documented)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;6 progressive setup levels&lt;/strong&gt; (Level 2 → 7) — you can iterate the same way I did&lt;/li&gt;
&lt;li&gt;11 agent definitions (7 Team A + 4 Team B)&lt;/li&gt;
&lt;li&gt;Rules, skills, commands, hooks — all battle-tested&lt;/li&gt;
&lt;li&gt;Stack-specific examples (Django, Rails, Node.js)&lt;/li&gt;
&lt;li&gt;Reliability scoring formula&lt;/li&gt;
&lt;li&gt;References to &lt;a href="https://www.anthropic.com/engineering/claude-code-best-practices" rel="noopener noreferrer"&gt;Anthropic's official docs&lt;/a&gt; and community best practices&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Quick start
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/bchtitihi/legacy-audit-agents.git
&lt;span class="nb"&gt;cp&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; legacy-audit-agents/setup/.claude your-project/.claude
&lt;span class="nb"&gt;cp &lt;/span&gt;legacy-audit-agents/setup/CLAUDE.md your-project/CLAUDE.md
&lt;span class="nb"&gt;cd &lt;/span&gt;your-project &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; claude &lt;span class="nt"&gt;--dangerously-skip-permissions&lt;/span&gt;
&lt;span class="c"&gt;# Type: /audit-run&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or follow the progressive path: start at Level 2, iterate to Level 7.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I'm Publishing This
&lt;/h2&gt;

&lt;p&gt;Every other Claude Code repo gives you configs and templates. This one gives you a &lt;strong&gt;methodology&lt;/strong&gt; — with every mistake documented so you don't repeat them.&lt;/p&gt;

&lt;p&gt;The methodology didn't start with 11 agents. It started with 1 agent that hallucinated. &lt;strong&gt;This progression is the real value.&lt;/strong&gt; Not just the final setup — the entire journey from naive to adversarial.&lt;/p&gt;

&lt;p&gt;Nobody else has published this. I'm first, and I want others to build on it.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;&lt;a href="https://github.com/bchtitihi/legacy-audit-agents" rel="noopener noreferrer"&gt;⭐ Star the repo&lt;/a&gt; if this is useful. Questions? &lt;a href="https://github.com/bchtitihi/legacy-audit-agents/issues" rel="noopener noreferrer"&gt;Open an issue&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claudecode</category>
      <category>programming</category>
      <category>methodology</category>
    </item>
  </channel>
</rss>
