<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: WonderLab</title>
    <description>The latest articles on Forem by WonderLab (@wonderlab).</description>
    <link>https://forem.com/wonderlab</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3797373%2F25beba30-d8d4-4d2e-9ec6-170356089350.jpg</url>
      <title>Forem: WonderLab</title>
      <link>https://forem.com/wonderlab</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/wonderlab"/>
    <language>en</language>
    <item>
      <title>One Open Source Project a Day (No. 72): Andrej Karpathy Skills — Fix Four Chronic LLM Coding Problems With a Single CLAUDE.md</title>
      <dc:creator>WonderLab</dc:creator>
      <pubDate>Fri, 22 May 2026 07:00:16 +0000</pubDate>
      <link>https://forem.com/wonderlab/one-open-source-project-a-day-no-72-andrej-karpathy-skills-fix-four-chronic-llm-coding-4afo</link>
      <guid>https://forem.com/wonderlab/one-open-source-project-a-day-no-72-andrej-karpathy-skills-fix-four-chronic-llm-coding-4afo</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;"LLMs excel at looping until they meet specific goals — so provide success criteria rather than imperative instructions."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is the NO.72 article in the "One Open Source Project a Day" series. Today we are exploring &lt;strong&gt;andrej-karpathy-skills&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This project is unusual: its core is not a tool, framework, or library — &lt;strong&gt;it is a single CLAUDE.md file&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The story starts with Andrej Karpathy posting on X after heavy Claude Code usage, documenting failure patterns he observed in LLM coding: diving into implementation without clarification, engineering simple problems into complex solutions, making unrequested changes to adjacent code.&lt;/p&gt;

&lt;p&gt;The multica-ai team distilled those observations into four actionable behavioral principles and packaged them into a CLAUDE.md — drop it in a project, and Claude Code changes how it behaves. It also ships in Claude Code plugin and Cursor rules formats, covering the two main AI coding tools.&lt;/p&gt;

&lt;p&gt;The project answers a frequently overlooked question: &lt;strong&gt;rather than teaching an LLM exactly what to do, teach it how to think&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  What You Will Learn
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;The four LLM coding failure modes Karpathy identified&lt;/li&gt;
&lt;li&gt;The content and real-world examples behind each of the four principles&lt;/li&gt;
&lt;li&gt;Three installation methods: standalone CLAUDE.md / Claude Code plugin / Cursor rules&lt;/li&gt;
&lt;li&gt;Why "give success criteria" is more effective than "give step-by-step instructions"&lt;/li&gt;
&lt;li&gt;How to verify the guidelines are actually working&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Familiarity with Claude Code, Cursor, or similar AI coding tools&lt;/li&gt;
&lt;li&gt;Enough hands-on coding experience to recognize LLM coding pain points&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Project Background
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Project Introduction
&lt;/h3&gt;

&lt;p&gt;andrej-karpathy-skills is fundamentally a behavioral configuration file. Its design philosophy stems from one key insight:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;LLM coding problems are often not about capability — they are about unconstrained behavior.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The model is capable of writing simple code, but nothing tells it "don't write complex code." It is capable of asking for clarification first, but there is no pressure making it do so. It knows it shouldn't touch unrelated code, but the habit of "while I'm at it" is hard to suppress.&lt;/p&gt;

&lt;p&gt;This CLAUDE.md file makes those constraints explicit, injecting them into every conversation as context.&lt;/p&gt;

&lt;p&gt;The project ships in three formats to cover different workflows and tools.&lt;/p&gt;

&lt;h3&gt;
  
  
  Author/Team
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Maintainer&lt;/strong&gt;: &lt;a href="https://github.com/multica-ai" rel="noopener noreferrer"&gt;multica-ai&lt;/a&gt; (Multica AI team)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inspiration&lt;/strong&gt;: Andrej Karpathy's observations shared on X about LLM coding usage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Original author&lt;/strong&gt;: The CLAUDE.md content was originally compiled by &lt;a href="https://github.com/forrestchang" rel="noopener noreferrer"&gt;forrestchang&lt;/a&gt;; multica-ai extended it into a full plugin ecosystem&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;About Andrej Karpathy: Co-founder of OpenAI, former Tesla AI Director, now independent researcher. Known for nanoGPT, &lt;em&gt;Neural Networks: Zero to Hero&lt;/em&gt;, and other educational projects widely followed in the AI community. His practical feedback on AI tools carries significant weight.&lt;/p&gt;

&lt;h3&gt;
  
  
  Project Stats
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;📄 Core file: &lt;code&gt;CLAUDE.md&lt;/code&gt; (behavioral guidelines)&lt;/li&gt;
&lt;li&gt;🔌 Claude Code plugin + Cursor rules&lt;/li&gt;
&lt;li&gt;📖 Includes: &lt;code&gt;EXAMPLES.md&lt;/code&gt; (contrast examples for each principle)&lt;/li&gt;
&lt;li&gt;📄 License: MIT&lt;/li&gt;
&lt;li&gt;🌐 Repository: &lt;a href="https://github.com/multica-ai/andrej-karpathy-skills" rel="noopener noreferrer"&gt;multica-ai/andrej-karpathy-skills&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Main Features
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Core Utility
&lt;/h3&gt;

&lt;p&gt;This CLAUDE.md directly targets the four most common LLM coding failure modes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Common LLM coding failures
  ├── Silent assumptions (dive into code without clarifying)
  ├── Over-engineering (turn simple problems into complex ones)
  ├── Scope creep (touch unrelated code while "in the area")
  └── Vague execution (no verifiable definition of done)
        ↓ CLAUDE.md injects four behavioral principles
  Changed behavior
  ├── Think Before Coding
  ├── Simplicity First
  ├── Surgical Changes
  └── Goal-Driven Execution
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Quick Start
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Method 1: Claude Code plugin (recommended, global)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;/plugin marketplace add forrestchang/andrej-karpathy-skills
/plugin &lt;span class="nb"&gt;install &lt;/span&gt;andrej-karpathy-skills@karpathy-skills
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Applies to all projects after install — no per-project setup needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Method 2: Per-project CLAUDE.md&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;your-project
curl &lt;span class="nt"&gt;-o&lt;/span&gt; CLAUDE.md https://raw.githubusercontent.com/forrestchang/andrej-karpathy-skills/main/CLAUDE.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If your project already has a CLAUDE.md, merge the content manually.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Method 3: Cursor rules&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The repo ships &lt;code&gt;.cursor/rules/karpathy-guidelines.mdc&lt;/code&gt; with &lt;code&gt;alwaysApply: true&lt;/code&gt;. Copy it into your project:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; .cursor/rules
&lt;span class="nb"&gt;cp &lt;/span&gt;path/to/karpathy-guidelines.mdc .cursor/rules/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify it's active: Cursor → Settings → Rules — the guideline should appear in the list.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Signs the guidelines are working&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Claude asks clarifying questions before implementing, rather than diving in&lt;/li&gt;
&lt;li&gt;PR diffs are cleaner — only the requested changes appear&lt;/li&gt;
&lt;li&gt;No more "while I was there, I also improved..." in commit messages&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Four Principles
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Principle 1: Think Before Coding
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The problem it targets&lt;/strong&gt;: When faced with ambiguous requirements, LLMs tend to make implicit assumptions and execute silently, rather than surfacing the uncertainty.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Wrong behavior example&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User: Add a feature to export user data
LLM: [immediately starts writing code]
     → Assumes CSV format, assumes all fields, assumes synchronous download
     → But the user actually needed a background job + specific fields + JSON format
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Correct behavior&lt;/strong&gt; (the principle requires surfacing uncertainty first):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User: Add a feature to export user data
LLM: Before I start, a few things to clarify:
     1. Scope: Export all users or a filtered subset?
     2. Format: CSV / JSON / Excel?
     3. Delivery: Browser download, background job via email, or API endpoint?
     4. Fields: Which fields to include? Any sensitive data restrictions?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Another canonical example&lt;/strong&gt; — "Make search faster" has three completely different valid meanings:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Interpretation&lt;/th&gt;
&lt;th&gt;Actual Need&lt;/th&gt;
&lt;th&gt;Solution&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Response time&lt;/td&gt;
&lt;td&gt;API returns slowly&lt;/td&gt;
&lt;td&gt;Add indexes, caching, optimize queries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Throughput&lt;/td&gt;
&lt;td&gt;High concurrency&lt;/td&gt;
&lt;td&gt;Horizontal scaling, queuing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Perceived UX speed&lt;/td&gt;
&lt;td&gt;User feels it is slow&lt;/td&gt;
&lt;td&gt;Preloading, skeleton screens, instant feedback&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;All three require fundamentally different approaches. None can be "defaulted to."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What the principle requires&lt;/strong&gt;: When multiple reasonable interpretations exist, present all of them and let the user choose. When genuinely confused, stop and say "I'm not sure how to handle this" rather than pushing through.&lt;/p&gt;




&lt;h3&gt;
  
  
  Principle 2: Simplicity First
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The problem it targets&lt;/strong&gt;: LLMs have a strong over-engineering tendency — introducing abstractions, frameworks, and "flexibility" before complexity is actually needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Wrong behavior example&lt;/strong&gt; — discount calculation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# User request: implement a simple discount calculation
&lt;/span&gt;
&lt;span class="c1"&gt;# ❌ LLM's "solution" (10x the code needed):
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;DiscountStrategy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ABC&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nd"&gt;@abstractmethod&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;calculate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;PercentageDiscount&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DiscountStrategy&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DiscountConfig&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;calculate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;DiscountCalculator&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;strategy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DiscountStrategy&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cart&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Cart&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="c1"&gt;# ...plus factory class, config class, registry...
&lt;/span&gt;
&lt;span class="c1"&gt;# ✅ What was actually needed (one function):
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;apply_discount&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;discount_pct&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;price&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;discount_pct&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Another example&lt;/strong&gt; — "Save user preferences":&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;❌ LLM implemented:
   - A caching layer with expiry (nobody asked)
   - Input validation (no bad data has appeared yet)
   - Conflict merging logic (nobody hit this problem)
   - A change notification system (nobody mentioned it)

✅ What was actually needed:
   - One function that writes preferences to the database
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The benchmark the principle provides&lt;/strong&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Would a senior engineer look at this and say it's overcomplicated?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If 200 lines could be 50, rewrite it as 50.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Core maxim&lt;/strong&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Good code is code that solves today's problem simply, not tomorrow's problem prematurely."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Premature complexity is not just wasteful — it makes code harder to understand, introduces more bugs, and slows development, even when it follows recognized design patterns.&lt;/p&gt;




&lt;h3&gt;
  
  
  Principle 3: Surgical Changes
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The problem it targets&lt;/strong&gt;: LLMs do "drive-by refactoring" — while fixing a bug, they also update quote styles, add type annotations, rename variables, and reorganize imports.&lt;/p&gt;

&lt;p&gt;This behavior feels helpful but has two serious problems:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Makes diffs hard to review&lt;/strong&gt;: Reviewers cannot distinguish which changes are bug fixes from which are "while I was there" improvements&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Introduces unexpected regressions&lt;/strong&gt;: Every unrequested change is a potential risk point&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The right approach&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Original code (has a bug, and some "imperfections")
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;calculate_total&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;price&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# single quotes
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;                 &lt;span class="c1"&gt;# no type annotation
&lt;/span&gt;
&lt;span class="c1"&gt;# ❌ LLM's "comprehensive improvement":
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;calculate_total&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# added type annotation
&lt;/span&gt;    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Calculate total price of items.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;          &lt;span class="c1"&gt;# added docstring
&lt;/span&gt;    &lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;                             &lt;span class="c1"&gt;# changed variable type
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# changed to double quotes ("style consistency")
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;

&lt;span class="c1"&gt;# ✅ Only fix the bug (suppose the bug is empty items):
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;calculate_total&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;               &lt;span class="c1"&gt;# only this one line added
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;price&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# original style preserved
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Specific requirements of the principle&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Every changed line must trace directly to the user's request&lt;/li&gt;
&lt;li&gt;Match existing code style even if you prefer a different one&lt;/li&gt;
&lt;li&gt;Do not improve code you happen to pass through unless explicitly asked&lt;/li&gt;
&lt;li&gt;Only clean up unused imports/variables that &lt;strong&gt;your changes&lt;/strong&gt; created — leave pre-existing dead code alone&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Principle 4: Goal-Driven Execution
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The problem it targets&lt;/strong&gt;: Given a vague task, LLMs produce plans that look comprehensive but lack verifiable outcomes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A vague plan example&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Task: "Refactor the auth module"

❌ Vague plan:
   1. Review existing code
   2. Identify problems
   3. Improve structure
   4. Run tests
   → Not a single step has a clear definition of "done"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The principle requires converting tasks to verifiable goals&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Task: "Fix the login bug"

✅ Goal-driven plan:
   Step 1: Write a failing test that reproduces the bug
           Checkpoint: Test actually fails on current code
   Step 2: Implement the fix
           Checkpoint: The test now passes
   Step 3: Run the full test suite
           Checkpoint: No new regressions
   → Every step has a clear, objective definition of "complete"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Another example&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Task: "Refactor the auth module"

✅ Concretized:
   1. All existing tests pass (record baseline)
   2. Extract TokenService (checkpoint: standalone unit tests pass)
   3. Refactor AuthController to use TokenService (checkpoint: integration tests pass)
   4. All original tests still pass (no regressions)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Karpathy's core insight&lt;/strong&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;LLMs excel at "looping until they meet specific goals" — so providing success criteria is more effective than providing imperative instructions.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Imperative instructions ("do A, then B, then C") leave the LLM without guidance when something goes wrong. Declarative goals ("this test must pass," "this interface must be callable") let the LLM choose its own path while giving it a clear completion criterion.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why a File, Not a Tool?
&lt;/h2&gt;

&lt;p&gt;The design choice of this project is worth reflecting on. Faced with LLM coding behavior problems, many solutions are possible:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Build an agent framework to constrain behavior&lt;/li&gt;
&lt;li&gt;Develop post-processing tools to detect and correct issues&lt;/li&gt;
&lt;li&gt;Fine-tune a better model&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;andrej-karpathy-skills chose &lt;strong&gt;the simplest one&lt;/strong&gt;: a text file, placed in the project, which the LLM reads and follows itself.&lt;/p&gt;

&lt;p&gt;This choice is itself the best demonstration of "Simplicity First" — minimum mechanism, today's problem solved. And a text file has one advantage no tool can match: &lt;strong&gt;it can be read, understood, and modified by anyone at any time&lt;/strong&gt;, with no black box.&lt;/p&gt;




&lt;h2&gt;
  
  
  Project Links &amp;amp; Resources
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Official Resources
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;🌟 &lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/multica-ai/andrej-karpathy-skills" rel="noopener noreferrer"&gt;https://github.com/multica-ai/andrej-karpathy-skills&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;📄 &lt;strong&gt;Direct CLAUDE.md download&lt;/strong&gt;: Available via &lt;code&gt;curl&lt;/code&gt; (see Quick Start above)&lt;/li&gt;
&lt;li&gt;📖 &lt;strong&gt;Examples&lt;/strong&gt;: &lt;code&gt;EXAMPLES.md&lt;/code&gt; in the repo (contrast examples for each principle — recommended reading)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Target Audience
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Daily Claude Code / Cursor users&lt;/strong&gt;: Who want to reduce LLM over-engineering and unnecessary code changes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Team engineering productivity leads&lt;/strong&gt;: Looking to integrate behavioral standards into a shared CLAUDE.md and standardize AI-assisted coding across a team&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Developers who care about reviewable PRs&lt;/strong&gt;: Who are tired of LLM-generated "super diffs" and want clean, focused pull requests containing only the requested changes&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Key Takeaways
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Origin&lt;/strong&gt;: Directly distilled from Karpathy's first-hand observations of LLM coding failure modes — grounded in real usage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Four principles&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Think Before Coding&lt;/strong&gt;: Make implicit assumptions explicit questions rather than silently picking one&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Simplicity First&lt;/strong&gt;: Write the minimum code to solve today's problem, don't pre-build "flexibility"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Surgical Changes&lt;/strong&gt;: Every changed line traces to the request; no drive-by refactoring&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Goal-Driven Execution&lt;/strong&gt;: Provide success criteria, not step-by-step instructions&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Three install formats&lt;/strong&gt;: CLAUDE.md (per-project) / Claude Code plugin (global) / Cursor rules&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Core philosophy&lt;/strong&gt;: LLMs excel at looping toward goals — give them goals, not procedures&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-demonstrating&lt;/strong&gt;: The simplest possible solution (one file) to the problem it addresses — Simplicity First, embodied&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  One-Line Review
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;andrej-karpathy-skills does something deceptively small but far-reaching: compresses the engineering wisdom of "how to use LLMs well for coding" into a single text file anyone can read, understand, and drop into any project — and that file itself is the best proof of the simple-first philosophy it advocates.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;em&gt;Find more useful knowledge and interesting products on my &lt;a href="https://home.wonlab.top" rel="noopener noreferrer"&gt;Homepage&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>ai</category>
      <category>claude</category>
      <category>karpathy</category>
    </item>
    <item>
      <title>Building Reliable AI Agents: Harness Engineering and Multi-Agent Architecture in Practice</title>
      <dc:creator>WonderLab</dc:creator>
      <pubDate>Fri, 22 May 2026 06:58:31 +0000</pubDate>
      <link>https://forem.com/wonderlab/building-reliable-ai-agents-harness-engineering-and-multi-agent-architecture-in-practice-3jbn</link>
      <guid>https://forem.com/wonderlab/building-reliable-ai-agents-harness-engineering-and-multi-agent-architecture-in-practice-3jbn</guid>
      <description>&lt;h2&gt;
  
  
  The Problems We Actually Ran Into
&lt;/h2&gt;

&lt;p&gt;If you've built an AI-assisted analysis tool, you've probably hit these two walls:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Wall #1: Inconsistent output quality.&lt;/strong&gt; The longer the task chain, the more the AI drifts — its language stays precise, its tone stays confident, but the conclusions don't hold up. Ask it "are you sure?" and it'll double down with even more conviction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Wall #2: Token costs keep climbing.&lt;/strong&gt; The more history you accumulate, the more you have to re-feed the model on every new session. Token consumption grows linearly. Analysis quality doesn't follow.&lt;/p&gt;

&lt;p&gt;These are real problems we encountered building a CarPlay bug analysis tool. AI performance on multi-project, long-chain bug analysis was wildly inconsistent. When we introduced a multi-agent architecture to fix that, token consumption and runtime shot up instead.&lt;/p&gt;

&lt;p&gt;Those two problems pushed us to find a systematic engineering solution.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 1: Harness Engineering — Putting a Leash on the Model
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Problem: Non-determinism Is an Agent's Original Sin
&lt;/h3&gt;

&lt;p&gt;LLMs are inherently probabilistic. In a single-turn conversation, that randomness is what makes them creative. In a long-chain task, it's what makes them dangerous.&lt;/p&gt;

&lt;p&gt;A typical failure scenario: you ask an Agent to complete a 10-step development task. Steps 1–7 go fine. On step 8, the Agent drifts slightly. Step 9 builds on the drift. By step 10, you receive a result that looks complete but is entirely off-target. And you almost don't notice — because the Agent wrote a very convincing summary.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When AI starts executing autonomously across many steps, the central engineering challenge becomes: how do you supervise it and course-correct before the damage is done?&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Solution: Agent = Model + Harness
&lt;/h3&gt;

&lt;p&gt;In February 2026, Martin Fowler's team (author: Birgitta Böckeler) introduced the concept of &lt;strong&gt;Harness Engineering&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Agent = Model + Harness
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Harness&lt;/strong&gt; is everything in an Agent that isn't the model itself — prompts, tool definitions, rules, context management, validation mechanisms, feedback loops. All of it is the Harness.&lt;/p&gt;

&lt;p&gt;This definition sounds unremarkable at first. But it carries an important shift in thinking: &lt;strong&gt;to improve Agent reliability, don't swap in a better model — design a better Harness.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A Harness has two components:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Guides (feedforward):&lt;/strong&gt; Give the AI the right inputs before it acts — clear instructions, relevant context, structured task descriptions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sensors (feedback):&lt;/strong&gt; Validate the AI's outputs after it acts — independent validators, quality evaluators, anomaly detectors&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;LangChain's experiment makes this concrete: &lt;strong&gt;without changing the underlying model, using Harness Engineering alone, their Agent's benchmark ranking jumped from outside the top 30 to top 5.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Goal: Fix Problems Before They Reach Human Eyes
&lt;/h3&gt;

&lt;p&gt;One sentence captures what Harness Engineering is for:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;To make AI Coding Agents work with less human supervision, you need to systematically build an "external control framework" — the Harness. It's composed of feedforward Guides and feedback Sensors, with the goal of &lt;strong&gt;automatically correcting problems before they ever reach a human reviewer.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Part 2: Two Failure Modes That Break Single Agents
&lt;/h2&gt;

&lt;p&gt;With the framework established, let's look at exactly where single agents fail. Anthropic's article &lt;em&gt;Harness Design for Long-Running Application Development&lt;/em&gt; identifies two core failure modes:&lt;/p&gt;

&lt;h3&gt;
  
  
  Failure Mode 1: Context Anxiety
&lt;/h3&gt;

&lt;p&gt;As a task gets long and the Agent approaches its context window limit, it starts to "panic" and rush to finish — marking incomplete work as done, writing vague analyses with false certainty.&lt;/p&gt;

&lt;p&gt;This isn't a bug. It's the model using "confident closure" as a coping mechanism for "uncertain continuation."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Don't use Compact (context compression). Use &lt;strong&gt;Context Reset&lt;/strong&gt; — completely clear the context, start a new Agent with a structured handoff document, and let the new Agent take over.&lt;/p&gt;

&lt;h3&gt;
  
  
  Failure Mode 2: Self-Evaluation Breakdown
&lt;/h3&gt;

&lt;p&gt;Ask an Agent to evaluate its own output and it becomes pathologically optimistic — it'll give itself high marks even when the work is poor, because it's evaluating using the same cognitive framework that produced the output.&lt;/p&gt;

&lt;p&gt;It's like asking someone to take an exam and grade it themselves. Almost guaranteed to score well.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Introduce an &lt;strong&gt;independent Evaluator Agent&lt;/strong&gt;, deliberately prompted to be critical and skeptical, with no shared context with the Generator.&lt;/p&gt;

&lt;p&gt;These two discoveries directly motivate the first and most fundamental multi-agent pattern.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 3: Multi-Agent Architecture — Five Coordination Patterns
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;Some teams choose a pattern based on how sophisticated it sounds, not on whether it fits the problem at hand. Start with the &lt;strong&gt;simplest pattern that might work&lt;/strong&gt;, see where it struggles, then evolve from there.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Pattern 1: Generator-Validator
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Tasks where output quality is critical and evaluation criteria can be stated explicitly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt; Generator produces output → Validator evaluates against explicit criteria → if rejected, Generator gets specific feedback → loop until accepted or max iterations reached.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Generator → Output → Validator ──pass──→ Done
                          │
                     fail + feedback
                          │
                          └─→ Generator (next round)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Typical use cases:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Code generation (Generator writes code, Validator writes and runs tests)&lt;/li&gt;
&lt;li&gt;Customer support replies (Validator checks accuracy, tone, completeness)&lt;/li&gt;
&lt;li&gt;Compliance review (Validator checks output against rules line by line)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Critical caveat:&lt;/strong&gt; The Validator must have &lt;strong&gt;specific, explicit criteria&lt;/strong&gt; — not "check if it's good." A Validator without concrete standards will just rubber-stamp the Generator's output. Also set a maximum iteration count with a fallback strategy to prevent infinite oscillation.&lt;/p&gt;




&lt;h3&gt;
  
  
  Pattern 2: Orchestrator-Subagent
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Tasks that decompose cleanly into independent subtasks with minimal interdependencies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt; Orchestrator handles global planning and task delegation. Subagents each own a specific responsibility and report results back. Orchestrator integrates and produces final output.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Orchestrator ──delegate──→ Subagent A (security check)
             ──delegate──→ Subagent B (code style)
             ──delegate──→ Subagent C (test coverage)
                   ←──── collect all results ────
                   → integrate into final report
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Claude Code uses this pattern: the main Agent handles the primary workflow while dispatching subagents in the background to search codebases or investigate independent questions — &lt;strong&gt;keeping the Orchestrator's context focused on the main task&lt;/strong&gt; while parallel work happens elsewhere.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Limitation:&lt;/strong&gt; The Orchestrator is an information bottleneck. Information discovered by one subagent that's relevant to another must route through the Orchestrator. Key details get lost or over-summarized after a few hops.&lt;/p&gt;




&lt;h3&gt;
  
  
  Pattern 3: Agent Team
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Tasks that decompose into long-running, independent subtasks where each worker benefits from accumulating domain context over time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt; Coordinator distributes work via a shared queue. Multiple Workers each pick up tasks, run autonomously through multi-step work, and signal on completion. Unlike Pattern 2, Workers &lt;strong&gt;persist across tasks&lt;/strong&gt; — they keep accumulating context rather than starting fresh each time.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Coordinator ──→ Task queue
Worker A ←── pick up task ──→ complete ──→ signal
Worker B ←── pick up task ──→ complete ──→ signal
Worker C ←── pick up task ──→ complete ──→ signal
                                ↓
              Coordinator collects → integration tests
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Typical use case:&lt;/strong&gt; Migrating a large codebase from one framework to another, with each Worker independently migrating one service.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Limitation:&lt;/strong&gt; Independence is a prerequisite. Workers can't easily share intermediate findings. Careful task partitioning and conflict resolution mechanisms are required.&lt;/p&gt;




&lt;h3&gt;
  
  
  Pattern 4: Message Bus
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Event-driven pipelines where the agent ecosystem is expected to keep growing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt; Agents communicate through publish/subscribe events, decoupled from each other. A router delivers matching messages. New agents can join by subscribing to topics without modifying existing connections.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Alert sources → Triage Agent → Router
                                ├──→ Network Investigation Agent
                                ├──→ Identity Analysis Agent
                                └──→ Context Enrichment Agent
                                            ↓
                                Response Coordination Agent → Actions
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Best fit when:&lt;/strong&gt; Work is triggered by events rather than a predetermined sequence, and teams need to develop and deploy individual agents independently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Limitation:&lt;/strong&gt; The longer the event chain, the harder debugging becomes. A misrouted message causes silent failures — the system doesn't crash, it just doesn't process the event.&lt;/p&gt;




&lt;h3&gt;
  
  
  Pattern 5: Shared State
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Collaborative tasks where agents need to build on each other's discoveries, without a central coordinator.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt; Agents run autonomously, coordinating through a shared persistent store (database, filesystem, or document). No central Orchestrator. Each agent reads what others have written, takes action based on those findings, and writes its own discoveries back.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Agent A (academic literature) ─┐
Agent B (industry reports)    ─┤──→ Shared knowledge store
Agent C (patent filings)      ─┘    ↑ agents read each other's findings
                                    → iteratively deepen research
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Limitation:&lt;/strong&gt; Without explicit coordination, agents may duplicate work. The hardest failure mode is the &lt;strong&gt;reactive loop&lt;/strong&gt;: Agent A writes a finding → Agent B responds → Agent A reacts again — burning tokens indefinitely. &lt;strong&gt;Termination conditions must be first-class citizens:&lt;/strong&gt; time budgets, convergence thresholds (no new findings after N cycles), or a dedicated "am I done?" judge agent.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 4: Solving the "Goldfish Memory" Problem
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Problem: The Context Tax
&lt;/h3&gt;

&lt;p&gt;Claude Code has a fundamental limitation: it's &lt;strong&gt;stateless&lt;/strong&gt;. Close the conversation window, memory resets.&lt;/p&gt;

&lt;p&gt;Every new session, you have to re-explain everything — project architecture, past decisions, coding style preferences, bugs you've already ruled out. You're paying a &lt;strong&gt;context tax&lt;/strong&gt; on every session just to get the AI back up to speed.&lt;/p&gt;

&lt;p&gt;Worse: this repeated loading is &lt;strong&gt;billed by token&lt;/strong&gt;. You're paying for compute that produces zero new value.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution 1: Claude Code's Native Memory (CLAUDE.md + Auto Memory)
&lt;/h3&gt;

&lt;p&gt;Claude Code offers two mechanisms for carrying knowledge across sessions:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;CLAUDE.md Files&lt;/th&gt;
&lt;th&gt;Auto Memory&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Written by&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;You&lt;/td&gt;
&lt;td&gt;Claude automatically&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Contains&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Instructions and rules&lt;/td&gt;
&lt;td&gt;Learned patterns and preferences&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best for&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Coding standards, architecture, workflows&lt;/td&gt;
&lt;td&gt;Build commands, debugging insights, behavior preferences&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Loaded each session&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;First 200 lines&lt;/td&gt;
&lt;td&gt;First 200 lines&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;CLAUDE.md placement&lt;/strong&gt; determines scope, from most to least specific:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;.claude/CLAUDE.md&lt;/code&gt; (project-level, shared via version control)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;~/.claude/CLAUDE.md&lt;/code&gt; (user-level, applies across all projects)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For larger projects, split rules into &lt;code&gt;.claude/rules/&lt;/code&gt; with one file per topic. Rules can also be &lt;strong&gt;path-scoped&lt;/strong&gt; using YAML frontmatter — only loaded when Claude is working on matching files:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;src/api/**/*.ts"&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="c1"&gt;# API Development Rules&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;All endpoints must include input validation&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Use the standard error response format&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Path-scoped rules reduce context noise — the relevant rules load when relevant, and stay out of the way otherwise.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Auto Memory&lt;/strong&gt; stores Claude's self-generated notes at &lt;code&gt;~/.claude/projects/&amp;lt;project&amp;gt;/memory/&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="s"&gt;memory/&lt;/span&gt;
&lt;span class="s"&gt;├── MEMORY.md&lt;/span&gt;          &lt;span class="c1"&gt;# Index file, loaded every session&lt;/span&gt;
&lt;span class="s"&gt;├── debugging.md&lt;/span&gt;       &lt;span class="c1"&gt;# Detailed debugging notes&lt;/span&gt;
&lt;span class="s"&gt;├── api-conventions.md&lt;/span&gt; &lt;span class="c1"&gt;# API design decisions&lt;/span&gt;
&lt;span class="s"&gt;└── ...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run &lt;code&gt;/init&lt;/code&gt; to auto-generate a starter CLAUDE.md. Claude will update Auto Memory over time based on your corrections and preferences.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution 2: claude-mem — The Community's Context Tax Workaround
&lt;/h3&gt;

&lt;p&gt;The native solution is reactive (you correct → Claude updates). The open-source project &lt;a href="https://github.com/thedotmack/claude-mem" rel="noopener noreferrer"&gt;claude-mem&lt;/a&gt; is more aggressive:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx claude-mem &lt;span class="nb"&gt;install&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Core mechanism:&lt;/strong&gt; Attach a local memory store outside Claude Code. Hooks intercept every tool call, compress the interaction into a summary stored in SQLite, and on the next session inject only the semantically relevant history — not everything.&lt;/p&gt;

&lt;p&gt;Data flow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Tool call → PostToolUse Hook captures
         → Claude API call (Observer role)
         → Compresses to XML-format observation
         → Stored in SQLite + Chroma vector DB
         ↓
New session → SessionStart
         → Query last 50 observations + 10 summaries
         → Inject into Claude context
         ↓
User submits prompt → UserPromptSubmit
         → Semantic search for top 5 relevant observations
         → Precision-inject relevant history
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Reported result from the project itself: &lt;strong&gt;95% token reduction&lt;/strong&gt; — 6 observations consumed 2,911 tokens to deliver work that would have taken 56,291 tokens with full context re-loading.&lt;/p&gt;

&lt;p&gt;The Observer uses a structured prompt to produce parseable XML:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;observation&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;type&amp;gt;&lt;/span&gt;bugfix&lt;span class="nt"&gt;&amp;lt;/type&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;title&amp;gt;&lt;/span&gt;CarPlay startup disconnect root cause identified&lt;span class="nt"&gt;&amp;lt;/title&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;narrative&amp;gt;&lt;/span&gt;Root cause was IOKit initialization timing. Fix was to...&lt;span class="nt"&gt;&amp;lt;/narrative&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;facts&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;fact&amp;gt;&lt;/span&gt;Disconnect occurs 200ms after kIOMessageServiceIsTerminated event&lt;span class="nt"&gt;&amp;lt;/fact&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;fact&amp;gt;&lt;/span&gt;CarPlay framework begins handshake before driver init completes&lt;span class="nt"&gt;&amp;lt;/fact&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/facts&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/observation&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Part 5: Teaching the AI Your Habits — Continuous Learning
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Problem: AI Has No Muscle Memory
&lt;/h3&gt;

&lt;p&gt;You've used Claude Code for three months. It still doesn't know your coding style. Every new session, it's a new employee who knows nothing about you — doesn't know you prefer functional over OOP, doesn't remember your project's quirky conventions, has no record of how you solved a similar problem last week.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Solution: Hooks + Instinct System
&lt;/h3&gt;

&lt;p&gt;This architecture comes from the &lt;strong&gt;claude-code-everything&lt;/strong&gt; project and consists of two independent subsystems:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Subsystem A: Memory Persistence
  → Answers "what did I do last session?" 
  → Short-term memory, restores work state across sessions

Subsystem B: Instinct Learning (Continuous Learning)
  → Answers "what are the user's habits?"
  → Long-term learning, accumulates behavioral preferences
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both are triggered via Hooks and inject their results into Claude's context at session start.&lt;/p&gt;

&lt;h4&gt;
  
  
  What Are Hooks?
&lt;/h4&gt;

&lt;p&gt;Hooks are &lt;strong&gt;event-driven triggers&lt;/strong&gt; that fire before and after Claude Code tool calls:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User request → Claude selects tool → PreToolUse hook → Tool executes → PostToolUse hook
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Hook Type&lt;/th&gt;
&lt;th&gt;Fires When&lt;/th&gt;
&lt;th&gt;Key Input&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;PreToolUse&lt;/td&gt;
&lt;td&gt;Before tool execution&lt;/td&gt;
&lt;td&gt;tool_name, tool_input&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PostToolUse&lt;/td&gt;
&lt;td&gt;After tool completes&lt;/td&gt;
&lt;td&gt;tool_name, tool_input, tool_output&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stop&lt;/td&gt;
&lt;td&gt;After each Claude response&lt;/td&gt;
&lt;td&gt;transcript_path (full session JSONL)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SessionStart&lt;/td&gt;
&lt;td&gt;Session begins&lt;/td&gt;
&lt;td&gt;session_id, cwd (can inject additionalContext)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SessionEnd&lt;/td&gt;
&lt;td&gt;Session ends&lt;/td&gt;
&lt;td&gt;session_id&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;PreToolUse hooks can control whether the tool runs: exit 0 continues, exit 2 aborts and surfaces the error to Claude.&lt;/p&gt;

&lt;h4&gt;
  
  
  What Does an Instinct Look Like?
&lt;/h4&gt;

&lt;p&gt;An Instinct is a single &lt;strong&gt;atomic behavioral preference&lt;/strong&gt; stored as a YAML file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;grep-before-edit&lt;/span&gt;
&lt;span class="na"&gt;trigger&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;when&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;modifying&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;existing&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;code"&lt;/span&gt;
&lt;span class="na"&gt;confidence&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.7&lt;/span&gt;
&lt;span class="na"&gt;domain&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;workflow&lt;/span&gt;
&lt;span class="na"&gt;scope&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;project&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

&lt;span class="c1"&gt;# Grep Before Edit&lt;/span&gt;

&lt;span class="c1"&gt;## Action&lt;/span&gt;
&lt;span class="s"&gt;Use Grep to locate code before Edit to confirm exact location.&lt;/span&gt;

&lt;span class="c1"&gt;## Evidence&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Observed 8 times across sessions&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Pattern&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Grep → Read → Edit sequence repeated consistently&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Last observed&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2026-04-16&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;confidence&lt;/code&gt;: 0.3–0.9, controls whether the instinct gets injected (threshold ≥ 0.7)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;scope&lt;/code&gt;: &lt;code&gt;project&lt;/code&gt; (this project only) or &lt;code&gt;global&lt;/code&gt; (all projects)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;trigger + action&lt;/code&gt;: what Claude actually sees when this instinct is active&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  The Full Learning Pipeline
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="s"&gt;Tool call&lt;/span&gt;
  &lt;span class="s"&gt;→ observe.sh (async, non-blocking)&lt;/span&gt;
      &lt;span class="s"&gt;→ append to observations.jsonl&lt;/span&gt;
      &lt;span class="s"&gt;→ increment counter, send SIGUSR1 every 20 calls&lt;/span&gt;

&lt;span class="s"&gt;observer-loop.sh (background daemon)&lt;/span&gt;
  &lt;span class="s"&gt;→ receives SIGUSR1&lt;/span&gt;
  &lt;span class="s"&gt;→ take last 500 observations&lt;/span&gt;
  &lt;span class="s"&gt;→ spawn Claude Haiku analysis (claude --model haiku --print)&lt;/span&gt;
      &lt;span class="s"&gt;→ Haiku identifies behavioral patterns&lt;/span&gt;
      &lt;span class="s"&gt;→ writes instinct YAML by rule&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;
          &lt;span class="s"&gt;3–5 occurrences  → confidence &lt;/span&gt;&lt;span class="m"&gt;0.5&lt;/span&gt;
          &lt;span class="s"&gt;6–10 occurrences → confidence &lt;/span&gt;&lt;span class="m"&gt;0.7&lt;/span&gt;
          &lt;span class="s"&gt;11+ occurrences  → confidence &lt;/span&gt;&lt;span class="m"&gt;0.85&lt;/span&gt;
  &lt;span class="s"&gt;→ archive analyzed observations&lt;/span&gt;

&lt;span class="s"&gt;New session → SessionStart&lt;/span&gt;
  &lt;span class="s"&gt;→ session-start.js reads instinct YAML files&lt;/span&gt;
  &lt;span class="s"&gt;→ filter confidence ≥ 0.7, take top &lt;/span&gt;&lt;span class="m"&gt;6&lt;/span&gt;
  &lt;span class="na"&gt;→ inject as additionalContext&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;

&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Active&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;instincts:&lt;/span&gt;
&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;[project&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;70%]&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Use&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Grep&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;locate&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;code&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;before&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Edit&lt;/span&gt;
&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;[global&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;85%]&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Grep&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;before&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Edit,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Read&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;before&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Write"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Using Haiku instead of Sonnet for analysis is a deliberate cost decision — pattern recognition doesn't need the most powerful model, and this process fires every 20 tool calls.&lt;/p&gt;




&lt;h2&gt;
  
  
  Putting It Together
&lt;/h2&gt;

&lt;p&gt;These four mechanisms address four distinct layers of AI Agent engineering:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Problem&lt;/th&gt;
&lt;th&gt;Solution&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Stability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;AI drifts off-track on long tasks&lt;/td&gt;
&lt;td&gt;Harness Engineering (Guides + Sensors)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Reliability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Single agent failure modes, self-evaluation blindspot&lt;/td&gt;
&lt;td&gt;Multi-agent architecture (match pattern to problem)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Continuity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Every session starts from zero&lt;/td&gt;
&lt;td&gt;CLAUDE.md + Auto Memory + claude-mem&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Growth&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;AI can't accumulate behavioral habits&lt;/td&gt;
&lt;td&gt;Hooks + Instinct continuous learning&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The direction of travel is clear: from "re-teach the AI every session" to "the more you use it, the more it knows you." None of these solutions are theoretical novelties. They're engineering practices that emerged from real projects hitting real walls — and finding ways through them.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This article is based on the CarPlay bug analysis tool development experience from the Connected Car team. Shared for learning and discussion.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>harness</category>
      <category>multiagent</category>
      <category>claude</category>
    </item>
    <item>
      <title>RAG Series (24): Code RAG — Teaching AI to Understand Your Codebase</title>
      <dc:creator>WonderLab</dc:creator>
      <pubDate>Thu, 21 May 2026 12:52:10 +0000</pubDate>
      <link>https://forem.com/wonderlab/rag-series-24-code-rag-teaching-ai-to-understand-your-codebase-1jm6</link>
      <guid>https://forem.com/wonderlab/rag-series-24-code-rag-teaching-ai-to-understand-your-codebase-1jm6</guid>
      <description>&lt;h2&gt;
  
  
  The Difference Between Code and Documents
&lt;/h2&gt;

&lt;p&gt;Split a Python file into 1000-character chunks with &lt;code&gt;RecursiveCharacterTextSplitter&lt;/code&gt;, embed them, run vector search — this is the most common "code RAG" implementation. The problem is that it treats code as text:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;evaluate_rag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;questions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;answers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;contexts&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Evaluate RAG system quality&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="err"&gt;（&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt; &lt;span class="n"&gt;lines&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="err"&gt;）&lt;/span&gt;&lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Character-based chunking will:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Split functions in half (first half in chunk A, second half in chunk B)&lt;/li&gt;
&lt;li&gt;Lose function boundary information (this IS &lt;code&gt;evaluate_rag&lt;/code&gt;, not random text)&lt;/li&gt;
&lt;li&gt;Ignore call relationships (what this function calls, who calls it)&lt;/li&gt;
&lt;li&gt;Destroy structural hierarchy (this is a method of &lt;code&gt;RAGPipeline&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Code carries three layers of information: &lt;strong&gt;semantics&lt;/strong&gt; (what it does), &lt;strong&gt;structure&lt;/strong&gt; (function/class/module boundaries), &lt;strong&gt;call relationships&lt;/strong&gt; (who calls whom). Good code RAG models all three.&lt;/p&gt;

&lt;p&gt;This article uses &lt;a href="https://github.com/chendongqi/llm-in-action" rel="noopener noreferrer"&gt;llm-in-action&lt;/a&gt; as the target and builds a code RAG system capable of answering "how is this function used?" and "show me all call chains through this function."&lt;/p&gt;




&lt;h2&gt;
  
  
  Parse Code with AST, Not Character Offsets
&lt;/h2&gt;

&lt;p&gt;Python's &lt;code&gt;ast&lt;/code&gt; module parses source files into syntax trees. A function definition is a node (&lt;code&gt;ast.FunctionDef&lt;/code&gt;) with its exact start line, end line, and decorator list. Chunking at AST boundaries guarantees splits at function edges:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;_FuncExtractor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ast&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NodeVisitor&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rel_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_lines&lt;/span&gt;       &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;splitlines&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_rel_path&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rel_path&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_class_stack&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;units&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;        &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;CodeUnit&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;visit_ClassDef&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ast&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ClassDef&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Track current class so methods know their parent_class
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_class_stack&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generic_visit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_class_stack&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_visit_func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Extract source by line number, not character offset
&lt;/span&gt;        &lt;span class="n"&gt;src&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_lines&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lineno&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;end_lineno&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;unit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;CodeUnit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;name&lt;/span&gt;         &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;kind&lt;/span&gt;         &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;method&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_class_stack&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="nb"&gt;file&lt;/span&gt;         &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_rel_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;start_line&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lineno&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;end_line&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;end_lineno&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;source&lt;/span&gt;       &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;docstring&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ast&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_docstring&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;parent_class&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_class_stack&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_class_stack&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;calls&lt;/span&gt;        &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_extract_calls&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;units&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;unit&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generic_visit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;visit_FunctionDef&lt;/span&gt;      &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_visit_func&lt;/span&gt;
    &lt;span class="n"&gt;visit_AsyncFunctionDef&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_visit_func&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Call relationships are extracted from &lt;code&gt;ast.Call&lt;/code&gt; nodes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_extract_calls&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;calls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;child&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;ast&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;walk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;child&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ast&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Call&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;child&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;func&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ast&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="n"&gt;calls&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;child&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;func&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;           &lt;span class="c1"&gt;# direct call: foo()
&lt;/span&gt;            &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;child&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;func&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ast&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Attribute&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="n"&gt;calls&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;child&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;func&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;          &lt;span class="c1"&gt;# attribute call: self.foo()
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;calls&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Extraction results on llm-in-action
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Scanned: /mnt/hdd/Database/03_Projects/LLM/llm-in-action
Time: 0.13 seconds

Python files:   22
Functions:      188 (top-level)
Methods:         37 (class methods)
Total units:    225
Article dirs:    18
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;0.13 seconds to scan the entire codebase. AST parsing doesn't execute code, so there are zero side effects.&lt;/p&gt;




&lt;h2&gt;
  
  
  Call Graph: Understanding Who Calls Whom
&lt;/h2&gt;

&lt;p&gt;Once function call relationships are extracted, build a bidirectional adjacency map — queryable in both directions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CallGraph&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;units&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;CodeUnit&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;callees&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;defaultdict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# caller → called
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;callers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;defaultdict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# callee → caller
&lt;/span&gt;
        &lt;span class="n"&gt;known&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;units&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;units&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;callee&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;calls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;callee&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;known&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;           &lt;span class="c1"&gt;# intra-repo edges only
&lt;/span&gt;                    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;callees&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;callee&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;callers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;callee&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;downstream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;depth&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;All functions transitively called by name (BFS).&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_bfs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;callees&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;depth&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;upstream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;depth&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;All functions that transitively call name (BFS).&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_bfs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;callers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;depth&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;shortest_path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]]:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Shortest call path from start → end.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;deque&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;deque&lt;/span&gt;&lt;span class="p"&gt;([[&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt;
        &lt;span class="n"&gt;visited&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;popleft&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;nxt&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;callees&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()):&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;nxt&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;visited&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;visited&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nxt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;nxt&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Call graph analysis results
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Call graph statistics:
  Functions with outgoing edges:  78  (they call others)
  Functions with incoming edges:  92  (they are called)
  Total edges:                   168
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Most-called functions&lt;/strong&gt; (the codebase's core utilities):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;get               ← called from 48 places  (cache reads throughout all articles)
set               ← called from 10 places  (cache writes)
split_documents   ← called from  5 places  (shared chunking helper)
build_embeddings  ← called from  4 places
query             ← called from  4 places
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;get&lt;/code&gt; appearing 48 times reflects Python duck typing — cache &lt;code&gt;.get()&lt;/code&gt; calls across &lt;code&gt;SemanticCache&lt;/code&gt;, &lt;code&gt;InMemoryCache&lt;/code&gt;, and similar types all collapse to the same name in static analysis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Functions with the most outgoing calls&lt;/strong&gt; (orchestrators):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;main                → 54 direct calls
build_self_rag_graph →  6 direct calls
build_index          →  5 direct calls
build_ragas_dataset  →  5 direct calls
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;main&lt;/code&gt; calling 54 functions is the signature of an entry point — it orchestrates the full pipeline by calling every sub-step.&lt;/p&gt;

&lt;h3&gt;
  
  
  Call chain traversal
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;build_self_rag_graph&lt;/code&gt; (14-self-rag/self_rag.py) full downstream:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;build_self_rag_graph
  ├── make_retrieve_node
  ├── make_filter_node
  ├── make_decide_node
  ├── make_support_node
  ├── make_rag_generate_node
  └── make_direct_generate_node
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is exactly the Self-RAG StateGraph builder pattern: one factory function assembles all graph nodes, each node is an independent small function. The call graph makes this structure immediately visible.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;build_index&lt;/code&gt; (08-ragas-eval/rag_pipeline.py) downstream chain:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;build_index
  → load_documents
  → build_llm
  → build_embeddings
  → split_documents
  → get  (cache)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A canonical RAG initialization sequence: load docs → build LLM → build embeddings → chunk → cache.&lt;/p&gt;




&lt;h2&gt;
  
  
  Vector Store: Semantic Code Search
&lt;/h2&gt;

&lt;p&gt;Code vectorization has one engineering constraint: function source can be long (50–200 lines), but embedding APIs commonly have a 512-token limit.&lt;/p&gt;

&lt;p&gt;Solution: &lt;strong&gt;separate the retrieval unit from the Q&amp;amp;A context&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Embedding content&lt;/strong&gt;: function name + docstring (short, semantically precise, fits in token budget)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metadata&lt;/strong&gt;: complete source code (stored in Chroma's metadata field, read at Q&amp;amp;A time for LLM context)
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;sig_line&lt;/span&gt;      &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;splitlines&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;embed_content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;full_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;docstring&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;sig_line&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="nc"&gt;Document&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;page_content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;embed_content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;         &lt;span class="c1"&gt;# vectorized — used for retrieval
&lt;/span&gt;    &lt;span class="n"&gt;metadata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;        &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;        &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;start_line&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;start_line&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source_code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;2000&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;   &lt;span class="c1"&gt;# not vectorized — used for LLM context
&lt;/span&gt;    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At Q&amp;amp;A time, retrieval finds relevant functions, then the full source is read from metadata:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;docs&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;similarity_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;---&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source_code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;page_content&lt;/span&gt;&lt;span class="p"&gt;)[:&lt;/span&gt;&lt;span class="mi"&gt;600&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;docs&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Semantic search results
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Query: "RAGAS evaluation metrics calculation"
  0.488  RAGPipeline.build_index   (08-ragas-eval/rag_pipeline.py:95)
  0.476  create_ragas_embeddings   (08-ragas-eval/evaluate.py:50)
  0.467  RAGPipeline.query         (08-ragas-eval/rag_pipeline.py:141)

Query: "rate limiting and access control in enterprise RAG"
  0.504  RAGPipeline.__init__      (08-ragas-eval/rag_pipeline.py:78)
  0.497  RateLimiter.__init__      (20-enterprise-rag/enterprise_rag.py:118)

Query: "incremental document indexing with record manager"
  0.296  generate_testset          (08-ragas-eval/generate_qa.py:51)

Query: "conversational history aware retriever"
  0.400  make_ds                   (18-conversational-rag/conversational_rag.py:428)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;RAGAS and enterprise RAG rate limiting queries found the right files. Incremental update didn't — because the functions in &lt;code&gt;19-incremental-update/&lt;/code&gt; don't mention "record manager" in their docstrings, only in their source code bodies. This is the core limitation of docstring-only embedding: &lt;strong&gt;search quality is bounded by docstring quality&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Choosing a Code Embedding Model
&lt;/h2&gt;

&lt;p&gt;General-purpose text embedding models (BGE, text-embedding-3) are "adequate but not great" for code. They can retrieve by docstring, but don't understand that &lt;code&gt;for i in range(n): acc += arr[i]&lt;/code&gt; is an accumulation.&lt;/p&gt;

&lt;p&gt;Specialized code embedding models:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Characteristics&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;microsoft/codebert-base&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Code + documentation dual-tower; understands variable names and signatures&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Salesforce/codet5-base&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Generative model; suited for code completion + retrieval&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;nomic-ai/nomic-embed-text-v1.5&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;General model with strong code performance; 8192-token limit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;voyage-code-2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Voyage AI's code-specialized model; among the best available&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Recommended: if token limits aren't a concern (e.g., &lt;code&gt;nomic-embed-text-v1.5&lt;/code&gt; supports 8192 tokens), embed the complete function source directly — no need to split docstrings from source.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Complete Code RAG Pipeline
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Build a code RAG system
&lt;/span&gt;
&lt;span class="c1"&gt;# 1. AST extraction: all functions and methods
&lt;/span&gt;&lt;span class="n"&gt;units&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_repo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;repo_dir&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 2. Call graph: bidirectional adjacency
&lt;/span&gt;&lt;span class="n"&gt;cg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;CallGraph&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;units&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 3. Vector store: docstrings for retrieval, source_code in metadata for Q&amp;amp;A
&lt;/span&gt;&lt;span class="n"&gt;vs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_vectorstore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;units&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Three query modes
&lt;/span&gt;
&lt;span class="c1"&gt;# A: Semantic search — find functions by meaning
&lt;/span&gt;&lt;span class="n"&gt;hits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;similarity_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;embedding caching&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# B: Call chain — given a function name, find all upstream/downstream
&lt;/span&gt;&lt;span class="n"&gt;callers&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upstream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;build_embeddings&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    &lt;span class="c1"&gt;# → who calls it
&lt;/span&gt;&lt;span class="n"&gt;callees&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;downstream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;main&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;              &lt;span class="c1"&gt;# → what it calls
&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;shortest_path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;main&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    &lt;span class="c1"&gt;# → how main reaches get
&lt;/span&gt;
&lt;span class="c1"&gt;# C: LLM Q&amp;amp;A — retrieve context, generate answer
&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;llm_code_qa&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;How is incremental update implemented?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Results Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Python files&lt;/td&gt;
&lt;td&gt;22&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code units extracted&lt;/td&gt;
&lt;td&gt;225 (188 functions + 37 methods)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AST parse time&lt;/td&gt;
&lt;td&gt;0.13 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Call graph edges&lt;/td&gt;
&lt;td&gt;168&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vectorization time&lt;/td&gt;
&lt;td&gt;5.8 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Most-called function&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;get&lt;/code&gt; (48 places)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Widest caller&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;main&lt;/code&gt; (54 direct calls)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Full Code
&lt;/h2&gt;

&lt;p&gt;Complete code is open-sourced at:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/chendongqi/llm-in-action/tree/main/24-code-rag" rel="noopener noreferrer"&gt;https://github.com/chendongqi/llm-in-action/tree/main/24-code-rag&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Key file:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;code_rag.py&lt;/code&gt; — AST extraction, call graph, vectorization, search, report&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;How to run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/chendongqi/llm-in-action
&lt;span class="nb"&gt;cd &lt;/span&gt;24-code-rag
&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
python code_rag.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;The core difference between code RAG and document RAG:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Document RAG&lt;/th&gt;
&lt;th&gt;Code RAG&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Chunk unit&lt;/td&gt;
&lt;td&gt;Fixed-size text blocks&lt;/td&gt;
&lt;td&gt;Functions/methods (AST boundaries)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Structure&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Class hierarchy, module hierarchy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Call relationships&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Call graph (bidirectional)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Embedding content&lt;/td&gt;
&lt;td&gt;Full text&lt;/td&gt;
&lt;td&gt;Docstring + signature (or full source)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Query types&lt;/td&gt;
&lt;td&gt;Semantic search&lt;/td&gt;
&lt;td&gt;Semantic search + call chain traversal&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Three key tradeoffs:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;AST vs text chunking&lt;/strong&gt;: AST cuts at function boundaries and preserves complete units. Text chunking is faster but destroys structure. For production code RAG, use AST — there's no reason not to.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docstring vs full source embedding&lt;/strong&gt;: Under token constraints, embed docstrings (short and semantically focused) — but quality depends on docstring completeness. With a long-context embedding model, embed the full source directly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Call graph vs pure vector retrieval&lt;/strong&gt;: Vector retrieval finds semantically similar functions; the call graph answers "what does X call?" and "who uses X?" — they're complementary, not interchangeable.&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;This is the final article in the RAG series. Twenty-four articles covering the complete path from "what is RAG?" to "how do you teach AI to understand a codebase?" All code is open-sourced at &lt;a href="https://github.com/chendongqi/llm-in-action" rel="noopener noreferrer"&gt;llm-in-action&lt;/a&gt; — every article has a runnable demo and a real benchmark report.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.python.org/3/library/ast.html" rel="noopener noreferrer"&gt;Python ast module documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2002.08155" rel="noopener noreferrer"&gt;CodeBERT: A Pre-Trained Model for Programming and Natural Languages&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2004.12832" rel="noopener noreferrer"&gt;ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://python.langchain.com/docs/use_cases/code_understanding/" rel="noopener noreferrer"&gt;LangChain Code Understanding&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>rag</category>
      <category>llm</category>
    </item>
    <item>
      <title>Agent Series (1): What Is an Agent — It's Not Just an LLM That Can Call Tools</title>
      <dc:creator>WonderLab</dc:creator>
      <pubDate>Thu, 21 May 2026 07:43:56 +0000</pubDate>
      <link>https://forem.com/wonderlab/agent-series-1-what-is-an-agent-its-not-just-an-llm-that-can-call-tools-4plh</link>
      <guid>https://forem.com/wonderlab/agent-series-1-what-is-an-agent-its-not-just-an-llm-that-can-call-tools-4plh</guid>
      <description>&lt;h2&gt;
  
  
  You Think You're Using an Agent. You're Not.
&lt;/h2&gt;

&lt;p&gt;In 2023, "AI Agent" became a buzzword overnight. Every company claimed they built an Agent. Every product slapped the Agent label on it.&lt;/p&gt;

&lt;p&gt;But ask them: &lt;strong&gt;What's the fundamental difference between your Agent and a regular LLM call?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most people go quiet for three seconds, then say: "...it can call tools."&lt;/p&gt;

&lt;p&gt;Is that wrong? No. But it's missing the point. It's like answering "what's the difference between a car and a bicycle" with "a car has four wheels" — technically correct, but you forgot to mention the engine.&lt;/p&gt;

&lt;p&gt;This article has one goal: &lt;strong&gt;help you understand what an Agent actually is&lt;/strong&gt; — and why it's fundamentally different from an LLM or a Chatbot. Get this right, and you'll make better technical decisions instead of wrapping an LLM API call and calling it "our Agent system."&lt;/p&gt;




&lt;h2&gt;
  
  
  Start With a Scenario
&lt;/h2&gt;

&lt;p&gt;Say you want to build an AI tool that analyzes competitors for users. The user types a company name, and the tool generates a competitive analysis report.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option A: Direct LLM call&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User input: Analyze Notion's competitors
↓
LLM generates report directly
↓
Output (based on training data, potentially outdated)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Option B: Chatbot&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User input: Analyze Notion's competitors
↓
LLM generates reply, remembers conversation history
User follow-up: Focus on pricing strategy
↓
LLM continues with context
↓
Multi-turn conversation, still based on training data
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Option C: Agent&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User input: Analyze Notion's competitors
↓
Agent thinks: I need fresh data, let me search first
↓
Calls search tool → gets latest competitor info
↓
Agent thinks: I should compare pricing, let me calculate
↓
Calls calculation tool → gets result
↓
Agent thinks: I have enough information now
↓
Outputs report (grounded in real-time data, with sources)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;See the difference? &lt;strong&gt;An Agent actively thinks "what do I need to do" and autonomously decides the next action.&lt;/strong&gt; That's the core — not whether it can call tools, but &lt;strong&gt;who decides which tool to call and when&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Three Concepts, Three Levels
&lt;/h2&gt;

&lt;h3&gt;
  
  
  LLM: A "Brain" with Language Ability
&lt;/h3&gt;

&lt;p&gt;A Large Language Model is fundamentally a &lt;strong&gt;function&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input: text (prompt)
Output: predicted next token (repeated until done)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Its capabilities come from statistical patterns learned from massive amounts of text. It understands language, it can reason — but it has &lt;strong&gt;no memory, no perception, no action capability&lt;/strong&gt;. Every call is stateless. It has no idea what you talked about last time.&lt;/p&gt;

&lt;p&gt;A standalone LLM is like a brilliant scholar who only answers questions: deeply knowledgeable, but locked in a room with no windows, unaware of what's happening outside, unable to proactively do anything for you.&lt;/p&gt;

&lt;h3&gt;
  
  
  Chatbot: An LLM with Memory
&lt;/h3&gt;

&lt;p&gt;Chatbot = LLM + &lt;strong&gt;conversation history management&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;It solves one simple problem: making the LLM "remember" what was said in this conversation. The implementation is also simple — prepend conversation history to every prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Pseudocode: the core logic of a Chatbot
&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;user_input&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_user_input&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# send full history to LLM
&lt;/span&gt;    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;assistant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The limitation of a Chatbot: &lt;strong&gt;it can converse, but it can't act&lt;/strong&gt;. It can't proactively look up information, call APIs, or run code — it can only answer based on what it already knows.&lt;/p&gt;

&lt;p&gt;If the LLM is the brilliant scholar, the Chatbot is that scholar with a phone — you can finally have a conversation, but they're still in their room.&lt;/p&gt;

&lt;h3&gt;
  
  
  Agent: An Autonomous Actor
&lt;/h3&gt;

&lt;p&gt;An Agent adds two critical capabilities on top of a Chatbot: &lt;strong&gt;tool use&lt;/strong&gt; and &lt;strong&gt;autonomous decision-making loop&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;But here's the key point that's often misunderstood: &lt;strong&gt;the tools themselves aren't what makes something an Agent. What matters is who decides which tool to use and when.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Chatbot with tools&lt;/strong&gt; = you tell it "check the weather," it calls the weather API&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent&lt;/strong&gt; = it decides on its own that "to answer this question, I need to check the weather," then proactively calls it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the difference between &lt;strong&gt;passive response&lt;/strong&gt; and &lt;strong&gt;active planning&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Four Elements of an Agent
&lt;/h2&gt;

&lt;p&gt;The clearest framework for understanding an Agent is to break it into four components. This framework draws from cognitive science research on intelligent behavior and represents the mainstream engineering understanding today.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────┐
│                        Agent                            │
│                                                         │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐  │
│  │  Perception │    │   Memory    │    │   Action    │  │
│  │             │    │             │    │             │  │
│  │ · User msgs │    │ · Chat hist │    │ · Call tools│  │
│  │ · Tool results    · Tool results    · Run code   │  │
│  │ · Environment│   · External KB│    │ · Call APIs │  │
│  └──────┬──────┘    └──────┬──────┘    └──────▲──────┘  │
│         │                  │                  │         │
│         └──────────┬───────┘                  │         │
│                    ▼                           │         │
│             ┌─────────────┐                   │         │
│             │  Reasoning  │───────────────────┘         │
│             │             │                             │
│             │ · Plan steps│                             │
│             │ · Choose tool                             │
│             │ · Decide done                             │
│             └─────────────┘                             │
└─────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Perception&lt;/strong&gt;: What the Agent can "see." At minimum, user input. More advanced: tool return values, database query results, screenshots, file contents. Perception defines the Agent's awareness — what it can't see, it can't act on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Memory&lt;/strong&gt;: What the Agent can "remember." This operates at several levels: current conversation history (short-term memory), past experiences stored in vector databases (long-term memory), and static external knowledge bases (semantic memory). We'll dedicate a full article to memory systems later in this series.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reasoning&lt;/strong&gt;: The Agent's "brain," and the most essential difference from a Chatbot. The LLM here acts as a &lt;strong&gt;controller&lt;/strong&gt;, not a "question answerer." Its job: decompose the task, plan the steps, choose which tool to use next, decide when the task is complete.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Action&lt;/strong&gt;: What the Agent can "do." Tool calls are the most common action — search, query a database, send an email, execute code. The range of actions defines the Agent's capability boundary — more tools means more tasks it can handle, but also higher risk of things going wrong (this is what Harness Engineering, a later topic, addresses).&lt;/p&gt;

&lt;p&gt;&lt;br&gt;
These four elements are the foundation for understanding Agent architecture, and the central thread running through the rest of this series:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Perception + Memory → Article 6: &lt;em&gt;Memory Management&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Reasoning → Article 2: &lt;em&gt;The ReAct Paradigm&lt;/em&gt;, Article 3: &lt;em&gt;Plan-and-Solve&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Action → Article 4: &lt;em&gt;Tool Calling&lt;/em&gt;, Article 5: &lt;em&gt;Intent Recognition&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Two Agent Paradigms: Assembly Line vs. Expedition Guide
&lt;/h2&gt;

&lt;p&gt;Real-world Agent systems break into two camps based on who controls the execution flow:&lt;/p&gt;
&lt;h3&gt;
  
  
  Workflow-Driven Agent
&lt;/h3&gt;

&lt;p&gt;Representative tools: Dify, n8n, Coze, Zapier AI&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Core idea&lt;/strong&gt;: The developer draws the flowchart; LLM is one node in it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Flowchart (defined by developer in advance):
Receive user question
    ↓
[LLM node] Classify question type
    ↓
If "billing question"  → [Tool node] Query billing system
If "complaint"         → [Tool node] Create support ticket
    ↓
[LLM node] Generate final reply
    ↓
Send to user
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The execution path is &lt;strong&gt;pre-designed by the developer&lt;/strong&gt;. The LLM handles natural language understanding and generation, but the "what happens next" logic is hardcoded in the flowchart.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Behavior is predictable; every path can be fully tested before launch&lt;/li&gt;
&lt;li&gt;Easy to debug when something goes wrong (broken node is obvious)&lt;/li&gt;
&lt;li&gt;Doesn't require the LLM to understand complex task planning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Customer service bots (question types are fixed, processes are known)&lt;/li&gt;
&lt;li&gt;Approval flow automation (steps are fixed, conditions are clear)&lt;/li&gt;
&lt;li&gt;Form processing, data ETL (structured, predictable)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  AI Native Agent
&lt;/h3&gt;

&lt;p&gt;Representative frameworks: LangGraph, AutoGen, CrewAI&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Core idea&lt;/strong&gt;: LLM is the control center and decides what to do.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User question arrives
    ↓
[LLM reasoning]: I need current data, should search first
    ↓
[Calls search tool] → results returned
    ↓
[LLM reasoning]: A number in the results needs verification
    ↓
[Calls calculation tool] → result returned
    ↓
[LLM reasoning]: I have enough to answer now
    ↓
Final answer output
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every "what to do next" step is dynamically decided by the LLM at runtime. Nobody hardcoded the flow. This is the essence of AI Native Agent: &lt;strong&gt;the LLM isn't a tool — the LLM is the conductor&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Handles open-ended tasks with unclear boundaries&lt;/li&gt;
&lt;li&gt;Adapts strategy based on intermediate results&lt;/li&gt;
&lt;li&gt;Suited for problems requiring multi-step reasoning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Open-ended research (user questions are diverse, impossible to enumerate)&lt;/li&gt;
&lt;li&gt;Automated bug fixing (requires dynamic decisions based on code analysis)&lt;/li&gt;
&lt;li&gt;Complex data analysis (needs multiple rounds of retrieval and computation)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  An Analogy to Remember
&lt;/h3&gt;

&lt;p&gt;Imagine planning a trip:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Workflow-Driven Agent&lt;/strong&gt; = high-speed rail. Fixed tracks, fixed stops, fixed departure times. Highly efficient, never gets lost — but can only go where the rails go.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI Native Agent&lt;/strong&gt; = an experienced travel guide. You say "I want somewhere with historical character," and they ask a few questions, check reviews in real time, adjust the itinerary based on today's weather, and handle "the attraction is temporarily closed" on the fly. Flexible — but they might also take you on a detour.&lt;/p&gt;




&lt;h2&gt;
  
  
  When to Use an Agent vs. a Plain LLM Call
&lt;/h2&gt;

&lt;p&gt;This is the most important engineering judgment you'll make — and the most common place for over-engineering.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The trap many fall into&lt;/strong&gt;: Agent sounds sophisticated, so people reach for it regardless of the problem. But Agents have costs — longer response times, higher token consumption, more complex debugging.&lt;/p&gt;

&lt;p&gt;Use this decision tree:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Does your task need an Agent?
│
├─ Are the task steps fixed and enumerable?
│   └─ Yes → use LLM + fixed Prompt, or Workflow-Driven Agent
│
├─ Does the task only need a single LLM call (no tools)?
│   └─ Yes → call the LLM API directly, no Agent needed
│
├─ Does the task need to decide the next step based on intermediate results?
│   └─ Yes → needs an Agent
│
├─ Does the task have more than 3 interdependent steps?
│   └─ Yes → needs an Agent
│
└─ Does the task need to handle situations you can't predict in advance?
    └─ Yes → needs an Agent (and specifically AI Native Agent)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Real examples:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Recommended Approach&lt;/th&gt;
&lt;th&gt;Reason&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Article summarization&lt;/td&gt;
&lt;td&gt;Direct LLM call&lt;/td&gt;
&lt;td&gt;Single call, fixed prompt&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FAQ chatbot&lt;/td&gt;
&lt;td&gt;Chatbot&lt;/td&gt;
&lt;td&gt;Multi-turn needed, no tools required&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Customer service routing&lt;/td&gt;
&lt;td&gt;Workflow-Driven Agent&lt;/td&gt;
&lt;td&gt;Fixed flow, enumerable cases&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Automated bug analysis &amp;amp; fix&lt;/td&gt;
&lt;td&gt;AI Native Agent&lt;/td&gt;
&lt;td&gt;Dynamic decisions based on code analysis&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Competitive research report&lt;/td&gt;
&lt;td&gt;AI Native Agent&lt;/td&gt;
&lt;td&gt;Open-ended, needs multi-round search&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code review&lt;/td&gt;
&lt;td&gt;AI Native Agent&lt;/td&gt;
&lt;td&gt;Dynamic, depends on code structure&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
&lt;strong&gt;Don't use an Agent just because you can&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If your task can be solved with a well-crafted Prompt, use the Prompt. The added complexity of an Agent (harder to debug, higher latency, higher cost) is only worth it when the task genuinely requires dynamic decision-making.&lt;/p&gt;

&lt;p&gt;Anthropic's official guidance says it plainly: &lt;em&gt;"LLMs should only be used as autonomous agents when autonomy and flexibility genuinely provide value — otherwise, direct API calls are more reliable and predictable."&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  What a Minimal AI Native Agent Looks Like
&lt;/h2&gt;

&lt;p&gt;Enough theory — here's real code. Below is a minimal ReAct Agent built with LangGraph (this is the most foundational AI Native Agent paradigm; the next article covers it in depth):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Dependencies: pip install langchain-anthropic langgraph
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_anthropic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChatAnthropic&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_core.tools&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;tool&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langgraph.prebuilt&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;create_react_agent&lt;/span&gt;

&lt;span class="c1"&gt;# 1. Define tools (the Agent's "hands")
&lt;/span&gt;&lt;span class="nd"&gt;@tool&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;search_web&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Search the web for current information&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# In real use, connect to a real search API (e.g., Tavily)
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Search results: latest information about &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="nd"&gt;@tool&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;calculate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expression&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Evaluate a mathematical expression&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;eval&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expression&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Note: don't use eval in production
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Calculation error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;# 2. Create the Agent (LLM is the control center)
&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ChatAnthropic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;search_web&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;calculate&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_react_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 3. Run
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is Apple&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s current market cap? How much more is that than $1 trillion?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Running this code, the Agent will automatically:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Decide it needs to search for Apple's market cap&lt;/li&gt;
&lt;li&gt;Call &lt;code&gt;search_web&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;See the result, decide it needs to compute the difference&lt;/li&gt;
&lt;li&gt;Call &lt;code&gt;calculate&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Combine the results into a final answer&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Nobody told it to search first and then calculate.&lt;/strong&gt; It planned that on its own. That's how an AI Native Agent works.&lt;/p&gt;

&lt;p&gt;&lt;br&gt;
&lt;strong&gt;Notes on the code above&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;eval(expression)&lt;/code&gt; has security implications in production; replace with a safe math library (e.g., &lt;code&gt;numexpr&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;A real search tool requires connecting to a search API like Tavily or SerpAPI&lt;/li&gt;
&lt;li&gt;The model &lt;code&gt;claude-sonnet-4-6&lt;/code&gt; is the recommended version as of this article's writing (May 2026); adjust as needed
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  How to Explain This in an Interview
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Common interview question: Is your system an Agent or a Workflow? What's the difference?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Many candidates stumble here because they've never seriously considered which one they actually built.&lt;/p&gt;

&lt;p&gt;A clear response framework:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Our system uses an AI Native Agent architecture, and the core distinction is &lt;strong&gt;who controls the execution flow&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In a Workflow-Driven approach, the developer pre-defines all possible paths, and the LLM is just one processing node — it's more predictable and well-suited for fixed-step scenarios.&lt;/p&gt;

&lt;p&gt;We chose AI Native Agent because our tasks (like automated bug analysis) have unclear boundaries — the code might span multiple modules, and we need to dynamically decide what to retrieve next based on each intermediate analysis result. A Workflow-Driven approach can't enumerate all possible code scenarios.&lt;/p&gt;

&lt;p&gt;Of course, the more autonomous the Agent, the higher the risk. That's why we added execution boundary controls (Harness Engineering) to ensure it never performs operations beyond its authorized scope."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The key to this answer: &lt;strong&gt;don't just say "I used an Agent." Explain why you chose it and show that you're aware of the trade-offs.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Three things from this article:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The hierarchy of LLM, Chatbot, and Agent&lt;/strong&gt;: LLM is the brain, Chatbot is the brain with memory, Agent is a complete system that can autonomously plan and act. The core difference isn't "can it call tools" — it's "who decides when to call which tool."&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The four elements of an Agent&lt;/strong&gt;: Perception (what it sees), Memory (what it remembers), Reasoning (what it plans), Action (what it executes). The LLM plays the role of "conductor," not "executor."&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The selection logic between the two paradigms&lt;/strong&gt;: Workflow-Driven suits fixed, predictable tasks; AI Native Agent suits open-ended tasks requiring dynamic decision-making. Don't use an Agent because it sounds impressive — use the right tool for the job.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;strong&gt;Next up&lt;/strong&gt;: Agent Series Article 2 — &lt;strong&gt;ReAct: The Most Important Reasoning Paradigm for Agents&lt;/strong&gt;. We'll dig into the Thought → Action → Observation loop, explore "what the Agent is thinking," and explain why Chain-of-Thought alone isn't enough.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/datawhalechina/Hello-Agents" rel="noopener noreferrer"&gt;hello-agents Open Tutorial&lt;/a&gt; (Chapter 1 and Chapter 4)&lt;/li&gt;
&lt;li&gt;Anthropic, &lt;em&gt;Building Effective Agents&lt;/em&gt;, 2024&lt;/li&gt;
&lt;li&gt;OpenAI, &lt;em&gt;A Practical Guide to Building Agents&lt;/em&gt;, 2025&lt;/li&gt;
&lt;li&gt;LangGraph Documentation: &lt;a href="https://langchain-ai.github.io/langgraph/" rel="noopener noreferrer"&gt;langchain-ai.github.io/langgraph&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;This is the first article in the Agent Engineering series. If you're just starting out with Agents, start here and read in order. Questions or feedback? Leave a comment below.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>llm</category>
      <category>ai</category>
      <category>langchain</category>
    </item>
    <item>
      <title>One Open Source Project a Day (No. 71): CodeGraph — Pre-Index Your Codebase for AI Agents, Save 35% Cost and 70% Tool Calls</title>
      <dc:creator>WonderLab</dc:creator>
      <pubDate>Thu, 21 May 2026 01:51:49 +0000</pubDate>
      <link>https://forem.com/wonderlab/one-open-source-project-a-day-no-71-codegraph-pre-index-your-codebase-for-ai-agents-save-35-50f3</link>
      <guid>https://forem.com/wonderlab/one-open-source-project-a-day-no-71-codegraph-pre-index-your-codebase-for-ai-agents-save-35-50f3</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;"~35% cheaper · ~70% fewer tool calls · 100% local"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is the No.71 article in the "One Open Source Project a Day" series. Today we are exploring &lt;strong&gt;CodeGraph&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Start with a scenario: you ask Claude Code "How is AuthService being called?" Without any assistance, Claude's approach is: glob-scan directories, run multiple greps, read several files — then finally answer. The whole process might trigger 10–15 tool calls and consume hundreds of thousands of tokens.&lt;/p&gt;

&lt;p&gt;CodeGraph's insight is to &lt;strong&gt;front-load this work&lt;/strong&gt;: before you start, it has already parsed your codebase with tree-sitter into a semantic graph stored in a local SQLite database, then exposes 8 query tools to AI agents via MCP. When the agent needs to understand code, a single &lt;code&gt;codegraph_context&lt;/code&gt; call returns entry points, related symbols, and code snippets — &lt;strong&gt;no file reading required&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;9.6k Stars, 588 Forks. Benchmarks across 7 real open-source projects: average 35% cost savings, 70% fewer tool calls, 49% speed improvement. On VS Code's large TypeScript repository, one architecture Q&amp;amp;A dropped from 1.4M tokens to 393k — cost from $0.64 to $0.42.&lt;/p&gt;

&lt;h3&gt;
  
  
  What You Will Learn
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;CodeGraph's four-stage pipeline: Extract → Store → Resolve → Auto-Sync&lt;/li&gt;
&lt;li&gt;The 8 MCP tools and when to use each&lt;/li&gt;
&lt;li&gt;A detailed breakdown of benchmark results across 7 projects: why do larger codebases benefit more?&lt;/li&gt;
&lt;li&gt;How 19-language support and 13-framework route recognition work&lt;/li&gt;
&lt;li&gt;Complete setup walkthrough from installation to Claude Code integration&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;codegraph affected&lt;/code&gt;: using dependency tracing for smart CI test selection&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Familiarity with Claude Code, Cursor, or similar AI coding tools&lt;/li&gt;
&lt;li&gt;Basic understanding of MCP (Model Context Protocol)&lt;/li&gt;
&lt;li&gt;Node.js experience&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Project Background
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Project Introduction
&lt;/h3&gt;

&lt;p&gt;CodeGraph is a &lt;strong&gt;local semantic code knowledge graph&lt;/strong&gt; tool designed specifically to improve AI coding agent efficiency. Its core insight:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;AI agents spend a massive amount of tokens and time in the "discovery phase" — scanning directories, searching for symbols, reading files — rather than on the actual reasoning and generation.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;CodeGraph's solution is to &lt;strong&gt;outsource the discovery phase to a pre-built index&lt;/strong&gt;: before you start working, the index is already ready, letting AI agents pull structured code knowledge directly instead of exploring the file system from scratch.&lt;/p&gt;

&lt;p&gt;The technology choices are pragmatic: tree-sitter for AST parsing (mature, multi-language, high-performance), SQLite FTS5 for full-text search (zero external dependencies, fully local), and native OS file events for live sync (FSEvents/inotify/ReadDirectoryChangesW).&lt;/p&gt;

&lt;h3&gt;
  
  
  Author/Team
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Author&lt;/strong&gt;: Colby McHenry (GitHub: colbymchenry)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Repository&lt;/strong&gt;: &lt;a href="https://github.com/colbymchenry/codegraph" rel="noopener noreferrer"&gt;colbymchenry/codegraph&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Distribution&lt;/strong&gt;: npm package &lt;code&gt;@colbymchenry/codegraph&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Project Stats
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;⭐ GitHub Stars: &lt;strong&gt;9,600+&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;🍴 Forks: &lt;strong&gt;588&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;📦 npm package: &lt;code&gt;@colbymchenry/codegraph&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;🔧 Runtime: Node.js 20–24&lt;/li&gt;
&lt;li&gt;💻 Platforms: Windows, macOS, Linux&lt;/li&gt;
&lt;li&gt;📄 License: MIT&lt;/li&gt;
&lt;li&gt;🌐 Repository: &lt;a href="https://github.com/colbymchenry/codegraph" rel="noopener noreferrer"&gt;colbymchenry/codegraph&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Main Features
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Core Utility
&lt;/h3&gt;

&lt;p&gt;CodeGraph inserts a pre-built index layer between AI agents and codebases:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Codebase (TypeScript / Python / Go / ...)
        ↓ tree-sitter parsing
  Semantic graph (symbols + relationships + call chains)
        ↓ stored in SQLite FTS5
  Local knowledge base
        ↓ exposed via MCP
  AI coding agents (Claude Code / Cursor / Codex CLI / OpenCode)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Without CodeGraph&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;User: &lt;span class="s2"&gt;"How is AuthService being called?"&lt;/span&gt;
→ Agent: glob&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"src/**/*.ts"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;         &lt;span class="c"&gt;# Tool call 1&lt;/span&gt;
→ Agent: &lt;span class="nb"&gt;grep&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"AuthService"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;         &lt;span class="c"&gt;# Tool call 2&lt;/span&gt;
→ Agent: &lt;span class="nb"&gt;read&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"auth.service.ts"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;     &lt;span class="c"&gt;# Tool call 3&lt;/span&gt;
→ Agent: &lt;span class="nb"&gt;grep&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"import.*Auth"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;        &lt;span class="c"&gt;# Tool call 4&lt;/span&gt;
→ Agent: &lt;span class="nb"&gt;read&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"user.controller.ts"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;  &lt;span class="c"&gt;# Tool call 5&lt;/span&gt;
→ Agent: &lt;span class="nb"&gt;read&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"app.module.ts"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;       &lt;span class="c"&gt;# Tool call 6&lt;/span&gt;
... 10–15 total tool calls, massive token consumption
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;With CodeGraph&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;User: &lt;span class="s2"&gt;"How is AuthService being called?"&lt;/span&gt;
→ Agent: codegraph_callers&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"AuthService"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;   &lt;span class="c"&gt;# Tool call 1&lt;/span&gt;
→ Returns: full &lt;span class="nb"&gt;caller &lt;/span&gt;list + call sites + code snippets
→ Agent answers directly, no file reading needed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Quick Start
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;One-command install (recommended)&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Run the interactive installer — auto-detects installed AI agents and configures them&lt;/span&gt;
npx @colbymchenry/codegraph

&lt;span class="c"&gt;# Initialize in your project (-i for interactive)&lt;/span&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;your-project
codegraph init &lt;span class="nt"&gt;-i&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Non-interactive install (CI environments)&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Auto-detect all installed agents, global install&lt;/span&gt;
codegraph &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--yes&lt;/span&gt;

&lt;span class="c"&gt;# Target specific agents&lt;/span&gt;
codegraph &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;cursor,claude &lt;span class="nt"&gt;--yes&lt;/span&gt;

&lt;span class="c"&gt;# Project-local install&lt;/span&gt;
codegraph &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;auto &lt;span class="nt"&gt;--location&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;local&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Manual Claude Code configuration&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; @colbymchenry/codegraph
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add to &lt;code&gt;~/.claude.json&lt;/code&gt; (or project-level &lt;code&gt;.claude.json&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"codegraph"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"stdio"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"codegraph"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"serve"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"--mcp"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Verify installation&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;codegraph status          &lt;span class="c"&gt;# Check index status and stats&lt;/span&gt;
codegraph query &lt;span class="s2"&gt;"UserService"&lt;/span&gt;  &lt;span class="c"&gt;# Test symbol search&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The 8 MCP Tools
&lt;/h3&gt;

&lt;p&gt;The complete toolset CodeGraph exposes to AI agents:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Typical Invocation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;codegraph_search&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Find symbols by name&lt;/td&gt;
&lt;td&gt;"Find all functions called authenticate"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;codegraph_context&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Build code context for a task&lt;/td&gt;
&lt;td&gt;"What code is relevant to the login flow?"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;codegraph_callers&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Find what calls a function&lt;/td&gt;
&lt;td&gt;"What calls AuthService?"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;codegraph_callees&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Find what a function calls&lt;/td&gt;
&lt;td&gt;"What does processPayment call internally?"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;codegraph_impact&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Analyze change impact radius&lt;/td&gt;
&lt;td&gt;"What breaks if I change this function?"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;codegraph_node&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Get details about a specific symbol&lt;/td&gt;
&lt;td&gt;"Show me UserController's full signature"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;codegraph_files&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Get indexed file structure&lt;/td&gt;
&lt;td&gt;"What is the overall project structure?"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;codegraph_status&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Check index health and stats&lt;/td&gt;
&lt;td&gt;"How many symbols are indexed? Last sync?"&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;codegraph_context&lt;/code&gt; is the most important tool&lt;/strong&gt; — it doesn't just return search results; it intelligently assembles a comprehensive context package for a given task, including entry points, related symbols, and code snippets:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Command-line equivalent&lt;/span&gt;
codegraph context &lt;span class="s2"&gt;"fix user login bug"&lt;/span&gt;
&lt;span class="c"&gt;# → Automatically finds login-related functions, call chains, and relevant files&lt;/span&gt;
&lt;span class="c"&gt;#   packaged into context Claude can consume directly&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Project Advantages
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;CodeGraph&lt;/th&gt;
&lt;th&gt;Native AI Agent (no assist)&lt;/th&gt;
&lt;th&gt;Other code indexers&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tool call count&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~70% fewer&lt;/td&gt;
&lt;td&gt;High (re-scans each task)&lt;/td&gt;
&lt;td&gt;Partial reduction&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Token usage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~59% fewer&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Partial reduction&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data privacy&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;100% local&lt;/td&gt;
&lt;td&gt;Depends on agent&lt;/td&gt;
&lt;td&gt;Most require uploads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Real-time sync&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Native OS file events&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;Usually polling or manual&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Language support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;19+ languages&lt;/td&gt;
&lt;td&gt;Depends on agent&lt;/td&gt;
&lt;td&gt;Usually 3–5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Framework route detection&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;13 frameworks&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Rare&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Installation complexity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;One npx command&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;Usually requires server&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Detailed Analysis
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. The Four-Stage Pipeline
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Stage 1: Extraction&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;tree-sitter parses source files into ASTs, extracting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Symbols&lt;/strong&gt;: functions, classes, methods, interfaces, variable definitions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Relationships&lt;/strong&gt;: function calls, module imports, class inheritance, interface implementations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;tree-sitter's key advantage: it is a &lt;strong&gt;fault-tolerant parser&lt;/strong&gt; — it can extract partial structure even when code has syntax errors. This is critical for indexing files that are actively being edited.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 2: Storage&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;All data lands in a local SQLite database using the FTS5 (Full-Text Search 5) extension:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Symbols table (simplified)&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;VIRTUAL&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;symbols&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;fts5&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;-- Symbol name&lt;/span&gt;
  &lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;-- function/class/method/...&lt;/span&gt;
  &lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="c1"&gt;-- Source file&lt;/span&gt;
  &lt;span class="n"&gt;line_start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;-- Starting line&lt;/span&gt;
  &lt;span class="n"&gt;signature&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="c1"&gt;-- Function signature&lt;/span&gt;
  &lt;span class="n"&gt;docstring&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="c1"&gt;-- Documentation comment&lt;/span&gt;
  &lt;span class="n"&gt;code_snippet&lt;/span&gt;   &lt;span class="c1"&gt;-- Code excerpt&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Relationships table&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;edges&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;from_id&lt;/span&gt;  &lt;span class="nb"&gt;INTEGER&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;-- Caller symbol ID&lt;/span&gt;
  &lt;span class="n"&gt;to_id&lt;/span&gt;    &lt;span class="nb"&gt;INTEGER&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;-- Callee symbol ID&lt;/span&gt;
  &lt;span class="n"&gt;kind&lt;/span&gt;     &lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="c1"&gt;-- calls/imports/inherits/implements&lt;/span&gt;
  &lt;span class="n"&gt;file&lt;/span&gt;     &lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;line&lt;/span&gt;     &lt;span class="nb"&gt;INTEGER&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Stage 3: Resolution&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The critical step: resolving abstract "called something named X" into concrete "called the definition in file Y at line Z."&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;Source&lt;/span&gt; &lt;span class="nx"&gt;code&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;AuthService&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;./auth.service&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
             &lt;span class="p"&gt;...&lt;/span&gt;
             &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;authService&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;login&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="err"&gt;↓&lt;/span&gt; &lt;span class="nx"&gt;resolution&lt;/span&gt;
&lt;span class="nx"&gt;Graph&lt;/span&gt; &lt;span class="nx"&gt;edges&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;UserController&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;login&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="nx"&gt;AuthService&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;login &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;calls&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
             &lt;span class="nx"&gt;UserController&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="nc"&gt;AuthService &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;imports&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Stage 4: Auto-Sync&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Uses native OS file events (not polling!) to detect changes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;macOS: &lt;code&gt;FSEvents&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Linux: &lt;code&gt;inotify&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Windows: &lt;code&gt;ReadDirectoryChangesW&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A &lt;strong&gt;2-second debounce&lt;/strong&gt; prevents triggering mass rebuilds when files change rapidly — it waits for changes to settle before doing incremental updates.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. Benchmark Deep Dive
&lt;/h3&gt;

&lt;p&gt;Test conditions: Claude Code (headless, Opus 4.7) answering architecture questions. Each result is the median of 4 runs on the same question, across 7 real open-source repositories.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Project        Language       Size            Cost ↓  Token ↓  Speed ↑  Tool Calls ↓
──────────────────────────────────────────────────────────────────────────────────────
VS Code        TypeScript     ~10k files      35%     73%      41%      72%
Excalidraw     TypeScript     ~600 files      47%     73%      60%      86%
Django         Python         ~2.7k files     34%     64%      59%      81%
Tokio          Rust           ~700 files      52%     81%      63%      89%
OkHttp         Java           ~640 files      17%     41%      36%      64%
Gin            Go             ~150 files      22%     23%      34%      19%
Alamofire      Swift          ~100 files      38%     59%      51%      77%
──────────────────────────────────────────────────────────────────────────────────────
Average                                       35%     59%      49%      70%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Patterns worth noting&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Tokio (Rust, 700 files) sees the biggest gains&lt;/strong&gt; (81% token reduction, 89% fewer tool calls): Rust's type system is complex — agents originally needed extensive file exploration to understand trait implementations and generic relationships. CodeGraph's pre-built relationships make this dramatically cheaper.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Gin (Go, 150 files) sees the smallest gains&lt;/strong&gt; (23% token reduction, 19% fewer tool calls): Small Go projects have simple file structures. Agents can already navigate them efficiently, so CodeGraph's marginal value is lower.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;VS Code's absolute numbers are the most striking&lt;/strong&gt;: the same question costs $0.64 (1.4M tokens) without CodeGraph, $0.42 (393k tokens) with it. A single task saves $0.22.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Takeaway&lt;/strong&gt;: &lt;strong&gt;The larger the codebase, the more complex the dependencies, and the richer the language's type system, the greater CodeGraph's benefit&lt;/strong&gt;. For developers using Claude Code heavily on large projects, the ROI is clear.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. 19 Languages + 13 Framework Route Detection
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Language support&lt;/strong&gt; (via tree-sitter grammars):&lt;/p&gt;

&lt;p&gt;TypeScript, JavaScript, Python, Go, Rust, Java, C#, PHP, Ruby, C, C++, Swift, Kotlin, Dart, Svelte, Vue, Liquid, Pascal/Delphi, Scala&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Framework route detection&lt;/strong&gt; is a differentiating feature — CodeGraph doesn't just recognize symbols, it understands the mapping between URL routes and their handler functions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Django
&lt;/span&gt;&lt;span class="n"&gt;urlpatterns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="nf"&gt;path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;users/&amp;lt;int:pk&amp;gt;/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;UserDetailView&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;as_view&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="c1"&gt;# → CodeGraph knows GET /users/{id}/ maps to UserDetailView
&lt;/span&gt;
&lt;span class="c1"&gt;# FastAPI
&lt;/span&gt;&lt;span class="nd"&gt;@app.get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/items/{item_id}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;read_item&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;span class="c1"&gt;# → CodeGraph knows GET /items/{id} maps to read_item()
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The 13 supported frameworks: Django, Flask, FastAPI, Express, NestJS, Laravel, Rails, Spring, Gin/chi/gorilla/mux, Axum/actix/Rocket, ASP.NET, Vapor, React Router/SvelteKit.&lt;/p&gt;

&lt;p&gt;This means AI agents can ask "Where is the handler for &lt;code&gt;/api/users/:id&lt;/code&gt;?" and get a precise answer, without needing to scan routing config files.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. &lt;code&gt;codegraph affected&lt;/code&gt; — Smart CI Test Selection
&lt;/h3&gt;

&lt;p&gt;An underappreciated feature: by tracing import dependencies, it identifies which test files are actually affected by changed source files.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# CI scenario: only run tests affected by this change&lt;/span&gt;
git diff &lt;span class="nt"&gt;--name-only&lt;/span&gt; | codegraph affected &lt;span class="nt"&gt;--stdin&lt;/span&gt;

&lt;span class="c"&gt;# Manually specify changed files&lt;/span&gt;
codegraph affected src/auth.ts

&lt;span class="c"&gt;# With filter (only e2e tests)&lt;/span&gt;
codegraph affected src/auth.ts &lt;span class="nt"&gt;--filter&lt;/span&gt; &lt;span class="s2"&gt;"e2e/*"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;How it works:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;Changed: src/auth.ts
  ↓ CodeGraph queries the dependency graph
  Direct importers: user.service.ts, auth.controller.ts
  Indirect importers: app.module.ts, integration.test.ts
  ↓ Filter to &lt;span class="nb"&gt;test &lt;/span&gt;files only
  Affected tests: auth.spec.ts, user.service.spec.ts, integration.test.ts
  ↓ Output
  &lt;span class="o"&gt;[&lt;/span&gt;these files] ← run only these, not the full &lt;span class="nb"&gt;test &lt;/span&gt;suite
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On large projects, this can compress CI test time from tens of minutes to a few minutes.&lt;/p&gt;




&lt;h3&gt;
  
  
  5. Configuration and Performance Notes
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Project config file&lt;/strong&gt; (&lt;code&gt;.codegraph/config.json&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"languages"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"typescript"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"javascript"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"exclude"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"node_modules/**"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"dist/**"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"build/**"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"*.min.js"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"maxFileSize"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1048576&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"extractDocstrings"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"trackCallSites"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;SQLite backend selection&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;CodeGraph ships with two SQLite backends:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Native &lt;code&gt;better-sqlite3&lt;/code&gt;&lt;/strong&gt; (default, recommended): High performance, supports concurrent reads&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;WASM fallback&lt;/strong&gt;: Better compatibility, but 5–10x slower than native, and concurrent operations may produce &lt;code&gt;database is locked&lt;/code&gt; errors&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you encounter performance issues or lock errors:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Rebuild the native module&lt;/span&gt;
npm rebuild better-sqlite3

&lt;span class="c"&gt;# Check which backend is active&lt;/span&gt;
codegraph status
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  CLI Reference
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;codegraph                        &lt;span class="c"&gt;# Run interactive installer&lt;/span&gt;
codegraph init &lt;span class="o"&gt;[&lt;/span&gt;path]            &lt;span class="c"&gt;# Initialize in a project&lt;/span&gt;
codegraph uninit &lt;span class="o"&gt;[&lt;/span&gt;path]          &lt;span class="c"&gt;# Remove CodeGraph from a project&lt;/span&gt;
codegraph index &lt;span class="o"&gt;[&lt;/span&gt;path]           &lt;span class="c"&gt;# Full index (--force to rebuild)&lt;/span&gt;
codegraph &lt;span class="nb"&gt;sync&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt;path]            &lt;span class="c"&gt;# Incremental update&lt;/span&gt;
codegraph status &lt;span class="o"&gt;[&lt;/span&gt;path]          &lt;span class="c"&gt;# Show statistics&lt;/span&gt;
codegraph query &amp;lt;search&amp;gt;         &lt;span class="c"&gt;# Search symbols&lt;/span&gt;
codegraph files &lt;span class="o"&gt;[&lt;/span&gt;path]           &lt;span class="c"&gt;# Show file structure&lt;/span&gt;
codegraph context &amp;lt;task&amp;gt;         &lt;span class="c"&gt;# Build AI context for a task&lt;/span&gt;
codegraph affected &lt;span class="o"&gt;[&lt;/span&gt;files]       &lt;span class="c"&gt;# Find affected test files&lt;/span&gt;
codegraph serve &lt;span class="nt"&gt;--mcp&lt;/span&gt;            &lt;span class="c"&gt;# Start MCP server&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Library API&lt;/strong&gt; (embed CodeGraph in your own tools):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;CodeGraph&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@colbymchenry/codegraph&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;cg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;CodeGraph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;init&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/path/to/project&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Full index with progress callbacks&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;cg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;indexAll&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;onProgress&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;p&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;phase&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;current&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;total&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// Search symbols&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;cg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;searchNodes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;UserService&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Get call chain&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;callers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;cg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getCallers&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Build AI context&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;cg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;buildContext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;fix login bug&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;maxNodes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;includeCode&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// Impact radius analysis (depth 2)&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;impact&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;cg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getImpactRadius&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="nx"&gt;cg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;watch&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;   &lt;span class="c1"&gt;// Start file watching for auto-sync&lt;/span&gt;
&lt;span class="nx"&gt;cg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;   &lt;span class="c1"&gt;// Clean up resources&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Project Links &amp;amp; Resources
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Official Resources
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;🌟 &lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/colbymchenry/codegraph" rel="noopener noreferrer"&gt;https://github.com/colbymchenry/codegraph&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;📦 &lt;strong&gt;npm&lt;/strong&gt;: &lt;a href="https://www.npmjs.com/package/@colbymchenry/codegraph" rel="noopener noreferrer"&gt;@colbymchenry/codegraph&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;⚡ &lt;strong&gt;Quick install&lt;/strong&gt;: &lt;code&gt;npx @colbymchenry/codegraph&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Target Audience
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Heavy Claude Code / Cursor users&lt;/strong&gt;: Working on large projects and looking to reduce cost and improve response speed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Large TypeScript/Rust/Python project developers&lt;/strong&gt;: Codebases large enough that AI agent file-scanning overhead is noticeable&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CI/CD engineers&lt;/strong&gt;: Using &lt;code&gt;codegraph affected&lt;/code&gt; for smart test selection to eliminate unnecessary full test runs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Toolchain developers&lt;/strong&gt;: Embedding code semantic analysis into their own tools via the Library API&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Key Takeaways
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Core value&lt;/strong&gt;: Inserts a pre-built semantic index between AI agents and codebases — average 35% cost savings, 70% fewer tool calls, 49% speed improvement&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Technology choices&lt;/strong&gt;: tree-sitter (AST parsing) + SQLite FTS5 (full-text search) + native OS file events (live sync) — zero external service dependencies&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;8 MCP tools&lt;/strong&gt;: &lt;code&gt;codegraph_context&lt;/code&gt; is the most critical — one call returns a complete context package for the task at hand&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;19 languages + 13 framework route detection&lt;/strong&gt; covering mainstream development stacks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;codegraph affected&lt;/code&gt;&lt;/strong&gt;: dependency-traced smart test selection, a CI acceleration tool&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gains scale with codebase size&lt;/strong&gt;: Tokio (Rust, 700 files) reaches 89% fewer tool calls; small Go projects see ~19%&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  One-Line Review
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;CodeGraph does something deceptively simple yet extremely practical: it converts the code discovery work that AI agents redo on every task into a reusable local index — not a feature addition, but a workflow architecture optimization.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;em&gt;Find more useful knowledge and interesting products on my &lt;a href="https://home.wonlab.top" rel="noopener noreferrer"&gt;Homepage&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>claude</category>
      <category>mcp</category>
      <category>sqlite</category>
    </item>
    <item>
      <title>RAG Series (23): Multimodal RAG — Images and Tables Can Be Retrieved Too</title>
      <dc:creator>WonderLab</dc:creator>
      <pubDate>Wed, 20 May 2026 12:24:18 +0000</pubDate>
      <link>https://forem.com/wonderlab/rag-series-23-multimodal-rag-images-and-tables-can-be-retrieved-too-3gj5</link>
      <guid>https://forem.com/wonderlab/rag-series-23-multimodal-rag-images-and-tables-can-be-retrieved-too-3gj5</guid>
      <description>&lt;h2&gt;
  
  
  What Text RAG Can't See
&lt;/h2&gt;

&lt;p&gt;Upload an annual report PDF. It contains revenue trend charts, product comparison tables, architecture diagrams. What does traditional RAG do?&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A PDF parser extracts text&lt;/li&gt;
&lt;li&gt;Text is chunked, embedded, stored in the vector store&lt;/li&gt;
&lt;li&gt;User asks: "What was the revenue growth in Q3?"&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The problem: &lt;strong&gt;the revenue chart is an image.&lt;/strong&gt; The PDF parser extracts its alt text (usually empty) or filename. The numbers are in the image, not the text. RAG will never find them.&lt;/p&gt;

&lt;p&gt;Tables are slightly better, but still problematic: parsers often flatten tables into lines of text, destroying the row/column structure and garbling the semantics.&lt;/p&gt;

&lt;p&gt;This is a real business pain point. Roughly 30–50% of the information in real-world documents exists in non-plain-text form.&lt;/p&gt;




&lt;h2&gt;
  
  
  Three Approaches
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Approach 1: Extract and Textualize
&lt;/h3&gt;

&lt;p&gt;The most direct and most mature approach: convert images and tables into text descriptions, then run standard text RAG.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Images: use a Vision Language Model (VLM) to generate descriptions&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;describe_image&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;image_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;b64encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image_url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image_url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data:image/png;base64,&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;image_data&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Describe this image in detail, including all numbers, labels, trends, and key information. If this is a chart, list all data points.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;}]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Tables: use &lt;code&gt;pdfplumber&lt;/code&gt; to preserve structure, convert to Markdown&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pdfplumber&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_tables_as_markdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pdf_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;tables_md&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;pdfplumber&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pdf_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pdf&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;page_num&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pdf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pages&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;table&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extract_tables&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="k"&gt;continue&lt;/span&gt;
                &lt;span class="n"&gt;header&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
                &lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;
                &lt;span class="n"&gt;md&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;| &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; | &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;header&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; |&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="n"&gt;md&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;| &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; | &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;---&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;header&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; |&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;md&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;| &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; | &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; |&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="n"&gt;tables_md&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[Page &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;page_num&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; table]&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;md&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;tables_md&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Integrate into the RAG pipeline:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_core.documents&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Document&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process_document&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pdf_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Document&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="c1"&gt;# 1. Extract plain text
&lt;/span&gt;    &lt;span class="n"&gt;text_chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_text_chunks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pdf_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="nc"&gt;Document&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page_content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pdf_path&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;text_chunks&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="c1"&gt;# 2. Extract images → VLM descriptions
&lt;/span&gt;    &lt;span class="n"&gt;images&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_images_from_pdf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pdf_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;img_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;page_num&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;description&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;describe_image&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;img_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Document&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;page_content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pdf_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;page&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;page_num&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image_path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;img_path&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="c1"&gt;# 3. Extract tables → Markdown
&lt;/span&gt;    &lt;span class="n"&gt;tables&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_tables_as_markdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pdf_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;table_md&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tables&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Document&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page_content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;table_md&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pdf_path&lt;/span&gt;&lt;span class="p"&gt;}))&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;docs&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Strengths&lt;/strong&gt;: Compatible with all existing text RAG infrastructure; no changes to the vector store.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Limitations&lt;/strong&gt;: VLM captioning adds cost and latency; description quality directly affects retrieval quality; OCR is sensitive to scan quality.&lt;/p&gt;


&lt;h3&gt;
  
  
  Approach 2: CLIP Multimodal Embeddings
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Principle&lt;/strong&gt;: CLIP (Contrastive Language–Image Pre-training, OpenAI 2021) projects both text and images into the &lt;strong&gt;same vector space&lt;/strong&gt;. The embedding of the phrase "revenue trend chart" will be close to the embedding of an actual revenue trend chart image.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_experimental.open_clip&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenCLIPEmbeddings&lt;/span&gt;

&lt;span class="n"&gt;clip_embeddings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenCLIPEmbeddings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ViT-H-14&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;checkpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;laion2b_s32b_b79k&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Embed text
&lt;/span&gt;&lt;span class="n"&gt;text_embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;clip_embeddings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;embed_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Q3 revenue trend&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Embed image
&lt;/span&gt;&lt;span class="n"&gt;image_embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;clip_embeddings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;embed_image&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;path/to/chart.png&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="c1"&gt;# Both are in the same vector space — similarity is meaningful
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dot&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;numpy.linalg&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;norm&lt;/span&gt;
&lt;span class="n"&gt;similarity&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;dot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text_embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;image_embedding&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text_embedding&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image_embedding&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Similarity: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;similarity&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# typically &amp;gt; 0.3 for semantically related pairs
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Build a mixed text+image vector store:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;uuid&lt;/span&gt;

&lt;span class="c1"&gt;# Images stored with their CLIP embeddings
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;img_path&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;image_paths&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;img_embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;clip_embeddings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;embed_image&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;img_path&lt;/span&gt;&lt;span class="p"&gt;])[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;doc_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uuid4&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="n"&gt;image_vectorstore&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_texts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;texts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[IMAGE]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;img_embedding&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;metadatas&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;img_path&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
        &lt;span class="n"&gt;ids&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;doc_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Dual-path retrieval at query time:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;multimodal_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Text retrieval
&lt;/span&gt;    &lt;span class="n"&gt;text_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text_vectorstore&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;similarity_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Image retrieval (via CLIP's text encoder)
&lt;/span&gt;    &lt;span class="n"&gt;query_embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;clip_embeddings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;embed_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;image_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;image_vectorstore&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;similarity_search_by_vector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;text_results&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;image_results&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Strengths&lt;/strong&gt;: Images don't need pre-captioning; retrieval operates on visual content directly.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Limitations&lt;/strong&gt;: CLIP performs well on natural photographs but poorly on professional charts and graphs — those require understanding numerical relationships, not just visual recognition.&lt;/p&gt;


&lt;h3&gt;
  
  
  Approach 3: ColPali (The 2024 Breakthrough)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Background&lt;/strong&gt;: Traditional document RAG follows this pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PDF → extract text/images → textualize → embed → retrieve
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every step loses information or introduces noise. ColPali (Google Research, 2024) took a different approach:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PDF → screenshot each page → vision language model → page-level embeddings → retrieve
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Process each PDF page directly as an image. Bypass text extraction entirely.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Key components:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Backbone&lt;/strong&gt;: PaliGemma 3B (Google's vision language model)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Late Interaction&lt;/strong&gt; (from ColBERT): each page is divided into 1,030 patches; each patch gets its own embedding; queries generate token-level embeddings; retrieval scores via fine-grained patch × token similarity, then aggregates&lt;/li&gt;
&lt;li&gt;The result: ColPali can pinpoint which part of a page answers a question
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Using the byaldi library (Python interface for ColPali)
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;byaldi&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RAGMultiModalModel&lt;/span&gt;

&lt;span class="c1"&gt;# Load ColPali
&lt;/span&gt;&lt;span class="n"&gt;RAG&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;RAGMultiModalModel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vidore/colpali-v1.2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Index a PDF directory (screenshots each page, generates patch embeddings)
&lt;/span&gt;&lt;span class="n"&gt;RAG&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;input_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./financial_reports/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;index_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reports_index&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;store_collection_with_index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# save original images for answer generation
&lt;/span&gt;    &lt;span class="n"&gt;overwrite&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Retrieve (returns the most relevant pages)
&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;RAG&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Q3 revenue quarter-over-quarter growth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;File: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;doc_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, Page: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;page_num&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, Score: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Generate an answer from the retrieved page image:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;answer_with_page_image&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;page_image_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page_image_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;img_b64&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;b64encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image_url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image_url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data:image/png;base64,&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;img_b64&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Based on this page, answer: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;}]&lt;/span&gt;
    &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The full ColPali flow:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User question → ColPali retrieves most relevant pages → extract page images → send to VLM → generate answer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Handles charts, formulas, and mixed layouts natively — no OCR required&lt;/li&gt;
&lt;li&gt;Page-level understanding preserves visual layout&lt;/li&gt;
&lt;li&gt;Significantly outperforms traditional methods on visually dense documents (research papers, financial reports)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Limitations:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Heavy model (PaliGemma 3B); retrieval latency higher than vector lookup&lt;/li&gt;
&lt;li&gt;Requires NVIDIA GPU; not suitable for CPU-only deployments&lt;/li&gt;
&lt;li&gt;Long index-build time (each page requires a forward pass)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Dedicated Table Handling
&lt;/h2&gt;

&lt;p&gt;Tables are different from images — they have structured semantics and deserve specialized treatment.&lt;/p&gt;

&lt;h3&gt;
  
  
  Method 1: Preserve Markdown structure
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;table_to_markdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;table&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
    &lt;span class="n"&gt;header&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;md&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;| &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; | &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;header&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; |&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;md&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;| &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; | &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;:---:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;header&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; |&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:]:&lt;/span&gt;
        &lt;span class="n"&gt;md&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;| &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; | &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; |&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;md&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Good LLMs can reason across rows and columns in Markdown format.&lt;/p&gt;

&lt;h3&gt;
  
  
  Method 2: Summary for retrieval + full table for generation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;index_table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table_md&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;table_metadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Use LLM to generate a retrieval-friendly summary
&lt;/span&gt;    &lt;span class="n"&gt;summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize the key information in this table in one sentence (under 50 words):&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;table_md&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Store summary as the retrieval unit, full table in metadata
&lt;/span&gt;    &lt;span class="n"&gt;vectorstore&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_texts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;metadatas&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;table_metadata&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;full_table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;table_md&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Retrieve by summary; send the full table Markdown to the LLM for answer generation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Method 3: Structured extraction → natural language
&lt;/h3&gt;

&lt;p&gt;For high-value tables (financials, product specs), extract as structured data then convert to natural language:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Table → JSON
&lt;/span&gt;&lt;span class="n"&gt;table_json&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;columns&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Quarter&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Revenue ($B)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;QoQ Growth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rows&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Quarter&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Q1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Revenue ($B)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;12.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;QoQ Growth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;+5.2%&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Quarter&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Q2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Revenue ($B)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;14.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;QoQ Growth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;+14.6%&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Quarter&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Q3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Revenue ($B)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;13.8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;QoQ Growth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-2.1%&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# JSON → natural language (better for semantic retrieval)
&lt;/span&gt;&lt;span class="n"&gt;nl_description&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Quarterly revenue data: Q1 $12.3B, Q2 $14.1B (up 14.6% QoQ), &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Q3 $13.8B (down 2.1% QoQ).&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Natural language is more retrieval-friendly and can be directly quoted in the LLM's answer.&lt;/p&gt;




&lt;h2&gt;
  
  
  Which Approach to Choose
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Extract + Textualize&lt;/th&gt;
&lt;th&gt;CLIP Multimodal&lt;/th&gt;
&lt;th&gt;ColPali&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Document types&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;All&lt;/td&gt;
&lt;td&gt;Image-heavy&lt;/td&gt;
&lt;td&gt;Visually dense (reports, academic PDFs)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Infrastructure&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Standard text RAG&lt;/td&gt;
&lt;td&gt;Requires CLIP&lt;/td&gt;
&lt;td&gt;Requires GPU, heavy model&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Chart understanding&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Depends on VLM caption quality&lt;/td&gt;
&lt;td&gt;Weak (charts ≠ natural photos)&lt;/td&gt;
&lt;td&gt;Strong (page-level understanding)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Update cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;High (re-indexing is expensive)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Engineering complexity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;VLM captioning fees&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Model inference cost&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Practical recommendations for most scenarios:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Scenario                              Recommended approach
──────────────────────────────────────────────────────────────
Standard enterprise docs (few images)  Text RAG, OCR or ignore images
Product docs (architecture diagrams)   Extract + GPT-4V caption
Financial/research reports (charts)    ColPali
E-commerce image search                CLIP
Quick knowledge base prototype         Extract + textualize (simplest)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  A Complete Multimodal RAG Pipeline
&lt;/h2&gt;

&lt;p&gt;Combining the approaches into a unified pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;enum&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Enum&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;DocElement&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Enum&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;TEXT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;IMAGE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;TABLE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MultimodalRAGPipeline&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text_embeddings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;clip_embeddings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text_emb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text_embeddings&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;clip_emb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;clip_embeddings&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vectorstore&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Chroma&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedding_function&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;text_embeddings&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pdf_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;elements&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_all_elements&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pdf_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# text / images / tables
&lt;/span&gt;        &lt;span class="n"&gt;docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;elem&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;elements&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;elem&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;DocElement&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Document&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page_content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;elem&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}))&lt;/span&gt;
            &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;elem&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;DocElement&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IMAGE&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;caption&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_generate_caption&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;elem&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;image_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Document&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="n"&gt;page_content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;caption&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;elem&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;image_path&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
                &lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;elem&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;DocElement&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TABLE&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Document&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="n"&gt;page_content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;table_to_markdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;elem&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                    &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
                &lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vectorstore&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_generate_caption&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;image_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;describe_image&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# calls GPT-4V
&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vectorstore&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;similarity_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;context_parts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="n"&gt;images_to_show&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;context_parts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[Image description] &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;page_content&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;images_to_show&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
            &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;context_parts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;page_content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Answer based on the following:&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;---&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context_parts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;Question: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;images&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;images_to_show&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Multimodal RAG is fundamentally about &lt;strong&gt;converting non-text information into a retrievable form&lt;/strong&gt;, then returning the original content to the LLM at answer-generation time. Three approaches:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Extract and textualize&lt;/strong&gt;: most mature, engineering-simple, but dependent on OCR/VLM quality — suitable for most scenarios&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CLIP multimodal embeddings&lt;/strong&gt;: unified vector space for text and images; good for natural photograph retrieval; limited on professional charts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ColPali&lt;/strong&gt;: direct visual page processing; best results for chart-heavy documents; requires GPU and higher engineering investment&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Tables are often simpler than images: preserve Markdown structure + generate a retrieval summary, and standard text RAG handles them well.&lt;/p&gt;

&lt;p&gt;Next (and final) in this series: &lt;strong&gt;Code RAG&lt;/strong&gt; — helping AI understand your codebase, including AST-based splitting, code embedding models, and representing call graphs with knowledge graphs.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2407.01449" rel="noopener noreferrer"&gt;ColPali: Efficient Document Retrieval with Vision Language Models (2024)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2103.00020" rel="noopener noreferrer"&gt;CLIP: Learning Transferable Visual Models From Natural Language Supervision (Radford et al., 2021)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/AnswerDotAI/byaldi" rel="noopener noreferrer"&gt;byaldi: Python interface for ColPali&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://python.langchain.com/docs/how_to/multimodal_inputs/" rel="noopener noreferrer"&gt;LangChain Multi-modal RAG Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/jsvine/pdfplumber" rel="noopener noreferrer"&gt;pdfplumber: Structured PDF extraction&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>rag</category>
      <category>multimodal</category>
      <category>vision</category>
      <category>llm</category>
    </item>
    <item>
      <title>One Open Source Project a Day (No. 70): Claude Plugins Official — A Complete Tour of Anthropic's Official Claude Code Plugin Ecosystem</title>
      <dc:creator>WonderLab</dc:creator>
      <pubDate>Wed, 20 May 2026 01:59:45 +0000</pubDate>
      <link>https://forem.com/wonderlab/one-open-source-project-a-day-no-70-claude-plugins-official-a-complete-tour-of-anthropics-4lgo</link>
      <guid>https://forem.com/wonderlab/one-open-source-project-a-day-no-70-claude-plugins-official-a-complete-tour-of-anthropics-4lgo</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;"Plugins extend Claude Code with commands, agents, skills, hooks, and MCP servers."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is the NO.70 article in the "One Open Source Project a Day" series. Today we are exploring &lt;strong&gt;claude-plugins-official&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This is Anthropic's officially maintained plugin registry on GitHub. If you have been using Claude Code lately, you have almost certainly seen the &lt;code&gt;/plugin install&lt;/code&gt; command — this repository is the official plugin source behind that command.&lt;/p&gt;

&lt;p&gt;20.2k Stars, 2.5k Forks, 669 open issues. Behind these numbers is a rapidly maturing ecosystem: Anthropic engineers have contributed 30+ plugins covering LSP support for 12 languages, a multi-agent PR review toolkit, Git workflow automation, and code quality analysis. 15 external partners — including GitHub, Firebase, Linear, and Terraform — have already joined.&lt;/p&gt;

&lt;p&gt;But this article is not just about "what plugins exist." What deserves equal attention is the &lt;strong&gt;design of the plugin specification itself&lt;/strong&gt;: a single plugin.json file, three extension mechanisms (Skills/Commands/MCP), and two trigger modes (user-invoked vs. model-invoked). Understanding this architecture means understanding the boundaries and possibilities of Claude Code extensibility.&lt;/p&gt;

&lt;h3&gt;
  
  
  What You Will Learn
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;The complete plugin directory structure of &lt;code&gt;claude-plugins-official&lt;/code&gt; (internal vs. external plugins)&lt;/li&gt;
&lt;li&gt;The Claude Code plugin specification: from &lt;code&gt;plugin.json&lt;/code&gt; to Skills/Commands/MCP&lt;/li&gt;
&lt;li&gt;Deep dives into five key plugins: &lt;code&gt;pr-review-toolkit&lt;/code&gt;, &lt;code&gt;agent-sdk-dev&lt;/code&gt;, &lt;code&gt;code-review&lt;/code&gt;, &lt;code&gt;hookify&lt;/code&gt;, &lt;code&gt;commit-commands&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;How to build a spec-compliant Claude Code plugin from scratch&lt;/li&gt;
&lt;li&gt;The current state of the external partner plugin ecosystem&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Familiarity with Claude Code (basic slash command usage)&lt;/li&gt;
&lt;li&gt;A basic understanding of MCP (Model Context Protocol)&lt;/li&gt;
&lt;li&gt;Interest in building your own Claude Code extensions&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Project Background
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Project Introduction
&lt;/h3&gt;

&lt;p&gt;claude-plugins-official is Anthropic's official plugin registry, serving the Claude Code plugin ecosystem. It plays two roles simultaneously:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Plugin registry&lt;/strong&gt;: Users install plugins via &lt;code&gt;/plugin install &amp;lt;name&amp;gt;@claude-plugins-official&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Development specification reference&lt;/strong&gt;: &lt;code&gt;plugins/example-plugin&lt;/code&gt; is the official complete reference implementation, demonstrating all extension mechanisms&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The repository is divided into two main directories:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;/plugins&lt;/code&gt;&lt;/strong&gt;: Developed by Anthropic engineers, covering LSP, development workflows, code quality, output styles, and more&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;/external_plugins&lt;/code&gt;&lt;/strong&gt;: Submitted by partners and the community, subject to quality and security review&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Author/Team Introduction
&lt;/h3&gt;

&lt;p&gt;This is a multi-contributor Anthropic internal project, with different engineers leading each plugin:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;pr-review-toolkit&lt;/strong&gt;: Daisy (&lt;a href="mailto:daisy@anthropic.com"&gt;daisy@anthropic.com&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;agent-sdk-dev&lt;/strong&gt;: Ashwin Bhat (&lt;a href="mailto:ashwin@anthropic.com"&gt;ashwin@anthropic.com&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;code-review&lt;/strong&gt;: Boris Cherny (&lt;a href="mailto:boris@anthropic.com"&gt;boris@anthropic.com&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;frontend-design&lt;/strong&gt;: Prithvi Rajasekaran + Alexander Bricken&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;commit-commands&lt;/strong&gt;: Anthropic team&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Project Stats
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;⭐ GitHub Stars: &lt;strong&gt;20,200+&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;🍴 Forks: &lt;strong&gt;2,500+&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;👁️ Watchers: &lt;strong&gt;147&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;🐛 Open Issues: &lt;strong&gt;669&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;💻 Languages: Python (31.6%), TypeScript (28.9%), HTML (19.5%), Shell (13.0%), JavaScript (7.0%)&lt;/li&gt;
&lt;li&gt;🏷️ Topics: &lt;code&gt;skills&lt;/code&gt;, &lt;code&gt;mcp&lt;/code&gt;, &lt;code&gt;claude-code&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;🌐 Repository: &lt;a href="https://github.com/anthropics/claude-plugins-official" rel="noopener noreferrer"&gt;anthropics/claude-plugins-official&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;📖 Docs: &lt;a href="https://code.claude.com/docs/en/plugins" rel="noopener noreferrer"&gt;code.claude.com/docs/en/plugins&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Main Features
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Core Utility
&lt;/h3&gt;

&lt;p&gt;claude-plugins-official provides a standardized way to extend Claude Code's capabilities:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Native Claude Code capabilities
        ↓
  /plugin install &amp;lt;name&amp;gt;@claude-plugins-official
        ↓
  Extended Claude Code
  ├── New Slash Commands (user-invoked)
  ├── New Skills (model-auto-triggered)
  ├── New Agents (specialized task delegates)
  └── New MCP Tools (external service integrations)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Quick Start
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Installing plugins (CLI)&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# PR review toolkit&lt;/span&gt;
/plugin &lt;span class="nb"&gt;install &lt;/span&gt;pr-review-toolkit@claude-plugins-official

&lt;span class="c"&gt;# Git commit command suite&lt;/span&gt;
/plugin &lt;span class="nb"&gt;install &lt;/span&gt;commit-commands@claude-plugins-official

&lt;span class="c"&gt;# Agent SDK development tools&lt;/span&gt;
/plugin &lt;span class="nb"&gt;install &lt;/span&gt;agent-sdk-dev@claude-plugins-official

&lt;span class="c"&gt;# Code review plugin&lt;/span&gt;
/plugin &lt;span class="nb"&gt;install &lt;/span&gt;code-review@claude-plugins-official

&lt;span class="c"&gt;# Hook management tool&lt;/span&gt;
/plugin &lt;span class="nb"&gt;install &lt;/span&gt;hookify@claude-plugins-official
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Installing plugins (UI)&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;In Claude Code, type: /plugin
→ Click "Discover"
→ Browse and install desired plugins
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Installing external partner plugins&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# GitHub integration&lt;/span&gt;
/plugin &lt;span class="nb"&gt;install &lt;/span&gt;github@claude-plugins-official

&lt;span class="c"&gt;# Linear project management&lt;/span&gt;
/plugin &lt;span class="nb"&gt;install &lt;/span&gt;linear@claude-plugins-official

&lt;span class="c"&gt;# Firebase integration&lt;/span&gt;
/plugin &lt;span class="nb"&gt;install &lt;/span&gt;firebase@claude-plugins-official

&lt;span class="c"&gt;# Terraform infrastructure&lt;/span&gt;
/plugin &lt;span class="nb"&gt;install &lt;/span&gt;terraform@claude-plugins-official
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Plugin Directory Overview
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Internal Plugins (/plugins)&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Plugins&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LSP Language Support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;clangd-lsp, csharp-lsp, gopls-lsp, jdtls-lsp, kotlin-lsp, lua-lsp, php-lsp, pyright-lsp, ruby-lsp, rust-analyzer-lsp, swift-lsp, typescript-lsp&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Development Workflows&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;agent-sdk-dev, claude-code-setup, commit-commands, feature-dev, mcp-server-dev, plugin-dev, pr-review-toolkit, ralph-loop&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Code Quality&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;code-modernization, code-review, code-simplifier, security-guidance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Output Styles&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;explanatory-output-style, learning-output-style&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Utility Tools&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;claude-md-management, cwc-makers, frontend-design, hookify, math-olympiad, session-report, skill-creator&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Reference&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;example-plugin&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;External Partner Plugins (/external_plugins)&lt;/strong&gt;: asana, context7, discord, fakechat, firebase, github, gitlab, greptile, imessage, laravel-boost, linear, playwright, serena, telegram, terraform&lt;/p&gt;




&lt;h2&gt;
  
  
  Five Key Plugins: Deep Dives
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Plugin 1: pr-review-toolkit — Six Parallel PR Review Agents
&lt;/h3&gt;

&lt;p&gt;This is arguably the most elegantly designed plugin in the entire directory: &lt;strong&gt;6 specialized agents running in parallel, reviewing the same Pull Request from different angles simultaneously&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PR submitted
  ↓
comment-analyzer      ← Comment accuracy, documentation completeness, comment rot
pr-test-analyzer      ← Test coverage quality, edge cases, behavioral vs. line coverage
silent-failure-hunter ← Silent failures, empty catch blocks, missing error logging
type-design-analyzer  ← Type encapsulation, invariants (rated 1–10 on 4 dimensions)
code-reviewer         ← CLAUDE.md compliance, style, bugs (0–100 score)
code-simplifier       ← Readability, unnecessary complexity, redundant abstractions
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Natural language triggering&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"Are the tests thorough?"                  → triggers pr-test-analyzer
"Check the error handling in the API client" → triggers silent-failure-hunter
"Is this documentation accurate?"           → triggers comment-analyzer
"Simplify this code"                        → triggers code-simplifier
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Full PR review (trigger all agents at once)&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"I'm ready to create this PR. Please:
1. Review test coverage
2. Check for silent failures
3. Verify code comments are accurate
4. Review any new types
5. General code review"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Recommended workflow&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Write code → code-reviewer
Fix issues → silent-failure-hunter (if error handling changed)
Add tests  → pr-test-analyzer
Document   → comment-analyzer
Polish     → code-simplifier
→ Create PR
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Plugin 2: agent-sdk-dev — Agent SDK Project Scaffolding
&lt;/h3&gt;

&lt;p&gt;This plugin compresses "building a Claude Agent SDK project from scratch" into a single command.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;/new-sdk-app&lt;/code&gt; command&lt;/strong&gt; (interactive project creation):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;/new-sdk-app my-agent-project
&lt;span class="c"&gt;# Interactive prompts:&lt;/span&gt;
&lt;span class="c"&gt;# 1. Language: TypeScript or Python?&lt;/span&gt;
&lt;span class="c"&gt;# 2. Agent type: coding / business / custom&lt;/span&gt;
&lt;span class="c"&gt;# 3. Starting point: minimal / basic / specific example&lt;/span&gt;
&lt;span class="c"&gt;# 4. Package manager: npm/yarn/pnpm or pip/poetry&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What it does automatically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Checks and installs the latest SDK version&lt;/li&gt;
&lt;li&gt;Creates project file structure, &lt;code&gt;.env.example&lt;/code&gt;, &lt;code&gt;.gitignore&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Runs type checking (TS) or syntax validation (Python)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automatically runs the appropriate Verifier agent&lt;/strong&gt; (validates the project against SDK best practices)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Two Verifier Agents&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Python project verification&lt;/span&gt;
&lt;span class="s2"&gt;"Verify my Python Agent SDK application"&lt;/span&gt;
→ Checks: SDK installation, requirements.txt/pyproject.toml,
           SDK usage patterns, agent init/config, .env security, error handling

&lt;span class="c"&gt;# TypeScript project verification&lt;/span&gt;
&lt;span class="s2"&gt;"Verify my TypeScript Agent SDK application"&lt;/span&gt;
→ Checks: SDK installation, tsconfig.json, &lt;span class="nb"&gt;type &lt;/span&gt;safety/imports,
           agent init/config, .env security, error handling
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Verifier output format&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Overall Status: PASS / PASS WITH WARNINGS / FAIL
- Critical Issues (blocking functionality)
- Warnings (suboptimal patterns)
- Passed Checks
- Recommendations (with SDK documentation links)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Plugin 3: code-review — Confidence-Filtered 4-Agent Code Review
&lt;/h3&gt;

&lt;p&gt;This plugin addresses the most common problem with AI code review: &lt;strong&gt;too many false positives&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it works&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. Pre-checks — Skip closed, draft, or trivial PRs
2. Collect CLAUDE.md guideline files from the repo
3. Summarize PR changes
4. Launch 4 parallel agents:
   - Agent #1 &amp;amp; #2 → CLAUDE.md compliance checks
   - Agent #3 → Bug detection (changed code only)
   - Agent #4 → Git blame/history context analysis
5. Score each issue 0–100 for confidence
6. Filter out issues below threshold (default: 80)
7. Post review comment with high-confidence issues only
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Confidence scale&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;Not confident, likely false positive&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;25&lt;/td&gt;
&lt;td&gt;Somewhat confident, might be real&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;Moderately confident, real but minor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;75&lt;/td&gt;
&lt;td&gt;Highly confident, real and important&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;Absolutely certain, must fix&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;What gets filtered out&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pre-existing issues not introduced by this PR&lt;/li&gt;
&lt;li&gt;Code that looks buggy but isn't&lt;/li&gt;
&lt;li&gt;Issues that linters will catch&lt;/li&gt;
&lt;li&gt;Items with lint-ignore comments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Configuring the confidence threshold&lt;/strong&gt; (in &lt;code&gt;commands/code-review.md&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="c"&gt;# Change 80 to your preferred threshold&lt;/span&gt;
&lt;span class="err"&gt;Filter&lt;/span&gt; &lt;span class="err"&gt;out&lt;/span&gt; &lt;span class="err"&gt;any&lt;/span&gt; &lt;span class="err"&gt;issues&lt;/span&gt; &lt;span class="err"&gt;with&lt;/span&gt; &lt;span class="err"&gt;a&lt;/span&gt; &lt;span class="err"&gt;score&lt;/span&gt; &lt;span class="err"&gt;less&lt;/span&gt; &lt;span class="err"&gt;than&lt;/span&gt; &lt;span class="err"&gt;80.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Plugin 4: hookify — Create Claude Code Hooks in Plain English
&lt;/h3&gt;

&lt;p&gt;Claude Code's Hooks feature (triggering custom logic before/after tool executions) requires manually editing complex JSON configuration. hookify eliminates that friction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Core commands&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create a rule from plain English&lt;/span&gt;
/hookify Warn me when I use &lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; commands

&lt;span class="c"&gt;# Analyze recent conversation, auto-suggest behaviors to block&lt;/span&gt;
/hookify

&lt;span class="c"&gt;# List all rules&lt;/span&gt;
/hookify:list

&lt;span class="c"&gt;# Enable/disable rules interactively&lt;/span&gt;
/hookify:configure
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Rules take effect &lt;strong&gt;immediately&lt;/strong&gt; — no restart required.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rule file format&lt;/strong&gt; (stored as &lt;code&gt;.claude/hookify.&amp;lt;name&amp;gt;.local.md&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;block-dangerous-rm&lt;/span&gt;
&lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="na"&gt;event&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bash&lt;/span&gt;
&lt;span class="na"&gt;pattern&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rm\s+-rf&lt;/span&gt;
&lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;block&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

⚠️ &lt;span class="gs"&gt;**Dangerous rm command detected!**&lt;/span&gt;

Please verify the path and ensure you have backups.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Advanced rules (multiple conditions)&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;warn-sensitive-files&lt;/span&gt;
&lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="na"&gt;event&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;file&lt;/span&gt;
&lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;warn&lt;/span&gt;
&lt;span class="na"&gt;conditions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;field&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;file_path&lt;/span&gt;
    &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;regex_match&lt;/span&gt;
    &lt;span class="na"&gt;pattern&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;\.env$|credentials|secrets&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;field&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;new_text&lt;/span&gt;
    &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;contains&lt;/span&gt;
    &lt;span class="na"&gt;pattern&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;KEY&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

🔐 &lt;span class="gs"&gt;**Sensitive file edit detected!**&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Event types and when they fire&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;event&lt;/th&gt;
&lt;th&gt;Fires when&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;bash&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Before a Bash command executes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;file&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;On file read/write operations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;prompt&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;When the user submits a prompt&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;stop&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;When Claude is about to stop responding&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;all&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;All of the above&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Practical rule examples&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Block destructive commands&lt;/span&gt;
&lt;span class="na"&gt;event&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bash&lt;/span&gt;
&lt;span class="na"&gt;pattern&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rm\s+-rf|dd\s+if=|mkfs|format&lt;/span&gt;
&lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;block&lt;/span&gt;

&lt;span class="c1"&gt;# Warn about debug code&lt;/span&gt;
&lt;span class="na"&gt;event&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;file&lt;/span&gt;
&lt;span class="na"&gt;pattern&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;console\.log\(|debugger;&lt;/span&gt;
&lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;warn&lt;/span&gt;

&lt;span class="c1"&gt;# Require tests before stopping&lt;/span&gt;
&lt;span class="na"&gt;event&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;stop&lt;/span&gt;
&lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;block&lt;/span&gt;
&lt;span class="na"&gt;conditions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;field&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;transcript&lt;/span&gt;
    &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;not_contains&lt;/span&gt;
    &lt;span class="na"&gt;pattern&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;npm test|pytest|cargo test&lt;/span&gt;

&lt;span class="c1"&gt;# Prevent hardcoded API keys in TypeScript&lt;/span&gt;
&lt;span class="na"&gt;event&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;file&lt;/span&gt;
&lt;span class="na"&gt;conditions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;field&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;file_path&lt;/span&gt;
    &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;regex_match&lt;/span&gt;
    &lt;span class="na"&gt;pattern&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;\.tsx?$&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;field&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;new_text&lt;/span&gt;
    &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;regex_match&lt;/span&gt;
    &lt;span class="na"&gt;pattern&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;(API_KEY|SECRET|TOKEN)\s*=\s*["']&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Plugin 5: commit-commands — Three-Command Git Workflow
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Auto-generate commit message and commit&lt;/span&gt;
/commit

&lt;span class="c"&gt;# Full one-command workflow: commit → push → create PR&lt;/span&gt;
/commit-push-pr

&lt;span class="c"&gt;# Clean up local branches whose remotes have been deleted&lt;/span&gt;
/clean_gone
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What &lt;code&gt;/commit&lt;/code&gt; does:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Analyzes staged and unstaged changes&lt;/li&gt;
&lt;li&gt;Reviews recent commit history to match repo style&lt;/li&gt;
&lt;li&gt;Stages files and creates a commit message with Claude Code attribution&lt;/li&gt;
&lt;li&gt;Automatically skips sensitive files (&lt;code&gt;.env&lt;/code&gt;, &lt;code&gt;credentials.json&lt;/code&gt;, etc.)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;What &lt;code&gt;/commit-push-pr&lt;/code&gt; does:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Creates a new branch if currently on &lt;code&gt;main&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Commits the changes&lt;/li&gt;
&lt;li&gt;Pushes to &lt;code&gt;origin&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Creates a PR via GitHub CLI (includes Summary + test checklist)&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Plugin Specification Deep Dive
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Standard Plugin Directory Structure
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="s"&gt;plugin-name/&lt;/span&gt;
&lt;span class="s"&gt;├── .claude-plugin/&lt;/span&gt;
&lt;span class="s"&gt;│   └── plugin.json&lt;/span&gt;      &lt;span class="c1"&gt;# Plugin metadata (required)&lt;/span&gt;
&lt;span class="s"&gt;├── .mcp.json&lt;/span&gt;            &lt;span class="c1"&gt;# MCP server config (optional)&lt;/span&gt;
&lt;span class="s"&gt;├── skills/&lt;/span&gt;              &lt;span class="c1"&gt;# Skill definitions (preferred)&lt;/span&gt;
&lt;span class="s"&gt;│   ├── my-skill/&lt;/span&gt;
&lt;span class="s"&gt;│   │   └── SKILL.md&lt;/span&gt;     &lt;span class="c1"&gt;# Model-auto-triggered skill&lt;/span&gt;
&lt;span class="s"&gt;│   └── my-command/&lt;/span&gt;
&lt;span class="s"&gt;│       └── SKILL.md&lt;/span&gt;     &lt;span class="c1"&gt;# User-invoked slash command (also uses SKILL.md)&lt;/span&gt;
&lt;span class="s"&gt;├── commands/&lt;/span&gt;            &lt;span class="c1"&gt;# Slash commands (legacy format)&lt;/span&gt;
&lt;span class="s"&gt;│   └── my-command.md&lt;/span&gt;
&lt;span class="s"&gt;├── agents/&lt;/span&gt;              &lt;span class="c1"&gt;# Agent definitions&lt;/span&gt;
&lt;span class="s"&gt;└── README.md&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The plugin.json Metadata Format
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"my-plugin"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"A description of what this plugin does"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"author"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Your Name"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"email"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"you@example.com"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Deliberately minimal by design: only three fields — no version, no dependency declarations. A plugin's capabilities are determined by its directory contents, not its metadata.&lt;/p&gt;

&lt;h3&gt;
  
  
  Three Extension Mechanisms Compared
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Mechanism 1: Skills (Recommended for new plugins)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Model-auto-triggered (context-based)&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;security-review&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Automatically trigger security review when security-related code is detected&lt;/span&gt;
&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1.0.0&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

&lt;span class="c1"&gt;# User-invoked (becomes /skill-name command)&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-command&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Short description shown in /help&lt;/span&gt;
&lt;span class="na"&gt;argument-hint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;&amp;lt;arg1&amp;gt; [optional-arg]&lt;/span&gt;
&lt;span class="na"&gt;allowed-tools&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;Read&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;Glob&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;Grep&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Mechanism 2: Commands (Legacy format, functionally equivalent)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# commands/my-command.md
---
# YAML frontmatter
---
# Command content (same as SKILL.md)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Mechanism 3: MCP Servers (External service integration)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;.mcp.json&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"my-service"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"http"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://api.myservice.com/mcp"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The Key Difference Between Two Skill Trigger Modes
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Trigger Type&lt;/th&gt;
&lt;th&gt;Activation&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;User-invoked&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;User types &lt;code&gt;/skill-name&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Explicit one-time operations (commit, review)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Model-auto-triggered&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Claude judges from task context&lt;/td&gt;
&lt;td&gt;Continuous context enhancement (output styles, security checks)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This distinction is critical: the &lt;code&gt;frontend-design&lt;/code&gt; plugin is a perfect example of model-auto-triggered — the user simply says "create a dashboard" and Claude automatically applies that plugin's design principles, no explicit invocation required.&lt;/p&gt;




&lt;h2&gt;
  
  
  External Plugin Submission Process
&lt;/h2&gt;

&lt;p&gt;To get your own plugin listed in the official directory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. Ensure your plugin meets quality and security standards
   ↓
2. Visit https://clau.de/plugin-directory-submission
   ↓
3. Fill out the plugin submission form
   ↓
4. Anthropic review (quality + security)
   ↓
5. Merged into /external_plugins directory
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Review criteria&lt;/strong&gt; (inferred from documentation):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Plugin must have a complete README and plugin.json&lt;/li&gt;
&lt;li&gt;MCP servers, if used, must be clearly documented&lt;/li&gt;
&lt;li&gt;No malicious code or unauthorized data collection&lt;/li&gt;
&lt;li&gt;Must provide real value, not duplicate existing functionality&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Project Links &amp;amp; Resources
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Official Resources
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;🌟 &lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/anthropics/claude-plugins-official" rel="noopener noreferrer"&gt;https://github.com/anthropics/claude-plugins-official&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;📖 &lt;strong&gt;Plugin Development Docs&lt;/strong&gt;: &lt;a href="https://code.claude.com/docs/en/plugins" rel="noopener noreferrer"&gt;code.claude.com/docs/en/plugins&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;📝 &lt;strong&gt;External Plugin Submission&lt;/strong&gt;: &lt;a href="https://clau.de/plugin-directory-submission" rel="noopener noreferrer"&gt;clau.de/plugin-directory-submission&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;🔧 &lt;strong&gt;Reference Implementation&lt;/strong&gt;: &lt;code&gt;/plugins/example-plugin&lt;/code&gt; (best starting point for plugin development)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Target Audience
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Power Claude Code users&lt;/strong&gt;: Looking to extend their workflow capabilities with official plugins&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Plugin developers&lt;/strong&gt;: Learning how to build spec-compliant Claude Code plugins&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Toolchain engineers&lt;/strong&gt;: Connecting their own services to the Claude Code ecosystem via MCP servers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Engineering productivity leads&lt;/strong&gt;: Building custom team plugins and standardizing development workflows&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Key Takeaways
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Plugin ecosystem&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;30+ internal plugins&lt;/strong&gt;: Covering LSP (12 languages), PR review, Git workflows, code quality, output styles, and other core development scenarios&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;15 external plugins&lt;/strong&gt;: GitHub, Firebase, Linear, Terraform, and other mainstream tools already integrated&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unified installation&lt;/strong&gt;: One command handles everything — &lt;code&gt;/plugin install &amp;lt;name&amp;gt;@claude-plugins-official&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Plugin specification&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Minimal metadata&lt;/strong&gt;: &lt;code&gt;plugin.json&lt;/code&gt; needs only name, description, and author&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Three extension mechanisms&lt;/strong&gt;: Skills (recommended) / Commands (legacy) / MCP Servers (external services)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Two trigger modes&lt;/strong&gt;: User-invoked vs. model-auto-triggered based on context&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security awareness built in&lt;/strong&gt;: Official docs explicitly note that Anthropic does not vouch for third-party MCP servers — trust verification is the user's responsibility&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Core value of the five key plugins&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;pr-review-toolkit&lt;/code&gt;: 6 parallel agents → multi-angle coverage, eliminating single-reviewer blind spots&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;agent-sdk-dev&lt;/code&gt;: One-command scaffolding + automatic Verifier → lower barrier to entry for Agent SDK&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;code-review&lt;/code&gt;: Confidence filtering (default threshold 80) → fewer false positives, focus on real issues&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;hookify&lt;/code&gt;: Natural language hook creation → no complex JSON config files&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;commit-commands&lt;/code&gt;: &lt;code&gt;/commit-push-pr&lt;/code&gt; → full pipeline from code to PR in one command&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  One-Line Review
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;claude-plugins-official is not just a plugin directory — it is Anthropic's public answer to the question "how should AI tools be extended": minimize metadata, maximize compositional flexibility, let Skills appear in the right context automatically rather than forcing users to memorize commands.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;em&gt;Find more useful knowledge and interesting products on my &lt;a href="https://home.wonlab.top" rel="noopener noreferrer"&gt;Homepage&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>claude</category>
      <category>mcp</category>
      <category>agentskills</category>
    </item>
    <item>
      <title>RAG Series (22): Long Context vs RAG — Do We Even Need RAG?</title>
      <dc:creator>WonderLab</dc:creator>
      <pubDate>Tue, 19 May 2026 02:02:34 +0000</pubDate>
      <link>https://forem.com/wonderlab/rag-series-22-long-context-vs-rag-do-we-even-need-rag-5a8j</link>
      <guid>https://forem.com/wonderlab/rag-series-22-long-context-vs-rag-do-we-even-need-rag-5a8j</guid>
      <description>&lt;h2&gt;
  
  
  A Question Worth Taking Seriously
&lt;/h2&gt;

&lt;p&gt;Gemini 1.5 Pro supports 1 million token context. Claude 3.5 handles 200K tokens. GPT-4 Turbo handles 128K. A small novel fits in context. Some people ask: is RAG still necessary?&lt;/p&gt;

&lt;p&gt;The question deserves a real answer, because it hides a genuine engineering decision: &lt;strong&gt;for a production system, should I use RAG or long context?&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Numbers
&lt;/h2&gt;

&lt;p&gt;Large language model context windows (2024–2025):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Context Window&lt;/th&gt;
&lt;th&gt;Approximate text&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 1.5 Pro&lt;/td&gt;
&lt;td&gt;1,000,000 tokens&lt;/td&gt;
&lt;td&gt;~750,000 words, ~1500 pages&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude 3.5 Sonnet&lt;/td&gt;
&lt;td&gt;200,000 tokens&lt;/td&gt;
&lt;td&gt;~150,000 words, ~300 pages&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4 Turbo&lt;/td&gt;
&lt;td&gt;128,000 tokens&lt;/td&gt;
&lt;td&gt;~96,000 words, ~190 pages&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o&lt;/td&gt;
&lt;td&gt;128,000 tokens&lt;/td&gt;
&lt;td&gt;~96,000 words, ~190 pages&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This looks like a lot. But how much content does a real knowledge base have?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A mid-sized company's internal documentation: thousands of documents, millions of words&lt;/li&gt;
&lt;li&gt;A large codebase: tens of thousands of files, billions of tokens&lt;/li&gt;
&lt;li&gt;A news or research database: millions of articles&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All of these exceed any model's context window. That is the hard ceiling on long context.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real Cost of Long Context
&lt;/h2&gt;

&lt;p&gt;"Bigger window" doesn't mean "free." Every request processes every token, and the cost is real.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost 1: Money
&lt;/h3&gt;

&lt;p&gt;Rough estimates at late 2024 pricing (input tokens):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Price per 1M tokens&lt;/th&gt;
&lt;th&gt;1M token request&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 1.5 Pro&lt;/td&gt;
&lt;td&gt;$1.25&lt;/td&gt;
&lt;td&gt;$1.25&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude 3.5 Sonnet&lt;/td&gt;
&lt;td&gt;$3.00&lt;/td&gt;
&lt;td&gt;$3.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4 Turbo&lt;/td&gt;
&lt;td&gt;$10.00&lt;/td&gt;
&lt;td&gt;$10.00&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Compare to RAG:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Retrieval phase: Embedding API only (&amp;lt; $0.001)&lt;/li&gt;
&lt;li&gt;Generation phase: 2,000–5,000 tokens of retrieved context + question (&amp;lt; $0.05)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;RAG can cost 20–200× less than long context for the same question.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;At 1,000 user queries per day against an enterprise knowledge base:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Long context (1M tokens): ~$1,250/day&lt;/li&gt;
&lt;li&gt;RAG (3K token context): ~$3–15/day&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Cost 2: Latency
&lt;/h3&gt;

&lt;p&gt;More tokens = slower response. Time to first token (TTFT) grows roughly linearly with input length:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;100K token input → TTFT ~2–5 seconds
1M token input   → TTFT ~15–30 seconds (varies by model and infrastructure)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A conversational application where the user waits 30 seconds before any output is largely unusable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost 3: Lost in the Middle
&lt;/h3&gt;

&lt;p&gt;A 2023 Stanford paper "Lost in the Middle" (Liu et al.) found that when relevant information appears in the middle of a long context, LLM recall drops significantly. Information at the beginning or end performs best; information in the middle performs worst.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Position vs. recall (approximate trend):
Beginning (0–10%)    ████████████████ high
Middle (40–60%)      ██████           low
End (90–100%)        ████████████     higher
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Stuffing 100 documents into context does not guarantee the model finds the one at position 50.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real Cost of RAG
&lt;/h2&gt;

&lt;p&gt;RAG isn't free either.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost 1: Imperfect Retrieval
&lt;/h3&gt;

&lt;p&gt;Vector search is approximate matching — it makes mistakes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;False negatives&lt;/strong&gt;: relevant documents not retrieved. The user's question is semantically distant from the relevant passage; it falls outside the top-k.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;False positives&lt;/strong&gt;: irrelevant documents retrieved. The LLM receives noise, which can cause confusion or hallucination.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is exactly the problem that earlier articles in this series addressed: hybrid retrieval, Rerank, HyDE — all of these are patches for retrieval imperfection.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost 2: Chunking Breaks Context
&lt;/h3&gt;

&lt;p&gt;Chunking splits documents into fragments. Related information can end up in different chunks. A 10-page research report whose conclusion depends on an assumption from page 3 may be split such that only the conclusion chunk is retrieved — the LLM gets the conclusion without the premise.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost 3: System Complexity
&lt;/h3&gt;

&lt;p&gt;RAG is an engineering system: vector store + embedding model + retrieval pipeline + update mechanism + evaluation framework. Compared to "send the document to the LLM," it has significantly higher maintenance cost.&lt;/p&gt;




&lt;h2&gt;
  
  
  Five-Dimension Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Long Context&lt;/th&gt;
&lt;th&gt;RAG&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Document volume ceiling&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~10–100 docs (limited by window and cost)&lt;/td&gt;
&lt;td&gt;Unlimited (vector store scales)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High (all tokens billed every request)&lt;/td&gt;
&lt;td&gt;Low (only relevant fragments)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High (large inputs are slow)&lt;/td&gt;
&lt;td&gt;Low (small inputs are fast)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Recall completeness&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Perfect (everything is present)&lt;/td&gt;
&lt;td&gt;Incomplete (depends on retrieval quality)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Knowledge updates&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Requires resending all content&lt;/td&gt;
&lt;td&gt;Only update changed documents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Engineering complexity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Low (direct API call)&lt;/td&gt;
&lt;td&gt;High (retrieval pipeline to maintain)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Single-document understanding&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Strong (cross-document reasoning)&lt;/td&gt;
&lt;td&gt;Weaker (affected by chunking)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Neither approach wins on all dimensions.&lt;/p&gt;




&lt;h2&gt;
  
  
  Decision Framework: Which One?
&lt;/h2&gt;

&lt;p&gt;Four dimensions to locate your scenario:&lt;/p&gt;

&lt;h3&gt;
  
  
  Dimension 1: Document Volume
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt; 50 docs, total &amp;lt; 100K tokens     → consider long context
50–1000 docs                       → evaluate cost, decide
&amp;gt; 1000 docs, or total &amp;gt; 1M tokens  → RAG
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Dimension 2: Update Frequency
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Static content (monthly updates or less)   → long context acceptable
Dynamic content (daily/hourly updates)     → RAG (incremental indexing is cheap)
Real-time data                             → RAG (or direct API integration)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Dimension 3: Query Volume
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;One-time analysis (research, report generation)   → long context
Low-frequency queries (&amp;lt; 100/day)                 → either works
High-frequency queries (&amp;gt; 1000/day)               → RAG (cost differences compound)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Dimension 4: Latency Requirements
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Interactive Q&amp;amp;A (&amp;lt; 3 second response)   → RAG
Report generation, offline analysis     → long context acceptable
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Summary Decision Table
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Use case                       Docs    Updates   Queries   Recommendation
──────────────────────────────────────────────────────────────────────────
Legal contract review (single) small   none      once      Long context
Enterprise knowledge base Q&amp;amp;A  large   frequent  high      RAG
PDF financial report analysis  medium  none      once      Long context
Product documentation chatbot  large   moderate  high      RAG
Codebase understanding         huge    frequent  high      RAG
Meeting notes summary (single) small   none      once      Long context
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Hybrid Strategy: Use Both
&lt;/h2&gt;

&lt;p&gt;Long context and RAG are not mutually exclusive. Sometimes the best choice is a combination.&lt;/p&gt;

&lt;h3&gt;
  
  
  Strategy 1: RAG selects documents, long context reads in full
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Step 1: use RAG to find the 3 most relevant documents
&lt;/span&gt;&lt;span class="n"&gt;relevant_docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;retriever&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# top-3 documents
&lt;/span&gt;
&lt;span class="c1"&gt;# Step 2: send full documents (not chunks) to the LLM
&lt;/span&gt;&lt;span class="n"&gt;full_docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;load_full_doc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;relevant_docs&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;full_context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;page_content&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;full_docs&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="c1"&gt;# Step 3: LLM answers based on complete documents
&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Answer based on the following documents:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;full_context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;Question: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Good fit for&lt;/strong&gt;: large document sets (can't send all), but each document requires complex cross-passage reasoning.&lt;/p&gt;

&lt;h3&gt;
  
  
  Strategy 2: Coarse-grained RAG with large chunks
&lt;/h3&gt;

&lt;p&gt;Traditional RAG uses 512–1024 token chunks. With larger windows, you can use 3,000–10,000 token chunks — preserving much more context while still doing retrieval filtering.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Split with larger chunks (preserve more context)
&lt;/span&gt;&lt;span class="n"&gt;splitter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RecursiveCharacterTextSplitter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;# traditional 512 → now 4000 is reasonable
&lt;/span&gt;    &lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Retrieve fewer chunks since each is larger
&lt;/span&gt;&lt;span class="n"&gt;retriever&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vectorstore&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;as_retriever&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;search_kwargs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;k&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="c1"&gt;# 3 × 4000 = 12,000 tokens: precise and context-rich
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Strategy 3: Summary cache + precise retrieval
&lt;/h3&gt;

&lt;p&gt;For large document libraries, use the LLM to generate a structured summary for each document; retrieve summaries; load the original passage on demand.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Pre-processing: generate summaries (one-time)
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;all_documents&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize this document&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s key points in 3 sentences:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;page_content&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;summary_doc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Document&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page_content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;original&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;page_content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="n"&gt;summary_vectorstore&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_documents&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;summary_doc&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="c1"&gt;# Query time: retrieve summaries, load original passages
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;query_with_summary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;summaries&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;summary_vectorstore&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;similarity_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;relevant_chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="nf"&gt;extract_relevant_passage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;original&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;summaries&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;build_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;relevant_chunks&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  What Actually Changed
&lt;/h2&gt;

&lt;p&gt;The rise of large context windows genuinely shifted some decisions:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scenarios where RAG was once necessary but now may not be:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Understanding documents under 50 pages (just stuff it in — simpler)&lt;/li&gt;
&lt;li&gt;One-time document analysis tasks (not worth building a RAG system)&lt;/li&gt;
&lt;li&gt;Prototype validation (fast idea testing, no need for production-grade RAG)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Scenarios where RAG is still necessary&lt;/strong&gt; (most production systems):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Knowledge bases with &amp;gt; 1,000 documents&lt;/li&gt;
&lt;li&gt;Frequently updated content&lt;/li&gt;
&lt;li&gt;High concurrency, cost-sensitive deployments&lt;/li&gt;
&lt;li&gt;Attribution requirements (RAG natively knows which document an answer came from)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Large context windows made "skip RAG for simple cases" a reasonable choice. They didn't make RAG obsolete — they made RAG's use case clearer: &lt;strong&gt;when document volume, update frequency, or cost makes "full context" impractical, RAG is irreplaceable.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Long Context&lt;/th&gt;
&lt;th&gt;RAG&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Core strength&lt;/td&gt;
&lt;td&gt;Complete context, cross-document reasoning&lt;/td&gt;
&lt;td&gt;Scalable, low cost, real-time updates&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Core limitation&lt;/td&gt;
&lt;td&gt;High cost, high latency, hard document ceiling&lt;/td&gt;
&lt;td&gt;Imperfect retrieval, engineering complexity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best for&lt;/td&gt;
&lt;td&gt;Small-scale, one-time deep analysis&lt;/td&gt;
&lt;td&gt;Large-scale production systems&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Trend&lt;/td&gt;
&lt;td&gt;Windows keep growing, costs keep falling&lt;/td&gt;
&lt;td&gt;Retrieval quality keeps improving&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These are not competitors — they're complementary tools. Understanding the true cost of each, and choosing the right one, is engineering judgment.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2307.03172" rel="noopener noreferrer"&gt;Lost in the Middle: How Language Models Use Long Contexts (Liu et al., 2023)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2403.05530" rel="noopener noreferrer"&gt;Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://python.langchain.com/docs/tutorials/rag/" rel="noopener noreferrer"&gt;LangChain Long Context RAG Documentation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>rag</category>
      <category>llm</category>
      <category>ai</category>
      <category>opensource</category>
    </item>
    <item>
      <title>RAG Series (21): Performance Optimization — Faster and Cheaper</title>
      <dc:creator>WonderLab</dc:creator>
      <pubDate>Tue, 19 May 2026 02:01:42 +0000</pubDate>
      <link>https://forem.com/wonderlab/rag-series-21-performance-optimization-faster-and-cheaper-1eg6</link>
      <guid>https://forem.com/wonderlab/rag-series-21-performance-optimization-faster-and-cheaper-1eg6</guid>
      <description>&lt;h2&gt;
  
  
  The Cost Structure of RAG
&lt;/h2&gt;

&lt;p&gt;What happens in a single RAG request:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. embed(question)          → 1 Embedding API call
2. vectorstore.search()     → vector store retrieval (local, fast)
3. llm.generate(context)    → 1 LLM API call
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At minimum 2 API calls per request. At scale, these compound quickly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Latency&lt;/strong&gt;: LLM calls typically 1–10 seconds; Embedding calls 0.1–0.5 seconds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost&lt;/strong&gt;: token-based billing means identical questions pay the same price every time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The four optimizations each target a different point in this chain:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Optimization&lt;/th&gt;
&lt;th&gt;Where&lt;/th&gt;
&lt;th&gt;What it saves&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;LLM response cache&lt;/td&gt;
&lt;td&gt;LLM call&lt;/td&gt;
&lt;td&gt;Skip LLM entirely, 0ms response&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Embedding cache&lt;/td&gt;
&lt;td&gt;Embedding call&lt;/td&gt;
&lt;td&gt;No re-embedding for identical text&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic Cache&lt;/td&gt;
&lt;td&gt;LLM call&lt;/td&gt;
&lt;td&gt;Reuse answers for similar questions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Async batch Embedding&lt;/td&gt;
&lt;td&gt;Embedding call&lt;/td&gt;
&lt;td&gt;N serial round-trips → 1 concurrent call&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Optimization 1: LLM Response Cache
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Principle&lt;/strong&gt;: A given (prompt, model, temperature) combination always produces a deterministic LLM call. Cache the result on the first call; return it directly on subsequent identical calls — no network request at all.&lt;/p&gt;

&lt;p&gt;LangChain exposes this as a global switch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_core.globals&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;set_llm_cache&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_community.cache&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;InMemoryCache&lt;/span&gt;

&lt;span class="nf"&gt;set_llm_cache&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;InMemoryCache&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;   &lt;span class="c1"&gt;# one line, affects all LLM calls
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For persistence across restarts, swap in SQLite:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_community.cache&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SQLiteCache&lt;/span&gt;
&lt;span class="nf"&gt;set_llm_cache&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;SQLiteCache&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;database_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.llm_cache.db&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Results
&lt;/h3&gt;

&lt;p&gt;3 questions, each asked twice:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;Q: What are the four core metrics in RAGAS?
  Cache miss:  1743ms   Cache hit:   0.7ms   Speedup: 2441×

Q: What are the common vector database options?
  Cache miss:  3675ms   Cache hit:   0.9ms   Speedup: 4126×

Q: What is Rerank?
  Cache miss:  9753ms   Cache hit:   0.9ms   Speedup: 10993×

Average: miss=5057ms  hit=0.8ms  speedup=6068×
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Hit latency is 0.8ms&lt;/strong&gt; — that's dictionary lookup time, not network latency. On a cache hit, zero network requests are made.&lt;/p&gt;

&lt;p&gt;6000× sounds exaggerated, but this is what "in-memory dict vs. network API call" actually looks like.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Good fit for&lt;/strong&gt;: FAQ-style Q&amp;amp;A, report generation (user clicks "regenerate" repeatedly), popular questions asked by many users.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Limitation&lt;/strong&gt;: Exact prompt match only. A rephrased question is a cache miss.&lt;/p&gt;




&lt;h2&gt;
  
  
  Optimization 2: Embedding Cache
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Principle&lt;/strong&gt;: The embedding vector for a given text is deterministic (same model + same text = same vector). &lt;code&gt;CacheBackedEmbeddings&lt;/code&gt; wraps a base embeddings object with a ByteStore layer — embed once, serialize and store, read from cache thereafter.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_classic.embeddings&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;CacheBackedEmbeddings&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_classic.storage&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;InMemoryByteStore&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;LocalFileStore&lt;/span&gt;

&lt;span class="c1"&gt;# In-memory (lost on restart)
&lt;/span&gt;&lt;span class="n"&gt;store&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;InMemoryByteStore&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# File-based (persistent across restarts)
# store = LocalFileStore("./embedding_cache/")
&lt;/span&gt;
&lt;span class="n"&gt;cached_embeddings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CacheBackedEmbeddings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_bytes_store&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;underlying_embeddings&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;base_embeddings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;document_embedding_cache&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;EMB_MODEL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# isolates cache by model name
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# API identical to regular embeddings
&lt;/span&gt;&lt;span class="n"&gt;vectorstore&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Chroma&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;cached_embeddings&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;namespace=EMB_MODEL&lt;/code&gt; matters: if you switch embedding models, the old cached vectors have a different dimension and distribution. Namespacing by model name prevents the new model from reading stale vectors.&lt;/p&gt;

&lt;h3&gt;
  
  
  Results
&lt;/h3&gt;

&lt;p&gt;8 texts, three passes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;First index (8 texts, all new):
  285ms   1 API call   8 texts sent

Repeat index (8 texts, all cached):
  5.7ms   0 API calls  0 texts sent

Knowledge base update (6 unchanged + 2 new):
  63.5ms  1 API call   2 texts sent
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The third row is the point&lt;/strong&gt;: on a knowledge base update, the 6 unchanged documents are served from cache. Only the 2 new documents trigger an API call. This pairs naturally with the Indexing API from the previous article — content hash tracking identifies which documents need re-indexing; Embedding cache ensures identical content is never re-embedded.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Good fit for&lt;/strong&gt;: knowledge bases with a large stable core and occasional updates. The more documents, the lower the update frequency, the bigger the benefit.&lt;/p&gt;




&lt;h2&gt;
  
  
  Optimization 3: Semantic Cache
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Principle&lt;/strong&gt;: LLM response cache requires an exact prompt match. Semantic Cache goes further: store historical (question, answer) pairs as vectors; when a new question arrives, run a nearest-neighbor search; if a sufficiently similar historical question is found, return its answer directly — skipping both retrieval and LLM.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"What metrics does the RAGAS framework have?"  → miss → LLM generates → stored
"Describe the four core RAGAS metrics"         → vector search → finds above
                                               → similarity ≥ threshold → return cached answer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Implementation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SemanticCache&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.85&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_store&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Chroma&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;semantic_cache&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_answers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;          &lt;span class="c1"&gt;# cache_id → answer
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_threshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;similarity_search_with_relevance_scores&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_answers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;cache_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uuid4&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_texts&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;metadatas&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;cache_id&lt;/span&gt;&lt;span class="p"&gt;}])&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_answers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;cache_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Results: Threshold Calibration Is the Hard Part
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Threshold: 0.85

RAGAS group:
  Original:  "What metrics does RAGAS have?"              → miss (3782ms)
  Paraphrase: "Describe the four core RAGAS metrics"      → miss (3298ms) ← expected HIT
  Different:  "How should I choose a vector database?"    → miss (2509ms) ← correct miss

Rerank group:
  Original:  "What role does Rerank play in RAG?"         → miss (11602ms)
  Paraphrase: "Why do RAG systems need re-ranking?"       → miss (3834ms) ← expected HIT
  Different:  "What is hybrid retrieval?"                 → miss (12578ms) ← correct miss

Total hit rate: 0/6
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The paraphrases didn't hit the cache. This is not a code bug — &lt;strong&gt;threshold 0.85 is too high for these paraphrase pairs&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Why: &lt;code&gt;bge-large-zh-v1.5&lt;/code&gt; cosine similarity between these pairs likely falls in the 0.80–0.84 range, just below the threshold. Semantic similarity ≠ high cosine similarity. The mapping depends on the embedding model's representation space and training data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The correct approach&lt;/strong&gt;: calibrate before setting a threshold. Measure the similarity distribution on your actual question samples:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Calibration: measure similarity on known similar pairs and known-different pairs
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dot&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;numpy.linalg&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;norm&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;cosine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;dot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;similar_pairs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What RAGAS metrics are there?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;List the RAGAS evaluation metrics&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;How to choose a vector DB?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Which vector database should I use?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;dissimilar_pairs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What RAGAS metrics are there?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;How to choose a vector DB?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;q1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;q2&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;similar_pairs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;v1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;embed_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;v2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;embed_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Similar:    &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;cosine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;q1&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; / &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;q2&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;q1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;q2&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;dissimilar_pairs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;v1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;embed_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;v2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;embed_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Dissimilar: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;cosine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;q1&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; / &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;q2&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Set threshold between the two distributions
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Find a threshold that separates the two distributions. For Chinese Q&amp;amp;A with bge models, 0.80–0.85 is a common starting range — but you must validate on your own data before deploying.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The real value of Semantic Cache&lt;/strong&gt;: high-volume FAQ systems where users ask the same questions in many different ways (customer service bots, documentation assistants). Potential for large LLM call reduction. But the value is entirely dependent on threshold calibration — it's not a drop-in default.&lt;/p&gt;




&lt;h2&gt;
  
  
  Optimization 4: Async Batch Embedding
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Principle&lt;/strong&gt;: Embedding N texts sequentially = N network round-trips. Embedding N texts in a single batch call = 1 network round-trip, processed in parallel server-side.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAIEmbeddings&lt;/span&gt;

&lt;span class="n"&gt;embeddings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAIEmbeddings&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;

&lt;span class="c1"&gt;# Sequential (slow): one API call per text
&lt;/span&gt;&lt;span class="n"&gt;sequential&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;embed_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;texts&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Async batch (fast): one API call for all texts
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;embed_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;texts&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;aembed_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;texts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;batch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;embed_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;texts&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Results
&lt;/h3&gt;

&lt;p&gt;12 texts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Sequential (one by one):    830ms
Async batch (one call):     289ms
Speedup:                    2.87×
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same vectors, 11 fewer network round-trips. Vector agreement &amp;gt; 0.9999 cosine similarity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where to apply in the RAG pipeline&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Batch indexing at build time
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;index_documents_async&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Document&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
    &lt;span class="n"&gt;texts&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;page_content&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;vectors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;aembed_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;texts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# bulk write to vector store
&lt;/span&gt;    &lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="c1"&gt;# Concurrent user queries in the service layer
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handle_batch_queries&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;questions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
    &lt;span class="n"&gt;vectors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;aembed_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;questions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="n"&gt;retriever&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ainvoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;questions&lt;/span&gt;
    &lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The more documents, the bigger the gain. Batch documents in chunks of 50–100 during index builds; expect 3–5× speedup over sequential, depending on network latency.&lt;/p&gt;




&lt;h2&gt;
  
  
  Combining All Four Optimizations
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# 1. LLM cache (global, always on)
&lt;/span&gt;&lt;span class="nf"&gt;set_llm_cache&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;SQLiteCache&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.llm_cache.db&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# 2. Embedding cache (wrap the base embeddings)
&lt;/span&gt;&lt;span class="n"&gt;store&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LocalFileStore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./embedding_cache/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;embeddings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CacheBackedEmbeddings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_bytes_store&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;underlying_embeddings&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;base_embeddings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;document_embedding_cache&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;EMB_MODEL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 3. Semantic Cache (check before full pipeline)
&lt;/span&gt;&lt;span class="n"&gt;semantic_cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SemanticCache&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;YOUR_CALIBRATED_THRESHOLD&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;cached&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;semantic_cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt;
    &lt;span class="n"&gt;docs&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;retriever&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;
    &lt;span class="n"&gt;semantic_cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;

&lt;span class="c1"&gt;# 4. Async for bulk operations
&lt;/span&gt;&lt;span class="n"&gt;vectors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;aembed_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;texts&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All four are orthogonal and stackable. Highest-ROI combination: &lt;strong&gt;LLM cache + Embedding cache&lt;/strong&gt; — near-zero implementation cost, should be on by default. &lt;strong&gt;Semantic Cache&lt;/strong&gt; requires calibration but delivers large savings once tuned. &lt;strong&gt;Async batch&lt;/strong&gt; is specifically valuable at index-build time and under high concurrency.&lt;/p&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;=====================================================================
  Optimization Results Summary
=====================================================================

  Optimization             Before          After         Savings
  ─────────────────────────────────────────────────────────────
  LLM response cache       5057ms          0.8ms         99.98%  ✓ strongly recommended
  Embedding cache (rebuild) 285ms          5.7ms         98%     ✓ strongly recommended
  Embedding cache (update)  8 API calls    2 API calls   75%     ✓ strongly recommended
  Semantic Cache (t=0.85)   functional     needs calibr. —       ⚠ calibrate first
  Async batch Embedding     830ms          289ms         65%     ✓ recommended at scale
=====================================================================
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Full Code
&lt;/h2&gt;

&lt;p&gt;Complete code is open-sourced at:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/chendongqi/llm-in-action/tree/main/21-rag-performance" rel="noopener noreferrer"&gt;https://github.com/chendongqi/llm-in-action/tree/main/21-rag-performance&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Key file:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;rag_performance.py&lt;/code&gt; — all four benchmarks with report generation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;How to run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/chendongqi/llm-in-action
&lt;span class="nb"&gt;cd &lt;/span&gt;21-rag-performance
&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
python rag_performance.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;This article implemented and measured four RAG performance optimizations:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;LLM response cache&lt;/strong&gt;: cheapest and highest impact — one line of code, repeated questions go from 5057ms to 0.8ms (6000× speedup)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embedding cache&lt;/strong&gt;: identical text never re-embedded; knowledge base updates only embed changed content (8 calls → 2 calls)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semantic Cache&lt;/strong&gt;: conceptually correct, but threshold 0.85 produced 0/6 hits in this experiment — threshold calibration is non-optional; measure similarity distribution on real data before setting any value&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Async batch Embedding&lt;/strong&gt;: 2.87× speedup for 12 texts; benefit grows with document count&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The first three optimizations attack the same root problem: &lt;strong&gt;repeated computation is waste&lt;/strong&gt;. The same work shouldn't cost twice. The fourth attacks a different problem: &lt;strong&gt;serial waiting is unnecessary&lt;/strong&gt;. Work that can be parallelized shouldn't be queued.&lt;/p&gt;

&lt;p&gt;Different problems, same goal: making RAG viable in production.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://python.langchain.com/docs/how_to/llm_caching/" rel="noopener noreferrer"&gt;LangChain LLM Caching Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://python.langchain.com/docs/how_to/caching_embeddings/" rel="noopener noreferrer"&gt;CacheBackedEmbeddings Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://python.langchain.com/docs/concepts/async/" rel="noopener noreferrer"&gt;LangChain Async API&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>rag</category>
      <category>llm</category>
      <category>performance</category>
    </item>
    <item>
      <title>RAG Series (20): Enterprise RAG Architecture</title>
      <dc:creator>WonderLab</dc:creator>
      <pubDate>Tue, 19 May 2026 02:00:40 +0000</pubDate>
      <link>https://forem.com/wonderlab/rag-series-20-enterprise-rag-architecture-4765</link>
      <guid>https://forem.com/wonderlab/rag-series-20-enterprise-rag-architecture-4765</guid>
      <description>&lt;h2&gt;
  
  
  The Gap Between Demo and Production
&lt;/h2&gt;

&lt;p&gt;Every article in this series has shared one architectural assumption: a single vector store, accessible to everyone, returning any document to any user.&lt;/p&gt;

&lt;p&gt;That works in a demo. In an enterprise environment, it breaks immediately:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Company A's documents can be retrieved by Company B's users&lt;/li&gt;
&lt;li&gt;Financial data can be pulled by any employee&lt;/li&gt;
&lt;li&gt;HR policies visible to contractors&lt;/li&gt;
&lt;li&gt;One user hammers the API and takes down the service for everyone else&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Production enterprise RAG needs three layers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Incoming request
  ↓ rate limit check   — is this user still within quota?
  ↓ cache lookup       — has this question been answered before?
  ↓ tenant routing     — which knowledge base?
  ↓ permission filter  — within that KB, what can this user see?
  ↓ retrieve + generate — answer from authorized content only
  ↓ cache write        — store for next time
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This article implements each layer.&lt;/p&gt;




&lt;h2&gt;
  
  
  Layer 1: Multi-Tenancy
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Strategy: one Qdrant Collection per tenant&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Each customer or department gets its own Qdrant Collection. Collections are physically isolated — you can't search &lt;code&gt;acme_corp&lt;/code&gt;'s content by querying &lt;code&gt;globex_corp&lt;/code&gt;, because the two collections are entirely separate vector spaces.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_qdrant&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;QdrantVectorStore&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;qdrant_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;QdrantClient&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;qdrant_client.models&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Distance&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;VectorParams&lt;/span&gt;

&lt;span class="n"&gt;qdrant_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;QdrantClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;:memory:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# production: host="qdrant-server"
&lt;/span&gt;
&lt;span class="n"&gt;tenant_stores&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;QdrantVectorStore&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;tenant_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;docs&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;TENANT_DOCS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;qdrant_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_collection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tenant_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;vectors_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;VectorParams&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;distance&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Distance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;COSINE&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;store&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;QdrantVectorStore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;qdrant_client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tenant_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;tenant_stores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;tenant_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Routing is trivial — the request carries a &lt;code&gt;tenant_id&lt;/code&gt;, the service selects the matching store:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_retriever&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tenant_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;tenant_id&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tenant_stores&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Unknown tenant: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;tenant_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;store&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tenant_stores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;tenant_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="c1"&gt;# permission filter added in Layer 2
&lt;/span&gt;    &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why not a shared Collection with a &lt;code&gt;tenant_id&lt;/code&gt; metadata filter?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It works technically, but carries a risk: a filter bug means Tenant A's data leaks to Tenant B. There's no hard boundary. Collection-level isolation also makes teardown clean — removing a tenant means dropping their Collection, with no residue.&lt;/p&gt;

&lt;p&gt;For soft isolation (departments within one company), metadata filtering is fine. For hard isolation (different customers), separate Collections are safer.&lt;/p&gt;




&lt;h2&gt;
  
  
  Layer 2: Access Control
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Strategy: documents carry &lt;code&gt;access_level&lt;/code&gt;; retrieval injects a Qdrant filter&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Each document declares its access level in metadata:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nc"&gt;Document&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;page_content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Annual bonus: S-tier 3 months, A-tier 2 months...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hr-policy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;access_level&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hr_only&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nc"&gt;Document&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;page_content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Robot control system: EtherCAT bus, latency &amp;lt;1ms...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;robot-spec&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;access_level&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;engineering_only&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Roles map to the access levels they can see:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;ROLE_PERMISSIONS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;admin&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;public&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;engineering_only&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hr_only&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;finance_only&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;engineer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;public&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;engineering_only&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hr&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;       &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;public&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hr_only&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;finance&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;public&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;finance_only&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;employee&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;public&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At retrieval time, the role's allowed levels become a Qdrant &lt;code&gt;MatchAny&lt;/code&gt; filter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;qdrant_client.models&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Filter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;FieldCondition&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;MatchAny&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_retriever&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tenant_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;levels&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ROLE_PERMISSIONS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;public&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="n"&gt;access_filter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;must&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="nc"&gt;FieldCondition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metadata.access_level&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;MatchAny&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;any&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;levels&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;tenant_stores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;tenant_id&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;as_retriever&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;search_kwargs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;k&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;filter&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;access_filter&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;This filter executes at the vector database layer, not the application layer.&lt;/strong&gt; Unauthorized documents never leave the database — they aren't returned to the application, so there's nothing to leak.&lt;/p&gt;




&lt;h2&gt;
  
  
  Layer 3: Caching
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Strategy: &lt;code&gt;(tenant_id, role, question)&lt;/code&gt; as cache key, TTL 300 seconds&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CacheEntry&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;default_factory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;QueryCache&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ttl_seconds&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_store&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CacheEntry&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_ttl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ttl_seconds&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tenant_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;entry&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;tenant_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;entry&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_ttl&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tenant_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_store&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;tenant_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;())]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;CacheEntry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Including &lt;code&gt;role&lt;/code&gt; in the cache key matters: an engineer and an HR manager asking the same question get different contexts (different documents pass the permission filter), so they may get different answers. Cache entries are not cross-role reusable.&lt;/p&gt;




&lt;h2&gt;
  
  
  Layer 4: Rate Limiting
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Strategy: sliding window, 5 requests per user per minute&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;RateLimiter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_requests&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;window_seconds&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_max&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;max_requests&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_window&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;window_seconds&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_log&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;defaultdict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;allow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_log&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_log&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
                               &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_window&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_log&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_max&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_log&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Sliding window vs. fixed window: a fixed window allows bursting at boundaries — a user can send 5 requests at second 59 and 5 more at second 61, sending 10 in 60 seconds. A sliding window enforces the limit across any 60-second interval.&lt;/p&gt;




&lt;h2&gt;
  
  
  Experiment Results
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Scenario A: Normal Retrieval
&lt;/h3&gt;

&lt;p&gt;Engineer alice queries company info and technical docs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Q: What type of company is ACME Corp?
A: ACME Corp is a smart manufacturing company.
Sources: [company-intro, robot-spec]   ← public + engineering docs, correct
elapsed: 995ms

Q: What communication protocol does ACME's robot system use?
A: ACME Corp's robot control system uses the EtherCAT real-time bus.
Sources: [company-intro, robot-spec]   ← engineering doc correctly retrieved
elapsed: 1709ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Scenario B: Permission Filtering
&lt;/h3&gt;

&lt;p&gt;The key thing to read here is the &lt;code&gt;sources&lt;/code&gt; array — not whether &lt;code&gt;docs_retrieved &amp;gt; 0&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[B1] Engineer alice asks about annual bonus (hr_only doc):
  Sources: [company-intro, robot-spec]   ← hr-policy is NOT in sources
  A: The reference material does not contain information about the bonus policy.

[B2] HR bob asks about net profit (finance_only doc):
  Sources: [company-intro, hr-policy]    ← financial-report is NOT in sources
  A: The reference material does not contain ACME's 2025 net profit.

[B3] HR bob asks about annual leave (hr_only doc):
  Sources: [company-intro, hr-policy]    ← hr-policy correctly appears
  A: Year 1: 12 days. Each additional year: +2 days. Maximum: 20 days.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What access control actually looks like in practice&lt;/strong&gt;: &lt;code&gt;hr-policy&lt;/code&gt; never appears in alice's sources list; &lt;code&gt;financial-report&lt;/code&gt; never appears in bob's sources list. The Qdrant filter intercepts these documents at the database layer. The LLM never receives them, so it correctly responds that the information isn't available.&lt;/p&gt;

&lt;p&gt;This is the right behavior: users still get documents they &lt;em&gt;can&lt;/em&gt; access (public + their role-specific docs); only the restricted documents are absent.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scenario C: Tenant Isolation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[C1] Globex user charlie asks about ACME Corp's headcount:
  Tenant: globex_corp
  Sources: [products, company-intro]   ← these are Globex's own docs
  A: The reference material does not contain ACME's employee count.

[C2] Globex user queries their own product lines:
  Sources: [company-intro, products]   ← Globex docs correctly returned
  A: GlexCloud, GlexAnalytics, GlexAI...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Charlie is querying the &lt;code&gt;globex_corp&lt;/code&gt; Collection for ACME Corp information. Of course nothing comes back — ACME's content doesn't physically exist in Globex's Collection.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scenario D: Cache Hit
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="err"&gt;First&lt;/span&gt; &lt;span class="err"&gt;request&lt;/span&gt; &lt;span class="err"&gt;(Scenario&lt;/span&gt; &lt;span class="err"&gt;A1):&lt;/span&gt; &lt;span class="err"&gt;995ms,&lt;/span&gt; &lt;span class="py"&gt;cache_hit&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;false&lt;/span&gt;
&lt;span class="err"&gt;Same&lt;/span&gt; &lt;span class="err"&gt;question&lt;/span&gt; &lt;span class="py"&gt;repeated&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;        &lt;span class="s"&gt;0ms, cache_hit=true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;0ms means the repeated request skipped both retrieval and LLM generation entirely. For frequently repeated questions — company policy, common workflows, product FAQs — caching compounds quickly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scenario E: Rate Limiting
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="py"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5 req / 60s / user&lt;/span&gt;

&lt;span class="err"&gt;Request&lt;/span&gt; &lt;span class="py"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;allowed&lt;/span&gt;
&lt;span class="err"&gt;Request&lt;/span&gt; &lt;span class="py"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;allowed&lt;/span&gt;
&lt;span class="err"&gt;Request&lt;/span&gt; &lt;span class="py"&gt;3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;allowed&lt;/span&gt;
&lt;span class="err"&gt;Request&lt;/span&gt; &lt;span class="py"&gt;4&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;allowed&lt;/span&gt;
&lt;span class="err"&gt;Request&lt;/span&gt; &lt;span class="py"&gt;5&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;allowed&lt;/span&gt;
&lt;span class="err"&gt;Request&lt;/span&gt; &lt;span class="py"&gt;6&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;RATE LIMITED   ← limit enforced&lt;/span&gt;
&lt;span class="err"&gt;Request&lt;/span&gt; &lt;span class="py"&gt;7&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;RATE LIMITED&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The rate limiter correctly allowed 5 and blocked 2 out of 7 requests.&lt;/p&gt;




&lt;h2&gt;
  
  
  FastAPI Service Layer
&lt;/h2&gt;

&lt;p&gt;The four layers above are wired together in a single &lt;code&gt;query()&lt;/code&gt; function, then exposed via FastAPI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastapi&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FastAPI&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;HTTPException&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Header&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pydantic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseModel&lt;/span&gt;

&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastAPI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Enterprise RAG Service&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;QueryRequest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;tenant_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;

&lt;span class="nd"&gt;@app.post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;query_endpoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;QueryRequest&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;x_user_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Header&lt;/span&gt;&lt;span class="p"&gt;(...),&lt;/span&gt;     &lt;span class="c1"&gt;# user identity from request header
&lt;/span&gt;    &lt;span class="n"&gt;x_user_role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Header&lt;/span&gt;&lt;span class="p"&gt;(...),&lt;/span&gt;   &lt;span class="c1"&gt;# user role from request header
&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;tenant_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tenant_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;x_user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;x_user_role&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rate_limited&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;HTTPException&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;429&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;detail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Too many requests&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sources&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sources&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache_hit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache_hit&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Start with: &lt;code&gt;uvicorn enterprise_rag:app --host 0.0.0.0 --port 8080&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;In production, &lt;code&gt;x_user_id&lt;/code&gt; and &lt;code&gt;x_user_role&lt;/code&gt; should come from JWT token decoding, not raw client headers.&lt;/p&gt;




&lt;h2&gt;
  
  
  Production Upgrade Path
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Demo Implementation&lt;/th&gt;
&lt;th&gt;Production Replacement&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Qdrant&lt;/td&gt;
&lt;td&gt;&lt;code&gt;:memory:&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Dedicated server, &lt;code&gt;host="qdrant-server"&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cache&lt;/td&gt;
&lt;td&gt;In-process dict&lt;/td&gt;
&lt;td&gt;Redis (distributed, persistent, TTL native)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rate limiter&lt;/td&gt;
&lt;td&gt;In-process counter&lt;/td&gt;
&lt;td&gt;Redis + sliding-window Lua script (safe across instances)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;User identity&lt;/td&gt;
&lt;td&gt;Raw Header&lt;/td&gt;
&lt;td&gt;JWT token decode + signature verification&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Logging&lt;/td&gt;
&lt;td&gt;print()&lt;/td&gt;
&lt;td&gt;Structured logs + alerting on LLM call volume / latency / errors&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Full Code
&lt;/h2&gt;

&lt;p&gt;Complete code is open-sourced at:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/chendongqi/llm-in-action/tree/main/20-enterprise-rag" rel="noopener noreferrer"&gt;https://github.com/chendongqi/llm-in-action/tree/main/20-enterprise-rag&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Key file:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;enterprise_rag.py&lt;/code&gt; — full implementation: multi-tenancy + access control + cache + rate limiting + FastAPI + scenario verification&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;How to run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/chendongqi/llm-in-action
&lt;span class="nb"&gt;cd &lt;/span&gt;20-enterprise-rag
&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
python enterprise_rag.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;This article implemented a four-layer enterprise RAG architecture. Key findings:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Collection-level tenant isolation&lt;/strong&gt; — separate Qdrant Collections per tenant provide a physical boundary; metadata filtering alone offers no hard guarantee&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Permissions enforced at the DB layer&lt;/strong&gt; — Qdrant's &lt;code&gt;MatchAny&lt;/code&gt; filter means restricted documents never leave the database; there's nothing for the application to leak&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cache key must include role&lt;/strong&gt; — same question, different role → different context → potentially different answer; cross-role cache reuse produces wrong results&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sliding window beats fixed window&lt;/strong&gt; — eliminates boundary bursting; any 60-second interval is bounded, not just aligned windows&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Access control is about absence&lt;/strong&gt; — users see the documents they're allowed to see; restricted documents simply don't appear in sources; the LLM correctly reports "no information available" for what it never received&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The gap between a RAG demo and a RAG production system is mostly engineering, not algorithms.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://qdrant.tech/documentation/concepts/filtering/" rel="noopener noreferrer"&gt;Qdrant Filtering Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://python.langchain.com/docs/integrations/vectorstores/qdrant/" rel="noopener noreferrer"&gt;langchain-qdrant Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://fastapi.tiangolo.com/" rel="noopener noreferrer"&gt;FastAPI Documentation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>rag</category>
      <category>ragas</category>
      <category>ai</category>
      <category>qdrant</category>
    </item>
    <item>
      <title>One Open Source Project a Day (No. 69): Academic Research Skills - A Full-Pipeline AI Agent Suite for Academic Research</title>
      <dc:creator>WonderLab</dc:creator>
      <pubDate>Tue, 19 May 2026 01:59:22 +0000</pubDate>
      <link>https://forem.com/wonderlab/one-open-source-project-a-day-no-69-academic-research-skills-a-full-pipeline-ai-agent-suite-2956</link>
      <guid>https://forem.com/wonderlab/one-open-source-project-a-day-no-69-academic-research-skills-a-full-pipeline-ai-agent-suite-2956</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;"AI is your copilot, not the pilot."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is the 69th article in the "One Open Source Project a Day" series. Today, we are exploring &lt;strong&gt;Academic Research Skills&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This is a Claude Code Skills suite serving academic researchers, covering the full workflow from literature review and paper writing to peer review. 11.9k Stars, 1.2k Forks — in the academic tooling space, those numbers stand out.&lt;/p&gt;

&lt;p&gt;But what I want to emphasize isn't just "what this tool can do." It's &lt;strong&gt;how the workflow itself is designed&lt;/strong&gt;. The author systematically studied how AI fails in academic contexts — hallucinated citations, position collapse under pushback, premature convergence — and engineered specific countermechanisms for each failure mode. These design patterns are directly applicable whether you're doing academic research or building any other complex AI Skill.&lt;/p&gt;

&lt;h3&gt;
  
  
  What You Will Learn
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;The complete workflows of the four core Skills (Deep Research / Academic Paper / Peer Reviewer / Full Pipeline)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anti-Hallucination Gate design&lt;/strong&gt;: why the integrity checks at Stage 2.5 and Stage 4.5 are non-skippable&lt;/li&gt;
&lt;li&gt;How the &lt;strong&gt;Devil's Advocate (DA) mechanism&lt;/strong&gt; prevents AI from collapsing its position under social pressure&lt;/li&gt;
&lt;li&gt;How &lt;strong&gt;Socratic dialogue&lt;/strong&gt; with intent detection distinguishes exploratory inquiry from goal-oriented requests&lt;/li&gt;
&lt;li&gt;How the &lt;strong&gt;Dialogue Health Indicator&lt;/strong&gt; auto-injects challenge questions after 5-turn agreement patterns&lt;/li&gt;
&lt;li&gt;What these mechanisms mean for your own AI Skill design&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Experience with Claude Code or similar AI coding tools&lt;/li&gt;
&lt;li&gt;Basic familiarity with academic writing workflows&lt;/li&gt;
&lt;li&gt;Interest in understanding AI Skill workflow design principles&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Project Background
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Project Introduction
&lt;/h3&gt;

&lt;p&gt;Academic Research Skills is an academic research assistant suite built on the Claude Code Skills specification, led by Cheng-I Wu, currently at version v3.9.4.1.&lt;/p&gt;

&lt;p&gt;Its core philosophy: &lt;strong&gt;AI handles verification, synthesis, and consistency checking; humans retain full sovereignty over research direction, argumentation framework, and publication decisions&lt;/strong&gt;. This stands in sharp contrast to most "fully automated AI research" tools — it is explicitly not a system for generating papers without human thought. It is a collaboration framework that places human confirmation checkpoints at every critical decision node.&lt;/p&gt;

&lt;p&gt;This design choice itself is worth reflection: in a domain where academic integrity is paramount, "keeping humans in the loop" is not a functional compromise — it is a deliberate architectural commitment.&lt;/p&gt;

&lt;h3&gt;
  
  
  Author / Team
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Primary Author&lt;/strong&gt;: Cheng-I Wu&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Contributors&lt;/strong&gt;: aspi6246 (read-only constraints and cognitive framework refinements), mchesbro1 and cloudenochcsis (expanded IS journal list to Senior Scholars' Basket of 11)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Academic grounding&lt;/strong&gt;: The project cites multiple 2026 peer-reviewed studies as design rationale (Lu et al., Zhao et al., Song/Pfister/Yoon, and others) — design decisions are literature-backed&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Project Data
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;⭐ GitHub Stars: &lt;strong&gt;11,900+&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;🍴 Forks: &lt;strong&gt;1,200+&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;📦 Latest Version: &lt;strong&gt;v3.9.4.1&lt;/strong&gt; (2026-05-19)&lt;/li&gt;
&lt;li&gt;🌍 Language Support: English, Traditional Chinese, bilingual abstracts&lt;/li&gt;
&lt;li&gt;📄 License: CC BY-NC 4.0&lt;/li&gt;
&lt;li&gt;🌐 Repository: &lt;a href="https://github.com/Imbad0202/academic-research-skills" rel="noopener noreferrer"&gt;Imbad0202/academic-research-skills&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Main Features
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Core Utility
&lt;/h3&gt;

&lt;p&gt;Academic Research Skills breaks the complete academic workflow — from research question formation to publication — into four Skills that can be used independently or orchestrated together:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Research Question Formation
        ↓
  🔬 Deep Research     ← 13-agent team, literature review and research synthesis
        ↓
  📝 Academic Paper    ← 12-agent pipeline, from outline to complete paper
        ↓
  🔍 Paper Reviewer    ← 7-agent review panel, simulated peer review
        ↓
  🔄 Academic Pipeline ← 10-stage orchestrator, full pipeline with integrity gates
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Quick Start
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Claude Code Installation (Fastest, v3.7.0+)&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;/plugin marketplace add Imbad0202/academic-research-skills
/plugin &lt;span class="nb"&gt;install &lt;/span&gt;academic-research-skills

&lt;span class="c"&gt;# Available slash commands after installation:&lt;/span&gt;
/deep-research        &lt;span class="c"&gt;# Start deep research mode&lt;/span&gt;
/academic-paper       &lt;span class="c"&gt;# Start paper writing mode&lt;/span&gt;
/paper-reviewer       &lt;span class="c"&gt;# Start peer review mode&lt;/span&gt;
/academic-pipeline    &lt;span class="c"&gt;# Start full pipeline orchestration&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Traditional Installation (5 methods, see &lt;code&gt;docs/SETUP.md&lt;/code&gt;)&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Global installation (available across all projects)&lt;/span&gt;
git clone https://github.com/Imbad0202/academic-research-skills.git
&lt;span class="nb"&gt;cp&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; academic-research-skills/skills ~/.claude/skills/

&lt;span class="c"&gt;# Project-level installation (current project only)&lt;/span&gt;
&lt;span class="nb"&gt;ln&lt;/span&gt; &lt;span class="nt"&gt;-s&lt;/span&gt; /path/to/academic-research-skills/skills ./.claude/skills/academic-research
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;With Experiment Agent&lt;/strong&gt; (empirical research):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install the companion experiment management agent&lt;/span&gt;
/plugin &lt;span class="nb"&gt;install &lt;/span&gt;experiment-agent@Imbad0202/experiment-agent

&lt;span class="c"&gt;# Full empirical research workflow:&lt;/span&gt;
&lt;span class="c"&gt;# /deep-research → form research questions&lt;/span&gt;
&lt;span class="c"&gt;# experiment-agent → design and run experiments&lt;/span&gt;
&lt;span class="c"&gt;# /academic-paper → write paper based on results&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Typical Usage Cost&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Full pipeline (15,000-word paper): approximately $4–6 USD&lt;/li&gt;
&lt;li&gt;Detailed token budgets in &lt;code&gt;docs/PERFORMANCE.md&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Four Skills in Detail
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Skill 1: Deep Research (v2.8) — 13-Agent Research Team
&lt;/h3&gt;

&lt;p&gt;This is not simple "search + summarize." It is a 13-agent research team with clear role division.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Seven modes&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;full&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Comprehensive deep research, multi-source synthesis&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;quick&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Rapid literature overview&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;review&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Literature review for an existing draft&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;literature-review&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Systematic literature review&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fact-check&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Fact verification and citation validation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;socratic&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Socratic guided exploration (interactive)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;systematic-review&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;PRISMA-compliant systematic review&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Start Socratic guided mode&lt;/span&gt;
/deep-research &lt;span class="nt"&gt;--mode&lt;/span&gt; socratic &lt;span class="s2"&gt;"Impact of quantum computing on cryptography"&lt;/span&gt;

&lt;span class="c"&gt;# Start systematic review mode (PRISMA standards)&lt;/span&gt;
/deep-research &lt;span class="nt"&gt;--mode&lt;/span&gt; systematic-review &lt;span class="nt"&gt;--topic&lt;/span&gt; &lt;span class="s2"&gt;"ML applications in medical imaging"&lt;/span&gt;

&lt;span class="c"&gt;# Enable cross-model verification (more reliable, higher cost)&lt;/span&gt;
/deep-research &lt;span class="nt"&gt;--cross-model-verify&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Skill 2: Academic Paper (v3.0) — 12-Agent Writing Pipeline
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Ten modes&lt;/strong&gt; covering every stage of the paper lifecycle:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;/academic-paper &lt;span class="nt"&gt;--mode&lt;/span&gt; plan      &lt;span class="c"&gt;# Guided planning (interactive, confirm before continuing)&lt;/span&gt;
/academic-paper &lt;span class="nt"&gt;--mode&lt;/span&gt; outline   &lt;span class="c"&gt;# Generate outline only&lt;/span&gt;
/academic-paper &lt;span class="nt"&gt;--mode&lt;/span&gt; full      &lt;span class="c"&gt;# Full paper writing&lt;/span&gt;
/academic-paper &lt;span class="nt"&gt;--mode&lt;/span&gt; revision  &lt;span class="c"&gt;# Revise an existing draft&lt;/span&gt;
/academic-paper &lt;span class="nt"&gt;--mode&lt;/span&gt; revision-coach  &lt;span class="c"&gt;# Revision coaching (guides, doesn't rewrite)&lt;/span&gt;
/academic-paper &lt;span class="nt"&gt;--mode&lt;/span&gt; abstract  &lt;span class="c"&gt;# Generate abstract only&lt;/span&gt;
/academic-paper &lt;span class="nt"&gt;--mode&lt;/span&gt; citation-check  &lt;span class="c"&gt;# Citation verification&lt;/span&gt;
/academic-paper &lt;span class="nt"&gt;--mode&lt;/span&gt; disclosure      &lt;span class="c"&gt;# Generate AI use disclosure statement&lt;/span&gt;
/academic-paper &lt;span class="nt"&gt;--mode&lt;/span&gt; format-convert  &lt;span class="c"&gt;# Format conversion (MD → DOCX/PDF)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Multiple output formats&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Markdown (default)&lt;/span&gt;
&lt;span class="c"&gt;# DOCX (via Pandoc)&lt;/span&gt;
&lt;span class="c"&gt;# PDF (via tectonic, APA 7.0 LaTeX)&lt;/span&gt;

/academic-paper &lt;span class="nt"&gt;--format&lt;/span&gt; pdf &lt;span class="nt"&gt;--citation-style&lt;/span&gt; apa7 &lt;span class="s2"&gt;"Quantum entanglement in communications"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Supported paper structures: IMRaD (empirical), thematic literature review, theoretical analysis, case study, policy brief, conference paper.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Citation format support&lt;/strong&gt;: APA 7.0 (default, including Chinese citation rules), Chicago (footnote and author-date), MLA, IEEE, Vancouver.&lt;/p&gt;

&lt;h3&gt;
  
  
  Skill 3: Academic Paper Reviewer (v1.8) — 7-Agent Review Panel
&lt;/h3&gt;

&lt;p&gt;This Skill models a real journal review process, constructing a virtual editorial board:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Role Composition:
  - Editor-in-Chief (EIC)      ← Coordinates review, makes final decision
  - Reviewer A                 ← Theoretical contribution and literature
  - Reviewer B                 ← Research methodology and statistics
  - Reviewer C                 ← Writing quality and logical structure
  - Devil's Advocate (DA)      ← Targets the paper's weakest points
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Scoring framework&lt;/strong&gt; (0–100):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;th&gt;Decision&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;≥ 80&lt;/td&gt;
&lt;td&gt;Accept&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;65–79&lt;/td&gt;
&lt;td&gt;Minor Revision&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;50–64&lt;/td&gt;
&lt;td&gt;Major Revision&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&amp;lt; 50&lt;/td&gt;
&lt;td&gt;Reject&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Six modes&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;/paper-reviewer &lt;span class="nt"&gt;--mode&lt;/span&gt; full          &lt;span class="c"&gt;# Full review (EIC + 3 reviewers + DA)&lt;/span&gt;
/paper-reviewer &lt;span class="nt"&gt;--mode&lt;/span&gt; re-review     &lt;span class="c"&gt;# Post-revision re-review&lt;/span&gt;
/paper-reviewer &lt;span class="nt"&gt;--mode&lt;/span&gt; quick         &lt;span class="c"&gt;# Quick review&lt;/span&gt;
/paper-reviewer &lt;span class="nt"&gt;--mode&lt;/span&gt; methodology   &lt;span class="c"&gt;# Focus on methodology&lt;/span&gt;
/paper-reviewer &lt;span class="nt"&gt;--mode&lt;/span&gt; guided        &lt;span class="c"&gt;# Guided mode (interactive confirmation)&lt;/span&gt;
/paper-reviewer &lt;span class="nt"&gt;--mode&lt;/span&gt; calibration   &lt;span class="c"&gt;# Calibration mode (compare against gold standard, test FNR/FPR)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Skill 4: Academic Pipeline (v3.7) — 10-Stage Orchestrator
&lt;/h3&gt;

&lt;p&gt;The "conductor" of the entire suite — organizing the three preceding Skills into a complete 10-stage workflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Stage 1  : RESEARCH (deep research + research question formation)
Stage 2  : WRITE (first draft)
Stage 2.5: INTEGRITY CHECK ⛔ [Non-skippable]
Stage 3  : POLISH (refinement and improvement)
Stage 4  : REVIEW (simulated peer review)
Stage 4.5: INTEGRITY RE-CHECK ⛔ [Non-skippable]
Stage 5  : REVISE (revisions based on review feedback)
Stage 6  : FINAL REVIEW (final manuscript review)
Stage 7  : FORMAT (formatting and output)
Stage 8  : DISCLOSURE (generate AI use disclosure statement)
Stage 9  : POST-PUBLICATION AUDIT (optional)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Three entry points&lt;/strong&gt; (you don't have to start from the beginning):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Full pipeline starting from Stage 1&lt;/span&gt;
/academic-pipeline &lt;span class="nt"&gt;--entry&lt;/span&gt; stage1 &lt;span class="s2"&gt;"Research topic description"&lt;/span&gt;

&lt;span class="c"&gt;# Start from Stage 2.5 (existing draft, run integrity check first)&lt;/span&gt;
/academic-pipeline &lt;span class="nt"&gt;--entry&lt;/span&gt; stage2.5 &lt;span class="nt"&gt;--draft&lt;/span&gt; my_paper.md

&lt;span class="c"&gt;# Start from Stage 4 (existing manuscript, go directly to peer review)&lt;/span&gt;
/academic-pipeline &lt;span class="nt"&gt;--entry&lt;/span&gt; stage4 &lt;span class="nt"&gt;--paper&lt;/span&gt; final_draft.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Workflow Design Insights Worth Studying
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;This is the most important section of today's article.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In building this system, the author systematically analyzed how AI fails in academic contexts — and engineered specific countermechanisms for each failure mode. These mechanisms are not just academic research tools. They are design patterns directly applicable to any complex AI Skill.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mechanism 1: Non-Skippable Integrity Gates (Anti-Hallucination Gates)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The problem&lt;/strong&gt;: Zhao et al. (2026) estimate that approximately &lt;strong&gt;146,932 hallucinated citations&lt;/strong&gt; were inserted into academic papers in 2025, with 85.3% of those persisting from preprint all the way to published versions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The response&lt;/strong&gt;: Stage 2.5 and Stage 4.5 enforce &lt;strong&gt;mandatory integrity verification&lt;/strong&gt; using the Semantic Scholar API to check citations. Neither gate can be bypassed, regardless of whether the user wants to skip them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Stage 2.5 Integrity Check — 7 Blocking Categories:
  ❌ Implementation errors (code/experiment inconsistent with description)
  ❌ Hallucinated results (reporting results from experiments never run)
  ❌ Methodology shortcuts (claimed rigorous, actually simplified)
  ❌ Methodological fabrication (described methods never used)
  ❌ Citation hallucination (citing non-existent or misrepresented sources)
  ❌ L3 claim audit (optional: pull cited sources, compare against claims)
  ❌ Statistical errors (p-values, confidence intervals, effect size consistency)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Insight for Skill designers&lt;/strong&gt;: In any high-stakes output workflow, place &lt;strong&gt;non-bypassable verification nodes&lt;/strong&gt;. Make "whether to do an integrity check" a non-choice — because under time pressure, humans will always choose to skip it.&lt;/p&gt;




&lt;h3&gt;
  
  
  Mechanism 2: Socratic Dialogue + Intent Detection
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The problem&lt;/strong&gt;: Most AI dialogue systems have an inherent tendency — converge to an answer quickly, reach conclusions fast. In the early stages of exploratory research, this is harmful. What researchers actually need is to be guided by better questions, not handed a premature answer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The response&lt;/strong&gt;: Deep Research's Socratic mode implements an &lt;strong&gt;intent classification layer&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Intent detection logic (evaluated every 3 turns)
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;classify_intent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dialogue_history&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;exploratory_signals&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;I&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;m thinking...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What do you think...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Is it possible that...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;exploratory&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="c1"&gt;# → Disable automatic convergence
&lt;/span&gt;        &lt;span class="c1"&gt;# → Raise max turns to 60
&lt;/span&gt;        &lt;span class="c1"&gt;# → Suppress early-summary prompts
&lt;/span&gt;    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="nf"&gt;goal_oriented_signals&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Generate me...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;I need a...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;goal-oriented&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="c1"&gt;# → Normal convergence behavior
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Dialogue Health Indicator&lt;/strong&gt; (silent evaluation every 5 turns):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Evaluated dimensions:
  - Is there a persistent agreement pattern?
  - Is conflict being avoided?
  - Is there premature convergence?

If problems detected → Auto-inject challenge questions to break surface harmony
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Insight for Skill designers&lt;/strong&gt;: Distinguish "the user wants to be guided in thinking" from "the user wants a deliverable." These two modes require completely different dialogue strategies. Add intent classification logic to your Skill's frontmatter rather than applying one prompt strategy to every scenario.&lt;/p&gt;




&lt;h3&gt;
  
  
  Mechanism 3: Devil's Advocate Concession Threshold Protocol
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The problem&lt;/strong&gt;: The author observed a phenomenon he calls &lt;strong&gt;Frame-lock&lt;/strong&gt;: when the user (or another agent) pushes back on the Devil's Advocate's position, the DA concedes within a few turns and begins agreeing. This turns "adversarial review" into theater.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Root cause&lt;/strong&gt;: RLHF training makes models prefer conflict reduction — which in multi-turn dialogue systematically causes position collapse (sycophancy under pushback).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The response&lt;/strong&gt;: A &lt;strong&gt;Concession Threshold Protocol&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;When DA receives pushback from user or other agents:

Step 1: DA internally scores the pushback on a 1–5 scale (not shown to user)
        1–2: Weak argument, appeals to authority only, or bare assertion
        3:   Some merit, but insufficient to overturn core position
        4:   Substantive evidence, warrants reconsideration
        5:   New evidence provided — position should be revised

Step 2: Act based on score
        ≤ 3 → DA maintains position, restates reasoning (no concession)
        ≥ 4 → DA may partially concede (but must explain why it changed)

Step 3: Consecutive concession protection
        Consecutive concessions prohibited (if DA just conceded, it cannot
        concede again the very next turn)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Frame-lock detection&lt;/strong&gt;: After each checkpoint, evaluate whether DA is only attacking arguments without questioning underlying assumptions. If so, automatically trigger "premise examination mode."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Insight for Skill designers&lt;/strong&gt;: In any Skill involving opposing viewpoints (code review, proposal evaluation, risk analysis), explicitly define &lt;strong&gt;concession conditions&lt;/strong&gt; rather than leaving it to the model's judgment. A numerical scoring threshold is the most direct and effective tool against sycophancy.&lt;/p&gt;




&lt;h3&gt;
  
  
  Mechanism 4: Style Calibration and Anti-AI-Pattern Writing
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The problem&lt;/strong&gt;: AI-generated academic text has recognizable "AI tells" — overuse of transitional phrases, formulaic paragraph structures, unnaturally uniform vocabulary distribution. This affects not just readability but may also trigger academic detection tools.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The response&lt;/strong&gt;: The Academic Paper Skill includes a &lt;strong&gt;style calibration phase&lt;/strong&gt; before writing begins:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input: 3–5 papers or articles the user has previously written or published
        ↓
Analysis: Sentence length distribution, paragraph structure preferences,
          common connectives, technical term density,
          active/passive voice ratio
        ↓
Calibration: Generation mimics the user's identified writing style profile
        ↓
Output Check: Writing Quality Check module
              Specifically identifies and reduces AI-pattern features
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Insight for Skill designers&lt;/strong&gt;: In writing-oriented Skills, &lt;strong&gt;style input is a required pre-step, not an option&lt;/strong&gt;. Have the model "learn how the user writes" before it starts writing. This is the difference between output that is genuinely useful and output that is merely functionally complete.&lt;/p&gt;




&lt;h3&gt;
  
  
  Mechanism 5: R&amp;amp;R Traceability Matrix (Revision Traceability)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The problem&lt;/strong&gt;: The revision stage is where "claiming a change was made without actually making it" most commonly occurs. Reviewers request changes to points A, B, and C. The author's response letter says "addressed." How does an AI agent verify this?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The response&lt;/strong&gt;: The &lt;strong&gt;R&amp;amp;R Traceability Matrix&lt;/strong&gt; (Schema 11):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input:
  - Reviewer comments (including specific change requests)
  - Revised manuscript
  - Author Response Letter
        ↓
Independent Verification:
  - Check each reviewer comment → locate corresponding change in manuscript
  - Check each claim in author response → verify actual change in manuscript
  - Flag items where "claimed addressed" but no corresponding change found
        ↓
Output: Traceability report (Addressed / Partially Addressed / Not Addressed / Claim Unverified)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Insight for Skill designers&lt;/strong&gt;: In any version-comparison workflow (code review, document revision, requirements changes), introduce &lt;strong&gt;claim-to-implementation consistency checking&lt;/strong&gt;. This is more reliable than manual review and provides more semantic judgment than a simple diff.&lt;/p&gt;




&lt;h2&gt;
  
  
  Project Links &amp;amp; Resources
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Official Resources
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;🌟 &lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/Imbad0202/academic-research-skills" rel="noopener noreferrer"&gt;https://github.com/Imbad0202/academic-research-skills&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;🔬 &lt;strong&gt;Companion Experiment Agent&lt;/strong&gt;: &lt;a href="https://github.com/Imbad0202/experiment-agent" rel="noopener noreferrer"&gt;Imbad0202/experiment-agent&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;📦 &lt;strong&gt;Codex Version&lt;/strong&gt;: &lt;a href="https://github.com/Imbad0202/academic-research-skills-codex" rel="noopener noreferrer"&gt;Imbad0202/academic-research-skills-codex&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;📖 &lt;strong&gt;Architecture Documentation&lt;/strong&gt;: &lt;code&gt;docs/ARCHITECTURE.md&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;🚀 &lt;strong&gt;Quick Start Guide&lt;/strong&gt;: &lt;code&gt;QUICKSTART.md&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Target Audience
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Academic researchers&lt;/strong&gt;: Graduate students, PhD candidates, and faculty who want AI assistance without sacrificing academic rigor&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI Skill designers&lt;/strong&gt;: Anyone interested in implementing anti-sycophancy, anti-hallucination gates, and intent detection in complex workflows&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Academic journal editors&lt;/strong&gt;: Using the reviewer mode to understand current AI-assisted research quality&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Research methods educators&lt;/strong&gt;: Using Socratic mode to guide students in critical thinking&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Key Takeaways
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Functional layer&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Four Skills cover the complete academic workflow: Deep Research (13 agents) + Academic Paper (12 agents) + Reviewer (7 agents) + Pipeline (10-stage orchestration)&lt;/li&gt;
&lt;li&gt;Supports APA 7.0, Chicago, MLA, IEEE, Vancouver citation formats; Markdown/DOCX/PDF output&lt;/li&gt;
&lt;li&gt;Complete pipeline for a 15,000-word paper costs approximately $4–6&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Workflow design layer&lt;/strong&gt; (core insights for Skill designers):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Non-skippable integrity gates&lt;/strong&gt;: Place mandatory verification nodes before high-stakes outputs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Intent detection&lt;/strong&gt;: Distinguish exploratory dialogue from goal-oriented requests; respond with different strategies&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Concession Threshold Protocol&lt;/strong&gt;: Use numerical scoring thresholds to prevent AI position collapse under conversational pressure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Style calibration&lt;/strong&gt;: A required pre-step in writing Skills — let the model learn how the user writes first&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Claim-implementation traceability&lt;/strong&gt;: Consistency verification in version-comparison workflows&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  One-Line Review
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;Academic Research Skills is not just an academic tool — it is a living reference on how to design responsible AI workflows in high-stakes scenarios.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;em&gt;Find more useful knowledge and interesting products on my &lt;a href="https://home.wonlab.top" rel="noopener noreferrer"&gt;Homepage&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>agents</category>
      <category>agentskills</category>
    </item>
    <item>
      <title>RAG Series (19): Incremental Updates — Keeping the Knowledge Base Fresh</title>
      <dc:creator>WonderLab</dc:creator>
      <pubDate>Mon, 18 May 2026 02:53:57 +0000</pubDate>
      <link>https://forem.com/wonderlab/rag-series-19-incremental-updates-keeping-the-knowledge-base-fresh-596c</link>
      <guid>https://forem.com/wonderlab/rag-series-19-incremental-updates-keeping-the-knowledge-base-fresh-596c</guid>
      <description>&lt;h2&gt;
  
  
  Knowledge Bases Are Not Static
&lt;/h2&gt;

&lt;p&gt;Every article in this series so far has shared one implicit assumption: documents are loaded once, and the index never changes.&lt;/p&gt;

&lt;p&gt;Production doesn't work like that.&lt;/p&gt;

&lt;p&gt;Product documentation updates weekly. Knowledge base articles are added daily. Outdated content gets retired. Every time something changes, you face a choice:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option A: Full rebuild&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Re-embed every document — including the ones that didn't change — and rebuild the entire vector index from scratch. Simple to implement. Expensive to run:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You pay for embedding every document every time&lt;/li&gt;
&lt;li&gt;1000 documents, 5 changed: still 1000 embed calls&lt;/li&gt;
&lt;li&gt;Rebuild time grows proportionally to corpus size, not change size&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Option B: Incremental update&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Store a content hash for each indexed document. On the next sync, only process the documents whose hash changed — embed the new ones, replace the modified ones, clean up the deleted ones, skip everything else.&lt;/p&gt;

&lt;p&gt;LangChain's Indexing API implements Option B.&lt;/p&gt;




&lt;h2&gt;
  
  
  How the Indexing API Works
&lt;/h2&gt;

&lt;p&gt;Two components:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;SQLRecordManager&lt;/code&gt;&lt;/strong&gt;: A SQLite database that stores a record for each indexed document:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csvs"&gt;&lt;code&gt;&lt;span class="k"&gt;source&lt;/span&gt;         &lt;span class="err"&gt;|&lt;/span&gt;  &lt;span class="k"&gt;content&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;hash&lt;/span&gt;           &lt;span class="err"&gt;|&lt;/span&gt;  &lt;span class="k"&gt;indexed&lt;/span&gt;&lt;span class="err"&gt;_&lt;/span&gt;&lt;span class="k"&gt;at&lt;/span&gt;
&lt;span class="k"&gt;rag&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="k"&gt;intro&lt;/span&gt;      &lt;span class="err"&gt;|&lt;/span&gt;  &lt;span class="k"&gt;a&lt;/span&gt;&lt;span class="mf"&gt;3&lt;/span&gt;&lt;span class="k"&gt;f&lt;/span&gt;&lt;span class="mf"&gt;8&lt;/span&gt;&lt;span class="k"&gt;b&lt;/span&gt;&lt;span class="mf"&gt;2&lt;/span&gt;&lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="mf"&gt;1.&lt;/span&gt;&lt;span class="err"&gt;..&lt;/span&gt;            &lt;span class="err"&gt;|&lt;/span&gt;  &lt;span class="ld"&gt;2026-05-15&lt;/span&gt; &lt;span class="mf"&gt;10&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;00&lt;/span&gt;
&lt;span class="k"&gt;ragas&lt;/span&gt;          &lt;span class="err"&gt;|&lt;/span&gt;  &lt;span class="k"&gt;d&lt;/span&gt;&lt;span class="mf"&gt;9e2&lt;/span&gt;&lt;span class="k"&gt;f&lt;/span&gt;&lt;span class="mf"&gt;1&lt;/span&gt;&lt;span class="k"&gt;a&lt;/span&gt;&lt;span class="mf"&gt;4.&lt;/span&gt;&lt;span class="err"&gt;..&lt;/span&gt;            &lt;span class="err"&gt;|&lt;/span&gt;  &lt;span class="ld"&gt;2026-05-15&lt;/span&gt; &lt;span class="mf"&gt;10&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;00&lt;/span&gt;
&lt;span class="k"&gt;vector&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="k"&gt;db&lt;/span&gt;      &lt;span class="err"&gt;|&lt;/span&gt;  &lt;span class="mf"&gt;7&lt;/span&gt;&lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="mf"&gt;4&lt;/span&gt;&lt;span class="k"&gt;b&lt;/span&gt;&lt;span class="mf"&gt;8e3&lt;/span&gt;&lt;span class="k"&gt;f&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;            &lt;span class="err"&gt;|&lt;/span&gt;  &lt;span class="ld"&gt;2026-05-15&lt;/span&gt; &lt;span class="mf"&gt;10&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;00&lt;/span&gt;
&lt;span class="err"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;code&gt;index()&lt;/code&gt; function&lt;/strong&gt;: Compares the current document batch against the RecordManager and decides what happens to each document:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;For each document in the batch:
  Hash matches   → skip (num_skipped++)
  Hash differs   → delete old version, insert new (num_deleted++, num_added++)
  First time     → insert (num_added++)

After processing all documents (cleanup="full"):
  In RecordManager but not in batch → delete (num_deleted++)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;cleanup="full"&lt;/code&gt; handles the deletion case. Without it, documents that were removed from your knowledge base continue to live in the vector store and show up in retrieval results — stale content, indefinitely.&lt;/p&gt;




&lt;h2&gt;
  
  
  Core Implementation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  RecordManager Setup
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_classic.indexes&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SQLRecordManager&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;

&lt;span class="n"&gt;NAMESPACE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chroma/rag_knowledge_base&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;record_manager&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SQLRecordManager&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;NAMESPACE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;db_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sqlite:///record_manager.db&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;record_manager&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_schema&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;   &lt;span class="c1"&gt;# create tables on first run
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;NAMESPACE&lt;/code&gt; acts as a partition key. One SQLite file can manage multiple independent knowledge bases without interference.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Sync Function
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;sync_knowledge_base&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Document&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Incrementally sync a document batch into the vector store.

    - Unchanged documents: skipped (no embedding API call)
    - New / modified documents: embedded and written
    - Removed documents: deleted from the vector store
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;record_manager&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;vectorstore&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;cleanup&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;full&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;# auto-remove docs not in this batch
&lt;/span&gt;        &lt;span class="n"&gt;source_id_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# metadata["source"] identifies each document
&lt;/span&gt;    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;source_id_key&lt;/code&gt; is the document identity key. Two documents with the same &lt;code&gt;source&lt;/code&gt; are treated as different versions of the same document. If content changes, the old version is deleted and the new version is added.&lt;/p&gt;

&lt;h3&gt;
  
  
  Documents Must Have a &lt;code&gt;source&lt;/code&gt; Field
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nc"&gt;Document&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;page_content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rag-intro&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;   &lt;span class="c1"&gt;# required for version tracking
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Documents without a &lt;code&gt;source&lt;/code&gt; can't be tracked incrementally — they'll be treated as new every single time.&lt;/p&gt;




&lt;h2&gt;
  
  
  Experiment: Three Sync Rounds
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Dataset Design
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;V1 (initial knowledge base, 6 documents):&lt;/strong&gt;&lt;br&gt;
rag-intro, ragas, vector-db, embedding, rerank, chunking&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;V2 (simulated update cycle):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Change Type&lt;/th&gt;
&lt;th&gt;Document&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Unchanged&lt;/td&gt;
&lt;td&gt;rag-intro, vector-db, rerank&lt;/td&gt;
&lt;td&gt;Identical content&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Modified&lt;/td&gt;
&lt;td&gt;ragas&lt;/td&gt;
&lt;td&gt;Added faithfulness explanation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Modified&lt;/td&gt;
&lt;td&gt;chunking&lt;/td&gt;
&lt;td&gt;Added contextual retrieval section&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deleted&lt;/td&gt;
&lt;td&gt;embedding&lt;/td&gt;
&lt;td&gt;Not present in V2 batch&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Added&lt;/td&gt;
&lt;td&gt;advanced-rag&lt;/td&gt;
&lt;td&gt;New document&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Added&lt;/td&gt;
&lt;td&gt;conv-rag&lt;/td&gt;
&lt;td&gt;New document&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;V1 → V2: 3 unchanged, 2 modified, 1 deleted, 2 added.&lt;/p&gt;
&lt;h3&gt;
  
  
  Results
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;======================================================================
  Scenario 1: Initial Index (V1 — 6 documents)
======================================================================

  [Initial Index]
  ┌─────────────────────────────────────────┐
  │  added:       6  (newly embedded)       │
  │  skipped:     0  (content unchanged)    │
  │  deleted:     0  (removed/replaced)     │
  ├─────────────────────────────────────────┤
  │  embed calls:    6                      │
  │  wall time:   0.913s                    │
  └─────────────────────────────────────────┘

======================================================================
  Scenario 2: Incremental Update (V2)
======================================================================

  [Incremental Update]
  ┌─────────────────────────────────────────┐
  │  added:       4  (newly embedded)       │
  │  skipped:     3  (content unchanged)    │
  │  deleted:     3  (removed/replaced)     │
  ├─────────────────────────────────────────┤
  │  embed calls:    4                      │
  │  wall time:   0.891s                    │
  └─────────────────────────────────────────┘

======================================================================
  Scenario 3: Full Rebuild (V2, record manager wiped)
======================================================================

  [Full Rebuild]
  ┌─────────────────────────────────────────┐
  │  added:       7  (newly embedded)       │
  │  skipped:     0  (content unchanged)    │
  │  deleted:     0  (removed/replaced)     │
  ├─────────────────────────────────────────┤
  │  embed calls:    7                      │
  │  wall time:   0.494s                    │
  └─────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Cost Comparison
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  ┌──────────────────────┬───────────────┬───────────────┐
  │                      │   Incremental │  Full Rebuild │
  ├──────────────────────┼───────────────┼───────────────┤
  │  Documents embedded  │       4       │       7       │
  │  Documents skipped   │       3       │       0       │
  │  Embedding savings   │    42.9%      │     0%        │
  └──────────────────────┴───────────────┴───────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Incremental update triggered 4 embed calls; full rebuild triggered 7. That's 42.9% fewer API calls for the same end state.&lt;/strong&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  An Honest Look at the Timing Results
&lt;/h2&gt;

&lt;p&gt;Full rebuild was actually faster (0.494s vs 0.891s). This deserves an explanation.&lt;/p&gt;

&lt;p&gt;With 7 small documents, the SQLite hash lookup and comparison overhead costs more than the time saved by skipping 3 embed calls. Embedding calls go out as batched async HTTP requests — latency is dominated by network round-trip. SQLite operations are synchronous local disk I/O. At small scale, the bookkeeping costs more than the savings.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This reverses quickly at realistic scale:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Scenario: 1,000-document knowledge base, 5% daily change rate (50 docs)

Full rebuild:    1,000 embed calls per day
Incremental:        50 embed calls per day  →  95% reduction

At $0.0001 per embed call (typical for bge-large-zh-v1.5):
  Full rebuild:   ~$100/day (assuming avg 200 tokens/doc)
  Incremental:    ~$5/day

At 10,000 documents:
  Full rebuild:   ~$1,000/day
  Incremental:    ~$50/day
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The time savings at small scale are not meaningful. The API cost savings are real from day one, and both metrics grow with corpus size.&lt;/p&gt;




&lt;h2&gt;
  
  
  Two Kinds of Deletion
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;deleted: 3&lt;/code&gt; in the incremental result contains two different things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Replacement deletion (2)&lt;/strong&gt;: ragas and chunking changed content. The old versions are deleted from the vector store; new versions are embedded and inserted. Net document count for these sources: unchanged.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cleanup deletion (1)&lt;/strong&gt;: the &lt;code&gt;embedding&lt;/code&gt; document was not in the V2 batch at all. With &lt;code&gt;cleanup="full"&lt;/code&gt;, after processing all documents in the batch, the indexer checks the RecordManager for any source not seen in this run — finds &lt;code&gt;embedding&lt;/code&gt;, and removes it.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you used &lt;code&gt;cleanup=None&lt;/code&gt; instead:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Not recommended: stale documents accumulate
&lt;/span&gt;&lt;span class="nf"&gt;index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;record_manager&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vectorstore&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cleanup&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Recommended: full sync, stale content auto-removed
&lt;/span&gt;&lt;span class="nf"&gt;index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;record_manager&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vectorstore&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cleanup&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;full&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;source_id_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without cleanup, &lt;code&gt;embedding&lt;/code&gt; stays in the vector store indefinitely. Users querying about embedding models would still get answers based on the retired document. This is the "ghost document" problem — one of the more insidious production bugs in RAG systems, because it's invisible until someone notices the answers reference content that no longer exists.&lt;/p&gt;




&lt;h2&gt;
  
  
  Production Integration Pattern
&lt;/h2&gt;

&lt;p&gt;In practice, incremental updates are triggered by a scheduled job or a document change event:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;glob&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load_documents_from_dir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;docs_dir&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Document&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Load documents from the filesystem, using file path as source.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;filepath&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;glob&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;glob&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;docs_dir&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/**/*.md&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;recursive&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filepath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Document&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;page_content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;filepath&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;docs&lt;/span&gt;

&lt;span class="c1"&gt;# Scheduled job: sync every hour
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;hourly_sync&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_documents_from_dir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./knowledge_base&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;record_manager&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;vectorstore&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;cleanup&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;full&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;source_id_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Sync done: +&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;num_added&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; added  &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;~&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;num_deleted&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; deleted  &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;num_skipped&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; skipped&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;File path as &lt;code&gt;source&lt;/code&gt; is naturally unique. File content changes automatically invalidate the stored hash, triggering re-embedding on the next sync. No extra tracking code needed.&lt;/p&gt;




&lt;h2&gt;
  
  
  RecordManager Persistence
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;SQLRecordManager&lt;/code&gt; persists to disk, so the hash registry survives service restarts. For production:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Development / single machine
&lt;/span&gt;&lt;span class="n"&gt;record_manager&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SQLRecordManager&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;namespace&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;db_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sqlite:///record_manager.db&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Production / distributed (multiple service instances share one registry)
&lt;/span&gt;&lt;span class="n"&gt;record_manager&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SQLRecordManager&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;namespace&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;db_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;postgresql://user:pass@host/dbname&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;SQLite works for single-instance deployments. Switch to PostgreSQL when multiple instances need to share the same RecordManager — otherwise concurrent writes will corrupt the hash registry.&lt;/p&gt;




&lt;h2&gt;
  
  
  Full Code
&lt;/h2&gt;

&lt;p&gt;Complete code is open-sourced at:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/chendongqi/llm-in-action/tree/main/19-incremental-update" rel="noopener noreferrer"&gt;https://github.com/chendongqi/llm-in-action/tree/main/19-incremental-update&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Key file:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;incremental_update.py&lt;/code&gt; — three sync scenarios, counting wrapper, cost comparison, query verification&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;How to run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/chendongqi/llm-in-action
&lt;span class="nb"&gt;cd &lt;/span&gt;19-incremental-update
&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
python incremental_update.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;This article implemented incremental knowledge base updates using LangChain's Indexing API. Key findings:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Content hash tracking is the mechanism&lt;/strong&gt; — RecordManager stores a hash for each document; unchanged → skip, modified → delete old + insert new, deleted → cleanup removes it&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;42.9% embedding reduction&lt;/strong&gt; — 7 documents, 3 unchanged: only 4 embed calls instead of 7. The ratio improves as the corpus grows and change rate decreases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Wall time savings don't show at small scale&lt;/strong&gt; — SQLite hash lookup overhead dominates at 7 documents; time savings become significant at 1,000+ documents&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;cleanup="full"&lt;/code&gt; prevents ghost documents&lt;/strong&gt; — without it, deleted documents stay in the vector store indefinitely and keep appearing in retrieval results&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Incremental updates are the step that moves RAG from "demo that works" to "production system that stays correct." A knowledge base is not a one-time artifact — it needs to evolve alongside the business that depends on it.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://python.langchain.com/docs/how_to/indexing/" rel="noopener noreferrer"&gt;LangChain Indexing API Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://python.langchain.com/api_reference/langchain/indexes/langchain.indexes.SQLRecordManager.html" rel="noopener noreferrer"&gt;SQLRecordManager API Reference&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>ragas</category>
      <category>rag</category>
      <category>langchain</category>
    </item>
  </channel>
</rss>
