<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Zafer Dace</title>
    <description>The latest articles on Forem by Zafer Dace (@zaferdace).</description>
    <link>https://forem.com/zaferdace</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3858391%2Feae2eb24-88da-4c67-b829-ff571b0de4d6.JPG</url>
      <title>Forem: Zafer Dace</title>
      <link>https://forem.com/zaferdace</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/zaferdace"/>
    <language>en</language>
    <item>
      <title>When Your AI Wiki Outgrows the Context Window — A Practical Guide to RAG</title>
      <dc:creator>Zafer Dace</dc:creator>
      <pubDate>Wed, 08 Apr 2026 13:24:33 +0000</pubDate>
      <link>https://forem.com/zaferdace/when-your-ai-wiki-outgrows-the-context-window-a-practical-guide-to-rag-kc2</link>
      <guid>https://forem.com/zaferdace/when-your-ai-wiki-outgrows-the-context-window-a-practical-guide-to-rag-kc2</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F158g70ywhxmu7q6iptfg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F158g70ywhxmu7q6iptfg.png" alt=" " width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Karpathy showed us how to build LLM-powered knowledge bases. But what happens when your wiki gets too big for the context window? Here's the missing piece.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;In a &lt;a href="https://x.com/karpathy/status/2039805659525644595" rel="noopener noreferrer"&gt;recent post&lt;/a&gt;, Andrej Karpathy described a workflow that resonated with thousands of developers: use LLMs to build and maintain personal knowledge bases as markdown wikis. Raw documents go in, the LLM compiles them into structured articles, and you query the wiki like a research assistant.&lt;/p&gt;

&lt;p&gt;He also noted something important:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"I thought I had to reach for fancy RAG, but the LLM has been pretty good about auto-maintaining index files and brief summaries... at this ~small scale."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The key phrase is &lt;strong&gt;"at this small scale."&lt;/strong&gt; His wiki is ~100 articles and ~400K words. That fits in a large context window. But what happens when you hit 500 articles? 1,000? 2 million words?&lt;/p&gt;

&lt;p&gt;The context window runs out. Your LLM can't read everything anymore. This is where &lt;strong&gt;RAG&lt;/strong&gt; comes in — and it's simpler than you think.&lt;/p&gt;




&lt;h2&gt;
  
  
  What is RAG?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;RAG (Retrieval Augmented Generation)&lt;/strong&gt; is a three-step pattern:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Retrieve&lt;/strong&gt; — Find the most relevant documents for a given question&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Augment&lt;/strong&gt; — Attach those documents to the prompt&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generate&lt;/strong&gt; — LLM answers using only the relevant context&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Think of it as an &lt;strong&gt;open-book exam&lt;/strong&gt;. The LLM doesn't memorize your entire wiki — it looks up the right pages before answering.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You: "How does attention differ from convolution?"
          ↓
    1. Search vector DB → top 5 relevant articles found
    2. Attach articles to prompt
    3. LLM reads 5 articles (not 500) → generates answer
          ↓
LLM: "Based on your wiki articles on attention mechanisms
      and CNN architectures, the key differences are..."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without RAG, you'd need to feed all 500 articles into the context window. With RAG, you feed only 5. Same quality, 100x less tokens.&lt;/p&gt;




&lt;h2&gt;
  
  
  How It Works Under the Hood
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fce2vzwlqyadlsve1t22o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fce2vzwlqyadlsve1t22o.png" alt=" " width="800" height="328"&gt;&lt;/a&gt;&lt;br&gt;
RAG relies on &lt;strong&gt;vector embeddings&lt;/strong&gt; — turning text into numbers that capture meaning.&lt;/p&gt;
&lt;h3&gt;
  
  
  Step 1: Index your wiki
&lt;/h3&gt;

&lt;p&gt;Every article gets converted into a vector (a list of numbers) by an &lt;strong&gt;embedding model&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="s2"&gt;"Attention mechanism"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;→&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.42&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.68&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.35&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;-0.12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="s2"&gt;"CNN architecture"&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="err"&gt;→&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.39&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.71&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;-0.15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="err"&gt;←&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;similar&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;topic,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;close&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;vectors&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="s2"&gt;"Cooking recipes"&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="err"&gt;→&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.85&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.92&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.44&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="err"&gt;←&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;different&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;topic,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;far&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;apart&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These vectors are stored in a &lt;strong&gt;vector database&lt;/strong&gt; — a specialized database that finds similar vectors fast.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Query
&lt;/h3&gt;

&lt;p&gt;When you ask a question, the same embedding model converts your question to a vector, then the vector DB finds the closest matches:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"How does self-attention work?"
    → vector → search → top 5 closest articles
    → attention-mechanism.md, transformer-architecture.md, ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Generate
&lt;/h3&gt;

&lt;p&gt;Those articles are injected into the LLM prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;System: Answer based on the following context:
[article 1 content]
[article 2 content]
[article 3 content]

User: How does self-attention work?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The LLM now has the right context and generates an accurate, grounded answer.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Landscape: Existing Tools
&lt;/h2&gt;

&lt;p&gt;Since Karpathy's post, several tools have emerged. Here's a comparison of the most notable ones:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Stack&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/Vasallo94/ObsidianRAG" rel="noopener noreferrer"&gt;ObsidianRAG&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;ChromaDB + Ollama + GraphRAG&lt;/td&gt;
&lt;td&gt;Full-featured local RAG with wikilink-aware search&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/proofgeist/obsidian-notes-rag" rel="noopener noreferrer"&gt;obsidian-notes-rag&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;SQLite-vec + MCP server&lt;/td&gt;
&lt;td&gt;Claude Code / AI agent integration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/lucasastorian/llmwiki" rel="noopener noreferrer"&gt;llmwiki&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Web UI + Claude&lt;/td&gt;
&lt;td&gt;Non-technical users who want a GUI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/sspaeti/obsidian-note-taking-assistant" rel="noopener noreferrer"&gt;obsidian-note-taking-assistant&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;DuckDB + Web app&lt;/td&gt;
&lt;td&gt;Combined note-taking + RAG&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/nicolaischneider/obsidianRAGsody" rel="noopener noreferrer"&gt;obsidianRAGsody&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;CLI + URL clipper&lt;/td&gt;
&lt;td&gt;CLI-first workflow with web scraping&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Which one should you use?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Want everything local + privacy?&lt;/strong&gt; → ObsidianRAG (Ollama + ChromaDB)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Using Claude Code as your agent?&lt;/strong&gt; → obsidian-notes-rag (MCP server)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Just want to try RAG quickly?&lt;/strong&gt; → obsidianRAGsody (simple CLI)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What Makes a Good RAG Pipeline?
&lt;/h2&gt;

&lt;p&gt;A naive RAG (embed → search → generate) works, but production-quality tools like ObsidianRAG go further:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Hybrid Search (Vector + Keyword)&lt;/strong&gt;&lt;br&gt;
Vector search finds semantically similar content ("How do transformers work?" → finds articles about attention). But it can miss exact terms. BM25 keyword search catches those. The best systems combine both — ObsidianRAG uses a 60/40 vector/keyword split.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Reranking&lt;/strong&gt;&lt;br&gt;
Initial retrieval returns ~20 candidates. A &lt;strong&gt;CrossEncoder reranker&lt;/strong&gt; (like &lt;code&gt;bge-reranker-v2-m3&lt;/code&gt;) then scores each candidate against the original query more carefully, keeping only the top 5. This dramatically improves precision.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Graph-Aware Expansion&lt;/strong&gt;&lt;br&gt;
If article A is retrieved and it contains &lt;code&gt;[[article B]]&lt;/code&gt; wikilinks, a smart system also pulls in article B. This follows the knowledge graph your LLM already built — exactly how Obsidian's backlinks work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Multilingual Embeddings&lt;/strong&gt;&lt;br&gt;
If your wiki has mixed-language content, use &lt;code&gt;paraphrase-multilingual-mpnet-base-v2&lt;/code&gt; instead of English-only models. It covers 50+ languages.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Simple RAG:    Query → Vector Search → Top 5 → LLM
Better RAG:    Query → Hybrid Search → Top 20 → Rerank → Top 5 → Expand Links → LLM
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Build It Yourself: Minimal RAG in 50 Lines
&lt;/h2&gt;

&lt;p&gt;If you want to understand the core concept, here's a minimal implementation. For production use, consider the tools listed above.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;chromadb sentence-transformers ollama
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The Code
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;glob&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;chromadb&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sentence_transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SentenceTransformer&lt;/span&gt;

&lt;span class="c1"&gt;# 1. Setup
&lt;/span&gt;&lt;span class="n"&gt;embedding_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SentenceTransformer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;all-MiniLM-L6-v2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# local, no API. Use paraphrase-multilingual-mpnet-base-v2 for multilingual wikis
&lt;/span&gt;&lt;span class="n"&gt;chroma&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chromadb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;PersistentClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./wiki_vectors&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;collection&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chroma&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_or_create_collection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wiki&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 2. Index your wiki
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;index_wiki&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wiki_path&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;md_files&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;glob&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;glob&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wiki_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;**/*.md&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;recursive&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;filepath&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;md_files&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filepath&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="n"&gt;doc_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;relpath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filepath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wiki_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Chunk long articles (simple split by sections)
&lt;/span&gt;        &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;## &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;chunk_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;doc_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;::chunk_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="n"&gt;collection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upsert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;ids&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;chunk_id&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="n"&gt;metadatas&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;doc_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chunk&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Indexed &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;md_files&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; files&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 3. Search
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_results&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;collection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;query_texts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;n_results&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;n_results&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;documents&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metadatas&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# 4. Ask with RAG
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;ask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wiki_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;wiki_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;index_wiki&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wiki_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metas&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;---&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[Source: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metas&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Answer the question based on the following context from my wiki.
Cite your sources.

Context:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Question: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="c1"&gt;# Use Ollama for local LLM
&lt;/span&gt;    &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ollama&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ollama&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llama3.2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Usage
&lt;/span&gt;&lt;span class="nf"&gt;index_wiki&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;~/knowledge-base/wiki&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;ask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What are the key differences between GPT and BERT?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. ~50 lines. Fully local. No API keys. No cloud.&lt;/p&gt;




&lt;h2&gt;
  
  
  When to Use RAG vs. Direct Context
&lt;/h2&gt;

&lt;p&gt;Not everything needs RAG. Here's a simple decision guide:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpxqqr8jhm27ujvh74u6u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpxqqr8jhm27ujvh74u6u.png" alt=" " width="800" height="264"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Wiki Size&lt;/th&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&amp;lt; 50 articles&lt;/td&gt;
&lt;td&gt;Direct context&lt;/td&gt;
&lt;td&gt;Fits in most context windows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;50-200 articles&lt;/td&gt;
&lt;td&gt;Index file + direct&lt;/td&gt;
&lt;td&gt;Karpathy's approach — LLM reads index, then relevant files&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;200-1000 articles&lt;/td&gt;
&lt;td&gt;RAG&lt;/td&gt;
&lt;td&gt;Too big for context, but RAG handles it easily&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1000+ articles&lt;/td&gt;
&lt;td&gt;RAG + hybrid search&lt;/td&gt;
&lt;td&gt;Add keyword search alongside vector search for precision&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The sweet spot for adding RAG is when you notice your LLM starting to miss information that's definitely in your wiki, or when token costs become significant.&lt;/p&gt;




&lt;h2&gt;
  
  
  Tips for Better RAG
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Chunk wisely
&lt;/h3&gt;

&lt;p&gt;Don't index entire articles as single vectors. Split by sections (&lt;code&gt;## headings&lt;/code&gt;). A 5,000-word article as one chunk loses precision — the vector becomes a blur of all topics in that article. Smaller chunks = more precise retrieval.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Keep metadata
&lt;/h3&gt;

&lt;p&gt;Store the source file path, section title, and date with each chunk. This lets you filter results ("only search articles from the last month") and cite sources in answers.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Use hybrid search
&lt;/h3&gt;

&lt;p&gt;Vector search finds semantically similar content. Keyword search finds exact matches. Combine both:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Vector: "How do transformers handle long sequences?" → finds articles about attention, context windows&lt;/li&gt;
&lt;li&gt;Keyword: "RoPE" → finds the exact article mentioning Rotary Position Embeddings&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Re-index incrementally
&lt;/h3&gt;

&lt;p&gt;Don't rebuild the entire index when you add one article. Use &lt;code&gt;upsert&lt;/code&gt; to add/update only the changed files. Most vector DBs support this natively.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Let the LLM maintain the wiki, RAG maintains the retrieval
&lt;/h3&gt;

&lt;p&gt;Keep Karpathy's workflow intact — the LLM still writes and organizes the wiki. RAG is just the lookup layer. Don't let RAG complexity infect your clean wiki structure.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next: The Compounding Knowledge Loop
&lt;/h2&gt;

&lt;p&gt;The real power emerges when you combine Karpathy's wiki pattern with RAG in a feedback loop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Raw Sources → LLM compiles wiki → RAG indexes wiki
                    ↑                      ↓
                    └──── You ask questions ─┘
                          Answers filed back
                          into the wiki
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every question you ask, every answer you file back — they compound. The wiki grows smarter. The RAG index gets richer. Six months in, you have a personal research assistant that knows your domain better than any general-purpose LLM ever could.&lt;/p&gt;

&lt;p&gt;And the best part? It all runs on your laptop.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Credit: The LLM knowledge base concept was originally described by &lt;a href="https://x.com/karpathy/status/2039805659525644595" rel="noopener noreferrer"&gt;Andrej Karpathy&lt;/a&gt;. This post explores the RAG extension for scaling beyond context window limits.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;If you're new to Karpathy's approach, check out my &lt;a href="https://dev.to/zaferdace/build-your-own-ai-powered-knowledge-base-with-llms-and-obsidian-18po"&gt;previous post on building the wiki itself&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Further Reading:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f" rel="noopener noreferrer"&gt;Karpathy's original LLM Wiki gist&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/Vasallo94/ObsidianRAG" rel="noopener noreferrer"&gt;ObsidianRAG&lt;/a&gt; — Full-featured local Obsidian RAG&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/proofgeist/obsidian-notes-rag" rel="noopener noreferrer"&gt;obsidian-notes-rag&lt;/a&gt; — MCP server for AI agents&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.trychroma.com/" rel="noopener noreferrer"&gt;ChromaDB docs&lt;/a&gt; — Getting started with vector databases&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>rag</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Build Your Own AI-Powered Knowledge Base with LLMs and Obsidian</title>
      <dc:creator>Zafer Dace</dc:creator>
      <pubDate>Tue, 07 Apr 2026 16:12:40 +0000</pubDate>
      <link>https://forem.com/zaferdace/build-your-own-ai-powered-knowledge-base-with-llms-and-obsidian-18po</link>
      <guid>https://forem.com/zaferdace/build-your-own-ai-powered-knowledge-base-with-llms-and-obsidian-18po</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgxmmihu2978uprj6wu9r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgxmmihu2978uprj6wu9r.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;em&gt;A practical guide to Andrej Karpathy's approach for turning raw research into a living, LLM-maintained wiki.&lt;/em&gt;
&lt;/h2&gt;

&lt;p&gt;Last week, &lt;a href="https://x.com/karpathy/status/2039805659525644595" rel="noopener noreferrer"&gt;Andrej Karpathy shared a fascinating workflow&lt;/a&gt; on X: instead of using LLMs primarily for code, he's been using them to &lt;strong&gt;build and maintain personal knowledge bases&lt;/strong&gt;. Raw documents go in, and the LLM compiles them into a structured markdown wiki — complete with summaries, backlinks, concept articles, and cross-references.&lt;/p&gt;

&lt;p&gt;The idea is simple but powerful: &lt;strong&gt;you rarely touch the wiki yourself. The LLM writes it, maintains it, and answers questions from it.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I loved this concept and decided to build my own version. In this post, I'll walk you through exactly how to set it up using &lt;strong&gt;Obsidian&lt;/strong&gt; as your viewer and &lt;strong&gt;Claude Code&lt;/strong&gt; (or any LLM coding agent) as the engine that manages everything.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;The system has four layers:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fta74h45cbrmv7g4eo21w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fta74h45cbrmv7g4eo21w.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There's no fancy integration or plugin needed. Obsidian and Claude Code simply share the same directory. Obsidian watches the files and renders them beautifully. Claude Code reads and writes them. That's it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 1: Set Up the Vault
&lt;/h2&gt;

&lt;p&gt;Create a folder structure for your knowledge base:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; ~/knowledge-base/&lt;span class="o"&gt;{&lt;/span&gt;raw,wiki/concepts,wiki/topics,output&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="nb"&gt;cd&lt;/span&gt; ~/knowledge-base
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create a &lt;code&gt;CLAUDE.md&lt;/code&gt; file at the root — this tells Claude Code how to behave in this project:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# Knowledge Base Instructions&lt;/span&gt;

&lt;span class="gu"&gt;## Structure&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="sb"&gt;`raw/`&lt;/span&gt; — Source documents (articles, papers, notes). Never modify these.
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="sb"&gt;`wiki/`&lt;/span&gt; — LLM-maintained wiki. All articles are markdown with YAML frontmatter.
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="sb"&gt;`wiki/concepts/`&lt;/span&gt; — Individual concept articles.
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="sb"&gt;`wiki/topics/`&lt;/span&gt; — Broader topic overviews.
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="sb"&gt;`output/`&lt;/span&gt; — Generated outputs (comparisons, slides, charts).
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="sb"&gt;`_index.md`&lt;/span&gt; — Master index of all wiki articles with one-line summaries.

&lt;span class="gu"&gt;## Article Format&lt;/span&gt;
Every wiki article must have:
&lt;span class="p"&gt;-&lt;/span&gt; YAML frontmatter with: title, tags, sources (list of raw/ files), last_updated
&lt;span class="p"&gt;-&lt;/span&gt; A brief summary (2-3 sentences) at the top
&lt;span class="p"&gt;-&lt;/span&gt; Backlinks to related concepts using [[wiki links]]
&lt;span class="p"&gt;-&lt;/span&gt; Sources section at the bottom linking to raw/ documents

&lt;span class="gu"&gt;## Rules&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Always update &lt;span class="sb"&gt;`_index.md`&lt;/span&gt; when creating or modifying articles.
&lt;span class="p"&gt;-&lt;/span&gt; Use [[double bracket]] links for cross-references.
&lt;span class="p"&gt;-&lt;/span&gt; Never delete or modify files in &lt;span class="sb"&gt;`raw/`&lt;/span&gt;.
&lt;span class="p"&gt;-&lt;/span&gt; When adding new information, cite the source file from &lt;span class="sb"&gt;`raw/`&lt;/span&gt;.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now open this folder as an Obsidian vault:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Open Obsidian&lt;/li&gt;
&lt;li&gt;"Open folder as vault" → select &lt;code&gt;~/knowledge-base&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Done — Obsidian is now your viewer&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Step 2: Collect Raw Data
&lt;/h2&gt;

&lt;p&gt;This is the "data ingest" phase. You have several options:&lt;/p&gt;

&lt;h3&gt;
  
  
  Obsidian Web Clipper (Recommended)
&lt;/h3&gt;

&lt;p&gt;Install the &lt;a href="https://obsidian.md/clipper" rel="noopener noreferrer"&gt;Obsidian Web Clipper&lt;/a&gt; browser extension. Configure it to save clipped articles into your &lt;code&gt;raw/&lt;/code&gt; folder. One click saves any web article as clean markdown.&lt;/p&gt;

&lt;h3&gt;
  
  
  Manual Copy
&lt;/h3&gt;

&lt;p&gt;For PDFs, papers, or notes — just drop markdown files into &lt;code&gt;raw/&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Attention&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Is&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;All&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;You&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Need"&lt;/span&gt;
&lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://arxiv.org/abs/1706.03762&lt;/span&gt;
&lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;paper&lt;/span&gt;
&lt;span class="na"&gt;date_added&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2025-04-07&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

&lt;span class="gh"&gt;# Attention Is All You Need&lt;/span&gt;

The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Images
&lt;/h3&gt;

&lt;p&gt;Save related images into &lt;code&gt;raw/images/&lt;/code&gt; and reference them in your markdown. Obsidian renders them inline, and Claude Code can analyze them too.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 3: Compile the Wiki
&lt;/h2&gt;

&lt;p&gt;This is where the magic happens. Open Claude Code in your knowledge base directory and ask it to compile:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Read all files in raw/ and compile a wiki:
- Create concept articles in wiki/concepts/ for each key concept
- Create topic overviews in wiki/topics/ for broader themes
- Add backlinks between related articles
- Update _index.md with all articles and one-line summaries
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Claude Code will:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Read every document in &lt;code&gt;raw/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Identify key concepts and themes&lt;/li&gt;
&lt;li&gt;Create structured markdown articles with frontmatter&lt;/li&gt;
&lt;li&gt;Cross-link everything with &lt;code&gt;[[wiki links]]&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Build a master index&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The result looks something like this in Obsidian's graph view — a connected web of knowledge that you never had to organize manually.&lt;/p&gt;

&lt;h3&gt;
  
  
  Incremental Updates
&lt;/h3&gt;

&lt;p&gt;When you add new documents to &lt;code&gt;raw/&lt;/code&gt;, you don't need to rebuild everything:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;I added 3 new articles to raw/. Read them and integrate into the existing wiki.
Update existing articles if there's new info, create new ones if needed,
and update _index.md.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The LLM reads the new sources, figures out what's new vs. what's already covered, and surgically updates the wiki.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 4: Ask Questions
&lt;/h2&gt;

&lt;p&gt;Once your wiki reaches a decent size, you can query it like a research assistant:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Based on the wiki, compare the training approaches of GPT-4 and Llama 3.
Write the comparison as output/gpt4-vs-llama3.md with a summary table.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;What are the main unsolved problems in RLHF according to our sources?
Write a brief report to output/rlhf-challenges.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;Create a Marp slide deck summarizing the key concepts in wiki/topics/
Save as output/overview-slides.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The LLM reads the relevant wiki articles, synthesizes an answer, and writes it as a markdown file — which you immediately see in Obsidian.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pro tip&lt;/strong&gt;: File the best outputs back into the wiki. Your explorations compound over time.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 5: Lint and Maintain
&lt;/h2&gt;

&lt;p&gt;As Karpathy mentioned, you can run "health checks" on your wiki:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Scan the entire wiki for:
- Inconsistent information between articles
- Missing backlinks (concepts mentioned but not linked)
- Articles that reference deleted or missing sources
- Stub articles that need expansion
Report findings in output/health-check.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Look at the wiki and suggest 5 new article topics that would
fill gaps in our coverage. Explain why each would be valuable.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is surprisingly useful — the LLM often finds connections and gaps you wouldn't notice yourself.&lt;/p&gt;




&lt;h2&gt;
  
  
  Tips and Tricks
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Use CLAUDE.md Wisely
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;CLAUDE.md&lt;/code&gt; file is your control plane. As your wiki grows, refine the instructions. Add domain-specific terminology, preferred article structure, or naming conventions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Keep _index.md Updated
&lt;/h3&gt;

&lt;p&gt;This is the LLM's "table of contents." When the wiki gets large (100+ articles), the LLM reads &lt;code&gt;_index.md&lt;/code&gt; first to understand what exists before diving into specific files. Keep it clean and current.&lt;/p&gt;

&lt;h3&gt;
  
  
  Obsidian Graph View
&lt;/h3&gt;

&lt;p&gt;Enable Obsidian's graph view to visualize connections. The &lt;code&gt;[[wiki links]]&lt;/code&gt; that the LLM creates show up as edges in the graph. It's a great way to spot isolated articles or missing connections.&lt;/p&gt;

&lt;h3&gt;
  
  
  Marp for Presentations
&lt;/h3&gt;

&lt;p&gt;Install the &lt;a href="https://github.com/marp-team/marp" rel="noopener noreferrer"&gt;Marp plugin for Obsidian&lt;/a&gt; to render slide decks. Ask Claude Code to generate presentations in Marp format — instant slides from your knowledge base.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scale Considerations
&lt;/h3&gt;

&lt;p&gt;Karpathy reports his wiki works well at ~100 articles and ~400K words without needing RAG. The key is the &lt;code&gt;_index.md&lt;/code&gt; with brief summaries — the LLM reads this first, then dives into relevant articles. At much larger scales, you might need a search tool or embeddings-based retrieval.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Works
&lt;/h2&gt;

&lt;p&gt;The insight behind this approach is subtle: &lt;strong&gt;LLMs are better at maintaining structured knowledge than we are.&lt;/strong&gt; They don't forget to add backlinks. They don't leave articles half-finished (unless you tell them to). They can read 50 articles and produce a consistent summary faster than we can read 5.&lt;/p&gt;

&lt;p&gt;You bring the judgment — which sources to add, which questions to ask, which outputs to keep. The LLM handles the grunt work of organizing, linking, summarizing, and maintaining.&lt;/p&gt;

&lt;p&gt;As Karpathy put it:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"You rarely ever write or edit the wiki manually, it's the domain of the LLM. I think there is room here for an incredible new product instead of a hacky collection of scripts."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Until that product exists, Obsidian + Claude Code gets you 90% of the way there — today, for free, with tools you might already have.&lt;/p&gt;




&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Create a folder, add &lt;code&gt;CLAUDE.md&lt;/code&gt; with your wiki rules&lt;/li&gt;
&lt;li&gt;Open it as an Obsidian vault&lt;/li&gt;
&lt;li&gt;Clip or drop 5-10 articles into &lt;code&gt;raw/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Run &lt;code&gt;claude&lt;/code&gt; in the folder and ask it to compile&lt;/li&gt;
&lt;li&gt;Explore the result in Obsidian&lt;/li&gt;
&lt;li&gt;Start asking questions&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The beauty of this system is that it &lt;strong&gt;compounds&lt;/strong&gt;. Every article you add, every question you ask, every health check you run — they all make the knowledge base richer and more connected. After a few weeks, you'll have a personal research assistant that actually knows your domain.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Credit: This approach was originally described by &lt;a href="https://x.com/karpathy/status/2039805659525644595" rel="noopener noreferrer"&gt;Andrej Karpathy&lt;/a&gt;. This post is a practical implementation guide based on his concept.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>productivity</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Choosing the Right Local LLM for Your Mac: A Developer's Real-World Guide to Parameters, Quantization, and Model Architecture</title>
      <dc:creator>Zafer Dace</dc:creator>
      <pubDate>Sat, 04 Apr 2026 11:37:52 +0000</pubDate>
      <link>https://forem.com/zaferdace/choosing-the-right-local-llm-for-your-mac-a-developers-real-world-guide-to-parameters-2mhk</link>
      <guid>https://forem.com/zaferdace/choosing-the-right-local-llm-for-your-mac-a-developers-real-world-guide-to-parameters-2mhk</guid>
      <description>&lt;p&gt;I tested four local LLMs on my 36GB Apple Silicon Mac with the same Unity/C# prompt, and the results were not what the model names suggested. The fastest model was roughly 10x faster than the slowest. The "code" model refused to write the code. The best answer came from a distilled model that felt smarter in practice than a larger alternative.&lt;/p&gt;

&lt;p&gt;That is why choosing a local model is harder than sorting by parameter count. Architecture, quantization, active parameters, context window, and actual behavior under your prompt matter more than the headline number.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Run LLMs Locally?
&lt;/h2&gt;

&lt;p&gt;I do not think local models replace Claude, GPT, or other frontier cloud systems. I use them as supplements, not substitutes. But they are already useful enough that every Mac developer should understand where they fit.&lt;/p&gt;

&lt;p&gt;The biggest benefit is cost. If I want to iterate on the same task ten times, local inference turns that into a zero-API-cost workflow. Then there is offline capability, IP protection, and freedom from rate limits or daily quotas.&lt;/p&gt;

&lt;p&gt;The tradeoff is also obvious: local models still trail the best cloud systems on reasoning and large-scale architecture work. I use them as part of a stack, not as replacements.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding the Jargon
&lt;/h2&gt;

&lt;p&gt;The local LLM ecosystem is full of terms that make simple tradeoffs sound more mysterious than they are. Here is the practical version.&lt;/p&gt;

&lt;h3&gt;
  
  
  Parameters (7B, 14B, 31B)
&lt;/h3&gt;

&lt;p&gt;When you see &lt;code&gt;7B&lt;/code&gt;, &lt;code&gt;14B&lt;/code&gt;, or &lt;code&gt;31B&lt;/code&gt;, the &lt;code&gt;B&lt;/code&gt; means billion parameters. You can think of parameters as the model's learned internal connections.&lt;/p&gt;

&lt;p&gt;My rough mental model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;7B&lt;/code&gt; = a capable high school student&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;14B&lt;/code&gt; = a university graduate&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;31B&lt;/code&gt; = a specialist&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;400B+&lt;/code&gt; = frontier cloud territory&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That analogy is crude but useful. More parameters usually mean better outputs. The cost is more RAM and slower inference.&lt;/p&gt;

&lt;h3&gt;
  
  
  Dense vs MoE (Mixture of Experts)
&lt;/h3&gt;

&lt;p&gt;A dense model means the full network participates in every token. I think of it as a 14-person team where everybody works on every question together.&lt;/p&gt;

&lt;p&gt;An MoE model is different. A &lt;code&gt;30B-A3B&lt;/code&gt; model might have 30 billion total parameters, but only 3 billion are active for a given token. That is more like a 30-person office where only three people handle the current task.&lt;/p&gt;

&lt;p&gt;The practical implication is simple: total parameters are not the same as active reasoning depth.&lt;/p&gt;

&lt;p&gt;Real example from my test:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Qwen3 Coder &lt;code&gt;30B-A3B&lt;/code&gt; (MoE, 3B active): &lt;code&gt;51.67 tok/s&lt;/code&gt;, but basic architecture output&lt;/li&gt;
&lt;li&gt;Qwen3.5 &lt;code&gt;27B&lt;/code&gt; (dense): &lt;code&gt;8.53 tok/s&lt;/code&gt;, but much stronger modular design and implementation detail&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is why I do not assume &lt;code&gt;30B&lt;/code&gt; beats &lt;code&gt;14B&lt;/code&gt; or &lt;code&gt;27B&lt;/code&gt;. Active parameters matter.&lt;/p&gt;

&lt;h3&gt;
  
  
  Quantization (Q4, Q6, Q8)
&lt;/h3&gt;

&lt;p&gt;Quantization is compression for model weights. The easiest analogy is image compression.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;FP16&lt;/code&gt; = the original full-quality photo&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Q8&lt;/code&gt; = high-quality JPEG, much smaller with minimal visible loss&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Q4&lt;/code&gt; = medium-quality JPEG, smaller again with more noticeable degradation&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Q2&lt;/code&gt; = thumbnail-level compression, technically visible but not something you want to rely on&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a &lt;code&gt;14B&lt;/code&gt; model, the memory picture looks roughly like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;FP16&lt;/code&gt;: about &lt;code&gt;28GB&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Q8&lt;/code&gt;: about &lt;code&gt;14GB&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Q4&lt;/code&gt;: about &lt;code&gt;8GB&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The exact numbers vary by format and runtime, but the rule is stable. If your RAM allows it, use &lt;code&gt;Q8&lt;/code&gt;. If memory is tight, use &lt;code&gt;Q4&lt;/code&gt;. I avoid &lt;code&gt;Q2&lt;/code&gt; for serious work.&lt;/p&gt;

&lt;h3&gt;
  
  
  KV Cache
&lt;/h3&gt;

&lt;p&gt;Every generated token depends on the tokens that came before it. KV cache stores that attention state so the model does not have to recompute everything from scratch.&lt;/p&gt;

&lt;p&gt;The catch is memory use. Bigger context means more RAM pressure. Roughly speaking:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;8K&lt;/code&gt; context can cost around &lt;code&gt;2GB&lt;/code&gt; extra&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;32K&lt;/code&gt; can push toward &lt;code&gt;8GB&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Exact usage depends on the model and backend, but the tradeoff is real. In my setup, TurboQuant+ helped Gemma by compressing KV cache so I could get more practical use out of limited memory.&lt;/p&gt;

&lt;h3&gt;
  
  
  Context Window
&lt;/h3&gt;

&lt;p&gt;Context window is how much text the model can see at one time.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;8K&lt;/code&gt; = around 500 lines of code&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;32K&lt;/code&gt; = around 2,000 lines&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;128K&lt;/code&gt; = around 8,000 lines&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;262K&lt;/code&gt; = large multi-file chunks&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;1M&lt;/code&gt; = cloud-model territory&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For developers, this matters immediately. An &lt;code&gt;8K&lt;/code&gt; model may be fine for one short file, but it becomes restrictive fast once you include package structure, interfaces, or multiple files.&lt;/p&gt;

&lt;h2&gt;
  
  
  My Test Setup
&lt;/h2&gt;

&lt;p&gt;I wanted a realistic prompt, not a benchmark toy. So I used a Unity/C# request that checks more than raw syntax:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Write a Firebase Analytics tool for Unity using VContainer, UniTask, and MessagePipe. Make it modular for reuse across games. Package it as UPM."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;My machine was a 36GB Apple Silicon Mac using unified memory. I ran Qwen models through LM Studio with the MLX backend, and Gemma through a llama.cpp TurboQuant+ fork because that runtime gave me better memory behavior for that particular model.&lt;/p&gt;

&lt;p&gt;This was not a scientific benchmark shootout. It was a practical developer test: same machine, same task, same expectation of usable output.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Results
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Model 1: Qwen3 Coder 30B-A3B (MoE)
&lt;/h3&gt;

&lt;p&gt;This was the speed monster.&lt;/p&gt;

&lt;p&gt;It is a &lt;code&gt;30B&lt;/code&gt; MoE model with only &lt;code&gt;3B&lt;/code&gt; active parameters per token, and it showed. I measured &lt;code&gt;51.67 tok/s&lt;/code&gt;, and it felt genuinely responsive. It generated &lt;code&gt;1682&lt;/code&gt; tokens in roughly half a minute.&lt;/p&gt;

&lt;p&gt;The output was decent: solid explanations and a usable class outline, but not production-ready architecture. It left important initialization details to me and stayed at the "good draft" level.&lt;/p&gt;

&lt;p&gt;My conclusion: excellent for quick questions, boilerplate, and fast ideation. Not enough for deep architecture work.&lt;/p&gt;

&lt;h3&gt;
  
  
  Model 2: Qwen3.5 27B Claude Distilled (Dense)
&lt;/h3&gt;

&lt;p&gt;This was the clear winner on quality.&lt;/p&gt;

&lt;p&gt;It is a dense &lt;code&gt;27B&lt;/code&gt; model, reportedly distilled from Claude 4.6 Opus behavior, and the output quality difference was obvious. It ran at &lt;code&gt;8.53 tok/s&lt;/code&gt;, much slower than the MoE model, but the answer was in a different class.&lt;/p&gt;

&lt;p&gt;It produced &lt;code&gt;5138&lt;/code&gt; tokens over about three to four minutes, and most of them were useful. The naming was cleaner. The module boundaries made sense. It handled service registration, dependency injection, and reusable package structure with much more confidence.&lt;/p&gt;

&lt;p&gt;This is the model that felt most like a serious coding partner.&lt;/p&gt;

&lt;p&gt;My conclusion: if the task involves architecture, modular design, or reusable package-level code, this is the one worth waiting for.&lt;/p&gt;

&lt;h3&gt;
  
  
  Model 3: Qwen 2.5 Coder 14B (Dense, code-specialized)
&lt;/h3&gt;

&lt;p&gt;This was the biggest disappointment.&lt;/p&gt;

&lt;p&gt;On paper, it should have been a strong fit: dense &lt;code&gt;14B&lt;/code&gt;, code-specialized, manageable size. In practice, it refused to do the work. Instead of writing the package, it explained how I could do it. When I pushed further, it said the task was too complex.&lt;/p&gt;

&lt;p&gt;That matters more to me than benchmark scores. A coding model that declines to code on a realistic prompt is not a reliable tool for my workflow.&lt;/p&gt;

&lt;p&gt;My conclusion: probably fine for completions and short snippets, not dependable for larger scoped generation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Model 4: Gemma 4 31B (Dense, TurboQuant+)
&lt;/h3&gt;

&lt;p&gt;Gemma 4 &lt;code&gt;31B&lt;/code&gt; was interesting because it felt strong in theory and limited in practice.&lt;/p&gt;

&lt;p&gt;It is a dense &lt;code&gt;31B&lt;/code&gt; model, but the &lt;code&gt;8K&lt;/code&gt; context window was the major bottleneck. Even with TurboQuant+ helping on memory through KV cache compression, I still felt boxed in by the context limit. It ran at &lt;code&gt;5.83 tok/s&lt;/code&gt; and produced &lt;code&gt;2454&lt;/code&gt; tokens in about seven minutes.&lt;/p&gt;

&lt;p&gt;The output quality was decent. I would place it closer to Qwen3 Coder than to Qwen3.5 distilled. It gave useful guidance, but not the modular, production-grade design I wanted.&lt;/p&gt;

&lt;p&gt;My conclusion: capable, but constrained. TurboQuant+ helps it fit and run, but it cannot fix the small context window.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results Table
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Architecture&lt;/th&gt;
&lt;th&gt;Context&lt;/th&gt;
&lt;th&gt;Speed&lt;/th&gt;
&lt;th&gt;Output&lt;/th&gt;
&lt;th&gt;Quality Summary&lt;/th&gt;
&lt;th&gt;Verdict&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3 Coder 30B-A3B&lt;/td&gt;
&lt;td&gt;MoE, &lt;code&gt;30B&lt;/code&gt; total / &lt;code&gt;3B&lt;/code&gt; active&lt;/td&gt;
&lt;td&gt;&lt;code&gt;262K&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;51.67 tok/s&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;1682&lt;/code&gt; tokens in ~30s&lt;/td&gt;
&lt;td&gt;Good explanations, basic structure, shallow architecture&lt;/td&gt;
&lt;td&gt;Best for speed, boilerplate, quick questions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3.5 27B Claude Distilled&lt;/td&gt;
&lt;td&gt;Dense &lt;code&gt;27B&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;262K&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;8.53 tok/s&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;5138&lt;/code&gt; tokens in 3-4 min&lt;/td&gt;
&lt;td&gt;Best modularity, DI patterns, naming, package structure&lt;/td&gt;
&lt;td&gt;Best overall code quality&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen 2.5 Coder 14B&lt;/td&gt;
&lt;td&gt;Dense &lt;code&gt;14B&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;32K&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;Refused full implementation&lt;/td&gt;
&lt;td&gt;Explained approach instead of coding; failed on complexity&lt;/td&gt;
&lt;td&gt;Disappointing for complex prompts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemma 4 31B&lt;/td&gt;
&lt;td&gt;Dense &lt;code&gt;31B&lt;/code&gt;, TurboQuant+ runtime&lt;/td&gt;
&lt;td&gt;&lt;code&gt;8K&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;5.83 tok/s&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;2454&lt;/code&gt; tokens in ~7 min&lt;/td&gt;
&lt;td&gt;Useful guidance, but not detailed enough for the speed&lt;/td&gt;
&lt;td&gt;Limited by context, hard to justify&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  RAM Guide: What to Download for Your Mac
&lt;/h2&gt;

&lt;h3&gt;
  
  
  16GB RAM
&lt;/h3&gt;

&lt;p&gt;At &lt;code&gt;16GB&lt;/code&gt;, I would stay modest and optimize for responsiveness.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Qwen &lt;code&gt;2.5 7B Q8&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Llama &lt;code&gt;3.1 8B Q8&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are good for completions, simple questions, and small utility generation. I would not expect serious architecture work from them.&lt;/p&gt;

&lt;h3&gt;
  
  
  32GB RAM
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Qwen3.5 &lt;code&gt;27B&lt;/code&gt; Claude Distilled &lt;code&gt;Q4&lt;/code&gt; for the best quality&lt;/li&gt;
&lt;li&gt;Qwen &lt;code&gt;2.5 Coder 14B Q8&lt;/code&gt; for fast code-oriented tasks&lt;/li&gt;
&lt;li&gt;Gemma &lt;code&gt;4 31B Q4&lt;/code&gt; via TurboQuant+ if you want to experiment with larger dense models&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where local LLMs start becoming genuinely useful. For me, the distilled &lt;code&gt;27B&lt;/code&gt; is the most compelling choice in this tier.&lt;/p&gt;

&lt;h3&gt;
  
  
  64GB+ RAM
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Qwen &lt;code&gt;2.5 Coder 32B Q8&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Llama &lt;code&gt;3.1 70B Q4&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Multiple models loaded simultaneously&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the tier where local work becomes much more flexible. You can keep a fast model and a smart model loaded at the same time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tools I Actually Found Useful
&lt;/h2&gt;

&lt;p&gt;The tooling matters almost as much as the model choice.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LM Studio&lt;/strong&gt;: the easiest place to start. Drag-and-drop workflow, clean interface, and MLX optimization make it especially friendly on Apple Silicon.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;llama.cpp / TurboQuant+&lt;/strong&gt;: the better choice if you want more control, server mode, and memory optimization tricks like improved KV cache handling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ollama&lt;/strong&gt;: good for quick CLI testing and simple local serving.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;llmfit&lt;/strong&gt; (&lt;code&gt;github.com/AlexsJones/llmfit&lt;/code&gt;): useful for estimating what model and quantization level will actually fit on your hardware before you waste time downloading huge files.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you are new to local LLMs on Mac, I would start with LM Studio. Once you care about squeezing more performance or memory efficiency out of your machine, llama.cpp-style runtimes are worth the extra complexity.&lt;/p&gt;

&lt;h2&gt;
  
  
  My Recommendation
&lt;/h2&gt;

&lt;p&gt;For me, the best setup is a multi-model workflow:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cloud models like Claude or Codex for architecture decisions, complex reasoning, and bigger refactors&lt;/li&gt;
&lt;li&gt;Local Qwen3.5 distilled for offline code generation, iterative package drafting, and zero-cost repetition&lt;/li&gt;
&lt;li&gt;Local Qwen3 Coder MoE for quick questions, boilerplate, and fast turnaround&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If I had to recommend one local model from this test for a 32GB-class Mac developer who wants the best coding output, I would choose Qwen3.5 &lt;code&gt;27B&lt;/code&gt; Claude Distilled. If I had to recommend one for speed, I would choose Qwen3 Coder &lt;code&gt;30B-A3B&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Those are different winners, and that is exactly the point.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Local LLMs in 2026 are genuinely useful for developers, but only if you understand what the labels do and do not mean. Parameters alone are not enough. Architecture, quantization, context window, runtime, and training all matter.&lt;/p&gt;

&lt;p&gt;The surprising result from my test was how differently the models failed and succeeded on the same prompt. The fastest model was useful but shallow. The code-specialized model failed the assignment. The biggest model was constrained by context. The best answer came from a distilled dense model that balanced capability and usability.&lt;/p&gt;

&lt;p&gt;If your goal is to write better code faster on a Mac, the winning strategy is not "download the largest model." It is to build a local stack that matches your hardware and your actual development loop.&lt;/p&gt;




</description>
      <category>ai</category>
      <category>llm</category>
      <category>machinelearning</category>
      <category>performance</category>
    </item>
    <item>
      <title>Multi-Model AI Orchestration for Software Development: How I Ship 10x Faster with Claude, Codex, and Gemini</title>
      <dc:creator>Zafer Dace</dc:creator>
      <pubDate>Thu, 02 Apr 2026 22:05:40 +0000</pubDate>
      <link>https://forem.com/zaferdace/multi-model-ai-orchestration-for-software-development-how-i-ship-10x-faster-with-claude-codex-53l3</link>
      <guid>https://forem.com/zaferdace/multi-model-ai-orchestration-for-software-development-how-i-ship-10x-faster-with-claude-codex-53l3</guid>
      <description>&lt;p&gt;I shipped 19 tools across 2 npm packages, got them reviewed, fixed 10 bugs, and published, all in one evening. I did not do it by typing faster. I did it by orchestrating multiple AI models the same way I would coordinate a small development team.&lt;/p&gt;

&lt;p&gt;That shift changed how I use AI for software work. Instead of asking one model to do everything, I assign roles: one model plans, another researches, another writes code, another reviews, and another handles large-scale analysis when the codebase is too broad for everyone else.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Most developers start with a simple pattern: open one chat, paste some code, and keep asking the same model to help with everything. That works for small tasks. It breaks down on real projects.&lt;/p&gt;

&lt;p&gt;The first problem is context pressure. As the conversation grows, the model’s context window fills with stale details, exploratory dead ends, copied logs, and half-finished code. Even when the window is technically large enough, quality often degrades because the model is trying to juggle too many concerns at once.&lt;/p&gt;

&lt;p&gt;The second problem is that modern codebases are not tidy, single-language systems. The projects I work on often span TypeScript, Python, C#, shell scripts, README docs, test suites, CI config, and package metadata. The mental model required to review a TypeScript AST transform is not the same as the one required to inspect Unity C# editor code or write reliable Python tests.&lt;/p&gt;

&lt;p&gt;The third problem is that software development is not one task. It is a bundle of different tasks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;writing implementation code&lt;/li&gt;
&lt;li&gt;researching project conventions&lt;/li&gt;
&lt;li&gt;reviewing for defects&lt;/li&gt;
&lt;li&gt;running builds and tests&lt;/li&gt;
&lt;li&gt;comparing architectures&lt;/li&gt;
&lt;li&gt;doing large-scale cross-file analysis&lt;/li&gt;
&lt;li&gt;answering quick lookup questions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Using one model for all of that is like asking one engineer to do product design, coding, testing, documentation, DevOps, and code review at the same time.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture: Each Model Has a Role
&lt;/h2&gt;

&lt;p&gt;I now use a multi-model setup where each model has a clear job.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;th&gt;Why This Model&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Claude Opus&lt;/strong&gt; (Orchestrator)&lt;/td&gt;
&lt;td&gt;Decision-making, planning, user communication, coordination&lt;/td&gt;
&lt;td&gt;Strongest reasoning, sees the big picture&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Claude Sonnet&lt;/strong&gt; (Subagent)&lt;/td&gt;
&lt;td&gt;Codebase research, file reading, build/test, pattern finding&lt;/td&gt;
&lt;td&gt;Fast, cheap, parallelizable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Codex MCP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Code writing in sandbox, counter-analysis, code review&lt;/td&gt;
&lt;td&gt;Independent context, can debate with Opus&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Gemini 2.5 Pro&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Large-scale analysis (10+ files), cross-cutting research&lt;/td&gt;
&lt;td&gt;1M token context for massive codebases&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is the important constraint: &lt;strong&gt;Opus almost never reads more than three files directly, and it never writes code spanning more than two files.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Opus is my scarce resource. I want its context window reserved for decisions, tradeoffs, and coordination. If I let it spend tokens reading ten implementation files, parsing test fixtures, or editing code across half the repo, I am wasting the most valuable reasoning surface in the system.&lt;/p&gt;

&lt;p&gt;So I deliberately make Opus act more like a tech lead than a hands-on individual contributor:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It decides what needs to be built.&lt;/li&gt;
&lt;li&gt;It asks subagents to gather evidence.&lt;/li&gt;
&lt;li&gt;It synthesizes findings into an implementation spec.&lt;/li&gt;
&lt;li&gt;It asks Codex to challenge that spec.&lt;/li&gt;
&lt;li&gt;It resolves disagreements.&lt;/li&gt;
&lt;li&gt;It sends implementation to the right execution agent.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Core Principle: Preserve the Orchestrator
&lt;/h2&gt;

&lt;p&gt;The best model should not be your file reader, log parser, or bulk code generator.&lt;/p&gt;

&lt;p&gt;If I need to answer questions like these:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What conventions does this repo use for new tools?&lt;/li&gt;
&lt;li&gt;Which helper utilities are already available?&lt;/li&gt;
&lt;li&gt;How do existing tests structure edge cases?&lt;/li&gt;
&lt;li&gt;Where does platform-specific formatting happen?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I do not spend Opus on that. I send Sonnet agents to inspect the codebase and return structured findings. If the question spans a huge number of files, I use Gemini for the broad scan and have it summarize patterns, architectural seams, and constraints.&lt;/p&gt;

&lt;p&gt;Then Opus makes the decision with clean inputs instead of raw noise.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-World Example 1: Building 4 Platform Mappers in One Session
&lt;/h2&gt;

&lt;p&gt;One of the clearest examples was &lt;strong&gt;figma-spec-mcp&lt;/strong&gt;, an open source MCP server that bridges Figma designs to code platforms. The package already had a React mapper, and I wanted to expand it with React Native, Flutter, and SwiftUI support while preserving shared conventions and reusing the normalized UI AST.&lt;/p&gt;

&lt;p&gt;Instead, I split the work.&lt;/p&gt;

&lt;h3&gt;
  
  
  Workflow
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;A Sonnet subagent researched the codebase: tool conventions, type patterns, existing React mapper design, shared helpers, and how the normalized AST flowed through the system.&lt;/li&gt;
&lt;li&gt;Opus synthesized those findings into a detailed implementation spec.&lt;/li&gt;
&lt;li&gt;I sent a single Codex prompt: create all three new mappers by reusing the normalized UI AST and following the discovered conventions.&lt;/li&gt;
&lt;li&gt;Codex wrote more than 2,000 lines across the new mapper surfaces.&lt;/li&gt;
&lt;li&gt;In a separate Codex review session, I asked it to review the output like a skeptical senior engineer, not like the original author.&lt;/li&gt;
&lt;li&gt;That review found ten platform-specific bugs.&lt;/li&gt;
&lt;li&gt;Three Sonnet subagents fixed those bugs in parallel.&lt;/li&gt;
&lt;li&gt;The full toolset passed TypeScript, ESLint, Prettier, and &lt;code&gt;publint&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  What the review caught
&lt;/h3&gt;

&lt;p&gt;The review surfaced bugs that were not obvious from a green-looking implementation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Flutter color output used the wrong byte ordering.&lt;/li&gt;
&lt;li&gt;React Native had &lt;code&gt;shadowOffset&lt;/code&gt; represented as a string instead of an object.&lt;/li&gt;
&lt;li&gt;SwiftUI output relied on a missing color initializer.&lt;/li&gt;
&lt;li&gt;A few generated platform props matched one framework’s conventions but not the actual target platform’s API.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Result
&lt;/h3&gt;

&lt;p&gt;I ended that session with four platform mappers, reviewed, fixed, lint-clean, and production-ready in about two hours. The speed came from specialization and parallelism, not from asking one model to “be smarter.”&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-World Example 2: Contributing to &lt;code&gt;CoplayDev/unity-mcp&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;The second example was a series of open source contributions to &lt;strong&gt;CoplayDev/unity-mcp&lt;/strong&gt;, a Unity MCP server with over 1,000 stars. The most significant was adding an &lt;code&gt;execute_code&lt;/code&gt; tool that lets AI agents run arbitrary C# code directly inside the Unity Editor, with in-memory compilation via Roslyn, safety checks, execution history, and replay support.&lt;/p&gt;

&lt;p&gt;The interesting part is how the feature gap was identified. I was already using a different Unity MCP server (AnkleBreaker) for my own projects, and I noticed it had capabilities that CoplayDev lacked. Rather than manually comparing 78 tools against 34, I had AI agents do the comparison systematically.&lt;/p&gt;

&lt;h3&gt;
  
  
  Workflow
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;I identified the gap myself by working with both MCP servers daily, then used a Sonnet exploration agent to systematically map all tools from AnkleBreaker’s 78-tool set against CoplayDev’s 34 tools. The agent returned a structured comparison table showing exactly which features were missing.&lt;/li&gt;
&lt;li&gt;From that gap analysis, I picked &lt;code&gt;execute_code&lt;/code&gt; as the highest-impact contribution: it unlocks an entire class of workflows where AI agents can inspect live Unity state, run editor automation, and validate assumptions without requiring manual steps.&lt;/li&gt;
&lt;li&gt;A Sonnet agent deep-dived CoplayDev’s dual-codebase conventions (Python MCP server + C# Unity plugin), studying the tool registration pattern, parameter handling, response envelope format, and test structure.&lt;/li&gt;
&lt;li&gt;Opus synthesized the research into a detailed implementation spec covering four actions (&lt;code&gt;execute&lt;/code&gt;, &lt;code&gt;get_history&lt;/code&gt;, &lt;code&gt;replay&lt;/code&gt;, &lt;code&gt;clear_history&lt;/code&gt;), safety checks for dangerous patterns, Roslyn/CSharpCodeProvider fallback, and execution history management.&lt;/li&gt;
&lt;li&gt;Codex wrote the full implementation: &lt;code&gt;ExecuteCode.cs&lt;/code&gt; (C# Unity handler with in-memory compilation), &lt;code&gt;execute_code.py&lt;/code&gt; (Python MCP tool), and &lt;code&gt;test_execute_code.py&lt;/code&gt; (unit tests). Over 1,600 lines of additions.&lt;/li&gt;
&lt;li&gt;Opus reviewed the output and caught issues before the PR went out.&lt;/li&gt;
&lt;li&gt;The PR was merged after reviewer feedback was addressed.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  What the review caught
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Safety check patterns needed tightening for edge cases around &lt;code&gt;System.IO&lt;/code&gt; and &lt;code&gt;Process&lt;/code&gt; usage&lt;/li&gt;
&lt;li&gt;Error line number normalization had to account for the wrapper class offset&lt;/li&gt;
&lt;li&gt;Compiler selection logic needed a cleaner fallback path&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Result
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;execute_code&lt;/code&gt; tool became one of the more significant contributions to the project, enabling AI agents to do things like inspect scene hierarchies at runtime, validate component references programmatically, and run editor automation scripts. The contribution was grounded in a real gap analysis rather than guesswork, and the multi-model workflow ensured the implementation matched the project’s conventions across two languages.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-World Example 3: &lt;code&gt;roblox-shipcheck&lt;/code&gt; Shooter Audit Expansion
&lt;/h2&gt;

&lt;p&gt;The third example was &lt;strong&gt;roblox-shipcheck&lt;/strong&gt;, an open source Roblox game audit tool. I wanted to add six shooter-genre-specific tools and expand the package around them with tests, documentation, examples, and release notes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Workflow
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Background Sonnet agents worked in parallel on the README rewrite, &lt;code&gt;CHANGELOG&lt;/code&gt;, usage examples, and unit tests.&lt;/li&gt;
&lt;li&gt;Codex wrote all six shooter tools: weapon config audit, hitbox audit, scope UI audit, mobile HUD audit, team infrastructure audit, and anti-cheat surface audit.&lt;/li&gt;
&lt;li&gt;In a separate review session, Codex reviewed the generated implementation and found eight issues.&lt;/li&gt;
&lt;li&gt;A Sonnet agent fixed those issues and got 124 tests passing.&lt;/li&gt;
&lt;li&gt;Sourcery AI, acting as an automated reviewer, found three additional issues.&lt;/li&gt;
&lt;li&gt;Another Sonnet agent addressed the review feedback and tightened the remaining edge cases.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  What the review caught
&lt;/h3&gt;

&lt;p&gt;The first review wave found:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ESLint violations&lt;/li&gt;
&lt;li&gt;heuristics that were too strict for real-world projects&lt;/li&gt;
&lt;li&gt;false positives for free-for-all game modes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The automated reviewer then found:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;opportunities to consolidate shared test helpers&lt;/li&gt;
&lt;li&gt;missing edge cases in the audit suite&lt;/li&gt;
&lt;li&gt;rough spots in the implementation details around reuse and consistency&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Result
&lt;/h3&gt;

&lt;p&gt;The package ended with 49 tools total, 124 passing tests, a cleaner README, updated examples, release notes, and green CI across TypeScript, ESLint, Prettier, and SonarCloud. That is the difference between “I added some code” and “I shipped a maintainable release.”&lt;/p&gt;

&lt;h2&gt;
  
  
  Token Budget Rules: The Key Insight
&lt;/h2&gt;

&lt;p&gt;The most important lesson in all of this is simple: &lt;strong&gt;your orchestrator’s context window is the scarcest resource in the system.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;These are the rules I follow now:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Opus reads three files or fewer per task.&lt;/strong&gt; If I need more than that, I delegate the reading to Sonnet or Gemini and ask for a structured summary.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Opus writes code in two files or fewer.&lt;/strong&gt; If the task spans more than two files, I send it to Codex with a detailed spec.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Before starting any task, I ask: “Can a subagent do this?”&lt;/strong&gt; If the answer is yes, I stop and delegate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Codex reviews everything.&lt;/strong&gt; Even code Codex wrote itself. The review happens in a separate session so it can challenge its own assumptions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Independent work gets parallel agents.&lt;/strong&gt; If docs, tests, examples, and changelog updates do not depend on each other, they should run at the same time.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here is the mental model I use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Opus = scarce strategic bandwidth
Sonnet = cheap parallel investigation
Codex = isolated implementation and review
Gemini = massive-context research pass
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once I started treating context like a budget instead of an infinite buffer, my sessions became dramatically more reliable.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Debate Pattern
&lt;/h2&gt;

&lt;p&gt;One of the most effective techniques in this setup is what I call the &lt;strong&gt;debate pattern&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Instead of asking one model for a solution and immediately implementing it, I force a disagreement phase.&lt;/p&gt;

&lt;h3&gt;
  
  
  The process
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Opus analyzes the problem and proposes a solution.&lt;/li&gt;
&lt;li&gt;Codex receives that analysis and produces counter-analysis: where it agrees, where it disagrees, and what it would change.&lt;/li&gt;
&lt;li&gt;If there are conflicts, I do one follow-up round to resolve them.&lt;/li&gt;
&lt;li&gt;Once there is consensus, I convert that into an implementation plan.&lt;/li&gt;
&lt;li&gt;Codex implements.&lt;/li&gt;
&lt;li&gt;A separate Codex session reviews the result.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This works because disagreement exposes hidden assumptions.&lt;/p&gt;

&lt;p&gt;In one session, that debate caught:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Flutter &lt;code&gt;Color&lt;/code&gt; formatting confusion between &lt;code&gt;0xRRGGBBAA&lt;/code&gt; and &lt;code&gt;0xAARRGGBB&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;React Native Paper prop mismatch using &lt;code&gt;mode&lt;/code&gt; where &lt;code&gt;variant&lt;/code&gt; was correct&lt;/li&gt;
&lt;li&gt;a non-existent SwiftUI &lt;code&gt;Color(hex:)&lt;/code&gt; initializer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of those issues were broad architectural failures. They were the kind of platform-specific correctness bugs that burn time after merge if you do not catch them early.&lt;/p&gt;

&lt;p&gt;The debate pattern turns AI assistance from “fast autocomplete” into “adversarial design review plus implementation.”&lt;/p&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;p&gt;The performance difference is large enough that I now think in terms of orchestration by default.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Single Model&lt;/th&gt;
&lt;th&gt;Multi-Model Orchestration&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Tools shipped per session&lt;/td&gt;
&lt;td&gt;2-3&lt;/td&gt;
&lt;td&gt;10-15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bugs caught before publish&lt;/td&gt;
&lt;td&gt;~60%&lt;/td&gt;
&lt;td&gt;~95% (Codex review)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Parallel workstreams&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;6+ simultaneous&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Context preservation&lt;/td&gt;
&lt;td&gt;Degrades after 3-4 files&lt;/td&gt;
&lt;td&gt;Stays sharp (delegated)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Convention compliance&lt;/td&gt;
&lt;td&gt;Often drifts&lt;/td&gt;
&lt;td&gt;Exact match (research first)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;If you want to try this workflow, start simple. You do not need a huge automation stack on day one. You just need role separation and a few clear rules.&lt;/p&gt;

&lt;h3&gt;
  
  
  My practical setup
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Claude Code CLI with Opus as orchestrator&lt;/strong&gt; for planning, decisions, and user-facing coordination&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Codex MCP server&lt;/strong&gt; (&lt;code&gt;npm: codex&lt;/code&gt;) for implementation, sandboxed code changes, and review&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gemini MCP&lt;/strong&gt; (&lt;code&gt;npm: gemini-mcp-tool&lt;/code&gt;) for large-scale repo analysis and broad research across many files&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sonnet subagents via Claude Code’s Agent tool&lt;/strong&gt; for codebase research, builds, tests, pattern extraction, docs, and support work&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The most important operational detail is to write your rules down in &lt;code&gt;CLAUDE.md&lt;/code&gt;. If the orchestrator has to rediscover your preferences every session, you lose consistency and waste tokens.&lt;/p&gt;

&lt;p&gt;My &lt;code&gt;CLAUDE.md&lt;/code&gt; contains rules like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="p"&gt;-&lt;/span&gt; Opus reads &amp;lt;= 3 files directly
&lt;span class="p"&gt;-&lt;/span&gt; Opus writes &amp;lt;= 2 files directly
&lt;span class="p"&gt;-&lt;/span&gt; Delegate codebase exploration to Sonnet
&lt;span class="p"&gt;-&lt;/span&gt; Use Codex for implementation spanning multiple files
&lt;span class="p"&gt;-&lt;/span&gt; Always run a separate review pass before publish
&lt;span class="p"&gt;-&lt;/span&gt; Prefer parallel subagents for independent tasks
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That single file turns ad hoc prompting into a repeatable operating model.&lt;/p&gt;

&lt;h3&gt;
  
  
  A good first workflow
&lt;/h3&gt;

&lt;p&gt;If you want a low-friction way to start, try this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Use Sonnet to inspect the repo and summarize conventions.&lt;/li&gt;
&lt;li&gt;Use Opus to write a short implementation spec.&lt;/li&gt;
&lt;li&gt;Use Codex to implement across the affected files.&lt;/li&gt;
&lt;li&gt;Use a fresh Codex session to review for defects.&lt;/li&gt;
&lt;li&gt;Use Sonnet to fix issues and run tests.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Practical Lessons
&lt;/h2&gt;

&lt;p&gt;Three habits made the biggest difference for me.&lt;/p&gt;

&lt;p&gt;First, I stopped treating AI output as a finished artifact and started treating it as a managed workstream. Every meaningful code change has research, implementation, review, and verification phases. Different models are better at different phases.&lt;/p&gt;

&lt;p&gt;Second, I learned that independent context is a feature, not a limitation. When Codex reviews code from a separate session, it does not inherit all the assumptions of the implementation pass. That distance is exactly why it catches bugs.&lt;/p&gt;

&lt;p&gt;Third, I stopped optimizing for “best prompt” and started optimizing for “best routing.” The better question is: which model should spend tokens on this specific task?&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The future of AI-assisted development is not a single omniscient model sitting in one giant chat. It is orchestration: using the right model for the right task, preserving your strongest model’s context for decisions, and letting specialized agents handle research, implementation, review, and verification.&lt;/p&gt;

&lt;p&gt;If you are already using AI in development, my practical advice is simple: stop asking one model to do everything. Give each model a role, protect your orchestrator’s context window, and add a real review pass. That is where the 10x improvement comes from.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>mcp</category>
      <category>devtools</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
