<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: ryoryp</title>
    <description>The latest articles on Forem by ryoryp (@ryoryp).</description>
    <link>https://forem.com/ryoryp</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3855784%2F18883bf3-3e38-4c9f-94dc-a0ffde12a18b.png</url>
      <title>Forem: ryoryp</title>
      <link>https://forem.com/ryoryp</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/ryoryp"/>
    <language>en</language>
    <item>
      <title>Zero-Cost AI Pair Programming: Mastering 'Aider' with Local LLMs (Ollama)</title>
      <dc:creator>ryoryp</dc:creator>
      <pubDate>Wed, 01 Apr 2026 14:36:37 +0000</pubDate>
      <link>https://forem.com/ryoryp/zero-cost-ai-pair-programming-mastering-aider-with-local-llms-ollama-5ae4</link>
      <guid>https://forem.com/ryoryp/zero-cost-ai-pair-programming-mastering-aider-with-local-llms-ollama-5ae4</guid>
      <description>&lt;p&gt;AI coding assistants are great, but relying on cloud APIs like Claude 3.5 Sonnet or GPT-4o for every single terminal command gets expensive fast. Plus, you might not want to send your proprietary codebase to the cloud.&lt;/p&gt;

&lt;p&gt;Enter &lt;strong&gt;&lt;a href="https://aider.chat/" rel="noopener noreferrer"&gt;Aider&lt;/a&gt;&lt;/strong&gt;. It’s a CLI-based AI pair programmer that actually edits files and commits changes directly to your Git repo. &lt;/p&gt;

&lt;p&gt;While most people use Aider with OpenAI or Anthropic APIs, you can run it completely offline using local models via Ollama. It's the ultimate privacy-first, zero-cost pair programming setup. &lt;/p&gt;

&lt;p&gt;Here is my guide to making Aider work flawlessly with local models (like &lt;code&gt;qwen3.5-coder:14b&lt;/code&gt; or &lt;code&gt;llama3&lt;/code&gt;) without losing context or breaking your code.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The Core Setup: Connecting Aider to Ollama
&lt;/h3&gt;

&lt;p&gt;Starting Aider with a local model is straightforward. If you have Ollama running, just point Aider to it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aider &lt;span class="nt"&gt;--model&lt;/span&gt; ollama/qwen3.5-coder:14b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;However, running it out of the box often leads to frustration. Local models might forget your instructions halfway through or output broken markdown. Here is how to fix that.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. The Secret Sauce: .aider.model.settings.yml
&lt;/h3&gt;

&lt;p&gt;This is the most critical step for local LLMs. By default, Ollama models might load with a restricted context window, causing "silent data drops" where the AI forgets the beginning of the file.&lt;/p&gt;

&lt;p&gt;Create a .aider.model.settings.yml file in your project root to explicitly tell Aider how to handle your specific local model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ollama/qwen3.5-coder:14b&lt;/span&gt;
  &lt;span class="na"&gt;num_ctx&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;32768&lt;/span&gt; &lt;span class="c1"&gt;# Expand the context window to prevent amnesia&lt;/span&gt;
  &lt;span class="na"&gt;edit_format&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;whole&lt;/span&gt; &lt;span class="c1"&gt;# Force the model to output the entire file if search/replace fails&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tip: If your local model struggles with partial code edits (diffs), forcing edit_format: whole ensures you get clean, working files, even if it takes a bit longer to generate.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Preventing "Design Drift"
&lt;/h3&gt;

&lt;p&gt;When pair programming with AI, sometimes the model tries to be too helpful and starts refactoring unrelated files. To maintain control and prevent your repo from drifting into chaos, master these CLI commands:&lt;/p&gt;

&lt;p&gt;🟢 Use /read-only for Context&lt;br&gt;
Don't add every file to the chat. Only /add the files you want to edit. For documentation or reference files, use /read-only. This prevents the AI from accidentally modifying your API docs or configs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; /add app.py
&amp;gt; /read-only api_docs.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;🔄 The Magic /undo&lt;br&gt;
Because Aider is Git-native, every change is committed automatically. If the local model hallucinates or writes garbage code, simply type /undo to instantly roll back the commit and the file changes.&lt;/p&gt;

&lt;p&gt;🧠 Architect Mode for Complex Refactoring&lt;br&gt;
If you want to use a heavy model (like Claude 3.7 Sonnet) for planning but a local model for writing, use Architect Mode. The "Architect" model creates the blueprint, and your local "Editor" model executes the changes. It's a great way to save API costs while maintaining high code quality.&lt;/p&gt;

&lt;p&gt;Conclusion&lt;br&gt;
Aider isn't just a code generator; it's a Git-aware workflow tool. By pairing it with a strong local model like qwen3.5-coder or llama3, you get an autonomous, private, and free pair programmer right in your terminal.&lt;/p&gt;

&lt;p&gt;Give it a try, tweak your context windows, and let me know which local models you find work best with Aider!&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Escaping API Quotas: How I Built a Local 14B Multi-Agent Squad for 16GB VRAM (Qwen3.5 &amp; DeepSeek-R1)</title>
      <dc:creator>ryoryp</dc:creator>
      <pubDate>Wed, 01 Apr 2026 14:21:56 +0000</pubDate>
      <link>https://forem.com/ryoryp/escaping-api-quotas-how-i-built-a-local-14b-multi-agent-squad-for-16gb-vram-qwen35-deepseek-r1-1he8</link>
      <guid>https://forem.com/ryoryp/escaping-api-quotas-how-i-built-a-local-14b-multi-agent-squad-for-16gb-vram-qwen35-deepseek-r1-1he8</guid>
      <description>&lt;p&gt;I was building a complex web app prototype using a cloud-based AI IDE. Just as I was getting into the flow, I hit the dreaded wall: &lt;strong&gt;"429 Too Many Requests"&lt;/strong&gt;. &lt;/p&gt;

&lt;p&gt;I was done dealing with subscription anxiety and 6-day quota limits. I wanted to offload the heavy cognitive work to my local machine. But there was a catch: my rig runs on an AMD Radeon RX 6800 with &lt;strong&gt;16GB of VRAM&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Here is how I bypassed the cloud limits and built a fully functional local multi-agent system without melting my GPU.&lt;/p&gt;

&lt;h3&gt;
  
  
  The "Goldilocks" Zone: Why 14B?
&lt;/h3&gt;

&lt;p&gt;Running a multi-agent system locally is tricky when you have strict hardware limits. Through trial and error, I quickly realized:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;7B/8B models?&lt;/strong&gt; They are fast, but too prone to hallucination when executing complex MCP (Model Context Protocol) tool calls or strict JSON outputs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;32B+ models?&lt;/strong&gt; Immediate Out Of Memory (OOM) on 16GB VRAM.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I found the absolute sweet spot: &lt;strong&gt;14B models quantized (GGUF Q4/Q6) via Ollama&lt;/strong&gt;. They are smart enough to reliably follow system prompts and handle agentic logic, while leaving just enough memory for a healthy context window.&lt;/p&gt;

&lt;h3&gt;
  
  
  Meet &lt;code&gt;hera-crew&lt;/code&gt;: Hybrid Edge-cloud Resource Allocation
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1gwvx36e03xhytdiq3jh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1gwvx36e03xhytdiq3jh.png" alt="Hand-drawn architecture diagram of HERA-CREW showing a Cloud AI IDE sending tasks to a local 16GB VRAM GPU. Three 14B agents collaborate, with an autonomous fallback routing back to the cloud via MCP." width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This constraint led me to build &lt;strong&gt;hera-crew&lt;/strong&gt;, a local-first multi-agent framework. It’s not just about running models offline; it’s about intelligent, autonomous routing.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Squad: DeepSeek-R1 &amp;amp; Qwen 3.5-Coder
&lt;/h3&gt;

&lt;p&gt;To maximize efficiency, I assigned specific roles to different 14B models. A single model trying to do everything degrades quality, but a specialized squad works wonders:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The Tech Lead / Coder (&lt;code&gt;qwen2.5-coder:14b&lt;/code&gt;)&lt;/strong&gt;: 
Absolutely brilliant at writing Next.js/TypeScript and reliably executing tool calls. It acts as the core engine for generation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Critic (&lt;code&gt;deepseek-r1:14b&lt;/code&gt;)&lt;/strong&gt;: 
Takes its time to "think" and review the generated code. It flawlessly catches logic flaws and architectural mistakes that smaller models typically miss.&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro-tip:&lt;/strong&gt; Set &lt;code&gt;num_ctx&lt;/code&gt; to &lt;code&gt;32768&lt;/code&gt; (32k) in your Ollama config to keep the multi-agent debate from losing context during long sessions!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  The Magic: Autonomous Fallback via MCP
&lt;/h3&gt;

&lt;p&gt;The coolest feature of &lt;code&gt;hera-crew&lt;/code&gt; is the autonomous fallback mechanism. &lt;/p&gt;

&lt;p&gt;I gave the crew a highly complex task. Instead of just failing locally when the context gets too heavy or requires external data, the &lt;code&gt;Critic&lt;/code&gt; agent evaluates the subtasks. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Standard logic and coding?&lt;/strong&gt; -&amp;gt; Routed to LOCAL (Zero latency, zero cost).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Too complex or requires live infrastructure data?&lt;/strong&gt; -&amp;gt; Routed to FALLBACK (Delegated back to the cloud IDE via an MCP tool).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It minimizes API costs, entirely eliminates the "friction of thinking," and handles resource allocation autonomously.&lt;/p&gt;

&lt;h3&gt;
  
  
  Let's Build Together
&lt;/h3&gt;

&lt;p&gt;I’ve open-sourced the project on GitHub because I know I'm not the only one fighting the 16GB VRAM battle:&lt;/p&gt;

&lt;p&gt;🔗 &lt;strong&gt;&lt;a href="https://github.com/ryohryp/hera-crew" rel="noopener noreferrer"&gt;GitHub - ryohryp/hera-crew&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I’m still refining the system prompts and trying to squeeze every drop of performance out of this setup. &lt;/p&gt;

&lt;p&gt;Are any of you running similar 14B agent squads on 16GB setups? How do you manage the context lengths and tool-calling latency? I'd genuinely love to hear your thoughts, feedback, or PRs!&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>opensource</category>
      <category>mcp</category>
    </item>
  </channel>
</rss>
