Zero-Cost AI Pair Programming: Mastering 'Aider' with Local LLMs (Ollama)

ryoryp — Wed, 01 Apr 2026 14:36:37 +0000

AI coding assistants are great, but relying on cloud APIs like Claude 3.5 Sonnet or GPT-4o for every single terminal command gets expensive fast. Plus, you might not want to send your proprietary codebase to the cloud.

Enter Aider. It’s a CLI-based AI pair programmer that actually edits files and commits changes directly to your Git repo.

While most people use Aider with OpenAI or Anthropic APIs, you can run it completely offline using local models via Ollama. It's the ultimate privacy-first, zero-cost pair programming setup.

Here is my guide to making Aider work flawlessly with local models (like qwen3.5-coder:14b or llama3) without losing context or breaking your code.

1. The Core Setup: Connecting Aider to Ollama

Starting Aider with a local model is straightforward. If you have Ollama running, just point Aider to it:

aider --model ollama/qwen3.5-coder:14b

However, running it out of the box often leads to frustration. Local models might forget your instructions halfway through or output broken markdown. Here is how to fix that.

2. The Secret Sauce: .aider.model.settings.yml

This is the most critical step for local LLMs. By default, Ollama models might load with a restricted context window, causing "silent data drops" where the AI forgets the beginning of the file.

Create a .aider.model.settings.yml file in your project root to explicitly tell Aider how to handle your specific local model:

- name: ollama/qwen3.5-coder:14b
  num_ctx: 32768 # Expand the context window to prevent amnesia
  edit_format: whole # Force the model to output the entire file if search/replace fails

Tip: If your local model struggles with partial code edits (diffs), forcing edit_format: whole ensures you get clean, working files, even if it takes a bit longer to generate.

3. Preventing "Design Drift"

When pair programming with AI, sometimes the model tries to be too helpful and starts refactoring unrelated files. To maintain control and prevent your repo from drifting into chaos, master these CLI commands:

🟢 Use /read-only for Context
Don't add every file to the chat. Only /add the files you want to edit. For documentation or reference files, use /read-only. This prevents the AI from accidentally modifying your API docs or configs.

> /add app.py
> /read-only api_docs.md

🔄 The Magic /undo
Because Aider is Git-native, every change is committed automatically. If the local model hallucinates or writes garbage code, simply type /undo to instantly roll back the commit and the file changes.

🧠 Architect Mode for Complex Refactoring
If you want to use a heavy model (like Claude 3.7 Sonnet) for planning but a local model for writing, use Architect Mode. The "Architect" model creates the blueprint, and your local "Editor" model executes the changes. It's a great way to save API costs while maintaining high code quality.

Conclusion
Aider isn't just a code generator; it's a Git-aware workflow tool. By pairing it with a strong local model like qwen3.5-coder or llama3, you get an autonomous, private, and free pair programmer right in your terminal.

Give it a try, tweak your context windows, and let me know which local models you find work best with Aider!

Escaping API Quotas: How I Built a Local 14B Multi-Agent Squad for 16GB VRAM (Qwen3.5 & DeepSeek-R1)

ryoryp — Wed, 01 Apr 2026 14:21:56 +0000

I was building a complex web app prototype using a cloud-based AI IDE. Just as I was getting into the flow, I hit the dreaded wall: "429 Too Many Requests".

I was done dealing with subscription anxiety and 6-day quota limits. I wanted to offload the heavy cognitive work to my local machine. But there was a catch: my rig runs on an AMD Radeon RX 6800 with 16GB of VRAM.

Here is how I bypassed the cloud limits and built a fully functional local multi-agent system without melting my GPU.

The "Goldilocks" Zone: Why 14B?

Running a multi-agent system locally is tricky when you have strict hardware limits. Through trial and error, I quickly realized:

7B/8B models? They are fast, but too prone to hallucination when executing complex MCP (Model Context Protocol) tool calls or strict JSON outputs.
32B+ models? Immediate Out Of Memory (OOM) on 16GB VRAM.

I found the absolute sweet spot: 14B models quantized (GGUF Q4/Q6) via Ollama. They are smart enough to reliably follow system prompts and handle agentic logic, while leaving just enough memory for a healthy context window.

Meet `hera-crew`: Hybrid Edge-cloud Resource Allocation

This constraint led me to build hera-crew, a local-first multi-agent framework. It’s not just about running models offline; it’s about intelligent, autonomous routing.

The Squad: DeepSeek-R1 & Qwen 3.5-Coder

To maximize efficiency, I assigned specific roles to different 14B models. A single model trying to do everything degrades quality, but a specialized squad works wonders:

The Tech Lead / Coder (qwen2.5-coder:14b): Absolutely brilliant at writing Next.js/TypeScript and reliably executing tool calls. It acts as the core engine for generation.
The Critic (deepseek-r1:14b): Takes its time to "think" and review the generated code. It flawlessly catches logic flaws and architectural mistakes that smaller models typically miss.

Pro-tip: Set num_ctx to 32768 (32k) in your Ollama config to keep the multi-agent debate from losing context during long sessions!

The Magic: Autonomous Fallback via MCP

The coolest feature of hera-crew is the autonomous fallback mechanism.

I gave the crew a highly complex task. Instead of just failing locally when the context gets too heavy or requires external data, the Critic agent evaluates the subtasks.

Standard logic and coding? -> Routed to LOCAL (Zero latency, zero cost).
Too complex or requires live infrastructure data? -> Routed to FALLBACK (Delegated back to the cloud IDE via an MCP tool).

It minimizes API costs, entirely eliminates the "friction of thinking," and handles resource allocation autonomously.

Let's Build Together

I’ve open-sourced the project on GitHub because I know I'm not the only one fighting the 16GB VRAM battle:

🔗 GitHub - ryohryp/hera-crew

I’m still refining the system prompts and trying to squeeze every drop of performance out of this setup.

Are any of you running similar 14B agent squads on 16GB setups? How do you manage the context lengths and tool-calling latency? I'd genuinely love to hear your thoughts, feedback, or PRs!

Forem: ryoryp