Forem: martin brice

How a Non-Developer Finally Understood RAG (And You Can Too)

martin brice — Wed, 15 Apr 2026 05:33:52 +0000

Tags: RAG, AI, LLM, beginners, machinelearning

About the author: Hi, I’m Martin Brice. Not a developer. Just someone who got way too deep into local LLMs and somehow ended up here. ㅋㅋ

Okay so. I’m not a developer.

But I’ve been obsessing over local LLMs — VS Code, Continue extension, Ollama, the whole thing. And everyone kept throwing around this word: RAG.

Like it was obvious. Like I should just know.

I didn’t.

So I started asking questions. Really basic ones. And somewhere between “wait, embedding is just… turning words into numbers?” and “hold on, this is literally just Homebrew” — it all made sense.

Here’s how I got there. No CS degree required.

First — What Problem Does RAG Even Solve?

Your local LLM is smart. But it doesn’t know your stuff.

Not your codebase. Not your internal docs. Not anything you made.

RAG = giving your LLM the right context before it answers.

That’s literally it. Retrieval Augmented Generation sounds fancy. It’s not. Find relevant stuff → hand it to the LLM → get a way better answer.

The 3 Parts. In Human.

🔢 Embedding — I Call It “Numberization”

Before anything can be searched, your text needs to be converted into numbers. Why? Because comparing numbers is way faster than comparing words.

I kept calling this numberization in my head and honestly? More accurate than “embedding.”

"memory leak fix"   →  [0.2, 0.8, 0.1, 0.5 ...]
"gc.collect usage"  →  [0.21, 0.79, 0.11, 0.48 ...]
"today's weather"   →  [0.9, 0.1, 0.7, 0.2 ...]

Similar meanings → similar numbers. So when you search, you’re not matching words. You’re matching meaning. Wild, right?

And with Ollama, this runs fully local. Your code never leaves your machine.

ollama pull nomic-embed-text  # that's it

🗄️ ChromaDB — The Warehouse

So now you’ve got all these numbers. You need somewhere to put them.

ChromaDB stores them. But here’s the key part — it doesn’t sort them when they go in. It just dumps everything in the warehouse.

The smart part happens at retrieval:

Question comes in
       ↓
Convert question to numbers (same embedding process)
       ↓
Compare against everything in the warehouse
       ↓
Pull out the closest matches

Think of it as a warehouse where nothing is organized — but the librarian can instantly find anything similar to what you’re looking for. That’s the vibe.

⚖️ Reranker — The Curator

Vector search is great but it casts a wide net. You might get 20 results. Maybe 3 are actually useful.

That’s what the Reranker is for. It reads each result carefully and re-orders by actual relevance.

Before Reranker:
1st → "memory concepts overview"   ← sounds related, not helpful
2nd → "memory leak debug code"     ← THIS is what you need
3rd → "memory optimization tips"

After Reranker:
1st → "memory leak debug code"     ✅
2nd → "memory optimization tips"
3rd → "memory concepts overview"

My favorite analogy:

Vector search = casting a fishing net (catch a lot)
Reranker = chef picking only the best catch (keep what matters)

The “Homebrew Moment” — LangChain

This is when it clicked for me.

LangChain is a Python library. And just like Homebrew on Mac — it’s not doing the hard work itself. It’s just coordinating everything else.

LangChain (the general contractor) 🏢
  ├── Chunking    → does this itself
  ├── Embedding   → calls Ollama
  ├── Storage     → calls ChromaDB
  └── Answer      → calls local LLM

Install the parts separately. Use them all through one interface.

pip install langchain chromadb   # like brew install
ollama pull nomic-embed-text

Before LangChain → wire everything manually, 100 lines of code
After LangChain → connect the pieces, done

It’s a PM that outsources everything and just manages the pipeline. Lightweight. Smart. Kinda lazy in the best way. ㅋㅋ

The Thing Nobody Explains Well: Chunking

Before any of the above happens, your documents get cut into pieces. This is chunking. And the cutting strategy matters a lot.

For code, the rule is: cut by function or class, not by character count.

# Good chunk ✅ — complete function
def calculate_memory(data):
    result = data * 2
    return result

# Bad chunk ❌ — cut mid-function
def calculate_memory(da

You also want metadata on every chunk:

{
  "file": "engine_core.py",
  "type": "function",
  "name": "calculate_memory",
  "line": 42
}

With this, your LLM can say:

“The issue is in engine_core.py, line 42, inside calculate_memory”

Instead of:

“Somewhere in your code maybe?”

Big difference. ㅋㅋ

What Gets Chunked vs What Doesn’t

This confused me early on:

Source	Chunk it?
Your codebase	✅ Yes — by function
Long docs	✅ Yes — by section
Q&A pairs	❌ No — already the right size

Q&A pairs go straight to embedding. Splitting them would destroy their meaning.

The Full Flow (My Actual Setup)

VS Code + Continue
        ↓
Proxy Server intercepts the prompt
        ↓
RAG kicks in:
  → ChromaDB searched
  → Reranker filters best results
  → Context injected into prompt
        ↓
Local LLM (Ollama) answers
        ↓
Q&A pair saved back to ChromaDB

That last step is the part I love most. Every question and answer gets stored. The system gets smarter the more you use it — no retraining, no API costs, no data leaving your machine.

Plain English Summary

Fancy term	What it actually means
Embedding	Numberization — text → numbers
Vector DB	Warehouse that retrieves by number similarity
Chunking	Cutting docs into meaningful pieces
Reranker	Curator that picks best results
LangChain	PM/contractor coordinating everything
RAG	Giving LLM the right context before answering

The jargon made this feel impossible. Once I stopped trying to understand “vector embeddings” and started thinking about numberization, warehouses, and general contractors — it took about an hour.

If you’re a non-developer trying to make sense of this stuff, I hope this saves you that hour. You don’t need to write the code to understand the architecture. And understanding the architecture helps you ask better questions when you do start building.

Which is kind of the whole point of RAG, isn’t it? ㅋㅋ

Stack: VS Code + Continue + Ollama + ChromaDB + LangChain
All local. No API keys. No data leaving the machine.

— Martin Brice, a non-developer who got too curious

CARE Loop: A Human-Centered Framework for Local LLM Development

martin brice — Tue, 14 Apr 2026 06:01:31 +0000

# 🔄 CARE Loop

Coding → Audit → RAG → Exit (Reincarnation)

A human-centered framework for maximizing local LLM performance in software development.

Why CARE Loop Exists

LLMs are remarkable. They also have limits.

They seem to know where to go — until they don't. They appear confident — until they're not. And when they get lost, they rarely admit it. They just keep going in the wrong direction, with the same confident tone.

That moment — when the AI is stuck but doesn't know it — is where most AI-assisted projects fall apart.

CARE Loop is built around that moment.

The insight is simple: when a human recognizes that the AI is lost and gives it the right nudge, the AI's full potential is suddenly unlocked. It stops spinning and starts flying. The human doesn't need to write the code. They just need to see what the AI can't see, and point the way.

CARE Loop is the system that makes that collaboration reliable, repeatable, and scalable.

The Problem

Anyone who has used an AI coding agent for a serious project has experienced this: the AI starts strong, then gradually begins to hallucinate, contradict itself, forget earlier decisions, and produce increasingly broken code. This is the context contamination problem.

The common solution is to throw more money at it — use bigger models, pay for longer context windows, upgrade to the latest API.

CARE Loop proposes a different answer.

The Core Insight

The quality of AI-generated code is not determined by the size of the model. It is determined by the quality of the human operating it.

Two things separate good AI-assisted development from bad:

A well-designed blueprint — Before writing a single line of code, a human must think clearly about architecture, requirements, and constraints. AI can assist, but the thinking must be human-led.
Curated context — Instead of dumping entire documentation or codebases into the AI's context, a human selects and distills only the essential concepts needed for the current task. Less noise, more signal.

When these two things are done well, a local LLM running on consumer hardware can produce results that rival — or exceed — what most developers get from expensive cloud APIs used carelessly.

What is CARE Loop?

CARE Loop is a framework — part philosophy, part system — for structured AI-assisted development.

┌─────────────────────────────────────────────────────┐
│                    CARE LOOP                        │
│                                                     │
│   [Human: Blueprint + Curated Context]              │
│                    ↓                                │
│   C → Coding AI works on the task                  │
│   A → Audit AI reviews the output                  │
│   R → RAG Scribe records progress & decisions      │
│   E → Exit: when context degrades, reset & reborn  │
│                    ↓                                │
│   New AI session inherits RAG memory, not noise    │
└─────────────────────────────────────────────────────┘

The Four Roles

Role	Function
Coding AI	Writes and edits code based on human-defined tasks
Audit AI	Reviews the output for correctness, consistency, and quality
RAG Scribe	Records what was built, why decisions were made, and current state
Token Manager	Monitors context usage; triggers reset before degradation begins

The Reincarnation Principle

When the Token Manager detects the context is approaching its limit (~70-80% capacity), it signals the system to Exit. Before the session ends, the RAG Scribe saves a structured summary of all progress. The next AI session is initialized fresh — no contamination — but immediately given access to the RAG memory. It picks up exactly where the previous session left off, without the accumulated noise.

The AI "dies" and is "reborn" — with memory, but without fatigue.

The Human as Tech Lead

This is where most AI-assisted development breaks down — and where CARE Loop makes the biggest difference.

When AI hits a wall

In any non-trivial project, the AI will eventually get stuck. What happens next determines everything:

AI hits a wall
    ↓
Human understands the problem immediately → solved in minutes 🚀
Human doesn't understand the problem     → AI keeps spinning 🌀

The difference between these two outcomes is not the AI. It's the human.

The Stubbornness Problem

Experienced AI users know this pattern well: the AI becomes convinced its approach is correct, even in the face of clear evidence that it isn't.

Human: "This approach is wrong."
AI:    "I understand your concern, but my approach is correct :)"
Human: "Look — here's the error it produces."
AI:    "Interesting. The error is likely caused by something else :)"
Human: "..."

The instinct is to get frustrated, force the AI, or give up. None of these work well.

What works is acting like a good tech lead:

Show the evidence calmly and clearly
Walk through the logic step by step
Present an alternative direction with reasoning
Let the AI arrive at the correct conclusion

When done well, the AI shifts from defensive to collaborative — and it starts flying again.

The Audit AI as Translator

One reason humans struggle to unblock AI is that the AI doesn't always clearly explain why it's stuck. It just produces bad output and tries again.

The Audit AI's job is to bridge this gap. When the Coding AI is going in circles, the Audit AI surfaces:

What the actual problem is
Why the current approach isn't working
What the two or three viable paths forward look like

This gives the human enough structured information to make a decision — even without deep technical expertise. The human doesn't need to know the answer. They just need enough context to point in the right direction.

The human's job is not to code. It is to think, decide, and unblock.

Who Is This For?

Non-developers

CARE Loop makes it possible to build real, working software through AI — without writing code yourself. The human role is not coding; it is directing: designing the blueprint, curating the context, and reviewing the audit. If you can think clearly and communicate precisely, you can build with CARE Loop.

Developers

If you already know how to code, CARE Loop is a multiplier. The structured approach — audit at every step, context managed deliberately, RAG preserving institutional memory — means you can tackle projects of a complexity and quality that ad-hoc AI usage simply cannot sustain. Large model or small, the framework scales.

The Human Is the System

This is the central philosophy of CARE Loop:

AI agents are powerful but stateless and fragile. Humans provide the continuity, judgment, and design that make them useful.

The framework does not try to make AI autonomous. It makes the human + AI collaboration more reliable, more structured, and more productive — especially under the constraints of local, open-weight models.

A well-operated CARE Loop with a mid-sized local LLM will consistently outperform an unstructured session with a frontier model.

Current Status

🧠 Concept phase — The framework is defined. Implementation is in progress.

Planned components:

[ ] File watcher (Scribe): monitors project folder, records changes automatically
[ ] Token counter: tracks context usage across the session
[ ] RAG builder: structures session summaries for retrieval
[ ] Reset trigger: detects degradation threshold and initiates reincarnation
[ ] Dashboard UI: four-panel view (prompt / code / token status / progress log)

Target stack: Python · Ollama · Local LLMs (Gemma 4, Qwen2.5-Coder, etc.) · VS Code + Cline

Philosophy in One Sentence

Give the AI a great blueprint, feed it only what it needs, watch it closely, unblock it when it's stuck, and reset it before it loses its mind.

Contributing

This project is in its earliest stage. If the concept resonates with you — whether you are a developer, a researcher, or a non-technical builder — ideas, feedback, and contributions are welcome.

The goal is a framework that works for everyone: not just those who can afford the biggest models, but anyone willing to think carefully before they prompt.

Started by a non-developer who got tired of AI going off the rails — and decided to build the reset button.