Forem: Kristoffer Nordström

[Boost]

Kristoffer Nordström — Fri, 13 Mar 2026 14:17:37 +0000

WeCoded 2026: Echoes of Experience 💜

ujja

Mar 10

The Enablers Who Helped Me Code Forward

#devchallenge #wecoded #dei #career

Comments 18

4 min read

Teaching a Knowledge Base to Search

Kristoffer Nordström — Fri, 13 Mar 2026 11:44:20 +0000

I searched for "Fredrik Haard" and got back a conversation about exploratory testing philosophy.

Not wrong, exactly. Fredrik and I talk about testing a lot. But I wasn't looking for a conversation about testing. I was looking for Fredrik. His contact card, his Slack ID, the email thread where he suggested we move the weekly sync to Mondays. Specific things about a specific person.

The knowledge base had over a hundred thousand chunks of my digital life indexed. And it couldn't find someone by name.

That's when I realised that building a search system is its own kind of engineering problem. And it took five distinct phases of "this doesn't work, let me try something else" to get it right. Each phase was driven by a concrete failure. Each fix revealed the next problem.

This is the story of how I taught my knowledge base to search.

Phase 1: Pure vectors, pure confidence

When I first built the knowledge base (the full story is in I Gave My AI a Memory), search was simple. Every document gets turned into a vector embedding, a numerical representation of its meaning. When you search, your query gets embedded the same way, and the system finds documents whose vectors are closest in meaning.

Think of it like a librarian who organises books by vibes. You walk in and say "I need something about conference planning" and she knows exactly which shelf to point you to. The topics, the feel of the content, the general subject matter. For broad queries, this worked great. "What do I know about conference planning?" pulled up the right notes. "Testing philosophy discussions" surfaced the right conversations.

But names aren't semantic. "Fredrik Haard" doesn't mean anything in vector space. It's just two tokens that could land anywhere relative to other tokens. Same problem with project codes, invoice numbers, technical terms. Anything where you need an exact match, not a vibe match. You wouldn't walk up to a librarian and ask for "the feeling of being named Fredrik." You'd want her to check the author index.

I had multiple Fredriks in the system. Fredrik Haard and Fredrik Thuresson. A pure vector search for "Fredrik" would rank results by how much the surrounding context resembled... what, exactly? The concept of being named Fredrik? It returned whatever document happened to be semantically closest to the query embedding. Sometimes that was the right Fredrik. Sometimes it wasn't.

I didn't discover this by accident. I'd been building an evaluation framework (a test harness for the knowledge base itself) and one of the tasks was: "What is Fredrik's Slack ID?" Simple question. The system couldn't reliably answer it.

That was the first lesson. Semantic search is great until you need something specific.

Phase 2: Adding keywords to the mix

The fix was obvious once I saw the problem. Don't replace vector search, add keyword search alongside it. Give the librarian a second tool. She already knows which shelf to point you to by topic. Now she also gets an alphabetical index, so she can look up names, codes, and specific terms directly. BM25 (the algorithm behind most traditional search engines) is that index. It finds "Fredrik Haard" by matching those exact words, weighted by how rare they are in the corpus.

So now the system runs both searches in parallel. Vector search finds documents that are semantically similar to the query. BM25 finds documents that contain the actual words. The results get combined using Reciprocal Rank Fusion, a merging algorithm that combines ranked lists by position rather than raw score. If a document ranks highly in both lists, it surfaces near the top. If it only matches one signal, it still appears but lower.

You search for "Fredrik Haard," the alphabetical index puts the right documents at the top. You search for "that conversation about exploratory testing approaches," the topic knowledge handles it even if those exact words never appear in the document. And when both signals agree, you get the most confident results.

The Fredrik problem was solved. But a new one appeared immediately.

The system was now returning better results. A lot more of them. And every result was a full document. When I was using Claude Code or a local LLM, the search would return fifty relevant chunks and try to stuff them all into the context window. With local models running 4K or 8K token windows, that blew things up completely. Even with Claude's larger context, loading fifty full documents just to answer a question about one of them was wasteful.

Better recall, but overwhelming the consumer of the results.

Phase 3: The librarian trick

I kept hitting context limits, so I started thinking about how a librarian actually works. They don't read every book in the library to answer your question. They find twenty books by topic (fast, rough), then read each one carefully to decide which three actually answer what you asked (slow, precise). I know in real life, your local librarian won't read all the books they pulled up from your ask about "quantum physics in the 80's", but its good analogy I happened to stick with.

So I built doc cards. A compressed summary of each document containing its title, key facts, main topics, and mentioned entities. About a tenth the size of the full document. Think of it as small generated index cards with good details about the books (full documents) that the librarian can now use when looking for the twenty books. The search now works in two stages. First, scan the doc cards to find what's relevant. Small enough to fit lots of them in a context window. Second, load the full document only for the ones that actually matter.

This was born directly from the eval framework. I had a test task where the model needed to find and synthesize information about a person from scattered sources (emails, meeting notes, Slack conversations). With full documents, the local models choked on context. With doc cards as the first stage, they could scan ten summaries, pick the three most relevant, and then load just those.

The eval scores jumped. Not because the search was returning different results, but because the consumer (the LLM) could actually use what it got.

Two-stage retrieval. It sounds like an optimisation trick. It's actually a design philosophy. Give the system just enough context to make a good decision about what to look at more closely.

Photo by Thomas Bormans on Unsplash

Phase 4: The expensive second opinion

At this point I had hybrid search producing candidates and doc cards keeping context manageable. But the ranking still felt rough sometimes. The BM25 and vector signals disagreed often enough that the merged results weren't always in the best order.

So I added a cross-encoder reranker. Back to the librarian. So far she's been finding books two ways (by topic and by index lookup) and scanning their index cards. But she hasn't actually sat down with your question and a book at the same time to judge how well it answers what you asked. That's what a cross-encoder does. It takes the query and a document together as a single input and produces a relevance score. Not "is this document about the right topic?" but "does this specific document answer this specific question?"

Much more accurate than the earlier stages. But expensive, because it has to process each query-document pair individually. You can't ask the librarian to sit down and carefully read all 68,000 books against your question, everytime you ask a new question. That doesn't scale.

The economics work because of the previous phases. By the time the cross-encoder sees the results, hybrid search and doc cards have already narrowed it down to maybe twenty candidates. Asking the librarian to carefully evaluate twenty books against your question? That's a Tuesday afternoon. Twenty instead of sixty-eight thousand is a completely different cost equation.

But raw fusion scores aren't the whole picture. Not all books in the library are equally trustworthy. A curated project note I wrote deliberately is more likely to be what I'm looking for than one of 38,000 Gmail threads. So before the cross-encoder even sees the results, I adjust scores by source type. Project docs and area notes get a boost (I wrote those on purpose, they're the reference shelf). Slack messages and emails get a slight penalty (there are tens of thousands of them, and most are noise, they're the periodicals bin). And everything decays over time, because a meeting note from last week is usually more relevant than one from two years ago. Different source types decay at different rates (a project README stays relevant longer than a Slack thread).

The final score blends 60% cross-encoder with 40% of that domain-adjusted score. The cross-encoder gets the biggest vote because it's the most accurate signal. But the domain adjustments keep it honest about what kind of document you're probably looking for.

The eval framework confirmed it. The task that asked about Fredrik? Cross-encoder reranking moved the right answer from position four to position one. On broader queries, the quality improvement was consistent enough that I could see it across multiple eval runs.

That's measurable progress. Not vibes. Scores.

Phase 5: The trust problem

And then ChromaDB corrupted my index.

I'd been using ChromaDB as the vector database from the start. It worked fine for months. Then one day, search results for a topic I knew I'd written about came back empty. Not wrong results. No results. The librarian showed up for work and books were missing from the shelves. The catalog said they existed. They didn't. Chunks had silently disappeared from the index after what looked like a process crash. The Python-FFI bridge between ChromaDB's Python wrapper and its underlying storage had a stability problem that only showed up under certain conditions.

Memory corruption in a memory system. The irony wasn't lost on me.

I migrated to LanceDB over that same week. One migration script, 117,000 chunks preserved with all their embeddings intact, zero data loss. LanceDB gave me a Rust core (no more segfaults from a Python bridge), native BM25 support (hybrid search built into the database instead of bolted on), and about a third of the storage footprint (2.2 GB down from the bloated ChromaDB files).

That migration wasn't about search quality. It was about trust. If you can't trust the storage layer, nothing you build on top matters. The fanciest reranking in the world is useless if chunks silently vanish.

That's a foundation decision. Not a feature decision.

What I actually learned

Five phases, and each one was driven by a failure I couldn't have predicted in advance. Vector search failed on names. Hybrid search overwhelmed context windows. Full documents were too expensive to process. Initial ranking wasn't precise enough. And the database itself couldn't be trusted.

I couldn't have designed this system on a whiteboard. I had to build it, test it, watch it fail, and fix the failure. The eval framework was the thing that made each phase's improvement visible and gave me confidence to ship each change. Without those test tasks (find Fredrik's Slack ID, aggregate person context, extract specific invoice numbers), I'd have been guessing about whether the search was actually getting better.

The system today handles 163,000 chunks from ten data sources. It runs three scoring stages (fusion, domain adjustment, cross-encoder reranking) and supports two-stage retrieval for context-constrained consumers. And it sits on a storage layer I trust.

But the architecture isn't the point. The point is that every piece exists because something broke. Not because I planned it, not because a best practices guide said so. Because I was using the system daily, hitting real problems, and fixing them one at a time.

Photo by Tim van Cleef on Unsplash

That's what building a knowledge system actually looks like. Not a design doc. A trail of solved failures.

This is part of a series about building a personal knowledge base. Previously: I Gave My AI a Memory, I Removed the Friction. That Was the Problem., and I Called Them Suggestions. There Was a Reason..

I Called Them Suggestions

Kristoffer Nordström — Tue, 10 Mar 2026 10:23:04 +0000

About a year ago I started building a personal assistant - not a chatbot, but a system that watches email, triages inboxes, drafts replies, manages receipts, and maintains task lists. It
operates as a background service and communicates through suggestions rather than autonomous actions.

The system proposes things. I decide. Every morning brings a list of overnight emails, draft replies, and pending tasks. I review, approve, reject, or edit them. Nothing happens until I
authorize it.

The Over-Eager Intern

AI assistants resemble "over-eager junior interns" - enthusiastic, well-meaning, always pushing forward without questioning whether action is appropriate. Yet many people grant these
systems email, calendar, and file system access without meaningful constraints.

Real incidents illustrate the risks: an executive told an AI to "confirm before acting," yet it deleted her entire inbox. At another office, an AI on Slack falsely claimed a fire alarm was
a scheduled test. As one observation notes: "When those patterns misfire, there is no gut instinct to hesitate... There is just forward motion."

Why We Trust Things We Shouldn't

Decades of deterministic software (where clicking "send" reliably sends, and "delete" reliably removes) has built deep trust. This confidence transfers uncritically to AI agents, which
appear functionally similar but operate fundamentally differently.

LLMs are probabilistic pattern machines predicting likely outputs, not executing verified instructions. The safety guarantees of deterministic systems don't apply. Yet the feeling of
safety persists, and that's problematic.

Understanding LLM Limitations

Yann LeCun (Turing Award winner) argues that before achieving genuine intelligence, AI must predict action consequences. LLMs instead predict the next token - the statistically probable
next element in a sequence. They generate responses without simulating outcomes. They lack models of physical consequences or persistent reasoning about "what happens if."

An agent deleting emails executes the next sequence step with identical computational mechanism as autocompleting a sentence. Modern AI systems improve at reasoning, but the underlying
mechanism remains prediction-based. Permission prompts and approval dialogs in deployed systems reveal that even creators don't trust unsupervised operation.

"If the AI can't predict consequences, someone has to. That's the human."

The Red Flag and the Car

Historical precedent supports cautious adoption. Early automobiles required someone walking ahead with a red flag, announcing their arrival. This sounds absurd now, but technology earned
trust incrementally: seatbelts, ABS, airbags, crumple zones, traffic rules, licensing, insurance. Each safeguard enabled next-stage adoption.

The parallel to autonomous vehicles reveals uncomfortable truths. Research from the University of Glasgow examines how cyclists could trust driverless cars. Human drivers communicate
intent through body language, eye contact, and subtle signals. Remove the human, and those signals vanish. Systems need entirely new communication methods.

Suggestions function as AI's equivalent of eye contact: "I'm thinking about doing this. Are you comfortable with that?" My system hasn't earned trust for autonomous action, nor was it
designed to.

The Sliding Scale

Not every AI action requires human approval - that would be exhausting and counterproductive. The question is where to position the line.

Research from Cummings on automation levels and Nemeth on human roles in automated systems provides framework. Cummings describes a spectrum from 100% human control to 100% technology
control; most designers default toward technology. My PA was designed for the human end.

This is a sliding scale, not binary. Low-stakes, easily reversible actions can be more autonomous. My PA classifies emails independently. But sending replies, deleting content, or
dismissing alarms require explicit approval. Consequence severity should match autonomy level.

Interface design matters significantly. Draft replies appear as org-mode files in Emacs - my environment. I read, rewrite, adjust tone, add context the AI missed, then approve. That's not
thumbs-up/thumbs-down review; that's editorial control in a familiar tool. Interface shapes how seriously review occurs.

Nemeth's research (shared through Isabel Evans) identifies a cognitive shift: handing control to AI transforms your role from decision-maker to monitor. Monitoring is cognitively expensive

attention drifts, mistakes slip through, rubber-stamping happens. The suggestion model preserves humans as decision-makers actively choosing, not passively watching.

The Design

After a year of daily use, the naming choice feels vindicated. Building an AI assistant wasn't about making it do things - it was making it stop and ask.

"That's not a limitation. That's the design."

I Removed the Friction. That Was the Problem.

Kristoffer Nordström — Mon, 09 Mar 2026 15:05:45 +0000

The Productivity Paradox

Over Christmas break 2025, I built a personal knowledge base. I wrote about
that. What I didn't write about was what happened next.

I got faster. A lot faster. Context that used to take twenty minutes to reconstruct was available in seconds. Release notes
that took an hour to draft took fifteen minutes. Meeting prep that required digging through Slack, email, and three
different wikis became a single search. Voice transcripts from meetings captured into a dozen indexed points in the
knowledge base, ready for retrieval whenever I needed them.

So of course I did more.

Not because anyone asked me to. Not because my job required it. Because doing more suddenly felt cheap. The cost of starting
something new had dropped to almost nothing, and my brain (which has always been good at juggling contexts) noticed
immediately. By January I was running four or five Claude Code sessions across multiple virtual desktops. One reviewing the
team's current roadmap and priorities, one drafting release notes, one analyzing PR changes for the weekly release, one
building a new feature for the knowledge base itself. All of them working in parallel while I bounced between them.

Former colleagues used to call me the Duracell bunny. I took pride in that. Still do, actually.

The Familiar Pattern

But here's the thing. This isn't new. I've been cycling through this pattern since my first real job at UIQ almost two
decades ago. Overextend, hit the wall, pull back, find a balance, feel good about it. Then slowly ramp up again until the
next wall. I just used to be younger and have what felt like infinite stamina, and on the rare occasion I overextended I'd
bounce back after a night's sleep.

You could track it by my task management. Easy periods meant task lists on paper. Ramping up meant digital to-do lists in
text files. And whenever it peaked I was usually at something like color coded spreadsheets in Excel to juggle and keep
track of everything. It's been the rhythm of my entire professional life.

The knowledge base surfaced the receipts. 2024: "Stop working long hours, need better discipline logging off." 2025:
"Sometimes I might be too much all over the place." The system I built to make me more productive had become a witness to a
pattern I keep repeating. And now in 2026, with the PKB making everything faster and cheaper, the cycle didn't slow down. It
accelerated.

Earlier I had told my therapist I was going to take the productivity gains and produce the same with less, cash in, take it
easier, coast a little, this time it would work. I saw that while lying in bed at 10pm, tired and happy, researching this
exact topic on my phone using claude.ai's phone app that I had hooked up to my PKB MCP server downstairs on my main
computer.

I'm failing badly at that.

Why can't I just stop?

This is the question that keeps coming back. I'm a grown adult with years of experience managing projects, teams, and
deadlines. I know what sustainable looks like. I've coached others on it.

And yet.

The feedback loop is constant. You open a Claude Code session, describe what you want, and watch it work. Iterate on the
results, produce quality work faster than ever, without the cognitive load of loading and storing context between sessions.
There's something deeply satisfying about seeing three terminals producing results in parallel while you start a fourth.
Marking a task DONE feels good. Seeing 5 out of 7 completed at the end of the day feels better. The reward is immediate,
visible, and always available.

It doesn't help that I work from home full-time. Amazing office downstairs, multiple screens, comfy chair. The commute is
fourteen steps.

And the knowledge base made it worse. Or better, depending on how you look at it. I built a system that reduces the cost of
context-switching to nearly zero. Research from Tuesday is retrievable on Friday without losing anything. A project I set
down for a week picks up right where I left off. The activation energy for starting new work dropped so low that the only
brake left was sleep.

Brakes turns out to be important.

The self-awareness was never missing. I've known about this pattern for my entire career. Every time I pulled back and found
balance, I thought I'd learned the lesson. I hadn't. I'd just run out of fuel temporarily. The PKB didn't create the
problem, but it removed the last natural brake. What I've never had is infrastructure to act on what I already know. Knowing
you should stop and having a system that helps you stop are very different things.

What I built to fight back

So I did what I always do when I hit a problem. I built something. But with the same principle I apply to every AI system I
design: the AI proposes, the human decides. No autonomous action without explicit approval.

The core is a daily planning routine. My personal assistant (a separate system that handles email triage, health checks, and
morning context gathering) prepares a structured summary every morning: today's calendar, the weekly work cadence, what's
urgent, what's overdue, and crucially, yesterday's activity stats. That last part matters more than everything else. If the
diary shows 54 Claude Code sessions and nearly 3,000 audit events from yesterday, the planner knows it was a heavy day and
biases toward a lighter one today.

Then I sit down with Claude Code and we negotiate. It proposes a bounded list. I push back or agree. Must-do items first
(work comes first, always), should-do items if there's room, and maybe one want-to-do if the day genuinely allows it. The
total effort can't exceed available hours, and there's always a slack buffer for the unplanned Slack messages and support
cases that inevitably show up.

The critical rule: if something new comes in, something else moves to tomorrow. No stacking.

When I confirm the plan, it gets written to my task file. During the day I check things off as I go. Next morning, the cycle
repeats. Archive what's done, return what isn't to the inbox, start fresh.

It's not complicated. The daily planner, the weekly planner, the activity diary, the audit logger that captures every tool
call and context switch. They're all just tools feeding off the same context infrastructure. The interesting part isn't the
tools.

It's why I needed them at all.

The crown jewel problem

None of this would work without the personal knowledge base. And that's the pattern I keep running into, across every tool
I've built.

I've been trying to push LLMs into practical daily work since ChatGPT launched in November 2021. Over the years I've had
them draft release notes, write Slack replies, prepare meeting agendas. The results got more impressive as models improved,
but they always fell short. The release notes lacked my voice and institutional context. The Slack replies were plausible
but generic. Not me.

The problem was never model intelligence. It was always lack of data and context.

The PKB changed that. It pulls in Slack conversations, GitHub PRs, emails, calendar events. All the institutional knowledge
that used to live scattered across a dozen tools. I had Claude Code retroactively analyze over 200 release notes I'd written
across four years to build a writing guide that captures my language, my style, my tone of voice, what to avoid. And
suddenly the output felt right. Not perfect, but recognizably mine. After four years of trying, that was the first time it
actually worked.

But even then: I review everything. Claude Code writes, suggests, proposes. I approve, reject, tweak. Always. The personal
assistant can read the knowledge base but cannot write back to it. If it could, one bad classification poisons future
decisions. I'm the gatekeeper. That's the boring design principle compared to "fully autonomous AI assistant."

That's what makes it trustworthy.

Is it working?

Honestly? Too early to tell.

The first day the planner ran, it told me I had zero free hours. I immediately added a new implementation task anyway.
Within minutes. The bunny lives.

But I'm collecting data now. Every day the diary logs what I actually did versus what I planned. The audit logger captures
every session, every context switch, every tool call. All of it local, on my hardware, for my eyes only. I have a
retrospective scheduled where Claude Code will analyze the patterns: am I actually stopping when the list is done? Is scope
creep measurable? Are heavy days followed by lighter ones, or am I just bulldozing through?

The honest answer might be that I've built a more sophisticated to-do list and spreadsheet and the Duracell bunny is now
wearing a watch. But even that would be something. At least I'd know.

What I do know is this: the tools that make knowledge work faster are also the tools that make it harder to stop. The
friction that used to force natural breaks (the blank page, the context rebuild, the slow feedback loop) is disappearing.
For people wired like me, that's a superpower and a trap in the same package.

I built the system that removed the friction. Now I'm building the system that puts some brakes back.

The bunny is still running. But at least now it knows what time it is.

I Gave My AI a Memory

Kristoffer Nordström — Fri, 06 Mar 2026 11:12:17 +0000

I had a dream.

Not the inspirational kind. A practical, slightly obsessive dream that had been nagging me for years. I wanted a system that remembers what I know. Not a notes app, not a wiki, not another
tool that promises "second brain" and delivers a folder structure. An actual memory. Something I could ask "what did we discuss about the Zoom integration last month?" and get a real
answer.

I tried building it. A lot. Every attempt died the same way: too much manual work. The technology wasn't there. LLMs could sort of understand text, but they couldn't search it reliably,
couldn't integrate with my existing workflow, couldn't run locally without eating all my vRAM.

Then, over Christmas 2025, it worked.

The pain that started it

The immediate trigger was copy-paste. I'd be in a Claude Code session, working on something, and I'd need context from an email. So I'd switch to Gmail, find the thread, copy the relevant
bits, paste them into the conversation. Five minutes later I'd need a Slack message. Switch, search, copy, paste. Then a calendar event. Then something from a ChatGPT conversation I'd had
two weeks ago.

Every context switch cost me time and focus. And each one added a little mental resistance, a small reason to not bother looking something up, to work from memory instead of checking.
Death by a thousand paper cuts on my mental discipline. And every new Claude session started from zero. No memory of what we discussed yesterday. No knowledge of my preferences, my
projects, my testing philosophy. Just a blank slate with good language skills.

I'd been living with that friction for long enough.

The cornerstones

Before writing a line of code, I knew three things that weren't negotiable.

Local and private. My emails, my calendar, my notes, my conversations. All of it stays on my machine. No cloud service gets to index my life. The embeddings run on my own GPUs. If a
company goes bankrupt or changes their terms of service, nothing happens to my data.

Plain text. Everything stored as org-mode files, the same format I've used in Emacs for years for my task management. Editable in any text editor, forever. No proprietary format, no
database you can't inspect. If the entire system disappeared tomorrow, my knowledge would still be there as readable text files on my hard drive.

Human ownership. The system retrieves, I decide. It doesn't auto-organize my notes. It doesn't delete things it thinks are irrelevant. It doesn't act on my behalf. I curate what matters.

Three principles. They sound obvious. They shaped every technical decision that followed.

What I actually built

The knowledge base indexes nine data sources into a single searchable system: Gmail, Google Calendar, Google Drive, Slack threads, WhatsApp messages, ChatGPT conversation exports, GitHub
issues, handwritten org-mode notes, and captured Claude Code session context. Over forty-four thousand files of my digital life back to 2012, a hundred thousand indexed chunks, all
available through a set of tools that plug directly into Claude Code or any other MCP capable LLM via the Model Context Protocol.

Of course, that sounds clean on paper. In practice, each piece exists because something broke or something was missing.

I started with ChromaDB for the vector database. Worked fine for a while. Then a memory corruption bug silently wiped out part of my index. I only noticed because search results for a
topic I knew I'd written about came back empty. Memory corruption in a memory system. I migrated to LanceDB the same week. It gave me hybrid search (BM25 keyword matching plus vector
similarity) and a Rust engine I actually trust. That's a trust decision.

Search was the next problem. Pure vector search is great for "find me things about testing philosophy" but terrible for "find me that email from Fredrik Haard." Names, project codes,
technical terms. They get scattered across the vector space in ways that make exact matches unreliable. BM25 handles those perfectly. So the system runs both in parallel and combines the
results. You search for "Fredrik", you get Fredrik. You search for "that conversation about exploratory testing approaches", you get the right conversations even if they never use those
exact words.

Then there was the context window problem. I run local LLMs for evaluation testing (a whole other story), and they have tiny context windows. You can't load fifty full documents into 4K
tokens and expect useful answers. So I built two-stage retrieval. First pass: scan compressed doc-card summaries (title, key facts, topics) to find what's relevant and smaller in tokens.
Second pass: load the full document only for the hits that matter. Think of a librarian who pulls twenty books off the shelf by topic, then carefully reads each one to decide which three
actually answer your question.

The task management integration was the easy part, actually. I've been running a Getting Things Done system in Emacs org-mode for years. The knowledge base just symlinks to those files.
Tasks live in one place, searchable from everywhere. Not a new system. An extension of one I already trust.

And the whole thing is organized using a PARA structure (Projects, Areas, Resources, Archive) because it maps to how I already think about my life. Work stuff in one place, personal areas
in another, reference material separate from active work. The structure exists so the retrieval stays sane.

How it changed my work

The best way to describe it: Claude actually knows me now.

Last week I was preparing for a meeting about an upcoming conference webinar. Instead of hunting through Gmail and Slack, I asked Claude to pull up everything related to the planning. Back
came the email thread, my notes from our last call, the talk abstract, and the calendar invite. All in one response, from four different sources.

That used to be four apps and fifteen minutes of tab-switching. And death by a thousand paper cuts switching context all the time.

When I start a morning planning session, the system pulls in my calendar, recent emails, overnight slack messages, and pending tasks without me having to explain any of it. When I'm about
to test a pull request at work, it already has the project documentation, the domain model, related Slack discussions, and my own testing heuristics loaded. I walk into the exploratory
testing session with context I used to spend twenty minutes assembling by hand. The sapient part (the judgment, the risk sense, the "what if I try this?") is still entirely mine. But now I
arrive prepared instead of cold. That preparation is what makes tools like ettool effective. The context feeds the judgment.

Every Claude session used to start with five minutes of "let me explain the background." Now the background is just there.

And the loop closes. After a session I run a single command that extracts the key decisions and new information from our conversation and files them into the right places in the knowledge
base. Next time anyone (me or the AI) asks about that topic, the latest context is already there.

But the real change was subtler. I stopped losing things. Ideas I captured in a ChatGPT conversation six weeks ago are findable. An email from a colleague about a technical approach
surfaces when I'm working on the related project. Meeting notes from three months ago appear when the topic comes up again.

And it's not just work. I used the system to plan Christmas gifts last year. Tracking what my nephews are into, what my wife mentioned wanting months ago, coordinating several 40106 synth
modules and a banana synth build with my daughter. It sounds silly. But that's the point. It's not a developer productivity tool. It's a memory system for your actual life.

I'd spent years building habits around "put the right thing in the right folder." Now the right thing finds me when I need it. That shift was bigger than I expected.

I'm not automating myself away. I'm removing the cost of loading and saving context.

The trap nobody warns you about

And then something happened that I didn't anticipate. I started working faster. A lot faster. Context switches disappeared. Research that used to take an hour took minutes. Writing with
full context instead of half-remembered details meant fewer rewrites.

Sounds great.

It wasn't entirely.

A former employer nicknamed me the Duracell bunny. Just keep going, one more thing, you're on a roll. Removing friction turned out to be a trap I didn't see coming. When everything becomes
frictionless, the limiting factor isn't productivity. It's knowing when to stop.

I'm not great at that. More on this in a future post.

Why this matters beyond me

Context infrastructure is what makes AI collaboration work. Not better models. Not cleverer prompts. The thing that changed my daily work wasn't switching from GPT-4 to Claude or tweaking
my prompting style. It was giving the AI access to what I know, what I've done, and how I think. But local AI, within my control, secure and private.

Every indexed document, every curated fact is a piece of context I'll never have to explain again. Systematically moving intelligence from runtime to build-time, while keeping human
judgment as the final authority.

That's the architecture nobody sees. And it matters more than the model.

This is the first in a series about what happened when I gave my AI a memory. Next up: what happens when you remove all the friction from your work, and why I had to deliberately put some
back.

Beyond Vibe Coding

Kristoffer Nordström — Thu, 05 Mar 2026 11:58:21 +0000

Over Christmas break 2025, I built a personal knowledge base. Not a notes app. Not a RAG app. A memory system with what I call a Data Constitution. The crown
jewel of everything else I've built since.

It syncs Gmail, Calendar, Slack, WhatsApp, ChatGPT conversations, GitHub issues, and handwritten notes into a single searchable index. Eight data sources, all flowing into one place. I
used to joke that ChatGPT had become my extended memory after four years and thousands of conversations. Then I built something that actually deserves that description, running locally,
under my control, with every piece of knowledge traceable back to its source.

Retrieval uses hybrid search: BM25 keyword matching, vector semantic search, and cross-encoder reranking. Everything lives as plain-text org-mode files. No vendor lock-in. Editable in any
text editor, forever.

Built in about four weeks with Claude Code. Running every day since. And designed around one principle: the AI assists, I decide.

If you follow the AI discourse, there's a term for this: "vibe coding." Anthropic themselves used it. The narrative goes something like: you tell the AI what you want, it writes the code,
you ship it. Easy.

I hate that term.

Not because it's wrong about the speed. The speed is real. What took weeks now takes hours. But "vibe coding" implies something passive. Like you're along for the ride. Like the hard part
is the prompting. Worse, it erases the human from the story entirely. As if the person behind the keyboard is interchangeable with anyone else who can type a prompt.

The hard part was never the prompting.

What actually happened

I chose org-mode as the foundation because I've used Emacs for years and I wanted plain text files I could edit in any tool, forever. Not markdown, not a database, not some proprietary
format. That's a sustainability decision that shapes everything downstream.

I already had a task management workflow that had been running for years: a Getting Things Done system in Emacs with specific states and transition rules. The new systems didn't replace
that. They were built to integrate with it, to respect an existing way of working rather than imposing a new one.

I built a two-stage retrieval system because I kept hitting context window limits with local models. When you have over a hundred thousand indexed chunks, you can't just load everything.
First pass: LLM-generated doc-card summaries that compress each document into its key facts and topics. Second pass: load the full document only for the hits that actually matter. That
pattern didn't come from a design document. It came from weeks of research, testing, hitting walls, reading about how other people solved similar problems, and making judgment calls based
on twenty years of building and testing systems.

The knowledge base became the foundation for everything else. A personal assistant that uses the KB to draft email replies and process receipts overnight. An evaluation framework that
tests whether local open-source models can handle my actual workflows. A training pipeline that does LoRA fine-tuning of local models using data extracted from my own real workflows,
exports to GGUF format for local inference, and validates through the eval framework before anything gets promoted to production. A frozen training corpus, built from my own data.

Throughout all of this I wasn't just prompting and accepting whatever came back. I use multiple AIs in my workflow: Claude Code as the main driver for implementation, ChatGPT for plan
review and feedback from a different perspective. I sit in the middle as the architect, actively reading, writing my own analysis, bouncing ideas between systems. It usually takes two or
three iterations before a plan is significantly better than what either AI initially suggested. That's not copying and pasting between chatbots. That's engineering.

And for all of these decisions I had to intimately know the system I was building. How it connects, how it behaves, where the failure modes are. Not every line of code, but the
architecture and its consequences. I pushed back on the AI when it tried to go off the rails. I challenged approaches that felt wrong. This is what testers do as a natural part of our
expertise. We learn systems deeply, even ones we didn't write.

None of that is vibes. That's active architecture.

The ettool weekend

The sharpest example happened in late January. On a Friday afternoon I was testing a pull request at work and got frustrated with my note-taking. The exploratory testing workflow I'd used
for years was: test in the browser, switch to a text file, write what I found, switch back, lose my place, repeat. By the time I left work that evening, I'd captured the idea and a rough
sketch of what a tool could look like.

Saturday lunch I went downstairs to my computer and worked through the architecture. What should it capture, how should the pieces connect, what constraints matter. Saturday afternoon,
about three to four hours of work, I had a working tool. ettool records your browser actions and your voice narration simultaneously, real-time
I mark what's a bug, an issue, a comment and so on via hot-keys, then it has an LLM correlate the three into structured test findings. After a session you get an org-mode report: timeline
of actions, correlated voice observations, and extracted findings categorized by type.

On Sunday I built an ambitious extension: real-time voice transcription with LLM intent detection. The idea was that the tool would understand what you were doing while you were doing it.

I tried it. It worked. I hated it. Within minutes I could feel it changing how I tested, and not in a good way. The real-time feature was second-guessing me, and I was losing the sense of
control that makes exploratory testing work. Me, the expert. The tester. The one with the domain knowledge and the judgment for where the risks are. The AI was taking that agency away.

I removed the feature and documented why in the repo.

Friday idea. Saturday MVP. Sunday experiment, tested, rejected. That entire cycle happened in one weekend. But here's what "vibe coding" misses about that story: none of it would have been
possible without the experience I brought to it. Knowing what exploratory testing needs. Knowing how testers think during sessions. Knowing what the browser APIs support, how to structure
audio processing, when a feature is hurting more than helping. A junior developer could have prompted their way to something that superficially worked. They wouldn't have known to reject
the real-time feature because they wouldn't have felt the cognitive interference the way someone with years of testing experience does.

ettool is now the tool I use for most of my exploratory testing sessions at work. The feature I rejected in January is still rejected.

What surprised me wasn't that I could build it fast. It was that I could discard a feature without emotional attachment because it was so cheap to try. When building is expensive, you get
attached to what you've built. When it's cheap, you can afford to be honest about what doesn't work.

That's not vibes. That's faster epistemology.

The architecture nobody sees

"Vibe coding" glosses over the most important part of building software: the decisions that aren't about code.

My knowledge base has a Data Constitution. Three validation lanes: CLEAN, REPAIR, QUARANTINE. Corrupted data fails visibly, never silently. That's a governance decision. The personal
assistant has a hard boundary: it drafts, I decide. No external action without explicit human approval. That's a trust boundary. The evaluation framework uses holdout sets and promotion
gates so local models can't regress on tasks they already pass. That's quality engineering.

These systems are built with the same rigor I bring to my professional work. Automated test harnesses. Git version control. CI pipelines with GitHub Actions. Code review. The kind of
engineering discipline that keeps systems from rotting. That rigor is exactly what separates this from actual vibe coding, because vibe coding does exist. There are plenty of examples of
people and organizations who prompt their way to a solution without preserving institutional knowledge, without knowing which decisions were made or why. It works until something breaks,
and then they're in a real mess because nobody understands the system well enough to fix it.

I didn't prompt my way to any of those architectural decisions. They came from twenty years of learning what happens when you ship systems without validation gates. The AI wrote the code.
But I researched the problems, I learned the domains, I made the architectural calls, and I rejected the approaches that didn't hold up. I was not a passive passenger. I was an active
architect who happened to have a very fast builder on the team.

What I'd actually call it

I've been trying to find a better term. "Architect-driven AI development" is accurate but clunky. When I think about the difference between me and my friend Fredrik, who is a proper
developer in ways I'll never be, it comes down to this: he's a tool-forger, I'm a tool-user. He wants to understand and validate every implementation. I want to understand the building
without inspecting every brick. He judges systems by correctness and elegance. I judge them by whether they actually change how I work and what value they bring.

Both are real technical skills. They just operate at different layers.

The most honest description I've found: I don't need to inspect every brick, but I care deeply about the building.

Come to think of it, that's how testers have always worked. We don't write the code line by line, but we're still deeply responsible for the system being delivered. We've spent decades
being accountable for things we didn't author. Responsibility without authorship. Maybe testers were built for this era more than we realize.

If you're building with AI and it feels like more than vibes, it probably is. The prompting is the easy part. Knowing what to ask for, what to reject, and what the system needs to be when
you're not looking at it. That's the work. That's always been the work. And it still requires a human who knows what they're doing.