Forem: Knitli

Context Engineering: How We Work Around the Goldfish Problem

Adam Poulemanos — Tue, 06 Jan 2026 15:50:43 +0000

Originally published at blog.knitli.com

tl;dr

Context engineering is the practice of deciding what information goes into a large language model's (LLM's) context window and when
The dominant approach today is summarization: using an LLM to compress context when the window fills up
Summarization works well for some tasks but loses critical details in others, forcing agents to re-retrieve the same information repeatedly
Other approaches like RAG and fine-tuning exist, each with real tradeoffs
Understanding these tradeoffs helps you choose the right tools and know when to trust them

If Context is King, Context Engineering is Kingmaking

In my last post, I explained that LLMs are goldfish. They can only see what fits in their context window, and they forget everything else. I also showed how context poisoning happens when you dump too much irrelevant information into that window, making it harder for the model to find what matters.

So how do engineers actually deal with this? That's where context engineering comes in.

Context engineering is the practice of deciding what information an LLM sees, when it sees it, and how much of it gets included. It's the difference between an AI that gives you useful answers and one that hallucinates or misses obvious details.

The Summarization Approach (What Most Tools Do Today)

Here's how most AI coding tools handle context limits:

The agent starts working on your task. It reads files, makes changes, runs commands, and accumulates history. All of this fills the context window.

When the window approaches its limit—usually around 95% full—the system needs to make room for more.

The standard solution: call another LLM to summarize everything that's happened so far. The summarization LLM gets the entire conversation history and a prompt like "compress this to save tokens." It produces a shortened version, the system discards the original details, and the agent continues with this compressed summary as its only record of what came before.

This is called "auto-compact" or "context compression" or "hierarchical summarization," but it's all the same basic idea. Claude Code does it. Cursor does it. Most agent frameworks do it.

Why is this approach so common? Because it's a reasonable response to a hard constraint. Context windows are finite. Work sessions aren't. Something has to give, and summarization is cheap to implement and works surprisingly well for many tasks.

But it has real limitations.

Where Summarization Works and Where It Doesn't

Summarization works well when:

The task is mostly linear (do step A, then B, then C)
Earlier details genuinely don't matter once completed
The agent won't need to revisit specific information from early in the session

Summarization struggles when:

The task involves debugging or iterative refinement
The agent needs to compare current state to earlier state
Specific details (exact error messages, variable names, code snippets) matter more than general narrative

Here's a concrete example of the second case:

An agent is debugging a function. It reads the function definition, identifies a bug, makes a fix, tests it, sees a new error, and reads the function again to understand the new error.

Then the context window fills up. The system summarizes.

The summary might say: "Fixed bug in calculate_total function, encountered new error."

But it doesn't include the actual function code, the specific error message, or the change that was made. That detail is gone.

Two turns later, the agent needs to understand why the new error is happening. It doesn't have the function code anymore—that got summarized away. So it re-reads the file, re-retrieving context it already had.

This happens often in debugging workflows. Agents spend time and tokens re-reading information they've already seen because summarization discarded the details they need.

It's like taking notes during a meeting by writing "discussed the budget" and then, when someone asks you what the actual numbers were, having to go back and re-watch the recording.

The deeper problem: summarization is lossy in unpredictable ways. The LLM doing the compression has to guess what's important. Sometimes it guesses wrong. When that happens, the agent either fails or has to backtrack and reconstruct context from scratch.

Other Approaches and Their Tradeoffs

Summarization dominates because it is easy and often 'good enough.' There are other approaches, each with their own pros and cons:

RAG: Retrieval Augmented Generation

RAG treats your codebase (or other data) like a searchable database. It breaks everything into chunks, converts them into numerical representations called embeddings (essentially coordinates in a high-dimensional space where similar content clusters together), and stores them. When the agent needs information, it searches for relevant chunks and adds them to the context.

The appeal: RAG lets you work with massive codebases without loading everything at once. You retrieve only what's relevant for each query.

The tradeoff: The quality of RAG depends entirely on how you implement it. Naive implementations use simple similarity matching—essentially asking "which chunks of text sound most like this query?" This works okay for documentation but breaks down for code. A function definition might have low textual similarity to a query about debugging an error that function causes. Dependencies three files away don't "sound like" the immediate problem, even when they're critical to understanding it.

More sophisticated RAG systems understand code structure: they know about function calls, imports, type definitions, and can traverse these relationships. This makes retrieval much more accurate but is significantly harder to build.

The practical result: RAG quality varies enormously between tools. When evaluating a tool that uses RAG, the question isn't "does it use RAG" but "how smart is its retrieval?"

Caching: Remember What You've Already Seen

Some systems cache frequently-accessed context so they don't have to re-retrieve or re-process it. If an agent reads the same file five times during a session, caching means you only pay the retrieval cost once.

The appeal: Caching directly addresses the re-retrieval problem that summarization creates.

The tradeoff: Caches take memory. They can become stale if files change. And deciding what to cache (and when to invalidate it) adds complexity.

Agents: Let the Model Search for Itself

Agent systems give the LLM tools to retrieve its own context. Instead of pre-selecting information, you let the model search files, run commands, or call APIs to find what it needs.

The appeal: Agents can adapt. They search for what they need in the moment and course-correct based on what they find.

The tradeoff: Agents are slower and more expensive. Every search is another API call (called an "inference call"), which means more tokens—the basic units that AI providers charge you for—and more compute. Agents also make mistakes: they search for the wrong things, miss obvious information, or get stuck in loops. And because the model has to reason about what to retrieve at each step, the whole process uses tokens fast.

Fine-tuning: Bake It Into the Model

Fine-tuning means retraining the model on your specific codebase or domain so it "learns" your patterns and doesn't need them in the context window.

The appeal: Once fine-tuned, the model already "knows" your code. No retrieval needed.

The tradeoff: Fine-tuning is expensive and inflexible. You need GPU time, training data, and constant retraining as your codebase changes. Fine-tuned models also aren't great at specific details—they learn general patterns but still hallucinate function names or recent changes. For fast-moving projects, fine-tuning can't keep up.

The Real Challenge: Context Engineering is a Hard Problem

Good context engineering for coding tasks requires several things that are genuinely difficult:

Understanding code structure: What depends on what? Which files matter for which tasks? How does information flow through the system? This requires parsing and analyzing code, not just treating it as text.

Dynamic decision-making: Different questions need different context. Understanding what a function does requires different information than debugging why it crashes, which requires different information than refactoring it for performance.

Precision: Pulling the right information without including noise. Every irrelevant token makes it harder for the model to find what matters.

Most tools make pragmatic tradeoffs here. They use approaches that are cheap to implement and work well enough for common cases, even if they break down on complex tasks. That's not incompetence—it's engineering under constraints.

But it does mean that for complex, real-world work, context engineering is often the limiting factor. Not model capability. Not prompt quality. Whether the model has the right information to work with.

The Hidden Costs of Poor Context Engineering

When context engineering breaks down, the costs show up in three places:

Money: Every token you process costs money. When you re-retrieve the same information repeatedly, you're paying to process those tokens over and over. When you include irrelevant context "just in case," you're paying for all of it. For teams using AI at scale, this can significantly increase infrastructure costs.

Speed: Processing large contexts takes time. The more tokens you feed the model, the longer it takes to respond. When agents have to search repeatedly for information they've already seen, tasks stretch out.

Reliability: When the model has to work with lossy summaries or sift through irrelevant information, it makes mistakes. It latches onto the wrong details, misses important nuance, or hallucinates. This is why AI coding tools sometimes confidently suggest fixes that break your code or miss bugs that are obvious if you have the right context.

What You Can Do About It

If you're using AI coding tools, here are some practical things to keep in mind:

Watch for re-retrieval patterns. If you notice an agent reading the same file multiple times in a session, that's a sign that context is being lost. Some tools handle this better than others.

Match tools to tasks. Summarization-based tools work fine for straightforward, linear tasks. For debugging or iterative work, look for tools with smarter context management.

Ask about context strategy. When evaluating AI coding tools, ask: How do they handle long sessions? What happens when the context window fills up? Do they use RAG, and if so, how sophisticated is the retrieval?

Keep sessions focused. Shorter, focused sessions are less likely to hit context limits than sprawling multi-hour sessions. If you're doing complex work, sometimes starting fresh with targeted context is more effective than continuing a bloated session.

Provide explicit context. Don't assume the tool will find what it needs. If you know a specific file or function is relevant, mention it directly.

How I'm Trying to Fix It

With Knitli, I'm working on context engineering that understands code structure—tracking dependencies, call graphs, repository patterns, and type relationships so retrieval is precise and adaptive rather than approximate and sweeping. My goal: assemble exactly the context each task needs, avoiding both the re-retrieval problem and context pollution.

My first attempt at that is CodeWeaver, which you can try today. It's rough around the edges and doesn't achieve that goal yet, but it's much more capable at attacking the problem than existing tools. It's also fully open source and free.

I'm not claiming I've solved context engineering. It's a genuinely hard problem. But I think current approaches leave a lot of room for improvement, and I'm focused on closing that gap.

If you're interested in following along, you can learn more at knitli.com, or try CodeWeaver and get involved in making something better.

What context engineering challenges have you run into with AI coding tools? I'd love to hear your experiences in the comments.

Tree-Sitter Grammars Explained: Leveraging Data for Clarity

Adam Poulemanos — Mon, 06 Oct 2025 14:22:20 +0000

How a Week of Jargon and 25 Languages Resulted in Creating the Parser I Needed

Clarity Engineering

TL;DR: If You're Here Because Tree-sitter's `node-types.json` Makes No Sense

You're not alone. Tree-sitter's terminology is confusing because it evolved from internal implementation details, not developer clarity.

The Core Problems

"Named" doesn't mean "has a name" (everything has a name). It means "corresponds to a named grammar rule" an internal detail that's noise for most use cases.
"Fields" and "children" are both parent-child relationships but the distinction is unclear. Fields are semantic ("this node's condition"), children are positional ("this node's first child").
Everything is a "type" : Nodes, edges, and abstract categories all use the same terminology, obscuring the differences that matter.

This post uses formatting unsupported by dev.to, please view the rest at our website

Context and Context Windows: What You Need to Know

Adam Poulemanos — Fri, 26 Sep 2025 18:17:18 +0000

Why Your AI is a Goldfish

Part 2 of Knitli's 101 introductions to AI and the economics of AI

tl;dr

Large language models (LLMs) use a fixed-size context window to process input and generate responses, but they don't have memory like humans.
The context window contains all the information the model can consider at once, and when it overflows, older information is lost.
LLMs are trained on outdated data, leading to a preference for older information and potential hallucinations when asked about unknown topics.
Context management is crucial, as including too much or irrelevant information can hinder response accuracy.
Engineers use techniques like prioritizing recent information and filtering out irrelevant details to manage context effectively.

LLMs and Their 'Memories'

When people talk about 'AI' today, they usually mean ChatGPT, Claude, or Gemini. These tools all use large language models (LLMs). LLMs consist of billions of parameters_think of each parameter as a number in a massive mathematical equation. The model combines these parameters with your input to generate responses. Its a huge statistical machine: predicting the most _likely output based on its training parameters and the context you gave it (intentionally or otherwise).

The Context Window 'Container'

LLMs don't remember like humans do (or at all). Instead, they work with a fixed-size container called a context window. Everything you send the model every word, file, or bit of data_and_ all of the model's previous responses fill this container. When the container overflows, the oldest information disappears. The model can no longer see it, even if you can still see it on your screen.

Think of it this way: the context window contains all the information the LLM can consider at once. Any information not in the window, or that doesn't fit in it, doesn't exist to the model.

Time Bias: Why LLMs Live in the Past

The context window is the only way to provide LLMs with recent or specific information. Companies train these models on huge datasets, but collecting and processing this data takes years. Most of the training data is 2-3 years old or even older. A model might say it was trained up to a month or two ago, but recent information is only a tiny part of its overall training data. Most of what it knows is outdated.

For instance, if you ask an LLM about a programming framework released last month, it won't know about it unless you include the documentation in your context window. The model's training data just doesn't have that recent information. This leads to a strong preference for older information, even when newer details might be more important.

It's also important to note that if you ask LLMs to provide information on something they haven't been trained on and have no context for, they will likely produce hallucinations. A hallucination occurs when the LLM generates false or made-up information that can sometimes sound real or nearly true. This happens because you asked it to provide information about something it can't, so it creates something similar based on its training.

Model training is permanent. Context is temporary.

Your AI Friend is a Goldfish: It Has No Memory

Models can't save their context between messages. The system that feeds data to the LLM rebuilds and reprocesses the entire context history every single turn. This process of generating output from input combined with the models training is called inference.

Here's what actually happens: You send "hey there" to ChatGPT. Your buddy ChatGPT replies with a friendly response. When you send your second message, the model doesn't just process that new message it processes your first message, its first response, AND your second message all at once. This context grows with each exchange.

The model treats this entire conversation thread as one giant input until the window reaches its limit and forces older turns to drop. Thats why responses can change tone or forget earlier details as conversations grow longer.

The Context Window Paradox

Context window sizes have grown dramatically. A few years ago, models handled only a few thousand words (8,192 or 16,384 tokens). Today's top models can process 128,000 to 2 million tokens worth of information.

Bigger windows allow more context, but they create new problems. Fill a window with irrelevant information, and you're giving the model junk data that makes accurate responses harder. Processing large contexts also takes more time and costs more money.

This creates a paradox: any information you exclude might be crucial, but including too much information can poison the model's ability to respond accurately.

The Context Poisoning Problem

Most current tools don't handle context well. For coding tasks, many systems add everything that might be relevant into the model's context without careful selection. They might include:

Entire codebases when only a few functions are needed
Outdated documentation along with current specs
Error logs mixed with successful runs
Multiple conflicting examples

This adds more confusion than clarity, making it difficult for the model to find useful information among irrelevant details.

Why This Matters for You

Understanding context windows helps explain common AI frustrations:

Why an AI assistant "forgets" something you mentioned earlier in a long conversation
Why providing too much background information sometimes makes responses worse
Why the same prompt can give different results depending on what else is in the context (since models are probabilistic, meaning they create output based on statistical likelihood, even the exact same context can lead to different results).
Why AI coding tools sometimes suggest outdated approaches despite having access to current documentation

Working Around the Limitations

Engineers use several techniques to manage context effectively, like prioritizing recent information, summarizing older exchanges, and filtering out irrelevant details. But these approaches have their own trade-offs and limitations.

The bottom line: "context is king" for LLMs. Feeding the right amount of the right information in the right order matters more than raw context window size. This makes context management the central engineering challenge for anyone building with LLMs.

Our next post in this series will explore current solutions to these context problems including their strengths, weaknesses, and why even the best approaches today aren't quite "good enough" for complex, long-running tasks.

visit us at knitli.com to learn how we're fixing the context problem, and sign up for our waitlist!

Understanding Tokens: What They Are and Why They're Important

Adam Poulemanos — Fri, 26 Sep 2025 02:08:14 +0000

Part 1 of Knitli's 101 introductions to AI and the economics of AI

Tokens are Parts of Words

Most people think AI, like ChatGPT, reads words. It doesn't.

It reads tokens invisible chunks of text that power every interaction.

When you type something like:

Hello, world!

The model doesn't see two words. It sees four tokens:

Hello 1 token
, 1 token
world (note the space) 1 token
! 1 token

That simple greeting is 4 tokens , not 2 words. Code fragments break into even more tokens because punctuation, brackets, and symbols all get split up. (What is and isn't a token and what becomes one actually depends on the model, so our example isn't exact.)

Tokens Aren't Expensive. Processing them is.

When you send your tokens to get processed, each one must be run through billions of math operations on very expensive GPUs every single time. Thats where the cost comes from:

Power-hungry hardware
Data center space
Cooling
Staff to maintain and secure it

More tokens more GPU time higher costs.

Fewer tokens less GPU time lower costs.

Right now, you probably don't see the meter running. You pay a flat subscription; someone else covers the token bill.

Under the hood, tokens are the biggest driver of compute costs at every AI company.

Tokens are the Foundation for Everything Else

Context windows , or how much a model can see at one time, are measured in tokens.
API pricing is per million tokens (API access is when companies or developers access an AI model to provide their own service, like a chatbot on a website, or just for internal use).
Memory , efficiency, and much of prompt engineering are all about how tokens are used.

If you want to understand AI, and how it really works, or why it sometimes costs so much. You have to start with tokens.

Learn more about how Knitli is tackling the hidden economics of AI at the source, visit us at knitli.com and subscribe to our waitlist for updates!

Forem: Knitli

Context Engineering: How We Work Around the Goldfish Problem

tl;dr

If Context is King, Context Engineering is Kingmaking

The Summarization Approach (What Most Tools Do Today)

Where Summarization Works and Where It Doesn't

Other Approaches and Their Tradeoffs

RAG: Retrieval Augmented Generation

Caching: Remember What You've Already Seen

Agents: Let the Model Search for Itself

Fine-tuning: Bake It Into the Model

The Real Challenge: Context Engineering is a Hard Problem

The Hidden Costs of Poor Context Engineering

What You Can Do About It

How I'm Trying to Fix It

Tree-Sitter Grammars Explained: Leveraging Data for Clarity

How a Week of Jargon and 25 Languages Resulted in Creating the Parser I Needed

Clarity Engineering

TL;DR: If You're Here Because Tree-sitter's node-types.json Makes No Sense

The Core Problems

Context and Context Windows: What You Need to Know

Why Your AI is a Goldfish

tl;dr

LLMs and Their 'Memories'

The Context Window 'Container'

Time Bias: Why LLMs Live in the Past

Your AI Friend is a Goldfish: It Has No Memory

The Context Window Paradox

The Context Poisoning Problem

Why This Matters for You

Working Around the Limitations

Understanding Tokens: What They Are and Why They're Important

Tokens are Parts of Words

Tokens Aren't Expensive. Processing them is.

Tokens are the Foundation for Everything Else

TL;DR: If You're Here Because Tree-sitter's `node-types.json` Makes No Sense