Forem: Serhii Panchyshyn

Why Your AI Agent Costs 7x What It Should

Serhii Panchyshyn — Tue, 21 Apr 2026 17:38:28 +0000

Most AI agents are loops. Call the model. Read the response. Run a tool. Feed the result back. Call again.

That loop is also a billing loop. Every iteration re-sends the entire prompt. And unless you've thought about it carefully, every iteration is paying full price for tokens the provider already saw three calls ago.

I learned this the hard way. I had an agent that called the model five to seven times per user interaction. Screenshots were involved. A single medium-resolution image runs about two thousand tokens. Multiply that by seven passes and you're billing fourteen thousand tokens for an image the model only needed to see once.

The fix took about thirty minutes. It cut my input costs by roughly 80%.

The problem is the loop, not the model

When people optimize LLM costs, they usually start with the model. Can I use a smaller one? Can I cut the system prompt? Can I reduce the context window?

Those are fine. But they're linear improvements. You shave off 20% here, 30% there.

The loop is a multiplier. If your agent runs five iterations and you're re-billing the same stable content on every pass, you're not overpaying by a percentage. You're overpaying by a multiple. Five iterations means 5x. Seven means 7x. That's the gap.

What's actually happening under the hood

Every major LLM provider now offers prompt caching. You pay full price the first time you send a prompt. On subsequent calls that share the same beginning, you pay a fraction of the input cost for the cached portion. OpenAI gives up to a 90% discount on cached tokens. Anthropic's cache reads run about 10% of normal input cost.

The key word is "beginning." These caches are prefix caches. They match your prompt from byte zero forward. The moment the bytes diverge from what's stored, the match ends. Everything after that point is a miss.

Call 1:  [A][B][C][D][E]   → writes prefix to cache
Call 2:  [A][B][C][D][F]   → cache hit through [D], full price for [F]
Call 3:  [A][X][C][D][E]   → cache MISS at position 1, full price for everything

In call 3, the tokens [C], [D], and [E] are identical to call 1. Doesn't matter. The chain broke at position 1. The cache is left-anchored and unforgiving.

This isn't theoretical. It's happening in production right now.

I was looking at LightRAG, a popular open-source RAG framework. Their entity extraction pipeline embeds variable content directly inside the system prompt:

System prompt:
  ---Role---         (static, ~100 tokens)
  ---Instructions--- (static, ~400 tokens)
  ---Examples---     (static, ~800 tokens)
  ---Input Text---
  {input_text}       ← CHANGES FOR EVERY CHUNK

Every chunk produces a completely different system prompt string. There's no shared prefix across chunks because the variable content is baked into the same message as the static instructions. Nothing gets cached.

For a typical indexing run of 8,000 chunks, that's roughly 11.6 million prompt tokens all counted as new. If the static prefix (~1,300 tokens) were separated from the variable input, roughly 10.4 million of those tokens would hit the cache. That's a 45% cost reduction just from moving one variable out of the system message and into the user message.

The fix is three lines of code. Split the template. Put static content in the system message. Put {input_text} in the user message. Done.

System message (cached):
  ---Role---
  ---Instructions---
  ---Examples---
  ---Entity Types---

User message (variable):
  ---Input Text---
  {input_text}

This pattern shows up everywhere. If you're building any pipeline that processes documents in chunks, your prompt is probably structured like LightRAG's. And you're probably paying for it.

The layout that actually works

Once you understand prefix caching, prompt layout stops being cosmetic and starts being economic. The shape you want is:

[  STABLE PREFIX  ][ cache breakpoint ][  GROWING TAIL  ]

Everything that stays the same between calls goes to the left. Everything that changes goes to the right. The breakpoint sits between them.

The Claude Code team at Anthropic shared their exact ordering and it's a good template for any agent:

1. Static system prompt + tool definitions  (globally cached)
2. Project-level context                    (cached within a project)
3. Session context                          (cached within a session)
4. Conversation messages                    (the growing tail)

Each layer is stable relative to the layer below it. System prompts change less than project context. Project context changes less than session context. Session context changes less than conversation messages. The cache hits cascade.

The counterintuitive part

In a single-shot call, the user's message naturally goes at the end. That's correct. But in a loop, the user's message is not the tail. The loop's output is.

Think about it. The user's message doesn't change between iterations. It's the same question on pass one as it is on pass five. What changes is the assistant's responses and tool results that accumulate with each iteration.

So the user's content belongs in the prefix:

[ system prompt ][ user message ][ breakpoint ][ loop state → grows each iteration ]

This looks wrong. The user's message isn't at the end. But the cache doesn't care about narrative order. It cares about byte stability. The user's message is frozen across iterations. The loop output is what moves. Frozen things go left. Moving things go right.

The math on images

Text tokens are cheap enough that sloppy caching is survivable. Images are not.

Image tokens	Passes	Without caching	With caching
2,000	1	2,000	2,000
2,000	5	10,000	~2,400
2,000	7	14,000	~2,600

The cached version writes the image once and reads it at a fraction of full price on every subsequent pass. That's the difference between an agent that's economically viable and one that burns through your API budget in a week.

If your agent processes screenshots, documents, or any visual input inside a loop, this is probably the single highest-leverage optimization available to you.

The silent cache killers

Even with the right layout, caching breaks in quiet ways. Every one of these has bitten me.

Timestamps in the system prompt. "The current time is 2025-04-22 14:23:07." Changes every call. One line and your entire prefix is invalidated. The fix is to pass time updates in the next user message instead. Claude Code does exactly this. They append a <system-reminder> tag in the next turn rather than touching the system prompt.

Adding or removing tools mid-conversation. This is probably the most common mistake I see. It seems logical to only give the model tools it needs right now. But tool definitions are part of the cached prefix. Adding or removing a tool invalidates the cache for the entire conversation history.

The Claude Code team learned this the hard way. Their plan mode initially swapped out tools for read-only versions. Cache broke every time. The fix: keep all tools in the request always. Make plan mode a tool itself (EnterPlanMode, ExitPlanMode). The tool definitions never change. The model calls a tool to change its own behavior instead of you changing the toolset.

If you have many tools and loading all of them is expensive, send lightweight stubs with just the tool name and let the model discover full schemas through a search tool when needed. The stubs are stable. The prefix stays intact.

Switching models mid-session. Prompt caches are model-specific. If you're 100k tokens into a conversation with a large model and want to hand off an easy subtask to a smaller one, you'd have to rebuild the entire cache for the new model. That rebuild often costs more than just letting the original model answer.

If you need multi-model workflows, use subagents. The primary model prepares a focused handoff message for the secondary model. The secondary model works with a short, fresh context. Neither model's cache gets broken.

Unordered data structures. If you build context from a set or unordered dict, iteration order can drift between calls. Sort before serializing.

Whitespace drift. One version of your template has a trailing newline, another doesn't. The bytes don't match.

In-place edits to history. The moment you mutate a past message, every byte after it shifts. Your cache for that whole conversation is gone.

The unifying principle: content that looks identical to a human is not necessarily byte-identical to a hash function. The cache only speaks bytes.

How to verify it's working

Don't trust your layout. Measure it.

Every major provider returns cache metrics in the response. OpenAI includes cached_tokens in usage.prompt_tokens_details. Anthropic returns cache_creation_input_tokens and cache_read_input_tokens.

On the first call, cached tokens should be zero. On every subsequent call, they should climb to match your stable prefix length. If they don't, your prefix isn't stable.

The best debugging step: dump the raw prompt from two consecutive calls and diff them. You'll find the drift immediately.

A habit that's saved me hours: write a test that runs your prompt builder twice with equivalent inputs and asserts the first N bytes are byte-equal. Humans can't eyeball byte stability. Hash functions can.

And if caching is a meaningful part of your cost structure, monitor it like you'd monitor uptime. The Claude Code team runs alerts on their cache hit rate and treats drops as incidents. A few percentage points of cache miss can dramatically change unit economics. It deserves a dashboard, not a gut check.

The reframe

I used to think of prompts as messages. Now I think of them as data structures with cache semantics. Some regions are stable. Some regions grow. The breakpoint is the contract between them.

Every piece of content gets the same triage: does this change between calls? If yes, it goes to the tail. If no, it goes to the prefix. If it's expensive and it belongs to the user's turn, I figure out how to keep it in the prefix anyway. Even if it means putting things somewhere that looks weird.

This reframe changes how I design features. Instead of asking "what tools does the model need right now?" I ask "how do I model this state change without breaking the prefix?" Instead of editing the system prompt to update context, I pass updates through messages. Instead of switching to a cheaper model mid-conversation, I fork a subagent with a clean context.

The model is the expensive part of your system. The shape of what you send it is the part you actually control.

I help engineering teams ship AI features that work in production, not just in demos. If your agents are burning through API budgets or your LLM infrastructure needs a cost audit, let's talk.

The One Mindset Shift That Separates People Who Use AI From People Who Get Left Behind

Serhii Panchyshyn — Sat, 18 Apr 2026 00:11:43 +0000

You take out the garbage every day.

You've done it for years. Maybe decades. It's just a thing you do. Part of the routine. You grab the bag, walk outside, toss it in the bin. Done. Never think about it twice.

But what if you stopped for 10 seconds and asked: "Does it have to be this way?"

What if the garbage could take itself out?

That sounds ridiculous. And that's exactly the point. Because most people never ask the ridiculous question. They never get curious enough to wonder if the thing they've always done could be done differently. Or not done at all.

And right now, in 2026, that lack of curiosity is the single biggest thing holding people back.

The Curiosity Gap Is the New Skills Gap

Traditional businesses sitting on decades of manual processes. The pattern I see over and over is not a technology gap. It's a curiosity gap.

The tools are here. AI can write, analyze, build, automate, reason. It gets better every month. But most people interact with these tools the way they interact with their garbage routine. They accept the default. They don't question the process. They don't get curious about what's underneath.

Adam Grant, organizational psychologist at Wharton, studied what he calls "originals." People who drive creativity and change. His research found something surprising. The biggest difference between originals and everyone else wasn't talent or intelligence. It was that originals were more afraid of not trying than of failing. They generated massive volumes of ideas. Most were bad. But the volume itself created the conditions for breakthroughs.

That's curiosity in action. Not passive wondering. Active experimentation.

Why Your Brain Fights Curiosity

Your brain is actually wired to avoid curiosity.

Psychiatrist Judson Brewer at Brown University has spent over 20 years studying how the brain forms habits. His research shows that our brains run on a reward-based learning loop. Trigger, behavior, reward. See garbage bag full. Pick it up. Take it out. Feel good that the task is done. Loop complete. Brain moves on.

The problem is that this same loop applies to how we think. We encounter a problem. We reach for the familiar solution. We get the small reward of "done." And we never question whether the problem itself was the right one to solve.

Brewer's key insight is that curiosity is actually more powerful than willpower for breaking these loops. When you get genuinely curious about a habit or pattern, your brain's reward system updates. You start seeing the actual results of your default behaviors instead of running on autopilot.

This is why curiosity isn't just nice to have. It's a mechanism for rewiring how you operate.

The Garbage Test

I use something I call the Garbage Test with teams I work with. It's simple.

Pick one thing you do every single day that you've never questioned. Something so routine it's invisible. Now get curious about it. Not "how do I optimize this?" That's efficiency thinking. Instead ask:

"Why does this exist at all?"

When you ask that question about enough things, you start finding entire categories of work that shouldn't exist. Reports nobody reads. Meetings that could be async messages. Manual data entry that an API could handle. Approval workflows that exist because someone got burned once in 2017.

The garbage doesn't need a faster route to the bin. The garbage needs to stop being generated in the first place.

Curiosity Is Not a Personality Trait. It's a Practice.

People tell me they're "not the curious type." That's like saying you're not the breathing type. Curiosity is a human default. Kids ask somewhere around 300 questions a day. By adulthood, that number drops to almost nothing.

What happened? We got trained out of it. Schools rewarded correct answers over good questions. Workplaces rewarded execution over exploration. We learned that asking "why" makes you look like you don't know what you're doing.

Anne-Laure Le Cunff, who spoke at SXSW EDU 2025 on the experimental mindset, put it well. She argues that by middle school, most kids have already shifted from the excitement of discovery to the pressure of getting things right. And we carry that pressure into our careers, our businesses, our relationship with technology.

The fix is not some grand mindset overhaul. It's small experiments.

The Curiosity Protocol

This is what I've used across engagements with teams adopting AI and building new workflows.

1. Pick one friction point. Something that bothers you. Something tedious. Something you complain about but accept. Start there.

2. Shut everything down for 15 minutes. No Slack. No email. No music. Just you and the question: "What is actually happening here? What's underneath this?"

3. Get weird with it. Ask the dumb question. "What if this didn't exist?" "What if I did the opposite?" "What if a five-year-old designed this?" The value isn't in the answer. It's in breaking the default pattern your brain is stuck in.

4. Run one micro-experiment. Don't plan. Don't strategize. Don't build a deck. Just try something. One small test. See what happens. The goal isn't to succeed. The goal is to learn something you didn't know 30 minutes ago.

5. Record what you found. Not a formal report. A single sentence. "I tried X and learned Y." That's it. Stack enough of those sentences and you have a roadmap that no consultant could have built for you.

What This Looks Like With AI

Say you spend 45 minutes every morning reading through Slack messages, emails, and project updates to figure out what needs your attention. You've done this for years. It's just the morning routine.

The Garbage Test: "Why does this exist?"

Because information is scattered. Because there's no single source of truth. Because everyone communicates differently.

The curious question: "What if I didn't do this at all? What if something did it for me?"

The micro-experiment: Spend one hour building a simple AI workflow that summarizes your channels and flags what actually needs you. Not a perfect system. A prototype. A test.

Maybe it works. Maybe it doesn't. But now you've learned something about what AI can do, what your actual information bottlenecks are, and where you should focus next. That's more progress than most people make in a month of "meaning to look into AI."

The Real Competitive Advantage

Research from ISG found that curiosity is becoming one of the most critical organizational capabilities in the AI era. Not because curious people are smarter. But because curious people experiment. And experimentation is the only way to figure out how AI actually fits into your specific context.

No blog post, course, or consultant can tell you exactly how AI will transform your work. That answer only comes from getting curious enough to try things. To break things. To ask the question nobody else is asking.

Google built their innovation culture on this. They gave employees 20% of their time for self-directed projects. Not because they knew what would come out of it. But because they understood that curiosity at scale produces outcomes you can't predict or plan for.

You don't need Google's budget to do this. You need 15 minutes and one dumb question.

The Part Nobody Wants to Do

Curiosity requires something most people avoid: sitting with not knowing.

We live in an era of instant answers. Google it. Ask ChatGPT. Get the solution. Move on. But curiosity isn't about getting answers faster. It's about asking better questions. And better questions come from the discomfort of not knowing. From staying in that space long enough to see what's really there.

Stuart Firestein, a neuroscientist at Columbia, gave a TED Talk called "The Pursuit of Ignorance" where he argued that knowledge actually generates more ignorance, not less. Every answer opens new questions. The people who thrive are the ones who see that as exciting, not threatening.

That's the mindset shift. Not "I need to learn AI." But "I wonder what would happen if..."

Start Today

Don't bookmark this article and forget about it. That's the old pattern. The default loop. Instead, do this:

Before you close this tab, pick one thing in your life or work that you've never questioned. One thing that's "just how it is." Write it down. Then spend 15 minutes getting curious about it.

Not tomorrow. Right now.

The garbage is waiting. But maybe it doesn't have to be.

I help companies figure out where AI actually fits in their business. Not the hype version. The version that makes your team's daily work better. If you're sitting on processes that feel like they shouldn't exist in 2026, let's talk.

Things You're Overengineering in Your AI Agent (The LLM Already Handles Them)

Serhii Panchyshyn — Tue, 14 Apr 2026 20:15:22 +0000

I've been building AI agents in production for the past two years. Not demos. Not weekend projects. Systems that real users talk to every day and get angry at when they break.

And the pattern I keep seeing? Engineers building elaborate machinery around the model. Custom orchestration layers. Hand-rolled retry logic. Massive tool routing systems. All to solve problems the LLM was already solving if you just let it.

Here's what I'd rip out if I could go back.

1. Custom Tool Selection Logic

You built a classifier that decides which tool the agent should use. Maybe a regex-based router. Maybe a whole separate model call just to pick the right function.

Stop.

Modern LLMs are shockingly good at tool selection when you give them well-named, well-described tools. The problem was never the model. It was your tool descriptions.

// Bad: vague tool name, model guesses wrong
{ name: "search", description: "Searches for things" }

// Good: specific name, clear scope, model nails it
{ name: "search_customer_accounts", description: "Search customer accounts by account ID, customer name, or date range. Returns subscription status, plan details, and usage history." }

The fix isn't a smarter router. It's better tool design. Name your tools like you're writing an API for a junior dev who's never seen your codebase. Be embarrassingly specific.

Tool selection metrics can look great while the final answer is still garbage. I've seen this firsthand. The agent picks the right tool 95% of the time but still gives wrong answers because the tool descriptions don't explain what the returned data actually means.

2. Prompt Chains for Multi-Step Reasoning

I used to build 4-5 step prompt chains for anything complex. Break the problem down. Feed output A into prompt B. Parse the result. Feed it into prompt C.

Turns out a single well-structured system prompt with clear instructions handles most of this natively. The model already knows how to decompose problems. You just need to tell it what your constraints are and what good output looks like.

// Instead of chaining 3 prompts:
// 1. "Classify the user intent"
// 2. "Based on intent X, gather context"  
// 3. "Now generate the answer"

// Just do this:
const systemPrompt = `You are a support agent for a SaaS platform.

When a user asks a question:
1. Identify whether they need account info, billing help, or technical support
2. Use the appropriate tool to get the data
3. Answer in plain English with the specific details they asked for

If you're unsure about intent, ask one clarifying question. Never guess.`

The chain approach also creates a hidden problem. Each step is a failure point. And debugging a 4-step chain when something breaks on step 3 is miserable. A single prompt with clear instructions is easier to observe, easier to eval, and fails more gracefully.

3. Retrieval Complexity Before Retrieval Quality

This one hurts because I've done it myself.

You spend two weeks building a hybrid retrieval pipeline. BM25 plus vector search plus re-ranking. Beautiful architecture. Looks great in a diagram.

Then you realize the actual problem is that your knowledge base documents are written in a way the model can't parse. Or your chunking strategy splits the answer across two chunks and neither one makes sense alone.

The retrieval pipeline doesn't matter if the underlying data is messy.

Before you optimize the search algorithm, ask yourself:

If I showed this chunk to a human with no context, would they understand the answer?
Are my documents written for the model or for the original author's brain?
Am I chunking at logical boundaries or just every 500 tokens?

I've seen teams where retrieval "works" but answers are still wrong because the reference data itself contains outdated or incorrect information. That's not a retrieval problem. That's a data quality problem wearing a retrieval costume.

4. Custom Guardrails That Block Legitimate Use

You built a content filter. It catches bad inputs. Great.

Then users start complaining that normal questions get blocked. Someone asks about "terminating a contract" and the guardrail flags "terminating." Someone asks about "explosive growth" in their metrics and that trips another filter.

Rule-based guardrails at scale become a whack-a-mole game you can't win.

The LLM itself is already pretty good at understanding intent and context. Instead of building regex walls around the model, build guardrails INTO the model's instructions. Tell it what topics are off-limits. Tell it what information it should never reveal. Tell it to redirect gracefully instead of stonewalling.

// Instead of: regex filter that blocks "kill", "terminate", "destroy"
// Try this in your system prompt:

`If a user asks about topics outside your domain (account management and billing),
politely redirect them. Never share internal system details, API keys, 
or other customer data. You can decline requests, but always explain why 
and suggest what you CAN help with.`

Guardrails and permissions are product design, not just safety theater. Treat them that way.

5. Agent Memory as a Separate System

You have your agent's database over here. Its memory system over there. A vector store somewhere else. And glue code holding all of it together with prayers and setTimeout.

The real question is simpler than the architecture you built: what does the agent actually need to remember between sessions?

Most agents don't need a sophisticated memory system. They need a well-structured context window. The conversation history plus a few key facts about the user. That's it. The model handles the rest.

When you DO need persistent memory, keep it close to your data. Don't build a separate memory service that has to sync with your database. Store memory where your data lives. Query it with the same tools.

The moment your agent's memory can't see its own database, you've created an integration problem disguised as a feature.

6. Sub-Agent Orchestration for Everything

Multi-agent architectures are seductive. One agent plans. One retrieves. One generates. One validates. They talk to each other through a message bus. It looks amazing on a whiteboard.

In production it's a nightmare to debug. When the answer is wrong, which agent broke? The planner? The retriever? The generator? You end up building observability tooling just to trace what happened across four agents when one would have been fine.

Start with one agent. Push it until it genuinely can't handle the complexity. Only THEN split into specialized sub-agents with clear, narrow responsibilities.

The rule I use: a sub-agent should exist only when the parent agent's context window literally can't hold the information it needs. Not because "separation of concerns" sounds good in a design doc.

Specialized agents make sense for high-context tasks where the prompt would blow up the token budget. General agents handle 80% of use cases with less operational overhead. Know which one you're building and why.

7. Evaluations That Test Happy Paths

This is the one that bites hardest.

You write 50 eval cases. The agent passes 48 of them. Ship it.

Then users find the 200 edge cases you didn't think of. The model hallucinates an account ID. It confidently answers a question it should have said "I don't know" to. It uses data from one customer to answer another customer's question.

Good evals don't test whether the agent CAN answer correctly. They test whether it WILL answer correctly under pressure.

Build evals that target failure modes:

What happens when the tool returns empty results?
What happens when two tools return conflicting information?
What happens when the user asks something slightly outside the agent's domain?
What happens when the context is ambiguous?

The eval suite is the real moat. Not the model. Not the prompts. Not the architecture. The team that can systematically find and fix failure modes ships better agents than the team with the fancier framework.

The Uncomfortable Truth

Most of the complexity in your agent isn't making it smarter. It's making it harder to debug, harder to eval, and harder to change.

The best agent architectures I've built are embarrassingly simple. One model. Clear system prompt. Well-named tools. Good data. Ruthless evals.

Everything else is either premature optimization or an expensive lesson waiting to happen.

What's the most over-engineered thing you've built into an agent that turned out to be unnecessary?

Your Agent Isn't Broken Because of the Prompt. It's Broken Because of What the Model Can See.

Serhii Panchyshyn — Mon, 13 Apr 2026 23:43:02 +0000

I've watched teams spend weeks rewriting the same system prompt.

Different phrasings. More examples. Clearer instructions. The agent still picks the wrong tool. Still hallucinates. Still feels broken.

Then they rename six functions and accuracy jumps 30%.

This pattern shows up constantly across the teams I work with. The model doesn't care how clever your prompt is. It cares about what it can see.

The problem I see everywhere

Teams treat prompts like magic spells. Say the right words, get the right output.

But agents aren't following instructions. They're making predictions based on everything in context. The tool names. The API responses. The error messages. The structure of your data.

That's perception. And it matters way more than your system prompt.

Most teams optimize the wrong layer. They iterate on prompts for weeks while their tool names are handleData and processRequest. The model has no chance.

Here are 10 patterns I've seen work across the past two years of helping teams build production agents 💪

1. Tool names are the real prompt

Bad tool names are invisible to the model.

I audit client codebases and find this constantly:

// ❌ The model has no idea what this does
async function handleRequest(data: unknown) { }

// ✅ Now it knows exactly when to use this
async function createInvoiceFromQuote(quoteId: string) { }

I've seen agents with 30, 40, even 50+ tools defined Half had names like processData or executeAction. The model was guessing.

We renamed a handful of functions. Tool selection accuracy went from 60% to 87%. No prompt changes.

2. Tool descriptions matter more than you think

The model reads descriptions to decide which tool to pick.

I tell clients: write descriptions like you're onboarding a new developer. Because you are.

// ❌ Vague description
const tool = {
  name: "searchRecords",
  description: "Search for records in the system"
}

// ✅ Specific description with constraints
const tool = {
  name: "searchSupportTickets",
  description: "Search support tickets by ticket ID, customer email, priority, or date range. Returns max 50 results. Use filters to narrow results before searching."
}

Specific descriptions reduce wrong tool selection by 30-40% in my experience.

3. Passing everything into context is lazy

I've reviewed architectures where teams dump entire conversation histories into context. 20 turns. 50 tool results. Everything.

The model drowns.

What works:

Last 3 turns by default
Relevant retrieved docs only
Structured summaries instead of raw data

Less context. Better decisions. Faster responses.

One team cut their context by 60% and saw answer quality improve. Counter-intuitive until you realize the model was distracted by noise.

4. Scoped retrieval beats broad retrieval

Early RAG implementations pull from everywhere. The whole knowledge base. 200+ docs. The model has no idea which ones matter.

I push clients toward module-level filtering. If someone asks about billing, only retrieve billing docs.

// ❌ Retrieve from everything
const docs = await retriever.search(query);

// ✅ Scope to relevant module
const docs = await retriever.search(query, { 
  module: detectModule(query),
  maxResults: 5 
});

Recall goes up. Hallucinations go down. Should be the default from day one.

5. Structured outputs prevent downstream chaos

If another agent or system consumes your output, structure it.

// ❌ Free text response
"I found 3 tickets that match. The first one is #12345 from a customer in Chicago..."

// ✅ Structured response
"tickets": [
{ "id": "12345", "customer": "Acme Corp", "status": "escalated" },
{ "id": "12346", "customer": "Globex Inc", "status": "resolved" }
]

Unstructured responses compound errors. Each downstream consumer has to parse and guess. I've seen entire pipelines break because one agent returned prose instead of JSON.

6. Silent failures are invisible failures

The model can't fix what it can't see.

I audit error handling in every client codebase. Same pattern:

// ❌ Silent failure
if (!hasPermission) {
  return null
}

// ✅ Loud failure
if (!hasPermission) {
  return {
    error: "PERMISSION_DENIED",
    message: "User lacks 'tickets.create' permission",
    requiredPermission: "tickets.create",
    suggestedAction: "Request access from workspace admin"
  }
}

Explicit errors let the agent reason about what went wrong. And let you debug faster.

7. Real system state beats assumed state

I've seen agents confidently tell users something was done when it wasn't. Ticket resolved. Payment processed. Account updated. The agent assumed based on patterns instead of checking the actual record.

This happens when teams don't pass real state:

// ❌ Agent has to guess
const context = {
  id: "12345",
}

// ✅ Agent knows the truth
const context = {
 ticket: {
    id: "12345",
    status: "open",
    lastUpdate: "2024-01-15T10:30:00Z",
    assignedAgent: "Sarah K."
  }
}

Agents will make up state if you don't give them real state. Always.

8. Specialized agents beat one generalist

I've seen teams try to build one agent that handles everything. Customer questions. Data entry. Workflow automation. Reports.

It's mediocre at all of them.

The pattern that works:

One agent for customer Q&A using the knowledge base
One agent for data operations with strict schemas
One agent for document parsing with specialized prompts

Each one is easier to eval. Easier to constrain. Easier to improve.

Generalist agents are harder to debug and harder to trust. I push clients toward decomposition early.

9. Guardrails should block bad things, not useful things

"Can you help me set up a webhook?" → BLOCKED (mentions code execution)

"What's the API endpoint for exports?" → BLOCKED (mentions API)

Users stop trusting the product. Not because the AI is bad. Because the guardrails are dumb.

The users stopped trusting the product. Not because the AI was bad. Because the guardrails were dumb.

Narrow guardrails work better. Be specific about what's actually dangerous. Allow everything else.

10. Audit perception before rewriting prompts

When a client tells me their agent is underperforming, I ask these questions first:

Can it see the right tools? Are names and descriptions clear?
Can it see the right context? Or is it drowning in noise?
Can it see real state? Or is it guessing?
Can it see errors? Or do failures happen silently?

Nine times out of ten, the problem is perception. Not the prompt.

The outcome when you get this right

Teams that engineer perception instead of prompts:

Stop the endless prompt iteration cycle
Get measurable accuracy improvements in days, not months
Build agents that actually work in production
Have clear debugging paths when things break

The teams that keep tweaking prompts stay stuck. I've seen it enough times to know.

The mental model shift

Prompt engineering asks: "How do I word this better?"

Perception engineering asks: "What does the agent need to see to make a good decision?"

One has diminishing returns after a few iterations.

The other compounds as your system improves.

Stop rewriting prompts. Start auditing what your agent can perceive.

Rename tools for clarity
Scope your context
Pass real state
Make errors loud
Use specialized agents

Your agent is only as good as what it can see 👀

If you're building agents and want a second set of eyes on your architecture, I help teams get this right. DM me on X or LinkedIn.

No Evals, No Idea. How 40% of RAG Answers Go Wrong.

Serhii Panchyshyn — Mon, 13 Apr 2026 20:58:06 +0000

When I started building production RAG systems, I noticed something: nobody was measuring retrieval quality.

Teams would ship a system, ask users if it "felt good," and move on. No metrics. No baseline. No way to know if changes actually helped.

So I started measuring everything. And the first thing I discovered: most RAG failures aren't LLM failures. They're retrieval failures.

The documents that could answer the question aren't making it into the context window. The LLM is being asked to answer questions without the information it needs. No wonder it hallucinates.

Here's what I've learned about measuring and fixing RAG systems across dozens of client engagements.

The metric that actually matters: Recall@k

Before I measure anything else on a new RAG system, I measure Recall@k.

Recall@k answers a simple question: "Of all the documents that should have been retrieved, what percentage actually made it into the top k results?"

def recall_at_k(retrieved_ids: list, relevant_ids: list, k: int) -> float:
    """What % of relevant docs are in the top k results?"""
    top_k = set(retrieved_ids[:k])
    relevant = set(relevant_ids)

    if not relevant:
        return 1.0

    return len(top_k & relevant) / len(relevant)

On systems I've audited, Recall@10 is often around 60%. That means 40% of the time, the document that could answer the question isn't even in the context. The LLM never had a chance.

Here's the math that drives everything:

P(correct answer) ≈ P(correct context retrieved)

If the right chunks aren't retrieved, the LLM can't answer correctly. This is why I always measure retrieval separately from answer quality. Otherwise you're debugging the wrong layer.

You can start measuring today

You don't need production traffic to build evals. Generate synthetic test data from your corpus:

def generate_synthetic_evals(chunks: list) -> list:
    """Generate question-answer pairs from your chunks."""
    eval_pairs = []

    for chunk in chunks:
        response = llm.generate(f"""
Generate 3 questions that this text can answer.
Make them specific. "What is this about?" doesn't test retrieval.

Text:
{chunk.text}

Return JSON: [{{"question": "...", "chunk_id": "{chunk.id}"}}]
""")

        eval_pairs.extend(parse_json(response))

    return eval_pairs

50-100 questions is enough to establish a baseline. Run your retriever, measure Recall@10, write down the number. Now you can actually tell if changes help.

The two fixes that consistently move the needle

I've tried a lot of retrieval improvements across different client systems. Most make marginal differences. Two consistently deliver results.

Fix 1: Hybrid search

Embeddings are great at semantic similarity. "How do I reset my password?" matches "Steps to recover account access" even though they share no keywords.

But embeddings are weak on:

Numbers: They don't understand that 49 is close to 50
Exact match: Product codes, IDs, ticker symbols
Rare terms: Domain jargon not in the training data

BM25 (keyword search) catches what embeddings miss. Combine them:

def hybrid_search(query: str, k: int = 10) -> list:
    """Combine embedding search and BM25 using RRF."""

    embedding_results = embedding_index.search(query, k=20)
    bm25_results = bm25_index.search(query, k=20)

    # Reciprocal Rank Fusion
    scores = {}
    rrf_k = 60

    for rank, doc_id in enumerate(embedding_results):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (rrf_k + rank + 1)

    for rank, doc_id in enumerate(bm25_results):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (rrf_k + rank + 1)

    ranked = sorted(scores.keys(), key=lambda x: scores[x], reverse=True)
    return ranked[:k]

Typical improvement: 5-15% recall boost depending on query mix.

Fix 2: Add a reranker

Embedding models are bi-encoders. They encode query and documents separately, then compare. Fast, but imprecise.

Cross-encoders (rerankers) look at the query and document together. Slower, but much more accurate. Use them as a second pass:

def search_with_rerank(query: str, k: int = 5) -> list:
    """Retrieve broadly, then rerank precisely."""

    # Cast a wide net
    candidates = hybrid_search(query, k=20)

    # Rerank with cross-encoder
    pairs = [(query, get_content(doc_id)) for doc_id in candidates]
    scores = reranker.score(pairs)

    # Return top k after reranking
    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    return [doc_id for doc_id, score in ranked[:k]]

Typical improvement: another 5-10% on top of hybrid search.

Combined, these two fixes often take a system from 60% to 80% recall. That's the difference between "works sometimes" and "works reliably."

Chunking decisions that make or break retrieval

Your chunking strategy matters more than your embedding model choice. A few things I always check when onboarding a new project:

The "it" problem

Chunks that start with "It also supports..." or "This feature allows..." are useless on their own. The word "it" has no meaning without the previous chunk.

Fix: Prepend context to every chunk.

def chunk_with_context(doc) -> list:
    chunks = []

    for section in doc.sections:
        # Prepend document and section info
        context = f"Document: {doc.title}\nSection: {section.header}\n\n"

        for chunk_text in split_section(section.content):
            chunks.append({
                "content": context + chunk_text,
                "metadata": {
                    "doc_title": doc.title,
                    "section": section.header
                }
            })

    return chunks

Other chunking rules I follow

Never split mid-table. A row without headers is meaningless.
10-20% overlap between consecutive chunks.
Test multiple chunk sizes (256, 512, 1024 tokens). Optimal depends on your queries.

The workflow I run on every new RAG engagement

Phase 1-2: Establish baseline

Parse documents (test multiple parsers for PDFs)
Chunk with context headers
Generate 50-100 synthetic eval questions
Build basic retriever
Measure Recall@10
Write down the number

Phase 2-4: Apply standard fixes

Add hybrid search (BM25 + embeddings)
Add reranker
Measure again
Compare to baseline

Phase 4+: Debug specific failures

Break down recall by query type
Find worst-performing segment
Fix that segment
Measure again

The key: measure after every change. If you can't see improvement in numbers, you're guessing.

When to measure answer quality

Only after retrieval is solid.

Once Recall@10 is above 80%, start measuring end-to-end:

def eval_answer(question: str, answer: str, context: list) -> dict:
    """Use LLM-as-judge for answer evaluation."""

    result = llm.generate(f"""
Evaluate this answer. Return JSON:
- correct: true/false (factually accurate)
- grounded: true/false (supported by the context)
- complete: true/false (addresses the full question)

Context: {format_context(context)}
Question: {question}
Answer: {answer}
""")

    return parse_json(result)

But if retrieval is broken, this eval is noise. You're just measuring how well your LLM fills in gaps it shouldn't have to fill.

The takeaway

RAG quality is retrieval quality.

Before you touch your prompts:

Generate synthetic evals from your corpus
Measure Recall@10
Add hybrid search
Add a reranker
Fix your chunking
Measure again

The fixes are straightforward. The impact is not.

This is Part 1 of a series on production AI systems. Next: how to know when to fix your prompts vs. build an evaluator.

About me

I help B2B SaaS companies ship production AI in 6 weeks.