<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Serhii Panchyshyn</title>
    <description>The latest articles on Forem by Serhii Panchyshyn (@serhiip).</description>
    <link>https://forem.com/serhiip</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F138013%2F5b142395-3c3d-49af-8418-515743a4e2fb.JPG</url>
      <title>Forem: Serhii Panchyshyn</title>
      <link>https://forem.com/serhiip</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/serhiip"/>
    <language>en</language>
    <item>
      <title>Things You're Overengineering in Your AI Agent (The LLM Already Handles Them)</title>
      <dc:creator>Serhii Panchyshyn</dc:creator>
      <pubDate>Tue, 14 Apr 2026 20:15:22 +0000</pubDate>
      <link>https://forem.com/serhiip/things-youre-overengineering-in-your-ai-agent-the-llm-already-handles-them-2lop</link>
      <guid>https://forem.com/serhiip/things-youre-overengineering-in-your-ai-agent-the-llm-already-handles-them-2lop</guid>
      <description>&lt;p&gt;I've been building AI agents in production for the past two years. Not demos. Not weekend projects. Systems that real users talk to every day and get angry at when they break.&lt;/p&gt;

&lt;p&gt;And the pattern I keep seeing? Engineers building elaborate machinery around the model. Custom orchestration layers. Hand-rolled retry logic. Massive tool routing systems. All to solve problems the LLM was already solving if you just let it.&lt;/p&gt;

&lt;p&gt;Here's what I'd rip out if I could go back.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Custom Tool Selection Logic
&lt;/h2&gt;

&lt;p&gt;You built a classifier that decides which tool the agent should use. Maybe a regex-based router. Maybe a whole separate model call just to pick the right function.&lt;/p&gt;

&lt;p&gt;Stop.&lt;/p&gt;

&lt;p&gt;Modern LLMs are shockingly good at tool selection when you give them well-named, well-described tools. The problem was never the model. It was your tool descriptions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Bad: vague tool name, model guesses wrong&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nl"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;search&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Searches for things&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Good: specific name, clear scope, model nails it&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;search_customer_orders&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Search customer order history by order ID, customer name, or date range. Returns order status, items, and tracking info.&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The fix isn't a smarter router. It's better tool design. Name your tools like you're writing an API for a junior dev who's never seen your codebase. Be embarrassingly specific.&lt;/p&gt;

&lt;p&gt;Tool selection metrics can look great while the final answer is still garbage. I've seen this firsthand. The agent picks the right tool 95% of the time but still gives wrong answers because the tool descriptions don't explain what the returned data actually means.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Prompt Chains for Multi-Step Reasoning
&lt;/h2&gt;

&lt;p&gt;I used to build 4-5 step prompt chains for anything complex. Break the problem down. Feed output A into prompt B. Parse the result. Feed it into prompt C.&lt;/p&gt;

&lt;p&gt;Turns out a single well-structured system prompt with clear instructions handles most of this natively. The model already knows how to decompose problems. You just need to tell it what your constraints are and what good output looks like.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Instead of chaining 3 prompts:&lt;/span&gt;
&lt;span class="c1"&gt;// 1. "Classify the user intent"&lt;/span&gt;
&lt;span class="c1"&gt;// 2. "Based on intent X, gather context"  &lt;/span&gt;
&lt;span class="c1"&gt;// 3. "Now generate the answer"&lt;/span&gt;

&lt;span class="c1"&gt;// Just do this:&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;systemPrompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;`You are a support agent for a logistics platform.

When a user asks a question:
1. Identify whether they need order status, account help, or technical support
2. Use the appropriate tool to get the data
3. Answer in plain English with the specific details they asked for

If you're unsure about intent, ask one clarifying question. Never guess.`&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The chain approach also creates a hidden problem. Each step is a failure point. And debugging a 4-step chain when something breaks on step 3 is miserable. A single prompt with clear instructions is easier to observe, easier to eval, and fails more gracefully.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Retrieval Complexity Before Retrieval Quality
&lt;/h2&gt;

&lt;p&gt;This one hurts because I've done it myself.&lt;/p&gt;

&lt;p&gt;You spend two weeks building a hybrid retrieval pipeline. BM25 plus vector search plus re-ranking. Beautiful architecture. Looks great in a diagram.&lt;/p&gt;

&lt;p&gt;Then you realize the actual problem is that your knowledge base documents are written in a way the model can't parse. Or your chunking strategy splits the answer across two chunks and neither one makes sense alone.&lt;/p&gt;

&lt;p&gt;The retrieval pipeline doesn't matter if the underlying data is messy.&lt;/p&gt;

&lt;p&gt;Before you optimize the search algorithm, ask yourself:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If I showed this chunk to a human with no context, would they understand the answer?&lt;/li&gt;
&lt;li&gt;Are my documents written for the model or for the original author's brain?&lt;/li&gt;
&lt;li&gt;Am I chunking at logical boundaries or just every 500 tokens?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I've seen teams where retrieval "works" but answers are still wrong because the reference data itself contains outdated or incorrect information. That's not a retrieval problem. That's a data quality problem wearing a retrieval costume.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Custom Guardrails That Block Legitimate Use
&lt;/h2&gt;

&lt;p&gt;You built a content filter. It catches bad inputs. Great.&lt;/p&gt;

&lt;p&gt;Then users start complaining that normal questions get blocked. Someone asks about "terminating a contract" and the guardrail flags "terminating." Someone asks about shipping "explosive growth" and that trips another filter.&lt;/p&gt;

&lt;p&gt;Rule-based guardrails at scale become a whack-a-mole game you can't win.&lt;/p&gt;

&lt;p&gt;The LLM itself is already pretty good at understanding intent and context. Instead of building regex walls around the model, build guardrails INTO the model's instructions. Tell it what topics are off-limits. Tell it what information it should never reveal. Tell it to redirect gracefully instead of stonewalling.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Instead of: regex filter that blocks "kill", "terminate", "destroy"&lt;/span&gt;
&lt;span class="c1"&gt;// Try this in your system prompt:&lt;/span&gt;

&lt;span class="s2"&gt;`If a user asks about topics outside your domain (logistics and order management), 
politely redirect them. Never share internal system details, API keys, 
or other customer data. You can decline requests, but always explain why 
and suggest what you CAN help with.`&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Guardrails and permissions are product design, not just safety theater. Treat them that way.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Agent Memory as a Separate System
&lt;/h2&gt;

&lt;p&gt;You have your agent's database over here. Its memory system over there. A vector store somewhere else. And glue code holding all of it together with prayers and setTimeout.&lt;/p&gt;

&lt;p&gt;The real question is simpler than the architecture you built: what does the agent actually need to remember between sessions?&lt;/p&gt;

&lt;p&gt;Most agents don't need a sophisticated memory system. They need a well-structured context window. The conversation history plus a few key facts about the user. That's it. The model handles the rest.&lt;/p&gt;

&lt;p&gt;When you DO need persistent memory, keep it close to your data. Don't build a separate memory service that has to sync with your database. Store memory where your data lives. Query it with the same tools.&lt;/p&gt;

&lt;p&gt;The moment your agent's memory can't see its own database, you've created an integration problem disguised as a feature.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Sub-Agent Orchestration for Everything
&lt;/h2&gt;

&lt;p&gt;Multi-agent architectures are seductive. One agent plans. One retrieves. One generates. One validates. They talk to each other through a message bus. It looks amazing on a whiteboard.&lt;/p&gt;

&lt;p&gt;In production it's a nightmare to debug. When the answer is wrong, which agent broke? The planner? The retriever? The generator? You end up building observability tooling just to trace what happened across four agents when one would have been fine.&lt;/p&gt;

&lt;p&gt;Start with one agent. Push it until it genuinely can't handle the complexity. Only THEN split into specialized sub-agents with clear, narrow responsibilities.&lt;/p&gt;

&lt;p&gt;The rule I use: a sub-agent should exist only when the parent agent's context window literally can't hold the information it needs. Not because "separation of concerns" sounds good in a design doc.&lt;/p&gt;

&lt;p&gt;Specialized agents make sense for high-context tasks where the prompt would blow up the token budget. General agents handle 80% of use cases with less operational overhead. Know which one you're building and why.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. Evaluations That Test Happy Paths
&lt;/h2&gt;

&lt;p&gt;This is the one that bites hardest.&lt;/p&gt;

&lt;p&gt;You write 50 eval cases. The agent passes 48 of them. Ship it.&lt;/p&gt;

&lt;p&gt;Then users find the 200 edge cases you didn't think of. The model hallucinates a tracking number. It confidently answers a question it should have said "I don't know" to. It uses data from one customer to answer another customer's question.&lt;/p&gt;

&lt;p&gt;Good evals don't test whether the agent CAN answer correctly. They test whether it WILL answer correctly under pressure.&lt;/p&gt;

&lt;p&gt;Build evals that target failure modes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What happens when the tool returns empty results?&lt;/li&gt;
&lt;li&gt;What happens when two tools return conflicting information?&lt;/li&gt;
&lt;li&gt;What happens when the user asks something slightly outside the agent's domain?&lt;/li&gt;
&lt;li&gt;What happens when the context is ambiguous?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The eval suite is the real moat. Not the model. Not the prompts. Not the architecture. The team that can systematically find and fix failure modes ships better agents than the team with the fancier framework.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Uncomfortable Truth
&lt;/h2&gt;

&lt;p&gt;Most of the complexity in your agent isn't making it smarter. It's making it harder to debug, harder to eval, and harder to change.&lt;/p&gt;

&lt;p&gt;The best agent architectures I've built are embarrassingly simple. One model. Clear system prompt. Well-named tools. Good data. Ruthless evals.&lt;/p&gt;

&lt;p&gt;Everything else is either premature optimization or an expensive lesson waiting to happen.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;What's the most over-engineered thing you've built into an agent that turned out to be unnecessary?&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>javascript</category>
    </item>
    <item>
      <title>How to Make a Company AI-Native (Without Building Anything)</title>
      <dc:creator>Serhii Panchyshyn</dc:creator>
      <pubDate>Tue, 14 Apr 2026 14:33:17 +0000</pubDate>
      <link>https://forem.com/serhiip/how-to-make-a-company-ai-native-without-building-anything-2dmk</link>
      <guid>https://forem.com/serhiip/how-to-make-a-company-ai-native-without-building-anything-2dmk</guid>
      <description>&lt;p&gt;I've been helping companies add AI to their products since early 2023.&lt;/p&gt;

&lt;p&gt;Two years doesn't sound like much. But in AI time, it's multiple generations. We've gone from "ChatGPT can write emails" to agents running workflows to AI systems coordinating with other AI systems.&lt;/p&gt;

&lt;p&gt;Through all of that, the failure patterns stayed the same.&lt;/p&gt;

&lt;p&gt;Most teams think becoming AI-native means building AI features. Ship a chatbot. Add a copilot. Sprinkle some RAG on the knowledge base.&lt;/p&gt;

&lt;p&gt;That's not what AI-native means.&lt;/p&gt;

&lt;p&gt;The teams that actually get there change how they operate first. How they learn. How they document. How they measure. How they build trust. The AI features come after.&lt;/p&gt;

&lt;p&gt;Here's what I've learned helping B2B SaaS teams make that shift.&lt;/p&gt;




&lt;h2&gt;
  
  
  AI-native is about operating around ambiguity
&lt;/h2&gt;

&lt;p&gt;Here's the thing nobody tells you: AI systems are probabilistic. They're not like traditional software where the same input gives the same output every time.&lt;/p&gt;

&lt;p&gt;That breaks a lot of assumptions.&lt;/p&gt;

&lt;p&gt;Traditional software: "If we ship this feature, it will work the same way for every user."&lt;/p&gt;

&lt;p&gt;AI software: "If we ship this feature, it will probably work most of the time, and we need systems to catch when it doesn't."&lt;/p&gt;

&lt;p&gt;The teams that become AI-native redesign their workflows around this reality. They build feedback loops. They measure obsessively. They treat failures as data, not embarrassments.&lt;/p&gt;

&lt;p&gt;The teams that fail keep expecting AI to behave like traditional software. Then they're surprised when it doesn't.&lt;/p&gt;




&lt;h2&gt;
  
  
  Start with a maturity model, not a feature roadmap
&lt;/h2&gt;

&lt;p&gt;Most companies treat AI adoption as a feature checklist. "We need a chatbot. We need RAG. We need agents."&lt;/p&gt;

&lt;p&gt;That's backwards.&lt;/p&gt;

&lt;p&gt;I use a five-stage model when I work with teams:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stage&lt;/th&gt;
&lt;th&gt;What it looks like&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;No AI usage. Maybe some ChatGPT for personal stuff.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Individual experimentation. People trying tools on their own.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Workflow integration. AI embedded in daily tools and processes.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Specialized AI for specific domains and jobs.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;AI systems coordinating with other AI systems.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The goal isn't to jump to Stage 4. That's how you build impressive demos that break in production.&lt;/p&gt;

&lt;p&gt;The goal is to help everyone move up one stage.&lt;/p&gt;

&lt;p&gt;A support team at Stage 0 needs permission to experiment and a few quick wins. An engineering team at Stage 2 might be ready for domain-specific AI. A data team at Stage 3 might be ready for more automation.&lt;/p&gt;

&lt;p&gt;When I start with a new client, I map where each team actually is. Not where they think they are. Not where leadership wishes they were. Where they actually are today.&lt;/p&gt;

&lt;p&gt;Then we figure out what moving up one stage looks like for each group.&lt;/p&gt;




&lt;h2&gt;
  
  
  Make learning mandatory and social
&lt;/h2&gt;

&lt;p&gt;The teams that succeed at AI adoption don't just "allow experimentation." They create structured learning environments.&lt;/p&gt;

&lt;p&gt;One pattern I've seen work: dedicated AI learning weeks. No customer calls. No side meetings. Everyone learns together.&lt;/p&gt;

&lt;p&gt;The details that make it work:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Everyone teaches something.&lt;/strong&gt; Not just a few experts presenting to passive audiences. Each person finds one thing they've figured out and shares it with the team.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sessions are leveled.&lt;/strong&gt; Mark each session by audience level and prerequisites. A Stage 1 person shouldn't sit through an advanced automation deep-dive. They'll tune out and feel behind.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context packs are provided.&lt;/strong&gt; Don't just demo tools. Give people the actual prompts, templates, and access they need to use the tools during and after the session.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It's not optional.&lt;/strong&gt; The forcing function matters. "Feel free to experiment" produces nothing. "We're all doing this together for a week" produces adoption.&lt;/p&gt;

&lt;p&gt;I've seen teams go from 20% AI usage to 80%+ in a single quarter using this approach. The structure matters more than the specific tools.&lt;/p&gt;




&lt;h2&gt;
  
  
  Trust is an eval problem, not a hype problem
&lt;/h2&gt;

&lt;p&gt;Most teams try to build AI trust through enthusiasm. "Look what it can do! It wrote a whole email!"&lt;/p&gt;

&lt;p&gt;That backfires fast.&lt;/p&gt;

&lt;p&gt;Someone uses it. Hits an obvious failure. Loses trust. Stops using it. Tells their team it doesn't work.&lt;/p&gt;

&lt;p&gt;The teams that build real trust treat it differently. They treat AI like a probabilistic system that must be measured, not believed in.&lt;/p&gt;

&lt;p&gt;What this looks like in practice:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Measure things separately.&lt;/strong&gt; Don't aggregate everything into one "accuracy" number. Did it find the right information? Did it use that information correctly? Did the user actually accept the output? Each question has a different answer and a different fix.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Inspect failures, not just wins.&lt;/strong&gt; When something breaks, look at what actually happened. What did the AI see? What did it do? Why? Share these learnings openly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Categorize failure modes.&lt;/strong&gt; Not all failures are the same. Missing information. Wrong information. Right information used incorrectly. Each has a different root cause.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Share failures openly.&lt;/strong&gt; "It made something up because the information wasn't documented anywhere" builds more trust than hiding the failure. People understand that AI isn't magic once you show them the mechanics.&lt;/p&gt;

&lt;p&gt;Trust increases through validation, verification, and visibility. Not hype.&lt;/p&gt;




&lt;h2&gt;
  
  
  Documentation becomes infrastructure
&lt;/h2&gt;

&lt;p&gt;In most companies, documentation is something you write once and forget.&lt;/p&gt;

&lt;p&gt;In AI-native companies, documentation is working infrastructure. It's not optional. It's a prerequisite for useful AI.&lt;/p&gt;

&lt;p&gt;Here's why: when an AI assistant can't answer a question, the root cause is usually that the information isn't documented anywhere. Not that the AI is bad. The knowledge simply doesn't exist in a retrievable form.&lt;/p&gt;

&lt;p&gt;I've seen systems where 30-40% of AI failures traced back to missing documentation. Three concepts weren't written down. Three docs got written. Failures dropped.&lt;/p&gt;

&lt;p&gt;This changes how you think about it:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Missing docs are bugs.&lt;/strong&gt; When AI fails because something isn't documented, treat it like any other bug. File it. Fix it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use AI failures as a feedback loop.&lt;/strong&gt; Every failure surfaces a gap. Every fix improves the docs. The AI becomes a quality check on your documentation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Documentation is no longer just for humans.&lt;/strong&gt; You're writing for people and for AI systems. That changes what "good documentation" means.&lt;/p&gt;

&lt;p&gt;The best teams I've worked with have a direct pipeline from AI failure analysis to documentation improvements. It's not a separate initiative. It's the same workflow.&lt;/p&gt;




&lt;h2&gt;
  
  
  Internal adoption comes before external features
&lt;/h2&gt;

&lt;p&gt;I've seen teams ship AI features to customers in week one. Trust gets destroyed. The project gets shelved. Starting over is harder than starting slow.&lt;/p&gt;

&lt;p&gt;The pattern that works:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Start with 3 internal users.&lt;/strong&gt; Not 30. Three people who will use it for real work and tell you when it breaks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Review what's actually happening.&lt;/strong&gt; For the first week, look at every interaction. What worked? What didn't? Why?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Build your understanding from real failures.&lt;/strong&gt; Every failure teaches you something. The failures from week one become the fixes for week two.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Expand slowly.&lt;/strong&gt; 3 users → 10 users → one team → all internal teams → external customers.&lt;/p&gt;

&lt;p&gt;Internal users are forgiving. They'll tell you what's broken. They'll help you fix it. External users will just churn.&lt;/p&gt;

&lt;p&gt;The companies that become AI-native make their employees power users first. Then they build for customers.&lt;/p&gt;




&lt;h2&gt;
  
  
  Redesign rituals, not just tools
&lt;/h2&gt;

&lt;p&gt;AI-native is not something you bolt onto existing workflows. It changes how people learn, plan, review, and improve.&lt;/p&gt;

&lt;p&gt;Rituals I've seen work:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quarterly AI learning weeks.&lt;/strong&gt; Blocked time. Mandatory attendance. Everyone teaches something.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Weekly failure reviews.&lt;/strong&gt; What broke? Why? What did we learn? No blame, just data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Documentation sprints.&lt;/strong&gt; Dedicated time to fill gaps surfaced by AI failures.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Continuous improvement loops.&lt;/strong&gt; AI quality isn't a launch milestone. It's an ongoing process that gets better every week.&lt;/p&gt;

&lt;p&gt;The teams that succeed treat AI adoption as an operating system change. Not a feature installation.&lt;/p&gt;




&lt;h2&gt;
  
  
  What this looks like when it works
&lt;/h2&gt;

&lt;p&gt;Teams that follow this playbook:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Have 80%+ AI tool adoption across functions&lt;/li&gt;
&lt;li&gt;Catch problems before customers do&lt;/li&gt;
&lt;li&gt;Improve AI quality continuously through feedback loops&lt;/li&gt;
&lt;li&gt;Build trust through measurement, not hype&lt;/li&gt;
&lt;li&gt;Ship AI features faster because the foundation is solid&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The teams that skip straight to building features stay stuck. They launch impressive demos that don't survive real usage.&lt;/p&gt;




&lt;h2&gt;
  
  
  The pattern
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Map where each team actually is on the maturity model&lt;/li&gt;
&lt;li&gt;Help everyone move up one stage&lt;/li&gt;
&lt;li&gt;Make AI learning mandatory, structured, and social&lt;/li&gt;
&lt;li&gt;Build trust through measurement, not enthusiasm&lt;/li&gt;
&lt;li&gt;Treat documentation as working infrastructure&lt;/li&gt;
&lt;li&gt;Roll out internally before shipping externally&lt;/li&gt;
&lt;li&gt;Redesign rituals to support continuous improvement&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;AI-native isn't about adding AI features. It's about changing how your team operates around ambiguity.&lt;/p&gt;

&lt;p&gt;The tools are the easy part. The behavior change is the hard part. Start there.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>saas</category>
      <category>agents</category>
      <category>startup</category>
    </item>
    <item>
      <title>How to Roll Out an Internal AI Product Without Lying to Yourself</title>
      <dc:creator>Serhii Panchyshyn</dc:creator>
      <pubDate>Mon, 13 Apr 2026 23:53:55 +0000</pubDate>
      <link>https://forem.com/serhiip/how-to-roll-out-an-internal-ai-product-without-lying-to-yourself-3bl2</link>
      <guid>https://forem.com/serhiip/how-to-roll-out-an-internal-ai-product-without-lying-to-yourself-3bl2</guid>
      <description>&lt;p&gt;I've helped teams roll out AI products for the past two years.&lt;/p&gt;

&lt;p&gt;The same failure pattern shows up almost every time.&lt;/p&gt;

&lt;p&gt;They build something that demos well. Leadership gets excited. They ship it to 50 users in week one. Within two weeks, trust is destroyed and the project gets shelved 😅&lt;/p&gt;

&lt;p&gt;The teams that succeed do something different. This is the playbook I walk clients through now.&lt;/p&gt;




&lt;h2&gt;
  
  
  The problem I see everywhere
&lt;/h2&gt;

&lt;p&gt;Most teams measure AI rollouts wrong.&lt;/p&gt;

&lt;p&gt;They track one number. "Accuracy" or "user satisfaction" or something equally vague. The number looks good. They ship broadly. Then real users hit edge cases, the agent hallucinates, and suddenly everyone thinks "AI doesn't work for us."&lt;/p&gt;

&lt;p&gt;The issue isn't the AI. The issue is they never built the infrastructure to see what was actually happening.&lt;/p&gt;

&lt;p&gt;You can't improve what you can't observe. And most teams can't observe anything.&lt;/p&gt;




&lt;h2&gt;
  
  
  The rollout framework that works
&lt;/h2&gt;

&lt;p&gt;Here's what I advise now. Nine steps, usually 6-8 weeks before external users.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 1: Start with 3 users, not 30
&lt;/h2&gt;

&lt;p&gt;Every team wants to move fast. "Let's get feedback from the whole department!"&lt;/p&gt;

&lt;p&gt;I push back hard on this.&lt;/p&gt;

&lt;p&gt;More users means more noise. You can't inspect every trace. You start pattern-matching on vibes instead of data.&lt;/p&gt;

&lt;p&gt;The right first cohort:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;3 people who actually need the tool for real work&lt;/li&gt;
&lt;li&gt;Different roles (support, ops, sales)&lt;/li&gt;
&lt;li&gt;Direct channel to the eng team&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One client started with 30 users. Couldn't keep up. Rolled back to 5. Found more bugs in one week than the previous month.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// What I recommend tracking for each early user&lt;/span&gt;
&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;EarlyUserContext&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;           &lt;span class="c1"&gt;// "support", "ops", "sales"&lt;/span&gt;
  &lt;span class="nl"&gt;primaryUseCase&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// "answer customer questions"&lt;/span&gt;
  &lt;span class="nl"&gt;feedbackChannel&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// direct line to eng team&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 2: Instrument everything before anyone touches it
&lt;/h2&gt;

&lt;p&gt;This is where most teams cut corners. They want to ship. Observability feels like overhead.&lt;/p&gt;

&lt;p&gt;It's not optional.&lt;/p&gt;

&lt;p&gt;Before the first user session, you need to answer these questions from your traces:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What query did the user send?&lt;/li&gt;
&lt;li&gt;What tools did the agent consider?&lt;/li&gt;
&lt;li&gt;Which tool did it pick and why?&lt;/li&gt;
&lt;li&gt;What context was in the window?&lt;/li&gt;
&lt;li&gt;What was the final response?&lt;/li&gt;
&lt;li&gt;Did the user accept, edit, or reject it?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I've seen teams ship without trace logging. They have no idea why things fail. They guess. They tweak prompts randomly. Nothing improves.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Minimum viable trace structure&lt;/span&gt;
&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;AgentTrace&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;runId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;toolsConsidered&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;[];&lt;/span&gt;
  &lt;span class="nl"&gt;toolSelected&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;contextSummary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;userFeedback&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;accepted&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;edited&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;rejected&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;latencyMs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;LangSmith, Langfuse, whatever. The tool matters less than having something.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 3: Review every trace for the first week
&lt;/h2&gt;

&lt;p&gt;Yes, every single one.&lt;/p&gt;

&lt;p&gt;This is where you learn what's actually broken. Not what you assumed was broken.&lt;/p&gt;

&lt;p&gt;I sit with clients and review traces together. Same patterns show up:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Wrong tool selection&lt;/strong&gt;: Agent picked &lt;code&gt;searchOrders&lt;/code&gt; when it should have picked &lt;code&gt;searchShipments&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Missing context&lt;/strong&gt;: Agent couldn't answer because the right doc wasn't retrieved&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hallucinations&lt;/strong&gt;: Agent made up data that doesn't exist&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Premature stopping&lt;/strong&gt;: Agent gave up too early&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Slow responses&lt;/strong&gt;: Anything over 10 seconds feels broken&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Create a simple spreadsheet. Log every failure. Categorize them.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Run ID&lt;/th&gt;
&lt;th&gt;Failure Type&lt;/th&gt;
&lt;th&gt;Root Cause&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;abc123&lt;/td&gt;
&lt;td&gt;Wrong tool&lt;/td&gt;
&lt;td&gt;Vague tool name&lt;/td&gt;
&lt;td&gt;Renamed function&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;def456&lt;/td&gt;
&lt;td&gt;Hallucination&lt;/td&gt;
&lt;td&gt;No source doc&lt;/td&gt;
&lt;td&gt;Added missing doc&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ghi789&lt;/td&gt;
&lt;td&gt;Slow response&lt;/td&gt;
&lt;td&gt;Too much context&lt;/td&gt;
&lt;td&gt;Scoped retrieval&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;After one week, you'll have a clear picture. This spreadsheet becomes your roadmap.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 4: Fix perception before prompts
&lt;/h2&gt;

&lt;p&gt;Here's the insight that saves teams weeks of wasted effort:&lt;/p&gt;

&lt;p&gt;90% of early failures come from three sources:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Bad tool names and descriptions&lt;/li&gt;
&lt;li&gt;Missing or wrong context&lt;/li&gt;
&lt;li&gt;Retrieval pulling irrelevant docs&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These aren't prompt problems. They're perception problems.&lt;/p&gt;

&lt;p&gt;I tell clients: the agent can only do the right thing if it can see the right things.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Before: I see this constantly&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;tool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;handleData&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Handles data operations&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// After: Clear enough for the model to reason about&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;tool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;createShipmentFromOrder&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Creates a new shipment record from an existing order. Requires orderId. Returns shipmentId and tracking number.&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One client renamed 12 tools in week one. Tool selection accuracy went from 60% to 87%. No prompt changes.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 5: Build evals from your failures
&lt;/h2&gt;

&lt;p&gt;Don't build generic evals. Build evals from the specific failures you observed.&lt;/p&gt;

&lt;p&gt;Every row in that failure spreadsheet becomes a test case.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Example eval case from a real client failure&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;evalCase&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;shipment-status-check&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;What's the status of order 12345?&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;expectedTool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;getShipmentByOrderId&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;expectedBehavior&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Return actual status from database&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;failureWeObserved&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Agent said 'delivered' without checking&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;groundTruth&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;in_transit&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One team I worked with had 47 eval cases after two weeks. All from actual user sessions. All testing things that actually broke.&lt;/p&gt;

&lt;p&gt;Generic benchmarks tell you nothing. Failure-driven evals tell you everything.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 6: Measure the right things separately
&lt;/h2&gt;

&lt;p&gt;This is where most teams lie to themselves.&lt;/p&gt;

&lt;p&gt;They compute one accuracy number. "We're at 85%!" Leadership is happy. But 85% of what?&lt;/p&gt;

&lt;p&gt;I push clients to measure these separately:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;AgentMetrics&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Did we pick the right tool?&lt;/span&gt;
  &lt;span class="nl"&gt;toolSelectionAccuracy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="c1"&gt;// Did we retrieve relevant docs?&lt;/span&gt;
  &lt;span class="nl"&gt;retrievalRecall&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="c1"&gt;// Did the final answer match ground truth?&lt;/span&gt;
  &lt;span class="nl"&gt;answerCorrectness&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="c1"&gt;// Did we cite the right sources?&lt;/span&gt;
  &lt;span class="nl"&gt;groundingAccuracy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="c1"&gt;// Did the user accept the response?&lt;/span&gt;
  &lt;span class="nl"&gt;userAcceptanceRate&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can have 95% tool selection and 40% answer correctness. That means retrieval or synthesis is broken.&lt;/p&gt;

&lt;p&gt;You can have 90% answer correctness and 60% user acceptance. That means the answer is technically right but useless in practice.&lt;/p&gt;

&lt;p&gt;Separate metrics tell you where to focus. One number tells you nothing.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 7: Expand slowly with permission gates
&lt;/h2&gt;

&lt;p&gt;After 2 weeks with 3 users, you might be ready for 10.&lt;/p&gt;

&lt;p&gt;Don't flip a switch. Add gates.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;canUseAgent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;User&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nx"&gt;boolean&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Phase 1: Named early adopters&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ROLLOUT_PHASE&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;earlyAdopters&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;includes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;// Phase 2: Specific teams&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ROLLOUT_PHASE&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;team&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;support&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;team&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ops&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;// Phase 3: Everyone&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each phase should last at least a week. Each phase needs its own baseline metrics.&lt;/p&gt;

&lt;p&gt;If metrics drop when you expand, you've found a gap. That's good. That's the system working.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 8: Watch for drift
&lt;/h2&gt;

&lt;p&gt;The first week is not representative.&lt;/p&gt;

&lt;p&gt;Early users are curious. They ask simple questions. They're forgiving.&lt;/p&gt;

&lt;p&gt;By week 4, they're using it for real work. Queries get harder. Edge cases appear. Patience drops.&lt;/p&gt;

&lt;p&gt;I tell clients to track metrics weekly, not just at launch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Week 1: 87% tool accuracy, 72% answer correctness
Week 2: 85% tool accuracy, 75% answer correctness  
Week 3: 83% tool accuracy, 71% answer correctness
Week 4: 79% tool accuracy, 68% answer correctness  ← investigate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If metrics drift down, dig into traces. Usually it's new use cases, missing docs, or users learning to ask harder questions.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 9: Know when you're actually ready
&lt;/h2&gt;

&lt;p&gt;I've seen teams ship too early and destroy trust. I've also seen teams wait forever and never ship.&lt;/p&gt;

&lt;p&gt;Here's what ready looks like:&lt;/p&gt;

&lt;p&gt;✅ Tool selection accuracy &amp;gt; 90%&lt;br&gt;&lt;br&gt;
✅ Answer correctness &amp;gt; 80%&lt;br&gt;&lt;br&gt;
✅ User acceptance rate &amp;gt; 75%&lt;br&gt;&lt;br&gt;
✅ p95 latency &amp;lt; 8 seconds&lt;br&gt;&lt;br&gt;
✅ No hallucinations in last 100 traces&lt;br&gt;&lt;br&gt;
✅ You've handled the top 10 failure modes  &lt;/p&gt;

&lt;p&gt;Not ready:&lt;/p&gt;

&lt;p&gt;❌ Still finding new failure categories weekly&lt;br&gt;&lt;br&gt;
❌ Metrics vary wildly day to day&lt;br&gt;&lt;br&gt;
❌ Users work around the agent instead of using it&lt;br&gt;&lt;br&gt;
❌ You can't explain why it fails when it fails  &lt;/p&gt;




&lt;h2&gt;
  
  
  The outcome when this works
&lt;/h2&gt;

&lt;p&gt;Teams that follow this playbook:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ship with confidence, not hope&lt;/li&gt;
&lt;li&gt;Have real data to show leadership&lt;/li&gt;
&lt;li&gt;Know exactly where to focus engineering effort&lt;/li&gt;
&lt;li&gt;Build user trust instead of destroying it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The teams that skip steps end up with shelved projects and skeptical users. I've seen it enough times to know.&lt;/p&gt;




&lt;h2&gt;
  
  
  The checklist
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Week 0&lt;/strong&gt;: Instrument everything&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Week 1&lt;/strong&gt;: 3 users, review every trace, build failure spreadsheet&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Week 2&lt;/strong&gt;: Fix perception issues (tools, context, retrieval)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Week 3&lt;/strong&gt;: Build evals from failures, establish baselines&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Week 4&lt;/strong&gt;: Expand to 10 users, new roles, new use cases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Week 5&lt;/strong&gt;: Fix new failures, update evals&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Week 6&lt;/strong&gt;: Expand to full internal team&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Week 7+&lt;/strong&gt;: Monitor drift, harden edge cases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;When metrics stabilize&lt;/strong&gt;: Consider external rollout&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;The boring work is the real work. Instrument first. Start small. Review everything. Fix perception before prompts. Measure the right things separately. Expand slowly.&lt;/p&gt;

&lt;p&gt;Your agent is only as good as your willingness to watch it fail and fix what you find.&lt;/p&gt;




&lt;p&gt;If you're rolling out an AI product and want a second set of eyes on your approach, I help teams get this right. DM me on X or LinkedIn.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>llm</category>
      <category>product</category>
    </item>
    <item>
      <title>Stop Prompting. Start Engineering Perception.</title>
      <dc:creator>Serhii Panchyshyn</dc:creator>
      <pubDate>Mon, 13 Apr 2026 23:43:02 +0000</pubDate>
      <link>https://forem.com/serhiip/stop-prompting-start-engineering-perception-4fh5</link>
      <guid>https://forem.com/serhiip/stop-prompting-start-engineering-perception-4fh5</guid>
      <description>&lt;p&gt;I've watched teams spend weeks rewriting the same system prompt.&lt;/p&gt;

&lt;p&gt;Different phrasings. More examples. Clearer instructions. The agent still picks the wrong tool. Still hallucinates. Still feels broken.&lt;/p&gt;

&lt;p&gt;Then they rename six functions and accuracy jumps 30%.&lt;/p&gt;

&lt;p&gt;This pattern shows up constantly. The model doesn't care how clever your prompt is. It cares about what it can &lt;em&gt;see&lt;/em&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The problem I see everywhere
&lt;/h2&gt;

&lt;p&gt;Teams treat prompts like magic spells. Say the right words, get the right output.&lt;/p&gt;

&lt;p&gt;But agents aren't following instructions. They're making predictions based on everything in context. The tool names. The API responses. The error messages. The structure of your data.&lt;/p&gt;

&lt;p&gt;That's perception. And it matters way more than your system prompt.&lt;/p&gt;

&lt;p&gt;Most teams optimize the wrong layer. They iterate on prompts for weeks while their tool names are &lt;code&gt;handleData&lt;/code&gt; and &lt;code&gt;processRequest&lt;/code&gt;. The model has no chance.&lt;/p&gt;

&lt;p&gt;Here are 10 patterns I've seen work across the past two years of helping teams build production agents 💪&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Tool names are the real prompt
&lt;/h2&gt;

&lt;p&gt;Bad tool names are invisible to the model.&lt;/p&gt;

&lt;p&gt;I audit client codebases and find this constantly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// ❌ The model has no idea what this does&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;handleRequest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;unknown&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// ✅ Now it knows exactly when to use this&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;createShipmentFromOrder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;orderId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One client had 47 tools. Half had names like &lt;code&gt;processData&lt;/code&gt; or &lt;code&gt;executeAction&lt;/code&gt;. The model was guessing.&lt;/p&gt;

&lt;p&gt;We renamed 12 functions. Tool selection accuracy went from 60% to 87%. No prompt changes.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Tool descriptions matter more than you think
&lt;/h2&gt;

&lt;p&gt;The model reads descriptions to decide which tool to pick.&lt;/p&gt;

&lt;p&gt;I tell clients: write descriptions like you're onboarding a new developer. Because you are.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// ❌ Vague description&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;tool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;searchRecords&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Search for records in the system&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// ✅ Specific description with constraints&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;tool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;searchShipments&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Search shipments by tracking number, origin, destination, or date range. Returns max 50 results. Use filters to narrow results before searching.&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Specific descriptions reduce wrong tool selection by 30-40% in my experience.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Passing everything into context is lazy
&lt;/h2&gt;

&lt;p&gt;I've reviewed architectures where teams dump entire conversation histories into context. 20 turns. 50 tool results. Everything.&lt;/p&gt;

&lt;p&gt;The model drowns.&lt;/p&gt;

&lt;p&gt;What works:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Last 3 turns by default&lt;/li&gt;
&lt;li&gt;Relevant retrieved docs only&lt;/li&gt;
&lt;li&gt;Structured summaries instead of raw data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Less context. Better decisions. Faster responses.&lt;/p&gt;

&lt;p&gt;One team cut their context by 60% and saw answer quality improve. Counter-intuitive until you realize the model was distracted by noise.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Scoped retrieval beats broad retrieval
&lt;/h2&gt;

&lt;p&gt;Early RAG implementations pull from everywhere. The whole knowledge base. 200+ docs. The model has no idea which ones matter.&lt;/p&gt;

&lt;p&gt;I push clients toward module-level filtering. If someone asks about shipments, only retrieve shipment docs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// ❌ Retrieve from everything&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;retriever&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// ✅ Scope to relevant module&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;retriever&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; 
  &lt;span class="na"&gt;module&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;detectModule&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="na"&gt;maxResults&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt; 
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Recall goes up. Hallucinations go down. Should be the default from day one.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Structured outputs prevent downstream chaos
&lt;/h2&gt;

&lt;p&gt;If another agent or system consumes your output, structure it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// ❌ Free text response&lt;/span&gt;
&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;I found 3 shipments that match. The first one is #12345 going to Chicago...&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;// ✅ Structured response&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;shipments&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;id&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;12345&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;destination&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Chicago&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;status&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;in_transit&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;id&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;12346&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;destination&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Denver&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;status&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;delivered&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;total&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;hasMore&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Unstructured responses compound errors. Each downstream consumer has to parse and guess. I've seen entire pipelines break because one agent returned prose instead of JSON.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Silent failures are invisible failures
&lt;/h2&gt;

&lt;p&gt;The model can't fix what it can't see.&lt;/p&gt;

&lt;p&gt;I audit error handling in every client codebase. Same pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// ❌ Silent failure&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;hasPermission&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// ✅ Loud failure&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;hasPermission&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;PERMISSION_DENIED&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;message&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;User lacks 'shipments.create' permission&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;requiredPermission&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;shipments.create&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;suggestedAction&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Request access from workspace admin&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Explicit errors let the agent reason about what went wrong. And let you debug faster.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. Real system state beats assumed state
&lt;/h2&gt;

&lt;p&gt;I watched an agent confidently tell a user their shipment was delivered.&lt;/p&gt;

&lt;p&gt;It wasn't. The agent assumed based on typical timelines. It never checked the actual record.&lt;/p&gt;

&lt;p&gt;This happens when teams don't pass real state:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// ❌ Agent has to guess&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;orderId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;12345&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// ✅ Agent knows the truth&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;shipment&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;12345&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;in_transit&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;// actual current status&lt;/span&gt;
    &lt;span class="na"&gt;lastUpdate&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;2024-01-15T10:30:00Z&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;currentLocation&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Memphis hub&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Agents will make up state if you don't give them real state. Always.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. Specialized agents beat one generalist
&lt;/h2&gt;

&lt;p&gt;I've seen teams try to build one agent that handles everything. Customer questions. Data entry. Workflow automation. Reports.&lt;/p&gt;

&lt;p&gt;It's mediocre at all of them.&lt;/p&gt;

&lt;p&gt;The pattern that works:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;One agent for Q&amp;amp;A&lt;/strong&gt; using org context&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One agent for record operations&lt;/strong&gt; with strict schemas&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One agent for document extraction&lt;/strong&gt; with specialized prompts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each one is easier to eval. Easier to constrain. Easier to improve.&lt;/p&gt;

&lt;p&gt;Generalist agents are harder to debug and harder to trust. I push clients toward decomposition early.&lt;/p&gt;




&lt;h2&gt;
  
  
  9. Guardrails should block bad things, not useful things
&lt;/h2&gt;

&lt;p&gt;I've seen guardrails so aggressive they blocked legitimate business operations.&lt;/p&gt;

&lt;p&gt;"Can you help me set up a webhook?" → BLOCKED (mentions code execution)&lt;/p&gt;

&lt;p&gt;"What's the API endpoint for shipments?" → BLOCKED (mentions API)&lt;/p&gt;

&lt;p&gt;The users stopped trusting the product. Not because the AI was bad. Because the guardrails were dumb.&lt;/p&gt;

&lt;p&gt;Narrow guardrails work better. Be specific about what's actually dangerous. Allow everything else.&lt;/p&gt;




&lt;h2&gt;
  
  
  10. Audit perception before rewriting prompts
&lt;/h2&gt;

&lt;p&gt;When a client tells me their agent is underperforming, I ask these questions first:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Can it see the right tools? Are names and descriptions clear?&lt;/li&gt;
&lt;li&gt;Can it see the right context? Or is it drowning in noise?&lt;/li&gt;
&lt;li&gt;Can it see real state? Or is it guessing?&lt;/li&gt;
&lt;li&gt;Can it see errors? Or do failures happen silently?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Nine times out of ten, the problem is perception. Not the prompt.&lt;/p&gt;




&lt;h2&gt;
  
  
  The outcome when you get this right
&lt;/h2&gt;

&lt;p&gt;Teams that engineer perception instead of prompts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stop the endless prompt iteration cycle&lt;/li&gt;
&lt;li&gt;Get measurable accuracy improvements in days, not months&lt;/li&gt;
&lt;li&gt;Build agents that actually work in production&lt;/li&gt;
&lt;li&gt;Have clear debugging paths when things break&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The teams that keep tweaking prompts stay stuck. I've seen it enough times to know.&lt;/p&gt;




&lt;h2&gt;
  
  
  The mental model shift
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Prompt engineering&lt;/strong&gt; asks: "How do I word this better?"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Perception engineering&lt;/strong&gt; asks: "What does the agent need to see to make a good decision?"&lt;/p&gt;

&lt;p&gt;One has diminishing returns after a few iterations.&lt;/p&gt;

&lt;p&gt;The other compounds as your system improves.&lt;/p&gt;




&lt;p&gt;Stop rewriting prompts. Start auditing what your agent can perceive.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Rename tools for clarity&lt;/li&gt;
&lt;li&gt;Scope your context&lt;/li&gt;
&lt;li&gt;Pass real state&lt;/li&gt;
&lt;li&gt;Make errors loud&lt;/li&gt;
&lt;li&gt;Use specialized agents&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Your agent is only as good as what it can see 👀&lt;/p&gt;




&lt;p&gt;If you're building agents and want a second set of eyes on your architecture, I help teams get this right. DM me on X or LinkedIn.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>llm</category>
      <category>webdev</category>
    </item>
    <item>
      <title>My First RAG System Had No Evals. 40% of Answers Were Wrong.</title>
      <dc:creator>Serhii Panchyshyn</dc:creator>
      <pubDate>Mon, 13 Apr 2026 20:58:06 +0000</pubDate>
      <link>https://forem.com/serhiip/my-first-rag-system-had-no-evals-40-of-answers-were-wrong-ab</link>
      <guid>https://forem.com/serhiip/my-first-rag-system-had-no-evals-40-of-answers-were-wrong-ab</guid>
      <description>&lt;p&gt;When I started building production RAG systems, I noticed something: nobody was measuring retrieval quality.&lt;/p&gt;

&lt;p&gt;Teams would ship a system, ask users if it "felt good," and move on. No metrics. No baseline. No way to know if changes actually helped.&lt;/p&gt;

&lt;p&gt;So I started measuring everything. And the first thing I discovered: &lt;strong&gt;most RAG failures aren't LLM failures. They're retrieval failures.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The documents that could answer the question aren't making it into the context window. The LLM is being asked to answer questions without the information it needs. No wonder it hallucinates.&lt;/p&gt;

&lt;p&gt;Here's what I've learned about measuring and fixing RAG systems after building them for B2B SaaS companies.&lt;/p&gt;




&lt;h2&gt;
  
  
  The metric that actually matters: Recall@k
&lt;/h2&gt;

&lt;p&gt;Before I measure anything else on a new RAG system, I measure &lt;strong&gt;Recall@k&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Recall@k answers a simple question: "Of all the documents that &lt;em&gt;should&lt;/em&gt; have been retrieved, what percentage actually made it into the top k results?"&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;recall_at_k&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retrieved_ids&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;relevant_ids&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;What % of relevant docs are in the top k results?&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;top_k&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retrieved_ids&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;relevant&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;relevant_ids&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;relevant&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;top_k&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;relevant&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;relevant&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;On systems I've audited, Recall@10 is often around 60%. That means 40% of the time, the document that could answer the question isn't even in the context. The LLM never had a chance.&lt;/p&gt;

&lt;p&gt;Here's the math that drives everything:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;P(correct answer) ≈ P(correct context retrieved)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If the right chunks aren't retrieved, the LLM can't answer correctly. This is why I always measure retrieval separately from answer quality. Otherwise you're debugging the wrong layer.&lt;/p&gt;


&lt;h2&gt;
  
  
  You can start measuring today
&lt;/h2&gt;

&lt;p&gt;You don't need production traffic to build evals. Generate synthetic test data from your corpus:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_synthetic_evals&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Generate question-answer pairs from your chunks.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;eval_pairs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
Generate 3 questions that this text can answer.
Make them specific. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is this about?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; doesn&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t test retrieval.

Text:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Return JSON: [{{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chunk_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;}}]
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;eval_pairs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;parse_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;eval_pairs&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;50-100 questions is enough to establish a baseline. Run your retriever, measure Recall@10, write down the number. Now you can actually tell if changes help.&lt;/p&gt;


&lt;h2&gt;
  
  
  The two fixes that consistently move the needle
&lt;/h2&gt;

&lt;p&gt;I've tried a lot of retrieval improvements. Most make marginal differences. Two consistently deliver results.&lt;/p&gt;
&lt;h3&gt;
  
  
  Fix 1: Hybrid search
&lt;/h3&gt;

&lt;p&gt;Embeddings are great at semantic similarity. "How do I reset my password?" matches "Steps to recover account access" even though they share no keywords.&lt;/p&gt;

&lt;p&gt;But embeddings are weak on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Numbers&lt;/strong&gt;: They don't understand that 49 is close to 50&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Exact match&lt;/strong&gt;: Product codes, IDs, ticker symbols&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rare terms&lt;/strong&gt;: Domain jargon not in the training data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;BM25 (keyword search) catches what embeddings miss. Combine them:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;hybrid_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Combine embedding search and BM25 using RRF.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="n"&gt;embedding_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;embedding_index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;bm25_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bm25_index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Reciprocal Rank Fusion
&lt;/span&gt;    &lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="n"&gt;rrf_k&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;doc_id&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedding_results&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;doc_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rrf_k&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;doc_id&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bm25_results&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;doc_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rrf_k&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;ranked&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;ranked&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Typical improvement: &lt;strong&gt;5-15% recall boost&lt;/strong&gt; depending on query mix.&lt;/p&gt;
&lt;h3&gt;
  
  
  Fix 2: Add a reranker
&lt;/h3&gt;

&lt;p&gt;Embedding models are bi-encoders. They encode query and documents separately, then compare. Fast, but imprecise.&lt;/p&gt;

&lt;p&gt;Cross-encoders (rerankers) look at the query and document together. Slower, but much more accurate. Use them as a second pass:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;search_with_rerank&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Retrieve broadly, then rerank precisely.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="c1"&gt;# Cast a wide net
&lt;/span&gt;    &lt;span class="n"&gt;candidates&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;hybrid_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Rerank with cross-encoder
&lt;/span&gt;    &lt;span class="n"&gt;pairs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;get_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc_id&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;doc_id&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;candidates&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;reranker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pairs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Return top k after reranking
&lt;/span&gt;    &lt;span class="n"&gt;ranked&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;candidates&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;doc_id&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;doc_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;ranked&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Typical improvement: &lt;strong&gt;another 5-10%&lt;/strong&gt; on top of hybrid search.&lt;/p&gt;

&lt;p&gt;Combined, these two fixes often take a system from 60% to 80% recall. That's the difference between "works sometimes" and "works reliably."&lt;/p&gt;


&lt;h2&gt;
  
  
  Chunking decisions that make or break retrieval
&lt;/h2&gt;

&lt;p&gt;Your chunking strategy matters more than your embedding model choice. A few things I always check:&lt;/p&gt;
&lt;h3&gt;
  
  
  The "it" problem
&lt;/h3&gt;

&lt;p&gt;Chunks that start with "It also supports..." or "This feature allows..." are useless on their own. The word "it" has no meaning without the previous chunk.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix: Prepend context to every chunk.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;chunk_with_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;section&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sections&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Prepend document and section info
&lt;/span&gt;        &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Document: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Section: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;section&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;header&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk_text&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;split_section&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;section&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;chunk_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;doc_title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;section&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;section&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;header&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Other chunking rules I follow
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Never split mid-table.&lt;/strong&gt; A row without headers is meaningless.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;10-20% overlap&lt;/strong&gt; between consecutive chunks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test multiple chunk sizes&lt;/strong&gt; (256, 512, 1024 tokens). Optimal depends on your queries.&lt;/li&gt;
&lt;/ol&gt;


&lt;h2&gt;
  
  
  The workflow I use on every RAG project
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Week 1-2: Establish baseline&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Parse documents (test multiple parsers for PDFs)&lt;/li&gt;
&lt;li&gt;Chunk with context headers&lt;/li&gt;
&lt;li&gt;Generate 50-100 synthetic eval questions&lt;/li&gt;
&lt;li&gt;Build basic retriever&lt;/li&gt;
&lt;li&gt;Measure Recall@10&lt;/li&gt;
&lt;li&gt;Write down the number&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Week 2-4: Apply standard fixes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Add hybrid search (BM25 + embeddings)&lt;/li&gt;
&lt;li&gt;Add reranker&lt;/li&gt;
&lt;li&gt;Measure again&lt;/li&gt;
&lt;li&gt;Compare to baseline&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Week 4+: Debug specific failures&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Break down recall by query type&lt;/li&gt;
&lt;li&gt;Find worst-performing segment&lt;/li&gt;
&lt;li&gt;Fix that segment&lt;/li&gt;
&lt;li&gt;Measure again&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The key: measure after every change. If you can't see improvement in numbers, you're guessing.&lt;/p&gt;


&lt;h2&gt;
  
  
  When to measure answer quality
&lt;/h2&gt;

&lt;p&gt;Only after retrieval is solid.&lt;/p&gt;

&lt;p&gt;Once Recall@10 is above 80%, start measuring end-to-end:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;eval_answer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Use LLM-as-judge for answer evaluation.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
Evaluate this answer. Return JSON:
- correct: true/false (factually accurate)
- grounded: true/false (supported by the context)
- complete: true/false (addresses the full question)

Context: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;format_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
Question: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
Answer: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;parse_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;But if retrieval is broken, this eval is noise. You're just measuring how well your LLM fills in gaps it shouldn't have to fill.&lt;/p&gt;


&lt;h2&gt;
  
  
  The takeaway
&lt;/h2&gt;

&lt;p&gt;RAG quality is retrieval quality.&lt;/p&gt;

&lt;p&gt;Before you touch your prompts:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Generate synthetic evals from your corpus&lt;/li&gt;
&lt;li&gt;Measure Recall@10&lt;/li&gt;
&lt;li&gt;Add hybrid search&lt;/li&gt;
&lt;li&gt;Add a reranker&lt;/li&gt;
&lt;li&gt;Fix your chunking&lt;/li&gt;
&lt;li&gt;Measure again&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The fixes are straightforward. The impact is not.&lt;/p&gt;



&lt;p&gt;&lt;em&gt;This is Part 1 of a series on production AI systems. Next: how to know when to fix your prompts vs. build an evaluator.&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  About me
&lt;/h2&gt;

&lt;p&gt;I help B2B SaaS companies ship production AI in 6 weeks.&lt;/p&gt;

&lt;p&gt;If you're building RAG and want a second set of eyes, I do free AI Teardowns — a 30-45 min video showing exactly where your pipeline is breaking and how to fix it.&lt;/p&gt;

&lt;p&gt;No pitch. Just clarity.&lt;/p&gt;


&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
    &lt;div class="c-embed__content"&gt;
        &lt;div class="c-embed__cover"&gt;
          &lt;a href="https://animanovalabs.com/" class="c-link align-middle" rel="noopener noreferrer"&gt;
            &lt;img alt="" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fanimanovalabs.com%2Fog-image.png" height="420" class="m-0" width="800"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="c-embed__body"&gt;
        &lt;h2 class="fs-xl lh-tight"&gt;
          &lt;a href="https://animanovalabs.com/" rel="noopener noreferrer" class="c-link"&gt;
            AI Implementation for B2B SaaS | AnimaNova Labs | AnimaNova Labs
          &lt;/a&gt;
        &lt;/h2&gt;
          &lt;p class="truncate-at-3"&gt;
            Ship production AI features in 6 weeks. For B2B SaaS companies who need AI but can't hire fast enough. No $300K engineer. No 6-month timeline.
          &lt;/p&gt;
        &lt;div class="color-secondary fs-s flex items-center"&gt;
            &lt;img alt="favicon" class="c-embed__favicon m-0 mr-2 radius-0" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fanimanovalabs.com%2Ficon.svg%3Ficon.056_r5p2xm~fh.svg" width="1024" height="1024"&gt;
          animanovalabs.com
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;



</description>
      <category>ai</category>
      <category>rag</category>
      <category>machinelearning</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
