<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Jaskirat Singh</title>
    <description>The latest articles on Forem by Jaskirat Singh (@jaskirat_singh).</description>
    <link>https://forem.com/jaskirat_singh</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3718238%2Fa607d550-7735-4135-9309-f8de6f028d7e.png</url>
      <title>Forem: Jaskirat Singh</title>
      <link>https://forem.com/jaskirat_singh</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/jaskirat_singh"/>
    <language>en</language>
    <item>
      <title>When Your LLM Starts Bleeding Context (And How I Fixed It)</title>
      <dc:creator>Jaskirat Singh</dc:creator>
      <pubDate>Mon, 16 Feb 2026 18:36:11 +0000</pubDate>
      <link>https://forem.com/jaskirat_singh/when-your-llm-starts-bleeding-context-and-how-i-fixed-it-4jgl</link>
      <guid>https://forem.com/jaskirat_singh/when-your-llm-starts-bleeding-context-and-how-i-fixed-it-4jgl</guid>
      <description>&lt;p&gt;&lt;strong&gt;LLM batch processing failing?&lt;/strong&gt; Learn how context bleeding between data rows tanks accuracy—plus the linearization technique that dropped our error rate to under 15%.&lt;/p&gt;

&lt;p&gt;Here’s the thing about working with LLMs at scale: they’re incredible until they’re not.&lt;/p&gt;

&lt;p&gt;I learned this the hard way while processing thousands of user feedback entries for sentiment analysis. We’re talking serious volume here — the kind that makes individual processing a non-starter from both a time and cost perspective. So naturally, I went with batch processing.&lt;/p&gt;

&lt;p&gt;Smart move, right?&lt;/p&gt;

&lt;p&gt;Wrong.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Day My Accuracy Tanked
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F27ygxwjjr2gi0onwnaus.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F27ygxwjjr2gi0onwnaus.png" alt="image1" width="800" height="396"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Three weeks into production, I started noticing something weird. Our false positive and false negative rates were climbing. Not catastrophically, but enough to make me nervous.&lt;/p&gt;

&lt;p&gt;The really frustrating part? When I spot-checked the problematic entries by processing them individually, they came back accurate.&lt;/p&gt;

&lt;p&gt;That’s when I knew we had a problem.&lt;/p&gt;

&lt;p&gt;After digging into the data (because of course I did — my academic research background doesn’t let me quit), I found a pattern. When feedback entries followed a continuous sentiment sequence — such as five positive reviews in a row — a sudden negative review would be misclassified as positive.&lt;/p&gt;

&lt;p&gt;And vice versa.&lt;/p&gt;

&lt;p&gt;The numbers were brutal:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;False positives jumped &lt;strong&gt;23%&lt;/strong&gt; when negative feedback followed strings of positive reviews
&lt;/li&gt;
&lt;li&gt;False negatives climbed &lt;strong&gt;18%&lt;/strong&gt; in the reverse scenario
&lt;/li&gt;
&lt;li&gt;Overall accuracy dropped from &lt;strong&gt;91% to 76%&lt;/strong&gt; on edge cases
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The LLM wasn’t treating each entry independently. It was &lt;strong&gt;bleeding context across rows&lt;/strong&gt;, letting previous patterns bias current predictions.&lt;/p&gt;

&lt;p&gt;Research shows that LLMs struggle to maintain strict contextual boundaries when processing sequential rows, leading to performance degradation as input length increases.&lt;/p&gt;

&lt;p&gt;This is what people in the industry call &lt;strong&gt;faithfulness hallucination&lt;/strong&gt; — when the model generates content that diverges from the actual input because it’s confused by the surrounding context.&lt;/p&gt;

&lt;p&gt;Not exactly what you want when you're trying to make data-driven decisions.&lt;/p&gt;




&lt;h2&gt;
  
  
  Quick Fixes (That Actually Worked)
&lt;/h2&gt;

&lt;p&gt;I needed solutions yesterday, so here’s what I tried first:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Batch Size Reduction
&lt;/h3&gt;

&lt;p&gt;I slashed our batch size from 32 entries down to 8. Throughput took a hit, but accuracy improved immediately.&lt;/p&gt;

&lt;p&gt;Studies using models like Llama3-70B show that pushing batch sizes beyond 64 often produces diminishing returns. There’s a sweet spot between efficiency and accuracy.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Prompt Engineering for Independence
&lt;/h3&gt;

&lt;p&gt;I rewrote our system prompt to explicitly hammer home row independence:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You are analysing user feedback entries independently. 
Each entry must be evaluated solely on its own content 
without influence from previous entries. 

Treat each review as a completely separate analysis task. 
Do not allow patterns from earlier reviews to bias your 
assessment of subsequent reviews.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Did it help? Yes.&lt;br&gt;&lt;br&gt;
Was it enough? Not even close.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Real Fix: Learning from AutoPK
&lt;/h2&gt;

&lt;p&gt;Here’s where things get interesting.&lt;/p&gt;

&lt;p&gt;I started researching how other teams were handling LLMs with structured data, and I came across the AutoPK framework. AutoPK demonstrates that LLMs often fail to preserve spatial and structural relationships when processing raw tables, requiring transformation to explicit key-value representations.&lt;/p&gt;

&lt;p&gt;And honestly?&lt;/p&gt;

&lt;p&gt;It changed everything.&lt;/p&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6gcbkakjj6s9km1ge1w6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6gcbkakjj6s9km1ge1w6.png" alt="Image2" width="800" height="248"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Step 1: Linearise Your Data
&lt;/h2&gt;

&lt;p&gt;Instead of feeding the LLM raw tabular rows, I transformed each entry into an explicit key-value format.&lt;/p&gt;
&lt;h3&gt;
  
  
  Before (What Everyone Does)
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Row_ID, Review_Text, Sentiment
1, Great product, positive
2, Terrible service, negative
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  After (What Actually Works)
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;ENTRY_1 @ Review: "Great product" | Sentiment: positive&amp;gt;
&amp;lt;ENTRY_2 @ Review: "Terrible service" | Sentiment: negative&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Why This Works
&lt;/h3&gt;

&lt;p&gt;By converting each relevant cell into a key-value pair, the transformation abstracts away layout differences. This enables text-based models to more effectively extract information from tabular data.&lt;/p&gt;

&lt;p&gt;You’re shifting the cognitive load from:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;❌ Spatial reasoning (which LLMs struggle with)
&lt;/li&gt;
&lt;li&gt;✅ Sequential text processing (which they’re actually good at)&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Step 2: Add Few-Shot Examples
&lt;/h2&gt;

&lt;p&gt;I included five examples in the linearized format before the actual task data. The examples showed the model exactly how to handle each entry independently.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Task: Analyze sentiment for each feedback entry independently.

Examples:
&amp;lt;ENTRY_A @ Review: "The interface is intuitive and fast" | Sentiment: positive&amp;gt;
&amp;lt;ENTRY_B @ Review: "Customer support was unhelpful" | Sentiment: negative&amp;gt;
&amp;lt;ENTRY_C @ Review: "Product works as described" | Sentiment: neutral&amp;gt;
&amp;lt;ENTRY_D @ Review: "Exceeded my expectations completely" | Sentiment: positive&amp;gt;
&amp;lt;ENTRY_E @ Review: "Shipping took three weeks" | Sentiment: negative&amp;gt;

Now analyze the following entries:
[Production data follows]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Research on TabLLM shows that this approach can outperform traditional deep-learning methods, particularly in few-shot scenarios with minimal labelled data.&lt;/p&gt;

&lt;p&gt;We saw error rates drop from &lt;strong&gt;60–95% down to under 15%&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 3: Simplify Your Input
&lt;/h2&gt;

&lt;p&gt;I stripped out every column that wasn’t directly relevant to sentiment analysis.&lt;/p&gt;

&lt;p&gt;No timestamps.&lt;br&gt;&lt;br&gt;
No URLs.&lt;br&gt;&lt;br&gt;
No user agents.  &lt;/p&gt;

&lt;p&gt;Just the entry ID and the feedback text.&lt;/p&gt;
&lt;h3&gt;
  
  
  Original Input (Noisy)
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;row_id, timestamp, user_id, session_id, feedback_text, page_url, user_agent, sentiment
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Cleaned Input
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;entry_id, feedback_text, sentiment
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Research confirms that heterogeneous features — ranging from dense numerical to sparse categorical — can confuse models, especially when columns contain information unrelated to the target task.&lt;/p&gt;

&lt;p&gt;This single change reduced hallucination rates by about &lt;strong&gt;12%&lt;/strong&gt;.&lt;/p&gt;


&lt;h2&gt;
  
  
  Step 4: Control the Output Format
&lt;/h2&gt;

&lt;p&gt;I explicitly instructed the model to return results in a structured format using only the provided identifiers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Return results in the following format, using only the entry identifiers provided:

entry_id,predicted_sentiment,confidence
1,positive,0.92
2,negative,0.87
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This prevents the model from inventing data or introducing information that wasn’t in the input.&lt;/p&gt;

&lt;p&gt;It also makes validation way easier.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Trade-Offs Nobody Talks About
&lt;/h2&gt;

&lt;p&gt;Look — these solutions aren’t free.&lt;/p&gt;

&lt;p&gt;Here’s what I learned about the cost-performance balance:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Smaller batch sizes = better accuracy but worse throughput
&lt;/li&gt;
&lt;li&gt;Linearization + few-shot prompting = higher token consumption per entry
&lt;/li&gt;
&lt;li&gt;Individual processing = guaranteed independence but eye-watering API costs
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Anthropic optimised Claude 3 with continuous batching, increasing throughput from 50 to 450 tokens per second while reducing latency and cutting GPU costs by 40%.&lt;/p&gt;

&lt;p&gt;The point is: you need to know what you’re optimising for.&lt;/p&gt;

&lt;p&gt;For our use case — customer feedback analysis where 85–90% accuracy was acceptable — optimized batch processing with linearization hit the sweet spot.&lt;/p&gt;

&lt;p&gt;Your mileage may vary.&lt;/p&gt;




&lt;h2&gt;
  
  
  How to Actually Implement This
&lt;/h2&gt;

&lt;p&gt;Here’s my step-by-step process:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Establish a baseline by processing a representative sample individually
&lt;/li&gt;
&lt;li&gt;Convert your pipeline to generate key-value formatted entries
&lt;/li&gt;
&lt;li&gt;Create domain-specific examples that cover your edge cases
&lt;/li&gt;
&lt;li&gt;Experiment with batch sizes to find your accuracy-throughput balance
&lt;/li&gt;
&lt;li&gt;Build validation checks comparing batch results to individual processing
&lt;/li&gt;
&lt;li&gt;Monitor continuously for regression over time
&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Code: Linearization + Prompt Builder
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def linearize_feedback(df, id_column, text_column):
    """
    Convert tabular feedback data into linearized key-value format.
    """
    linearized = []
    for _, row in df.iterrows():
        entry = f"&amp;lt;ENTRY_{row[id_column]} @ Review: \"{row[text_column]}\"&amp;gt;"
        linearized.append(entry)
    return linearized


def create_few_shot_prompt(examples, task_data):
    """
    Construct prompt with few-shot examples and task data.
    """
    prompt = "Task: Analyze sentiment for each feedback entry independently.\n\n"
    prompt += "Examples:\n"

    for entry, label in examples:
        prompt += f"{entry} | Sentiment: {label}\n"

    prompt += "\nNow analyze the following entries:\n"
    prompt += "\n".join(task_data)
    prompt += "\n\nReturn results as CSV: entry_id,predicted_sentiment,confidence"

    return prompt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  What I Wish Someone Had Told Me
&lt;/h2&gt;

&lt;p&gt;Context bleeding isn’t some edge case that only affects massive enterprise deployments.&lt;/p&gt;

&lt;p&gt;If you’re processing structured data with LLMs — feedback, survey responses, support tickets, whatever — you’re probably experiencing this problem right now.&lt;/p&gt;

&lt;p&gt;You just might not know it yet.&lt;/p&gt;

&lt;p&gt;Context errors can lead to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lost essential information
&lt;/li&gt;
&lt;li&gt;Misinterpreted model output
&lt;/li&gt;
&lt;li&gt;Incorrect downstream actions
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The AutoPK-inspired pipeline approach fundamentally changed how I think about feeding data to LLMs.&lt;/p&gt;

&lt;p&gt;Converting spatial table relationships into explicit textual representations isn’t just a workaround.&lt;/p&gt;

&lt;p&gt;It’s actually aligning with what these models are good at.&lt;/p&gt;




&lt;h2&gt;
  
  
  What’s Next
&lt;/h2&gt;

&lt;p&gt;The research community is exploring:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Attention mechanisms specifically designed for structured data
&lt;/li&gt;
&lt;li&gt;Architectural modifications that enforce entry boundaries
&lt;/li&gt;
&lt;li&gt;Hybrid approaches combining LLM strengths with traditional ML for tabular data
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Continuous batching combines KV caching, chunked prefill, and ragged batching with dynamic scheduling to maximise throughput — but these optimizations focus on throughput rather than accuracy preservation.&lt;/p&gt;

&lt;p&gt;For now, if you’re dealing with batch processing of structured data:&lt;/p&gt;

&lt;p&gt;Start with &lt;strong&gt;linearization + few-shot prompting&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The results speak for themselves.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Anyscale. "Achieve 23x LLM Inference Throughput &amp;amp; Reduce p50 Latency."&lt;br&gt;&lt;br&gt;
&lt;a href="https://www.anyscale.com/blog/continuous-batching-llm-inference" rel="noopener noreferrer"&gt;https://www.anyscale.com/blog/continuous-batching-llm-inference&lt;/a&gt;  &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Hugging Face. "Continuous Batching from First Principles."&lt;br&gt;&lt;br&gt;
&lt;a href="https://huggingface.co/blog/continuous_batching" rel="noopener noreferrer"&gt;https://huggingface.co/blog/continuous_batching&lt;/a&gt;  &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Chroma Research. "Context Rot: How Increasing Input Tokens Impacts LLM Performance."&lt;br&gt;&lt;br&gt;
&lt;a href="https://research.trychroma.com/context-rot" rel="noopener noreferrer"&gt;https://research.trychroma.com/context-rot&lt;/a&gt;  &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;ACM Transactions. "A Survey on Hallucination in Large Language Models."&lt;br&gt;&lt;br&gt;
&lt;a href="https://dl.acm.org/doi/10.1145/3703155" rel="noopener noreferrer"&gt;https://dl.acm.org/doi/10.1145/3703155&lt;/a&gt;  &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Nature. "Detecting Hallucinations in Large Language Models Using Semantic Entropy."&lt;br&gt;&lt;br&gt;
&lt;a href="https://www.nature.com/articles/s41586-024-07421-0" rel="noopener noreferrer"&gt;https://www.nature.com/articles/s41586-024-07421-0&lt;/a&gt;  &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Arxiv. "AutoPK: Leveraging LLMs and Hybrid Similarity Metric for Advanced Retrieval of Pharmacokinetic Data."&lt;br&gt;&lt;br&gt;
&lt;a href="https://arxiv.org/html/2510.00039" rel="noopener noreferrer"&gt;https://arxiv.org/html/2510.00039&lt;/a&gt;  &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Arxiv. "Large Language Models on Tabular Data: Prediction, Generation, and Understanding - A Survey."&lt;br&gt;&lt;br&gt;
&lt;a href="https://arxiv.org/html/2402.17944v2" rel="noopener noreferrer"&gt;https://arxiv.org/html/2402.17944v2&lt;/a&gt;  &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Predibase. "Maximize Zero-Shot LLM Performance on Tabular Data."&lt;br&gt;&lt;br&gt;
&lt;a href="https://predibase.com/blog/getting-the-best-zero-shot-performance-on-your-tabular-data-with-llms" rel="noopener noreferrer"&gt;https://predibase.com/blog/getting-the-best-zero-shot-performance-on-your-tabular-data-with-llms&lt;/a&gt;  &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Latitude. "Scaling LLMs with Batch Processing: Ultimate Guide."&lt;br&gt;&lt;br&gt;
&lt;a href="https://latitude-blog.ghost.io/blog/scaling-llms-with-batch-processing-ultimate-guide/" rel="noopener noreferrer"&gt;https://latitude-blog.ghost.io/blog/scaling-llms-with-batch-processing-ultimate-guide/&lt;/a&gt;  &lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>softwaredevelopment</category>
      <category>coding</category>
      <category>programming</category>
    </item>
    <item>
      <title>The Boring Truth About Letting AI Write Your Code</title>
      <dc:creator>Jaskirat Singh</dc:creator>
      <pubDate>Mon, 02 Feb 2026 17:10:58 +0000</pubDate>
      <link>https://forem.com/jaskirat_singh/the-boring-truth-about-letting-ai-write-your-code-4j98</link>
      <guid>https://forem.com/jaskirat_singh/the-boring-truth-about-letting-ai-write-your-code-4j98</guid>
      <description>&lt;p&gt;AI-assisted coding is incredible. It's also incredibly boring.&lt;/p&gt;

&lt;p&gt;There, I said it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Promise vs. The Reality
&lt;/h2&gt;

&lt;p&gt;We're living in the era of "vibe coding" where you describe what you want, hand it off to an AI agent, and watch it build your app while you sip coffee. Tools like Spec Kit, sudocode, and GitHub Copilot are genuinely powerful. They can take specifications and turn side project ideas into actual reality, saving you from the graveyard of purchased domain names that never became anything.&lt;/p&gt;

&lt;p&gt;The technology works. Sometimes too well.&lt;/p&gt;

&lt;p&gt;But here's what nobody talks about: &lt;strong&gt;watching AI code is the modern equivalent of watching paint dry.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Missing Piece: Joy
&lt;/h2&gt;

&lt;p&gt;I've literally dozed off this week watching agents work. Multiple times. The code appears line by line in the editor, technically correct, functionally sound, and completely devoid of any emotional satisfaction.&lt;/p&gt;

&lt;p&gt;There's no "aha!" moment. No problem-solving rush. No "I AM A GENIUS" feeling when you finally crack that tricky algorithm. Just... code appearing. Like magic, except magic is supposed to be exciting.&lt;/p&gt;

&lt;p&gt;The AI companies pitch this as freedom: "Let the computer do the boring work so you can focus on interesting work." But here's the uncomfortable realization &lt;strong&gt;coding itself isn't the boring work.&lt;/strong&gt; At least not for those of us who genuinely love it.&lt;/p&gt;

&lt;h2&gt;
  
  
  When Vibe Coding Actually Makes Sense
&lt;/h2&gt;

&lt;p&gt;Don't get me wrong—there's a place for this approach:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use AI when you only care about the output.&lt;/strong&gt; Those utility scripts, personal tools, or simple apps where the tech stack doesn't matter and you just want something functional. The projects sitting in your "someday" pile that are finally getting built because AI removed the activation energy barrier.&lt;/p&gt;

&lt;p&gt;For those projects? Fine. Ship it. Get the result. Move on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But don't use it for what you love.&lt;/strong&gt; For apps you're proud of, projects using interesting tech stacks, or anything where the journey matters as much as the destination—drive the development yourself. The experience, the learning, the satisfaction of building something with your own hands (and brain) is irreplaceable.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Risk Nobody Mentions
&lt;/h2&gt;

&lt;p&gt;The danger isn't that AI will replace developers. The danger is that developers will outsource so much that they lose their skills—and worse, lose the joy that made them want to code in the first place.&lt;/p&gt;

&lt;p&gt;Programming is problem-solving. It's creative. It's challenging in ways that feel rewarding. When you hand all of that to an AI, you're left with the least interesting part: project management for a robotic employee.&lt;/p&gt;

&lt;h2&gt;
  
  
  It's Just Another Tool
&lt;/h2&gt;

&lt;p&gt;Vibe coding isn't revolutionary. It's not the future of all development. It's simply another tool in the toolbelt—useful for specific situations, boring for everything else.&lt;/p&gt;

&lt;p&gt;I'll keep using it for projects where I genuinely don't care how they're built. I'll keep my coding skills sharp by building the things that matter with my own hands. And I'll accept that watching an AI agent work is about as thrilling as watching my code compile.&lt;/p&gt;

&lt;p&gt;Which is to say: not at all.&lt;/p&gt;

&lt;p&gt;The future where AI does all the "boring work" has arrived. Turns out, the work was never boring to begin with.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;The Takeaway:&lt;/strong&gt; Use AI to eliminate friction on projects you don't care about deeply. But protect the work you love. The satisfaction of solving problems yourself isn't a bug in the development process—it's the entire point.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>coding</category>
      <category>web3</category>
      <category>software</category>
    </item>
    <item>
      <title>I Texted 'Clean My Desktop' From My Phone. 30 Seconds Later, It Was Done.</title>
      <dc:creator>Jaskirat Singh</dc:creator>
      <pubDate>Sun, 01 Feb 2026 18:07:49 +0000</pubDate>
      <link>https://forem.com/jaskirat_singh/i-texted-clean-my-desktop-from-my-phone-30-seconds-later-it-was-done-559j</link>
      <guid>https://forem.com/jaskirat_singh/i-texted-clean-my-desktop-from-my-phone-30-seconds-later-it-was-done-559j</guid>
      <description>&lt;p&gt;ChatGPT tells you what to do. This AI actually does it for you.&lt;/p&gt;

&lt;p&gt;While you're still copy-pasting code snippets and following step-by-step instructions from your AI assistant, there's a new breed of AI that's &lt;strong&gt;executing tasks autonomously on your computer&lt;/strong&gt; organizing your files, researching content, building websites, and managing your digital life while you send commands from your phone.&lt;/p&gt;

&lt;p&gt;Welcome to the era of AI employees, not AI advisors.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem with Traditional AI Assistants
&lt;/h2&gt;

&lt;p&gt;Let's be honest: ChatGPT and similar tools are incredibly smart, but fundamentally passive. They give you advice, generate text, explain concepts. Then &lt;em&gt;you&lt;/em&gt; have to do the actual work open applications, move files, write code, execute commands.&lt;/p&gt;

&lt;p&gt;You're still the employee. The AI is just the consultant.&lt;/p&gt;

&lt;p&gt;But what if your AI could actually &lt;em&gt;do&lt;/em&gt; the work?&lt;/p&gt;

&lt;h2&gt;
  
  
  Meet Claudebot: Your AI Employee That Actually Works
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Claudebot&lt;/strong&gt; (also known as Moldbot due to copyright reasons) flips the traditional AI model on its head. Instead of running in the cloud and chatting politely, it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Runs locally on your computer&lt;/strong&gt;, keeping your data completely private&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Executes tasks autonomously&lt;/strong&gt; without you lifting a finger&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Accepts commands remotely&lt;/strong&gt; via WhatsApp, Telegram, Discord, or Slack&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remembers everything&lt;/strong&gt; you tell it, so you're not repeating instructions constantly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Builds its own tools&lt;/strong&gt; when it doesn't have what it needs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Think of it as hiring an AI assistant that actually sits at your computer and gets work done while you're on the couch texting instructions from your phone.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Can It Actually Do?
&lt;/h2&gt;

&lt;p&gt;Here's where it gets interesting. Real-world examples:&lt;/p&gt;

&lt;h3&gt;
  
  
  File Organization That Actually Happens
&lt;/h3&gt;

&lt;p&gt;Text from your phone: &lt;em&gt;"Organize my external SSD"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Claudebot scans hundreds of messy files, proposes a logical structure, deletes junk, and reorganizes everything all while you're nowhere near your computer. No more "I'll clean that up this weekend" promises to yourself.&lt;/p&gt;

&lt;h3&gt;
  
  
  Content Research on Autopilot
&lt;/h3&gt;

&lt;p&gt;Need a video script analyzing AI trends? Here's what happened in one test:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Command sent via Telegram: "Analyze top LinkedIn AI posts, my personal LinkedIn history, YouTube AI videos, and Twitter news then merge everything into one script"&lt;/li&gt;
&lt;li&gt;Claudebot autonomously opened browsers, scraped data, analyzed patterns, and generated a comprehensive script&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Total time: 13 minutes&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Try doing that manually. You'd still be on tab 47 of your browser research session.&lt;/p&gt;

&lt;h3&gt;
  
  
  Website Development While You Sleep
&lt;/h3&gt;

&lt;p&gt;"Redesign my landing page inspired by Apple's style using Replit"&lt;/p&gt;

&lt;p&gt;Claudebot analyzed Apple's website, reviewed the existing site, generated code, and created a working preview. Was it perfect? No about 80% complete with some context loss. But that's 80% of the work done autonomously while you did literally nothing.&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Works: The Technical Magic
&lt;/h2&gt;

&lt;p&gt;Unlike cloud-based AI that sends your data to external servers, Claudebot operates entirely on your machine. You control it through messaging apps you already use daily.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The workflow:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Install Claudebot on your computer (ideally a secondary machine for safety)&lt;/li&gt;
&lt;li&gt;Connect it to your preferred AI model (Anthropic's Claude recommended)&lt;/li&gt;
&lt;li&gt;Link your messaging app (Telegram, WhatsApp, etc.)&lt;/li&gt;
&lt;li&gt;Send text commands from anywhere&lt;/li&gt;
&lt;li&gt;Claudebot executes tasks in real-time&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The AI doesn't just follow rigid scripts. It adapts, problem-solves, and even creates new capabilities when needed. Can't do something? It builds the tool to do it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Privacy Advantage
&lt;/h2&gt;

&lt;p&gt;Here's something ChatGPT can't offer: &lt;strong&gt;complete data privacy&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Because Claudebot runs locally, your files, projects, and sensitive information never leave your machine. You get the power of advanced AI without uploading your entire digital life to the cloud.&lt;/p&gt;

&lt;p&gt;For developers, creators, and anyone handling confidential work, this is game-changing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Catch (Because There's Always One)
&lt;/h2&gt;

&lt;p&gt;Let's be real this isn't plug-and-play for everyone:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You'll need:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Basic command-line familiarity&lt;/li&gt;
&lt;li&gt;An API key setup (through Anthropic or other providers)&lt;/li&gt;
&lt;li&gt;A secondary computer for installation (don't risk your primary machine)&lt;/li&gt;
&lt;li&gt;Patience with occasional context loss during complex tasks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The risks:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;This AI has real control over your computer&lt;/li&gt;
&lt;li&gt;Memory limitations can cause it to "forget" mid-task&lt;/li&gt;
&lt;li&gt;Complex automations might need human intervention&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's powerful, but with power comes responsibility. This isn't an AI you install on your main work laptop without understanding what it can do.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who Should Use This?
&lt;/h2&gt;

&lt;p&gt;Claudebot shines for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Content creators&lt;/strong&gt; drowning in research and organization&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Developers&lt;/strong&gt; automating repetitive file management and code tasks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remote workers&lt;/strong&gt; who need tasks executed while away from their desk&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anyone tired&lt;/strong&gt; of being their AI's assistant instead of the other way around&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're comfortable with technology and want an AI that actually does work rather than describing work, this is your tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Future Is Already Here
&lt;/h2&gt;

&lt;p&gt;We've spent years asking AI for advice. Now AI is asking us, "What do you want done?"&lt;/p&gt;

&lt;p&gt;Claudebot represents a fundamental shift: from AI assistants that &lt;em&gt;inform&lt;/em&gt; to AI workers that &lt;em&gt;execute&lt;/em&gt;. It's not perfect, it's not risk-free, and it's definitely not for everyone.&lt;/p&gt;

&lt;p&gt;But for those ready to hand real tasks to AI not just questions it's the most practical implementation of autonomous AI assistance available today.&lt;/p&gt;

&lt;p&gt;And it's completely free to use.&lt;/p&gt;

&lt;p&gt;The question isn't whether AI will eventually control our computers. It's whether you're ready to let it start now.&lt;/p&gt;




&lt;h2&gt;
  
  
  Quick Setup Overview
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Visit the Claudebot/Moldbot website&lt;/li&gt;
&lt;li&gt;Download for your OS (MacOS, Linux, or Windows)&lt;/li&gt;
&lt;li&gt;Run the installation script in Terminal&lt;/li&gt;
&lt;li&gt;Choose quick start mode&lt;/li&gt;
&lt;li&gt;Connect your AI provider (Anthropic Claude recommended)&lt;/li&gt;
&lt;li&gt;Link your messaging app (Telegram works great)&lt;/li&gt;
&lt;li&gt;Start sending commands remotely&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Remember:&lt;/strong&gt; Install on a secondary machine first. Test carefully. Then decide if you want this level of AI autonomy in your workflow.&lt;/p&gt;

&lt;p&gt;The future of AI isn't just smarter responses. It's AI that actually does the job.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>ai</category>
      <category>agents</category>
      <category>agentaichallenge</category>
    </item>
    <item>
      <title>AI Tools Are Cute. AI Agents Are Dangerous</title>
      <dc:creator>Jaskirat Singh</dc:creator>
      <pubDate>Fri, 30 Jan 2026 18:52:00 +0000</pubDate>
      <link>https://forem.com/jaskirat_singh/ai-tools-are-cute-ai-agents-are-dangerous-30ip</link>
      <guid>https://forem.com/jaskirat_singh/ai-tools-are-cute-ai-agents-are-dangerous-30ip</guid>
      <description>&lt;p&gt;For the last few years, the internet has been obsessed with AI tools.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt; Tools that write emails.
&lt;/li&gt;
&lt;li&gt;  Tools that generate images.
&lt;/li&gt;
&lt;li&gt;  Tools that autocomplete code.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They made us faster — but they didn’t fundamentally change how work gets done.&lt;/p&gt;

&lt;p&gt;2026 marks a different shift.&lt;/p&gt;

&lt;p&gt;This is the year AI stops waiting for instructions and starts &lt;strong&gt;acting on your behalf&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  2025 Was About Discovery. 2026 Is About Delegation.
&lt;/h2&gt;

&lt;p&gt;In 2025, AI agents quietly crossed a threshold.&lt;/p&gt;

&lt;p&gt;Search trends spiked.&lt;br&gt;&lt;br&gt;
  Developer tools exploded.&lt;br&gt;&lt;br&gt;
  Startups pivoted from “AI assistant” to “AI agent.”&lt;/p&gt;

&lt;p&gt;Yet most people missed what was actually happening.&lt;/p&gt;

&lt;p&gt;They treated agents like smarter chatbots.&lt;/p&gt;

&lt;p&gt;That misunderstanding is why many still don’t see what’s coming next.&lt;/p&gt;




&lt;h2&gt;
  
  
  What an AI Agent Really Is (No Buzzwords)
&lt;/h2&gt;

&lt;p&gt;An AI agent is not a chatbot.&lt;/p&gt;

&lt;p&gt;An AI agent is a system that can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Accept a high-level goal
&lt;/li&gt;
&lt;li&gt;Decompose it into multiple steps
&lt;/li&gt;
&lt;li&gt;Choose tools and actions autonomously
&lt;/li&gt;
&lt;li&gt;Observe outcomes
&lt;/li&gt;
&lt;li&gt;Adjust its strategy
&lt;/li&gt;
&lt;li&gt;Continue until the goal is complete
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You don’t guide it step by step.&lt;br&gt;
  You define success.&lt;/p&gt;

&lt;p&gt;Everything else is delegated.&lt;/p&gt;




&lt;h2&gt;
  
  
  Tools Assist. Agents Decide.
&lt;/h2&gt;

&lt;p&gt;Traditional AI tools follow a familiar loop:&lt;/p&gt;

&lt;p&gt;You prompt.&lt;br&gt;&lt;br&gt;
  The tool responds.&lt;br&gt;&lt;br&gt;
  You decide what to do next.&lt;/p&gt;

&lt;p&gt;AI agents collapse that loop.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  You give a goal.
&lt;/li&gt;
&lt;li&gt;  The agent plans the workflow.
&lt;/li&gt;
&lt;li&gt;  The agent executes actions.
&lt;/li&gt;
&lt;li&gt;  The agent handles failures.
&lt;/li&gt;
&lt;li&gt;  The agent delivers an outcome.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not convenience.&lt;/p&gt;

&lt;p&gt;This is a new interaction model.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why AI Agents Suddenly Work in 2026
&lt;/h2&gt;

&lt;p&gt;Agents existed before — but they failed quietly.&lt;/p&gt;

&lt;p&gt;What changed?&lt;/p&gt;

&lt;h3&gt;
  
  
  Models Became Fast Enough to Think While Acting
&lt;/h3&gt;

&lt;p&gt;Latency dropped.&lt;br&gt;
  Reasoning improved.&lt;br&gt;
  Long-context memory became reliable.&lt;/p&gt;

&lt;p&gt;Agents can now reflect, correct, and continue without constant human intervention.&lt;/p&gt;

&lt;h3&gt;
  
  
  Software Became Actionable
&lt;/h3&gt;

&lt;p&gt;Browsers, calendars, CRMs, file systems, APIs — everything became permissioned and callable.&lt;/p&gt;

&lt;p&gt;Agents no longer simulate actions.&lt;br&gt;
  They perform them.&lt;/p&gt;

&lt;h3&gt;
  
  
  Work Became Workflow-Heavy
&lt;/h3&gt;

&lt;p&gt;Modern work isn’t one task.&lt;/p&gt;

&lt;p&gt;It’s dozens of interconnected micro-tasks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Research
&lt;/li&gt;
&lt;li&gt;Switching tabs
&lt;/li&gt;
&lt;li&gt;Copying data
&lt;/li&gt;
&lt;li&gt;Formatting outputs
&lt;/li&gt;
&lt;li&gt;Following up
&lt;/li&gt;
&lt;li&gt;Tracking progress
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Agents thrive in exactly this environment.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Mental Shift Most People Haven’t Made
&lt;/h2&gt;

&lt;p&gt;Most users still ask:&lt;/p&gt;

&lt;p&gt;What can this AI do?&lt;/p&gt;

&lt;p&gt;Advanced users ask:&lt;/p&gt;

&lt;p&gt;What should never require my attention again?&lt;/p&gt;

&lt;p&gt;This shift — from capability thinking to delegation thinking —&lt;br&gt;
  separates casual AI users from leverage builders.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where AI Agents Are Already Replacing Human Effort
&lt;/h2&gt;

&lt;p&gt;Not jobs.&lt;/p&gt;

&lt;p&gt;Friction.&lt;/p&gt;

&lt;p&gt;AI agents are already handling:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multi-step web research and data extraction
&lt;/li&gt;
&lt;li&gt;Slide decks and report generation
&lt;/li&gt;
&lt;li&gt;Inbox triage and calendar coordination
&lt;/li&gt;
&lt;li&gt;File organization and cleanup
&lt;/li&gt;
&lt;li&gt;Repetitive browser workflows
&lt;/li&gt;
&lt;li&gt;Monitoring tasks and follow-ups
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The human role moves up the stack.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Isn’t Traditional Automation
&lt;/h2&gt;

&lt;p&gt;Old automation was brittle.&lt;/p&gt;

&lt;p&gt;One broken selector.&lt;br&gt;
  One changed page.&lt;br&gt;
  One unexpected input.&lt;/p&gt;

&lt;p&gt;Everything failed.&lt;/p&gt;

&lt;p&gt;AI agents adapt.&lt;/p&gt;

&lt;p&gt;They retry.&lt;br&gt;
  They choose alternate paths.&lt;br&gt;
  They ask for clarification.&lt;br&gt;
  They escalate only when needed.&lt;/p&gt;

&lt;p&gt;This isn’t scripting.&lt;/p&gt;

&lt;p&gt;It’s situational decision-making.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real Value: Cognitive Offloading
&lt;/h2&gt;

&lt;p&gt;Speed is a side effect.&lt;/p&gt;

&lt;p&gt;The real value is mental bandwidth.&lt;/p&gt;

&lt;p&gt;Agents absorb:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Context switching
&lt;/li&gt;
&lt;li&gt;Progress tracking
&lt;/li&gt;
&lt;li&gt;Failure recovery
&lt;/li&gt;
&lt;li&gt;Repetitive decisions
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Humans regain:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Focus
&lt;/li&gt;
&lt;li&gt;Judgment
&lt;/li&gt;
&lt;li&gt;Creativity
&lt;/li&gt;
&lt;li&gt;Strategic thinking
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is why agents feel less like tools&lt;br&gt;
  and more like leverage.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Most People Will Use AI Agents Wrong
&lt;/h2&gt;

&lt;p&gt;At first, people will:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Over-control every step
&lt;/li&gt;
&lt;li&gt;Micromanage execution
&lt;/li&gt;
&lt;li&gt;Panic when agents pause
&lt;/li&gt;
&lt;li&gt;Avoid granting permissions
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Agents don’t fail because they’re weak.&lt;/p&gt;

&lt;p&gt;They fail because humans don’t yet know how to delegate.&lt;/p&gt;

&lt;p&gt;Clear goals.&lt;br&gt;
  Clear constraints.&lt;br&gt;
  Then trust.&lt;/p&gt;




&lt;h2&gt;
  
  
  Careers Are Quietly Being Reshaped
&lt;/h2&gt;

&lt;p&gt;In 2026, advantage won’t come from:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Knowing more tools
&lt;/li&gt;
&lt;li&gt;Writing better prompts
&lt;/li&gt;
&lt;li&gt;Memorizing workflows
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It will come from:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Designing systems
&lt;/li&gt;
&lt;li&gt;Knowing what to automate
&lt;/li&gt;
&lt;li&gt;Knowing what must remain human
&lt;/li&gt;
&lt;li&gt;Thinking in outcomes, not steps
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is a mindset shift, not a skill upgrade.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;AI agents are not replacing humans.&lt;/p&gt;

&lt;p&gt;They are replacing to-do lists.&lt;/p&gt;

&lt;p&gt;In 2026, delegation won’t feel optional.&lt;br&gt;
  It will feel obvious.&lt;/p&gt;

&lt;p&gt;And the people who learn to think in agents —&lt;br&gt;
  not tools —&lt;br&gt;
  will move faster with less effort than everyone else.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>aitools</category>
      <category>programming</category>
    </item>
    <item>
      <title>If Your Code Works on the First Try, There’s a Massive Mistake Somewhere</title>
      <dc:creator>Jaskirat Singh</dc:creator>
      <pubDate>Thu, 29 Jan 2026 18:24:25 +0000</pubDate>
      <link>https://forem.com/jaskirat_singh/if-your-code-works-on-the-first-try-theres-a-massive-mistake-somewhere-1h8o</link>
      <guid>https://forem.com/jaskirat_singh/if-your-code-works-on-the-first-try-theres-a-massive-mistake-somewhere-1h8o</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fckx9qjiikxp9jqv2xakk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fckx9qjiikxp9jqv2xakk.png" alt="Meme" width="590" height="500"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I read this line somewhere on Reddit and it stuck with me:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If your code works on the first try, there is a massive mistake somewhere.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It sounds like a joke, but every experienced developer knows there’s a painful truth hidden inside it.&lt;/p&gt;




&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;When code works on the first run, it usually means you tested only the happy path. Missing edge cases, untested assumptions, environment mismatches, or silent failures are often hiding underneath. First-try success should trigger skepticism, not celebration.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Quote Resonates With Developers
&lt;/h2&gt;

&lt;p&gt;Every developer has lived through this moment:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Code runs perfectly on your machine
&lt;/li&gt;
&lt;li&gt;You push it
&lt;/li&gt;
&lt;li&gt;Production breaks in ways you never imagined
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The quote isn’t saying good engineers write broken code. It’s saying real software lives in messy environments, not in ideal inputs and local setups.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why First-Try Success Is Suspicious
&lt;/h2&gt;

&lt;p&gt;If your code works immediately, one or more of these is likely true:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You only tested the happy path
&lt;/li&gt;
&lt;li&gt;Inputs were overly clean or hard-coded
&lt;/li&gt;
&lt;li&gt;Error handling was never exercised
&lt;/li&gt;
&lt;li&gt;The environment matches your local machine too closely
&lt;/li&gt;
&lt;li&gt;Concurrency and load were never tested
&lt;/li&gt;
&lt;li&gt;Latency and failure scenarios were ignored
&lt;/li&gt;
&lt;li&gt;Success criteria were too shallow
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Passing once proves almost nothing.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Simple Example
&lt;/h2&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from datetime import datetime

def parse_date(s):
    return datetime.strptime(s, "%Y-%m-%d")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This works perfectly for:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2026-01-23
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;But what about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;23-01-2026
&lt;/li&gt;
&lt;li&gt;2026/01/23
&lt;/li&gt;
&lt;li&gt;Jan 23 2026
&lt;/li&gt;
&lt;li&gt;2026-02-30
&lt;/li&gt;
&lt;li&gt;Different locales or time zones
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;First-run success only proves your input matched your assumption.&lt;/p&gt;




&lt;h2&gt;
  
  
  The “Try to Break It” Mindset
&lt;/h2&gt;

&lt;p&gt;When code works immediately, do this next:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Add unit tests for the happy path
&lt;/li&gt;
&lt;li&gt;Add tests for invalid and edge inputs
&lt;/li&gt;
&lt;li&gt;Run property-based or fuzz testing
&lt;/li&gt;
&lt;li&gt;Test against real dependencies, not mocks
&lt;/li&gt;
&lt;li&gt;Simulate production-like environments
&lt;/li&gt;
&lt;li&gt;Introduce latency and partial failures
&lt;/li&gt;
&lt;li&gt;Run concurrent requests
&lt;/li&gt;
&lt;li&gt;Verify logging and error reporting
&lt;/li&gt;
&lt;li&gt;Add static analysis and type checks
&lt;/li&gt;
&lt;li&gt;Get a second set of eyes in code review
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you can’t break it, write a test that proves why.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Production Is Where Bugs Are Born
&lt;/h2&gt;

&lt;p&gt;Production introduces things your laptop never will:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Network delays
&lt;/li&gt;
&lt;li&gt;Partial outages
&lt;/li&gt;
&lt;li&gt;Race conditions
&lt;/li&gt;
&lt;li&gt;Different CPU architectures
&lt;/li&gt;
&lt;li&gt;Locale and encoding differences
&lt;/li&gt;
&lt;li&gt;Unexpected user behavior
&lt;/li&gt;
&lt;li&gt;Scale and concurrency
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Code that works once in isolation hasn’t earned trust yet.&lt;/p&gt;




&lt;h2&gt;
  
  
  When First-Try Success Is Actually Fine
&lt;/h2&gt;

&lt;p&gt;There are exceptions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pure deterministic functions
&lt;/li&gt;
&lt;li&gt;Well-specified algorithms
&lt;/li&gt;
&lt;li&gt;One-off scripts
&lt;/li&gt;
&lt;li&gt;Throwaway experiments
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But the moment code becomes reusable, shared, or deployed, it needs tests and defensive thinking.&lt;/p&gt;




&lt;h2&gt;
  
  
  Turning First-Run Success Into Confidence
&lt;/h2&gt;

&lt;p&gt;A healthy workflow looks like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Code works on first run
&lt;/li&gt;
&lt;li&gt;Add tests that assert that behavior
&lt;/li&gt;
&lt;li&gt;Add tests that try to break it
&lt;/li&gt;
&lt;li&gt;Automate those tests in CI
&lt;/li&gt;
&lt;li&gt;Observe behavior in staging
&lt;/li&gt;
&lt;li&gt;Monitor and log in production
&lt;/li&gt;
&lt;li&gt;Iterate when reality disagrees
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Confidence comes from evidence, not luck.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;That Reddit quote isn’t pessimistic — it’s pragmatic.&lt;/p&gt;

&lt;p&gt;When your code works on the first try, don’t assume you’re done. Assume you haven’t looked hard enough yet. The best engineers aren’t the ones whose code works instantly — they’re the ones who expect it to fail and prepare for it anyway.&lt;/p&gt;

&lt;p&gt;If your code survives your attempts to break it, then you can trust it.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>coding</category>
      <category>buildinpublic</category>
    </item>
    <item>
      <title>Why Is Everyone Suddenly Talking About Voice Agents in 2026?</title>
      <dc:creator>Jaskirat Singh</dc:creator>
      <pubDate>Wed, 28 Jan 2026 18:20:32 +0000</pubDate>
      <link>https://forem.com/jaskirat_singh/why-is-everyone-suddenly-talking-about-voice-agents-in-2026-4fim</link>
      <guid>https://forem.com/jaskirat_singh/why-is-everyone-suddenly-talking-about-voice-agents-in-2026-4fim</guid>
      <description>&lt;p&gt;If you’ve been anywhere near AI conversations in 2026, one thing is impossible to ignore: &lt;strong&gt;voice agents are everywhere&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;From contact centers and sales calls to internal support desks and consumer apps, businesses are racing to deploy AI-powered voice agents. This isn’t just another AI hype cycle. What we’re seeing now is the result of multiple technologies finally maturing at the same time—models, infrastructure, latency, and real-world trust.&lt;/p&gt;

&lt;p&gt;Voice AI has entered its &lt;strong&gt;second phase of evolution&lt;/strong&gt;. And 2026 is shaping up to be the year it becomes mainstream.&lt;/p&gt;




&lt;h2&gt;
  
  
  TL;DR — Why Voice Agents Matter Now
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;AI voice agents are no longer robotic scripts; they’re conversational, adaptive, and emotion-aware
&lt;/li&gt;
&lt;li&gt;Real-time personalization and sentiment detection are changing customer experience
&lt;/li&gt;
&lt;li&gt;Omnichannel continuity is becoming the default expectation
&lt;/li&gt;
&lt;li&gt;Businesses are adopting voice AI to reduce costs, scale faster, and improve CSAT
&lt;/li&gt;
&lt;li&gt;2026 marks a maturity point where voice AI is finally production-ready
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Big Shift: From Automation to Conversational Intelligence
&lt;/h2&gt;

&lt;p&gt;For years, voice bots were little more than glorified IVRs. They followed scripts, failed on edge cases, and frustrated users more than they helped.&lt;/p&gt;

&lt;p&gt;That era is over.&lt;/p&gt;

&lt;p&gt;Modern voice agents are powered by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Large Language Models (LLMs)&lt;/li&gt;
&lt;li&gt;Advanced speech-to-text and text-to-speech systems&lt;/li&gt;
&lt;li&gt;Context-aware dialogue management&lt;/li&gt;
&lt;li&gt;Real-time analytics and feedback loops&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instead of rigid decision trees, today’s voice agents understand &lt;strong&gt;intent, context, and flow&lt;/strong&gt;. They can handle interruptions, clarify ambiguities, and adapt their responses dynamically—much closer to how humans communicate.&lt;/p&gt;

&lt;p&gt;This shift from scripted automation to conversational intelligence is the core reason voice AI is back in the spotlight.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why 2026 Is the Inflection Point for Voice AI
&lt;/h2&gt;

&lt;p&gt;Several forces converged to make 2026 a turning point:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Latency dropped enough for real-time conversations&lt;/li&gt;
&lt;li&gt;Models became fast, cheap, and accurate enough for voice&lt;/li&gt;
&lt;li&gt;Enterprises demanded measurable ROI from AI investments&lt;/li&gt;
&lt;li&gt;Customer expectations rose dramatically post-chatbot era&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;More than 80% of contact center leaders now rank AI-driven productivity as a top priority. According to industry forecasts, AI agents could unlock hundreds of billions of dollars in economic value over the next few years through cost savings and revenue growth.&lt;/p&gt;

&lt;p&gt;Voice AI is no longer experimental. It’s operational.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Rise of Next-Gen AI Voice Agents
&lt;/h2&gt;

&lt;p&gt;Next-generation voice agents are fundamentally different from their predecessors.&lt;/p&gt;

&lt;p&gt;They are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Context-aware across long conversations&lt;/li&gt;
&lt;li&gt;Adaptive to user behavior and tone&lt;/li&gt;
&lt;li&gt;Self-improving through learning loops&lt;/li&gt;
&lt;li&gt;Integrated deeply into business systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These agents don’t just answer questions—they &lt;strong&gt;participate in conversations&lt;/strong&gt;. They understand when a user is confused, frustrated, or satisfied, and adjust accordingly.&lt;/p&gt;

&lt;p&gt;This evolution turns voice AI from a support tool into a true engagement partner.&lt;/p&gt;




&lt;h2&gt;
  
  
  From Scripted Bots to Real Conversations
&lt;/h2&gt;

&lt;p&gt;Traditional voice bots relied on predefined flows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If user says X, respond with Y&lt;/li&gt;
&lt;li&gt;If intent unclear, repeat menu options&lt;/li&gt;
&lt;li&gt;Escalate early to human agents&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Modern voice agents do the opposite.&lt;/p&gt;

&lt;p&gt;They:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Interpret intent probabilistically&lt;/li&gt;
&lt;li&gt;Maintain conversational memory&lt;/li&gt;
&lt;li&gt;Adjust responses in real time&lt;/li&gt;
&lt;li&gt;Handle complex, multi-turn dialogues&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This enables more natural, efficient interactions and dramatically reduces call handling time and escalation rates.&lt;/p&gt;




&lt;h2&gt;
  
  
  Emotion and Empathy Are No Longer Optional
&lt;/h2&gt;

&lt;p&gt;One of the biggest breakthroughs in voice AI is &lt;strong&gt;emotion recognition&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;By analyzing tone, pace, pauses, and sentiment, voice agents can infer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Frustration&lt;/li&gt;
&lt;li&gt;Confusion&lt;/li&gt;
&lt;li&gt;Urgency&lt;/li&gt;
&lt;li&gt;Satisfaction&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;More importantly, they can respond empathetically—slowing down, changing tone, or escalating when necessary.&lt;/p&gt;

&lt;p&gt;This transforms voice interactions from transactional to human-centric, helping businesses build trust instead of frustration.&lt;/p&gt;




&lt;h2&gt;
  
  
  Multilingual and Accent-Adaptive Voice Agents
&lt;/h2&gt;

&lt;p&gt;Global businesses can no longer afford language barriers.&lt;/p&gt;

&lt;p&gt;Modern voice agents:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Switch languages mid-call&lt;/li&gt;
&lt;li&gt;Adapt to regional accents&lt;/li&gt;
&lt;li&gt;Understand dialect variations&lt;/li&gt;
&lt;li&gt;Maintain accuracy across geographies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This isn’t just about translation—it’s about &lt;strong&gt;inclusive communication&lt;/strong&gt;. Accent-adaptive AI prevents misinterpretation, reduces bias, and creates a more equitable customer experience.&lt;/p&gt;




&lt;h2&gt;
  
  
  5 Voice Agent Trends Defining 2026
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Trend 1: Generative AI for Real-Time Personalization
&lt;/h3&gt;

&lt;p&gt;Voice agents now generate responses dynamically using customer context, history, and intent. This leads to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Higher first-call resolution (FCR)&lt;/li&gt;
&lt;li&gt;Lower average handle time (AHT)&lt;/li&gt;
&lt;li&gt;More personalized customer journeys&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every call becomes unique.&lt;/p&gt;




&lt;h3&gt;
  
  
  Trend 2: Omnichannel Voice Experiences
&lt;/h3&gt;

&lt;p&gt;Customers move seamlessly between voice, chat, and digital channels. Voice agents maintain context across all of them.&lt;/p&gt;

&lt;p&gt;This continuity:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Improves retention&lt;/li&gt;
&lt;li&gt;Reduces repetition&lt;/li&gt;
&lt;li&gt;Increases operational efficiency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Omnichannel is no longer a feature—it’s an expectation.&lt;/p&gt;




&lt;h3&gt;
  
  
  Trend 3: Voice Analytics and Sentiment Tracking
&lt;/h3&gt;

&lt;p&gt;Every call becomes a data source.&lt;/p&gt;

&lt;p&gt;Voice AI now tracks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sentiment changes&lt;/li&gt;
&lt;li&gt;Emotional triggers&lt;/li&gt;
&lt;li&gt;Escalation patterns&lt;/li&gt;
&lt;li&gt;Behavioral trends&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These insights feed back into business strategy, enabling proactive CX improvements.&lt;/p&gt;




&lt;h3&gt;
  
  
  Trend 4: Data Privacy and Ethical Voice AI
&lt;/h3&gt;

&lt;p&gt;Trust is becoming a competitive advantage.&lt;/p&gt;

&lt;p&gt;Customers expect:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Transparency in data usage&lt;/li&gt;
&lt;li&gt;Compliance with global regulations&lt;/li&gt;
&lt;li&gt;Ethical AI behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Privacy-by-design and explainable AI are now mandatory for enterprise adoption.&lt;/p&gt;




&lt;h3&gt;
  
  
  Trend 5: Self-Learning Voice Agents
&lt;/h3&gt;

&lt;p&gt;Voice agents improve with every interaction.&lt;/p&gt;

&lt;p&gt;Using reinforcement learning, they:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Handle more complex cases over time&lt;/li&gt;
&lt;li&gt;Reduce human intervention&lt;/li&gt;
&lt;li&gt;Continuously optimize outcomes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This leads to massive operational cost reductions and faster scaling without additional headcount.&lt;/p&gt;




&lt;h2&gt;
  
  
  Business Impact: Why Companies Are Investing Heavily
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Faster Resolution and Higher Satisfaction
&lt;/h3&gt;

&lt;p&gt;Personalized, real-time conversations reduce friction and boost CSAT, NPS, and customer lifetime value.&lt;/p&gt;




&lt;h3&gt;
  
  
  Cost Efficiency and Workforce Optimization
&lt;/h3&gt;

&lt;p&gt;Routine interactions are automated, freeing human agents to focus on complex, emotional, or high-value cases.&lt;/p&gt;

&lt;p&gt;The result:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lower cost per interaction&lt;/li&gt;
&lt;li&gt;Higher agent productivity&lt;/li&gt;
&lt;li&gt;Better workforce utilization&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Global, 24/7 Availability
&lt;/h3&gt;

&lt;p&gt;Voice AI eliminates time zone limitations, ensuring consistent service availability worldwide.&lt;/p&gt;

&lt;p&gt;This directly translates to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Higher retention&lt;/li&gt;
&lt;li&gt;Reduced missed opportunities&lt;/li&gt;
&lt;li&gt;Expanded market reach&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Turning Conversations into Insights
&lt;/h3&gt;

&lt;p&gt;Voice AI doesn’t just handle calls—it extracts intelligence.&lt;/p&gt;

&lt;p&gt;Businesses use call data to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Improve products&lt;/li&gt;
&lt;li&gt;Refine marketing&lt;/li&gt;
&lt;li&gt;Optimize customer journeys&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every conversation becomes a strategic input.&lt;/p&gt;




&lt;h2&gt;
  
  
  Challenges Ahead for Voice AI
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Accent Bias and Inclusivity
&lt;/h3&gt;

&lt;p&gt;As voice AI scales globally, ensuring fair and accurate recognition across accents and dialects is critical.&lt;/p&gt;

&lt;p&gt;Inclusivity will define the next wave of innovation.&lt;/p&gt;




&lt;h3&gt;
  
  
  Preserving the Human Touch
&lt;/h3&gt;

&lt;p&gt;Automation must enhance—not replace—human empathy.&lt;/p&gt;

&lt;p&gt;The future belongs to hybrid systems where AI and humans collaborate seamlessly.&lt;/p&gt;




&lt;h3&gt;
  
  
  Regulation, Transparency, and Trust
&lt;/h3&gt;

&lt;p&gt;Compliance is just the baseline. Transparency and explainability will differentiate leaders from laggards.&lt;/p&gt;

&lt;p&gt;Ethical voice AI will become a brand value, not just a technical requirement.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thoughts: Why Voice Agents Are the Conversation of 2026
&lt;/h2&gt;

&lt;p&gt;Voice agents are no longer a novelty. They are becoming core infrastructure for customer engagement.&lt;/p&gt;

&lt;p&gt;What changed?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The technology matured&lt;/li&gt;
&lt;li&gt;The business case became undeniable&lt;/li&gt;
&lt;li&gt;Customer expectations evolved&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In 2026, voice AI isn’t about replacing humans—it’s about &lt;strong&gt;scaling empathy, intelligence, and efficiency&lt;/strong&gt; at the same time.&lt;/p&gt;

&lt;p&gt;And that’s why everyone is talking about it.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>agentaichallenge</category>
      <category>ai</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>vLLM Explained: How PagedAttention Makes LLMs Faster and Cheaper</title>
      <dc:creator>Jaskirat Singh</dc:creator>
      <pubDate>Mon, 26 Jan 2026 17:37:01 +0000</pubDate>
      <link>https://forem.com/jaskirat_singh/vllm-explained-how-pagedattention-makes-llms-faster-and-cheaper-785</link>
      <guid>https://forem.com/jaskirat_singh/vllm-explained-how-pagedattention-makes-llms-faster-and-cheaper-785</guid>
      <description>&lt;p&gt;Picture this: you're firing up a large language model (LLM) for your chatbot app, and bam—your GPU memory is toast. Half of it sits idle because of fragmented key-value (KV) caches from all those user queries piling up. Requests queue up, latency spikes, and you're burning cash on extra hardware just to keep things running.&lt;/p&gt;

&lt;p&gt;Sound familiar?&lt;/p&gt;

&lt;p&gt;That’s the pain of traditional LLM inference, and it’s a headache for developers everywhere.&lt;/p&gt;

&lt;p&gt;Enter &lt;strong&gt;vLLM&lt;/strong&gt;, an open-source serving engine that acts like a smart memory manager for your LLMs. At its heart is &lt;strong&gt;PagedAttention&lt;/strong&gt;, a clever technique that pages memory the same way an operating system does—dramatically reducing waste and boosting throughput.&lt;/p&gt;

&lt;p&gt;In this post, we’ll dive deep into:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Where vLLM came from&lt;/li&gt;
&lt;li&gt;How its core technology works&lt;/li&gt;
&lt;li&gt;Key features that make it shine&lt;/li&gt;
&lt;li&gt;Real-world wins&lt;/li&gt;
&lt;li&gt;How to get started in minutes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Whether you're building a production API or just experimenting locally, vLLM makes LLM serving &lt;strong&gt;easy, fast, and cheap&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is vLLM?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw2jeek1lil40fl0tsbag.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw2jeek1lil40fl0tsbag.png" alt="What Is vLLM" width="800" height="929"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;vLLM is a high-throughput, open-source library designed specifically for serving LLMs at scale. Think of it as the turbocharged engine under the hood of your AI inference server.&lt;/p&gt;

&lt;p&gt;It handles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Concurrent request processing&lt;/li&gt;
&lt;li&gt;GPU memory optimization&lt;/li&gt;
&lt;li&gt;High-throughput token generation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At a glance, vLLM provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A production-ready server compatible with OpenAI’s API specification&lt;/li&gt;
&lt;li&gt;A Python inference engine for custom integrations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The real magic comes from &lt;strong&gt;continuous batching&lt;/strong&gt; and &lt;strong&gt;PagedAttention&lt;/strong&gt;, which allow vLLM to pack far more requests onto a GPU than traditional inference engines. No more static batches waiting on slow prompts—requests flow in and out dynamically.&lt;/p&gt;

&lt;p&gt;I’ve used vLLM myself to deploy LLaMA models, and it turned a sluggish single-A100 setup into a system pushing over &lt;strong&gt;100+ tokens per second&lt;/strong&gt;. If you’re tired of memory babysitting or horizontal scaling just to serve a handful of users, vLLM is your fix.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Quick History of vLLM
&lt;/h2&gt;

&lt;p&gt;vLLM emerged in 2023 from the UC Berkeley Sky Computing Lab. It was introduced in the paper:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Authored by Woosuk Kwon and team, the paper identified a massive inefficiency in LLM inference: &lt;strong&gt;KV cache fragmentation&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;KV caches—temporary stores of attention keys and values—consume huge amounts of GPU memory during autoregressive generation. Traditional serving engines fragment this memory, leading to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Out-of-memory errors&lt;/li&gt;
&lt;li&gt;Underutilized GPUs&lt;/li&gt;
&lt;li&gt;Poor scalability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;PagedAttention was introduced as the solution, borrowing ideas from virtual memory systems in operating systems.&lt;/p&gt;

&lt;p&gt;The idea quickly gained traction. GitHub stars skyrocketed, community adoption exploded, and by mid-2023, vLLM was powering production workloads at startups and large tech companies alike.&lt;/p&gt;

&lt;p&gt;By 2026, vLLM supports:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;100+ models (LLaMA 3.1, Mistral, and more)&lt;/li&gt;
&lt;li&gt;Distributed serving with Ray&lt;/li&gt;
&lt;li&gt;Custom kernels and hardware optimizations&lt;/li&gt;
&lt;li&gt;Multi-LoRA and adapter-based serving&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It’s a rare example of an academic idea scaling cleanly into real-world production.&lt;/p&gt;




&lt;h2&gt;
  
  
  Core Technology: PagedAttention and Continuous Batching
&lt;/h2&gt;

&lt;h3&gt;
  
  
  PagedAttention
&lt;/h3&gt;

&lt;p&gt;PagedAttention is a non-contiguous memory management system for KV caches.&lt;/p&gt;

&lt;p&gt;In standard Transformer inference:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;KV caches grow with sequence length&lt;/li&gt;
&lt;li&gt;Memory becomes fragmented&lt;/li&gt;
&lt;li&gt;New requests fail despite free memory existing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;PagedAttention fixes this by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Splitting KV caches into fixed-size pages (e.g., 16 tokens)&lt;/li&gt;
&lt;li&gt;Storing them in a shared memory pool&lt;/li&gt;
&lt;li&gt;Tracking them using a logical-to-physical mapping table&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The attention kernel gathers only the required pages at runtime—no massive copies, no fragmentation.&lt;/p&gt;

&lt;p&gt;It’s essentially &lt;strong&gt;virtual memory for GPUs&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Continuous Batching
&lt;/h3&gt;

&lt;p&gt;Traditional batching waits for a full batch to complete. If one request is slow, everything stalls.&lt;/p&gt;

&lt;p&gt;vLLM uses &lt;strong&gt;continuous batching&lt;/strong&gt;, where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Completed sequences drop out mid-batch&lt;/li&gt;
&lt;li&gt;New requests immediately take their place&lt;/li&gt;
&lt;li&gt;GPU utilization stays consistently high&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Analogy:&lt;br&gt;&lt;br&gt;
Old batching is fixed seating in a restaurant.&lt;br&gt;&lt;br&gt;
vLLM is flexible seating—tables free up instantly when diners leave.&lt;/p&gt;

&lt;p&gt;The result:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Memory utilization jumps from 30–50% to over 90%&lt;/li&gt;
&lt;li&gt;Throughput improves by 2–4x&lt;/li&gt;
&lt;li&gt;Latency becomes predictable&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Key Features That Make vLLM Stand Out
&lt;/h2&gt;

&lt;p&gt;vLLM isn’t just about PagedAttention—it’s packed with production-grade features.&lt;/p&gt;
&lt;h3&gt;
  
  
  Quantization Support
&lt;/h3&gt;

&lt;p&gt;Supports AWQ, GPTQ, FP8, and more. This reduces memory usage by 2–4x with minimal quality loss.&lt;/p&gt;

&lt;p&gt;I’ve personally run a 70B model on two 40GB GPUs—something that was impossible before.&lt;/p&gt;
&lt;h3&gt;
  
  
  Tensor Parallelism
&lt;/h3&gt;

&lt;p&gt;Seamlessly shards models across multiple GPUs using tensor parallelism. Scaling is near-linear up to 8+ GPUs.&lt;/p&gt;
&lt;h3&gt;
  
  
  Speculative Decoding
&lt;/h3&gt;

&lt;p&gt;Uses a smaller draft model to propose tokens, which the main model verifies. This can double generation speed for interactive workloads.&lt;/p&gt;
&lt;h3&gt;
  
  
  Prefix Caching
&lt;/h3&gt;

&lt;p&gt;Reuses KV caches for repeated prompts, ideal for chatbots and RAG pipelines with static system prompts.&lt;/p&gt;

&lt;p&gt;Additional features include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OpenAI-compatible API&lt;/li&gt;
&lt;li&gt;Streaming responses&lt;/li&gt;
&lt;li&gt;JSON mode and tool calling&lt;/li&gt;
&lt;li&gt;Vision-language model support (e.g., LLaVA)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These optimizations stack. In benchmarks combining quantization, speculative decoding, and PagedAttention, vLLM exceeds &lt;strong&gt;500 tokens/sec on H100 GPUs&lt;/strong&gt;.&lt;/p&gt;


&lt;h2&gt;
  
  
  Benefits and Use Cases
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Why Use vLLM?
&lt;/h3&gt;

&lt;p&gt;Benchmarks consistently show:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;2–4x higher throughput&lt;/li&gt;
&lt;li&gt;24–48% memory savings&lt;/li&gt;
&lt;li&gt;Significantly lower infrastructure costs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On ShareGPT-style workloads, vLLM serves over &lt;strong&gt;2x more requests per second&lt;/strong&gt; than standard Hugging Face pipelines.&lt;/p&gt;
&lt;h3&gt;
  
  
  Common Use Cases
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Production APIs with OpenAI compatibility&lt;/li&gt;
&lt;li&gt;Retrieval-Augmented Generation pipelines&lt;/li&gt;
&lt;li&gt;Serving fine-tuned LoRAs and adapters&lt;/li&gt;
&lt;li&gt;On-prem or edge deployments&lt;/li&gt;
&lt;li&gt;Research and large-scale evaluation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A fintech team I know replaced TGI with vLLM for fraud detection. Throughput doubled, costs dropped 40%, and multi-tenancy became trivial.&lt;/p&gt;


&lt;h2&gt;
  
  
  Getting Started in 5 Minutes
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Installation
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install vllm
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Start the Server&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vllm serve meta-llama/Llama-2-7b-hf --port 8000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your API is now live at:&lt;/p&gt;

&lt;p&gt;&lt;a href="http://localhost:8000/v1/chat/completions" rel="noopener noreferrer"&gt;http://localhost:8000/v1/chat/completions&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Python Inference Example&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-2-7b-hf")
prompts = ["Hello, world!", "Why is the sky blue?"]

sampling_params = SamplingParams(
    temperature=0.8,
    top_p=0.95,
    max_tokens=100
)

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Batching, memory management, and scheduling are all handled automatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  vLLM vs Other Serving Engines
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Engine&lt;/th&gt;
&lt;th&gt;Memory Efficiency&lt;/th&gt;
&lt;th&gt;Throughput&lt;/th&gt;
&lt;th&gt;Ease of Use&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;vLLM&lt;/td&gt;
&lt;td&gt;High (PagedAttention)&lt;/td&gt;
&lt;td&gt;2–4x faster&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hugging Face TGI&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Standard&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TensorRT-LLM&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Very High&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;vLLM offers the best balance of performance, usability, and flexibility—without vendor lock-in.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;vLLM transforms LLM deployment from a resource-hungry problem into a scalable and affordable solution.&lt;/p&gt;

&lt;p&gt;With:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PagedAttention&lt;/strong&gt; eliminating memory waste&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Continuous batching&lt;/strong&gt; maximizing GPU utilization&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Advanced features&lt;/strong&gt; like quantization and speculative decoding&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;vLLM is quickly becoming the backbone of modern LLM serving.&lt;/p&gt;

&lt;p&gt;Whether you’re a solo developer or running large-scale AI infrastructure, vLLM makes high-performance inference accessible. Clone the repository, experiment locally, and keep an eye on upcoming releases—this space is moving fast.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>machinelearning</category>
      <category>programming</category>
    </item>
    <item>
      <title>How I Explained LLMs, SLMs &amp; VLMs at Microsoft</title>
      <dc:creator>Jaskirat Singh</dc:creator>
      <pubDate>Sun, 25 Jan 2026 06:17:47 +0000</pubDate>
      <link>https://forem.com/jaskirat_singh/how-i-explained-llms-slms-vlms-at-microsoft-3d5h</link>
      <guid>https://forem.com/jaskirat_singh/how-i-explained-llms-slms-vlms-at-microsoft-3d5h</guid>
      <description>&lt;h2&gt;
  
  
  Why This Talk Mattered
&lt;/h2&gt;

&lt;p&gt;I recently had the opportunity to present my thoughts on &lt;strong&gt;LLMs, SLMs, and VLMs&lt;/strong&gt; at the Microsoft office during a community event. This wasn’t just another AI talk filled with buzzwords and hype. The goal was simple but powerful: help students and professionals understand &lt;strong&gt;why not all AI models are built the same—and why that’s actually a good thing&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This blog is a written walkthrough of that presentation. I’ll be embedding the same slides I used and expanding on the thinking behind them—&lt;strong&gt;what I wanted the audience to feel, question, and take back with them&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Slide 1: Not All AI Models Are Built the Same — And That’s the Point
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv6mjzq5khzoj015sgn7k.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv6mjzq5khzoj015sgn7k.jpg" alt="Not All AI Models Are Built the Same" width="800" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key idea:&lt;/strong&gt; AI diversity is a feature, not a flaw.&lt;/p&gt;

&lt;p&gt;Most AI conversations start with the wrong question:&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Which model is the best?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I wanted to flip that narrative early and replace it with a better one:&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Which model fits the problem we are actually trying to solve?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That framing sets the foundation for everything that follows.&lt;/p&gt;




&lt;h2&gt;
  
  
  Slide 2: Who Am I and Why This Perspective Matters
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc46syo59p5c5iu2i6d39.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc46syo59p5c5iu2i6d39.jpg" alt="who i am" width="800" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Before diving into models, I briefly introduced myself and my background—working as an &lt;strong&gt;AI Data Scientist&lt;/strong&gt;, building &lt;strong&gt;SaaS products&lt;/strong&gt;, publishing research, and deploying &lt;strong&gt;production-grade AI systems&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This mattered because the talk wasn’t theoretical. It was grounded in &lt;strong&gt;real-world AI&lt;/strong&gt;, where &lt;strong&gt;cost, latency, privacy, and infrastructure constraints&lt;/strong&gt; are non-negotiable.&lt;/p&gt;




&lt;h2&gt;
  
  
  Slide 3: What Even Are LLMs?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0vhmede1vptvhy64d3gv.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0vhmede1vptvhy64d3gv.jpg" alt="What Even Are LLMs" width="800" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffz977qb7zw7oucwqa1hf.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffz977qb7zw7oucwqa1hf.jpg" alt=" " width="800" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Large Language Models (LLMs)&lt;/strong&gt; are neural networks trained on massive datasets. They represent the &lt;strong&gt;most powerful and versatile AI systems available today&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In simple terms, LLMs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Contain &lt;strong&gt;billions to trillions of parameters&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Use &lt;strong&gt;transformer architectures&lt;/strong&gt; with attention mechanisms&lt;/li&gt;
&lt;li&gt;Can &lt;strong&gt;reason, generate text, write code, translate languages, and analyze data&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Examples include &lt;strong&gt;GPT-style models, Claude, Gemini, and LLaMA-based systems&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Slide 4: Why the AI Landscape Is Not One-Size-Fits-All
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpmron5t25q3hgopcytc0.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpmron5t25q3hgopcytc0.jpg" alt="Why the AI Landscape Is Not One-Size-Fits-All" width="800" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I used a &lt;strong&gt;smartphone analogy&lt;/strong&gt; to make this intuitive.&lt;/p&gt;

&lt;p&gt;Just like we have &lt;strong&gt;flagship phones, budget phones, and specialized devices&lt;/strong&gt;, AI models exist for different needs.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LLMs&lt;/strong&gt; are the heavyweights
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SLMs&lt;/strong&gt; are the efficiency experts
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VLMs&lt;/strong&gt; are the multimodal specialists
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Different tools exist because &lt;strong&gt;different problems demand different trade-offs&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Slide 5–6: What LLMs Can Do Really Well
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faijznywk4683axuoe46e.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faijznywk4683axuoe46e.jpg" alt="What LLMs Can Do Really Well" width="800" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;LLMs are incredibly versatile. They can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Generate &lt;strong&gt;long-form content&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Perform &lt;strong&gt;complex reasoning&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Write and debug &lt;strong&gt;code across multiple languages&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Translate between &lt;strong&gt;dozens of languages&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Hold &lt;strong&gt;natural, human-like conversations&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Assist with &lt;strong&gt;deep research and analysis&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where most of the AI hype comes from—and rightly so.&lt;/p&gt;




&lt;h2&gt;
  
  
  Slide 7: The LLM Trade-Offs No One Talks About Enough
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0rt9hb2n78ez9q4xhs5i.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0rt9hb2n78ez9q4xhs5i.jpg" alt="The LLM Trade-Offs No One Talks About Enough" width="800" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;All that power comes at a cost.&lt;/p&gt;

&lt;p&gt;Running LLMs is like &lt;strong&gt;driving a Ferrari&lt;/strong&gt;. It’s impressive, but not always practical.&lt;/p&gt;

&lt;p&gt;Real-world limitations include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;High computational requirements&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Expensive inference costs&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Higher latency&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Heavy cloud dependency&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Significant energy consumption&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where many production systems start to struggle.&lt;/p&gt;




&lt;h2&gt;
  
  
  Slide 8: Enter SLMs — The Efficiency Experts
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkixrih7dx8tlsb0j78f8.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkixrih7dx8tlsb0j78f8.jpg" alt="Enter SLMs — The Efficiency Experts" width="800" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Small Language Models (SLMs)&lt;/strong&gt; are often underestimated, but they are having their moment.&lt;/p&gt;

&lt;p&gt;SLMs are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Smaller and more focused&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Optimized for &lt;strong&gt;specific tasks&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Fast and cost-efficient&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Capable of running on &lt;strong&gt;phones, laptops, and edge devices&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They are designed for &lt;strong&gt;practicality, not bragging rights&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Slide 9: SLMs Are Not Weak, They Are Strategic
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1qocq3eor16q74dulcc1.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1qocq3eor16q74dulcc1.jpg" alt="SLMs Are Not Weak, They Are Strategic" width="800" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I highlighted several modern SLMs to make this concrete:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Phi-3 (Microsoft)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Gemini Nano&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Mistral 7B&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;TinyLLaMA&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These models prove that &lt;strong&gt;intelligence is not just about size—it’s about optimization&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Slide 10: When Should You Use SLMs?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F08wprrl1g6diq50psmse.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F08wprrl1g6diq50psmse.jpg" alt="When Should You Use SLMs?" width="800" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;SLMs shine in scenarios where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Speed matters&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Privacy is critical&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Offline capability&lt;/strong&gt; is required&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Budgets are limited&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Edge deployment&lt;/strong&gt; is necessary&lt;/li&gt;
&lt;li&gt;Tasks are &lt;strong&gt;domain-specific&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;SLMs are not budget LLMs—they are &lt;strong&gt;the right choice for the right job&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Slide 11: VLMs — When AI Learns to See
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F66ve2aj22jb03ljppyit.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F66ve2aj22jb03ljppyit.jpg" alt="VLMs — When AI Learns to See" width="800" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vision-Language Models (VLMs)&lt;/strong&gt; take things a step further.&lt;/p&gt;

&lt;p&gt;They don’t just read text—they &lt;strong&gt;understand images as well&lt;/strong&gt;. This is where AI becomes truly &lt;strong&gt;multimodal&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;VLMs can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Process &lt;strong&gt;images and text together&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Understand &lt;strong&gt;visual context&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Answer questions about &lt;strong&gt;photos&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Generate &lt;strong&gt;descriptions from images&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Slide 12: How VLMs Actually Work
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0s47nmarnbwko4o8zo1r.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0s47nmarnbwko4o8zo1r.jpg" alt="How VLMs Actually Work" width="800" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Under the hood, VLMs combine:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;vision encoder&lt;/strong&gt; for images&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;language model&lt;/strong&gt; for text&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;fusion layer&lt;/strong&gt; to connect meaning across modalities&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This allows AI systems to &lt;strong&gt;see and reason at the same time&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Slide 13: VLMs in the Real World
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8te6yy0jz8z8z6cof2k4.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8te6yy0jz8z8z6cof2k4.jpg" alt="VLMs in the Real World" width="800" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;VLMs are already transforming industries such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Medical imaging and diagnostics&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Autonomous vehicles&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Accessibility tools&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Visual search engines&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Content moderation&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AR and VR experiences&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Multimodal AI is no longer optional—it’s becoming &lt;strong&gt;standard&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Slide 14: Comparing LLMs, SLMs, and VLMs
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqtppdo97d70ngeizkvna.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqtppdo97d70ngeizkvna.jpg" alt="Comparing LLMs, SLMs, and VLMs" width="800" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There is &lt;strong&gt;no single best model&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LLMs&lt;/strong&gt; excel at reasoning and versatility
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SLMs&lt;/strong&gt; excel at efficiency and speed
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VLMs&lt;/strong&gt; excel at multimodal understanding
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The right choice depends entirely on &lt;strong&gt;context&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Slide 15: Speed and Latency Reality Check
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Froe4i8w23mneeswhffw6.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Froe4i8w23mneeswhffw6.jpg" alt="Speed and Latency Reality Check" width="800" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Response time matters more than ever.&lt;/p&gt;

&lt;p&gt;Approximate latency expectations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LLMs:&lt;/strong&gt; 1–5 seconds
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SLMs:&lt;/strong&gt; under 0.5 seconds
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VLMs:&lt;/strong&gt; 2–8 seconds depending on image complexity
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For real-time applications, &lt;strong&gt;speed is not optional&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Slide 16: Choosing the Right Tool
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fosqsyjrbxydqf07alm3f.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fosqsyjrbxydqf07alm3f.jpg" alt="Choosing the Right Tool" width="800" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Choosing a model is like choosing a vehicle. A monster truck is overkill for city driving.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Need &lt;strong&gt;complex reasoning&lt;/strong&gt; → LLM
&lt;/li&gt;
&lt;li&gt;Need &lt;strong&gt;speed and efficiency&lt;/strong&gt; → SLM
&lt;/li&gt;
&lt;li&gt;Need &lt;strong&gt;visual understanding&lt;/strong&gt; → VLM
&lt;/li&gt;
&lt;li&gt;Need &lt;strong&gt;offline capability&lt;/strong&gt; → SLM
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final Takeaways
&lt;/h2&gt;

&lt;p&gt;The most important lessons from this talk:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;There is no universally best AI model&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Context beats capability&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Efficiency matters as much as intelligence&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hybrid systems often outperform single-model setups&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Choose models like an &lt;strong&gt;engineer&lt;/strong&gt;, not like a fan.&lt;/p&gt;




&lt;h2&gt;
  
  
  Closing Thoughts
&lt;/h2&gt;

&lt;p&gt;Presenting this at the Microsoft office was special—not because of the venue, but because the audience asked &lt;strong&gt;implementation-focused questions&lt;/strong&gt;, not hype-driven ones.&lt;/p&gt;

&lt;p&gt;If you’re building AI systems today, understanding &lt;strong&gt;LLMs, SLMs, and VLMs&lt;/strong&gt; isn’t optional—it’s foundational.&lt;/p&gt;

&lt;h2&gt;
  
  
  Let’s Continue the Conversation
&lt;/h2&gt;

&lt;p&gt;If this resonated with you, feel free to connect with me on LinkedIn or reach out directly. I’d love to hear how you’re thinking about model selection in your own AI stack.&lt;br&gt;
&lt;a href="https://www.linkedin.com/in/jaskiratai" rel="noopener noreferrer"&gt;https://www.linkedin.com/in/jaskiratai&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>microsoft</category>
      <category>nlp</category>
      <category>llm</category>
    </item>
    <item>
      <title>Why Your AI Feels Dumb (And How MCP Fixes It)</title>
      <dc:creator>Jaskirat Singh</dc:creator>
      <pubDate>Fri, 23 Jan 2026 18:02:11 +0000</pubDate>
      <link>https://forem.com/jaskirat_singh/why-your-ai-feels-dumb-and-how-mcp-fixes-it-3alf</link>
      <guid>https://forem.com/jaskirat_singh/why-your-ai-feels-dumb-and-how-mcp-fixes-it-3alf</guid>
      <description>&lt;p&gt;&lt;strong&gt;Your AI isn’t actually dumb.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It can write code you’d normally Google for. It can explain system design better than most interview prep blogs. It can summarize a 100-page document before your coffee gets cold.&lt;/p&gt;

&lt;p&gt;And yet, the moment you ask it to do something real—like read a file from your app, query a database, or trigger an internal API—it suddenly feels useless.&lt;/p&gt;

&lt;p&gt;That disconnect isn’t a model problem.&lt;br&gt;
It’s a context problem.&lt;/p&gt;

&lt;p&gt;And this is where Model Context Protocol (MCP) quietly changes everything.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The Real Issue: Your AI Lives in a Bubble&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Large Language Models don’t live inside your system. They live next to it.&lt;/p&gt;

&lt;p&gt;Out of the box, an LLM has no idea about your files, your services, your permissions, or your business logic. So when we try to make AI “useful,” we start stuffing all of that information into prompts, tool schemas, and wrapper code.&lt;/p&gt;

&lt;p&gt;At first, it feels like progress. The AI responds. It calls tools. It does something.&lt;/p&gt;

&lt;p&gt;But over time, the cracks start showing.&lt;/p&gt;

&lt;p&gt;Prompts grow massive. Tool logic becomes tightly coupled to specific models. Small changes ripple through the system. Switching models feels like rewriting half your stack.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your AI didn’t get smarter.&lt;/li&gt;
&lt;li&gt;You just buried the complexity deeper.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Life Before MCP: Clever Hacks, Fragile Systems&lt;/p&gt;

&lt;p&gt;Before MCP, every AI app solved the same problem in its own way.&lt;/p&gt;

&lt;p&gt;Some used function calling. Some used agent frameworks. Others built custom JSON protocols that only made sense to their team. Every solution worked—until it didn’t.&lt;/p&gt;

&lt;p&gt;The real issue was that there was no standard contract between AI models and external systems. Each integration was handcrafted. Each prompt carried architectural responsibility it was never meant to handle.&lt;/p&gt;

&lt;p&gt;We were asking language models to manage system design.&lt;/p&gt;

&lt;p&gt;That was never going to scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;MCP, Explained Without the Marketing&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Model Context Protocol (MCP) is a standard that defines how AI models interact with tools, data, and context in a structured, predictable way.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It doesn’t make models smarter.&lt;/li&gt;
&lt;li&gt;It makes AI systems usable.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;MCP introduces a clean separation. Your application exposes what it can do and what data it can share. The AI consumes that information through a well-defined interface, without knowing how things work internally.&lt;/p&gt;

&lt;p&gt;It’s boring in the best way possible. And boring is what production systems need.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The Shift MCP Introduces (And Why It Matters)&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The biggest change MCP brings is this: context stops leaking into prompts.&lt;/p&gt;

&lt;p&gt;Instead of encoding system behavior into text instructions, your system exposes capabilities directly. The AI no longer needs to guess how to interact with your app. It simply asks what’s available.&lt;/p&gt;

&lt;p&gt;This means prompts become simpler. Logic becomes clearer. Security boundaries become explicit. And your AI stops feeling fragile.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;You move from “prompt-powered hacks” to actual architecture.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;MCP Servers: Teaching Your System How to Talk to AI&lt;/p&gt;

&lt;p&gt;An MCP server acts as your system’s official interface for AI.&lt;/p&gt;

&lt;p&gt;Rather than exposing raw internals, it presents a curated view of what the AI is allowed to see and do. This might include access to files, databases, APIs, or workflows—but only through controlled, structured capabilities.&lt;/p&gt;

&lt;p&gt;From the AI’s perspective, the system becomes understandable. From your perspective, the system stays protected.&lt;/p&gt;

&lt;p&gt;That balance is what makes MCP especially powerful in enterprise and production environments.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;MCP Clients: Where Intelligence Finally Connects to Reality&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The MCP client is the AI-powered application itself. This could be a coding assistant, an internal chatbot, an autonomous agent, or a RAG system with actions.&lt;/p&gt;

&lt;p&gt;The key difference is that the client no longer needs custom logic for every integration. It connects to MCP servers, discovers available capabilities, and uses them consistently—regardless of the underlying implementation.&lt;/p&gt;

&lt;p&gt;This makes AI apps more modular, more flexible, and far easier to evolve over time.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Why MCP Makes Your AI Feel Smarter (Without Changing the Model)&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Once MCP is in place, something interesting happens.&lt;/p&gt;

&lt;p&gt;Your AI doesn’t hallucinate how to use tools. It doesn’t rely on brittle prompt instructions. It doesn’t break when you refactor your backend.&lt;/p&gt;

&lt;p&gt;It simply operates within a clear, structured environment.&lt;/p&gt;

&lt;p&gt;That’s why MCP doesn’t feel like a flashy feature. It feels like stability. Like things finally make sense.&lt;/p&gt;

&lt;p&gt;And in a world rushing to build agents, workflows, and autonomous systems, that kind of foundation matters more than ever.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Final Thought&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Your AI was never dumb.&lt;/p&gt;

&lt;p&gt;It was just disconnected.&lt;/p&gt;

&lt;p&gt;MCP doesn’t give your model new abilities. It gives your system a language the model can actually understand. And once that connection is clean, reliable, and standardized, everything else becomes easier.&lt;/p&gt;

&lt;p&gt;Smarter behavior isn’t always about better intelligence.&lt;/p&gt;

&lt;p&gt;Sometimes, it’s just about better context.&lt;/p&gt;

</description>
      <category>mcp</category>
      <category>ai</category>
      <category>llmops</category>
      <category>llm</category>
    </item>
    <item>
      <title>How a RAG Agent Helped My Father's Shoulder Treatment (And Saved ₹30,000).</title>
      <dc:creator>Jaskirat Singh</dc:creator>
      <pubDate>Thu, 22 Jan 2026 18:22:39 +0000</pubDate>
      <link>https://forem.com/jaskirat_singh/how-a-rag-agent-helped-my-fathers-shoulder-treatment-and-saved-30000-349h</link>
      <guid>https://forem.com/jaskirat_singh/how-a-rag-agent-helped-my-fathers-shoulder-treatment-and-saved-30000-349h</guid>
      <description>&lt;p&gt;My father slipped on wet tiles while watering plants. Left shoulder dislocation. Emergency room. The doctor said he needed treatment, possibly surgery.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hospital estimate: ₹30,000-50,000.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We had HDFC insurance for 4 years. But standing there, I had no idea what was covered.&lt;br&gt;
The 84-page policy PDF was useless at that moment.&lt;/p&gt;

&lt;p&gt;So I built an AI system that could answer our questions. It took one night. It saved us ₹30,000.&lt;/p&gt;

&lt;p&gt;Here's exactly how I did it.&lt;/p&gt;

&lt;p&gt;What Is a RAG Agent?&lt;br&gt;
&lt;strong&gt;RAG = Retrieval-Augmented Generation&lt;/strong&gt;&lt;br&gt;
Think of it as ChatGPT that:&lt;/p&gt;

&lt;p&gt;Only reads YOUR documents&lt;br&gt;
Never makes up answers&lt;br&gt;
Shows you which page it found information on&lt;/p&gt;

&lt;p&gt;Perfect for insurance policies where accuracy matters.&lt;/p&gt;

&lt;p&gt;How It Helped Us&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Question 1&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;"MY Father's shoulder is dislocated. Is it covered?"&lt;/em&gt;&lt;br&gt;
Agent Answer:&lt;br&gt;
✅ YES - Covered under Accidental Injury (Section 4.2)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No waiting period for accidents&lt;/li&gt;
&lt;li&gt;Manipal Hospital is a network hospital - use cashless&lt;/li&gt;
&lt;li&gt;Apply for pre-authorization within 24 hours
Result: Got cashless approval in 18 hours. No upfront payment needed.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Question 2&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;"Is an MRI scan also covered, or need to pay for it?"&lt;/em&gt;&lt;br&gt;
Agent Answer:&lt;br&gt;
✅ YES - Covered under Diagnostic Procedures (Section 5.1)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MRI needs pre-authorization&lt;/li&gt;
&lt;li&gt;Already approved with treatment&lt;/li&gt;
&lt;li&gt;Direct billing available
Result: MRI done cashless. Saved ₹18,000 out-of-pocket.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Question 3&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;"Physiotherapy sessions are covered?"&lt;/em&gt;&lt;br&gt;
Agent Answer:&lt;br&gt;
✅ YES - 10 sessions covered (Section 6.4)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Up to ₹800 per session&lt;/li&gt;
&lt;li&gt;Must be prescribed by a doctor&lt;/li&gt;
&lt;li&gt;Can be done at empaneled centers
Result: We had NO idea this was covered. Used all 10 sessions. Saved ₹8,000.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The Complete Code&lt;/strong&gt;&lt;br&gt;
Here's the exact system I built. Copy-paste ready.&lt;/p&gt;

&lt;p&gt;Prerequisites&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Install Python packages
pip install langchain langchain-community langchain-openai faiss-cpu pypdf python-dotenv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Get OpenAI API key from platform.openai.com
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Setup (.env file)&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;OPENAI_API_KEY=your_key_here&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Load Policy PDF&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# load_policy.py
from langchain_community.document_loaders import PyPDFLoader

def load_policy(pdf_path):
    """Load insurance policy PDF"""
    loader = PyPDFLoader(pdf_path)
    documents = loader.load()
    print(f"✅ Loaded {len(documents)} pages")
    return documents

### Usage
policy_docs = load_policy("star_health_policy.pdf")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 3: Create Smart Chunks&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# chunking.py
from langchain.text_splitter import RecursiveCharacterTextSplitter

def create_chunks(documents):
    """Split into searchable pieces"""
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=500,
        chunk_overlap=100
    )
    chunks = splitter.split_documents(documents)
    print(f"✅ Created {len(chunks)} chunks")
    return chunks

# Usage
chunks = create_chunks(policy_docs)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 4: Create Vector Database&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# vectorstore.py
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

def create_vectorstore(chunks):
    """Create searchable database"""
    embeddings = OpenAIEmbeddings()
    vectorstore = FAISS.from_documents(chunks, embeddings)
    vectorstore.save_local("policy_index")
    print("✅ Vector store created")
    return vectorstore

# Load existing
def load_vectorstore():
    embeddings = OpenAIEmbeddings()
    return FAISS.load_local(
        "policy_index", 
        embeddings,
        allow_dangerous_deserialization=True
    )

# Usage
vectorstore = create_vectorstore(chunks)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 5: Create the RAG System&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# rag_system.py
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

def create_rag_system(vectorstore):
    """Create the question-answering system"""

    # Define prompt
    prompt = PromptTemplate(
        input_variables=["context", "question"],
        template="""You are an insurance policy expert.

Answer using ONLY the context below. If you don't know, say so.

Context:
{context}

Question:
{question}

Answer clearly and cite sections."""
    )

    # Create LLM
    llm = ChatOpenAI(
        model="gpt-4-turbo-preview",
        temperature=0
    )

    # Create retriever
    retriever = vectorstore.as_retriever(
        search_kwargs={"k": 5}
    )

    # Build chain
    chain = RetrievalQA.from_chain_type(
        llm=llm,
        retriever=retriever,
        return_source_documents=True,
        chain_type_kwargs={"prompt": prompt}
    )

    return chain

# Usage
rag_chain = create_rag_system(vectorstore)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 6: Ask Questions&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# query.py
def ask_question(chain, question):
    """Ask and get answer with sources"""
    result = chain.invoke({"query": question})

    print(f"\n❓ Question: {question}")
    print(f"\n✅ Answer: {result['result']}")
    print(f"\n📚 Sources: {len(result['source_documents'])} sections")

    for i, doc in enumerate(result['source_documents'], 1):
        print(f"\n{i}. Page {doc.metadata.get('page', '?')}")
        print(f"   {doc.page_content[:200]}...")

    return result

# Usage
ask_question(rag_chain, "Is emergency treatment covered for accidents?")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Complete Script (main.py)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# main.py
import os
from dotenv import load_dotenv
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# Load API key
load_dotenv()

def setup_rag_system(pdf_path):
    """Complete setup in one function"""

    print("📄 Loading policy...")
    loader = PyPDFLoader(pdf_path)
    documents = loader.load()

    print("✂️ Creating chunks...")
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=500,
        chunk_overlap=100
    )
    chunks = splitter.split_documents(documents)

    print("🧠 Creating embeddings...")
    embeddings = OpenAIEmbeddings()
    vectorstore = FAISS.from_documents(chunks, embeddings)

    print("🔗 Building RAG chain...")
    prompt = PromptTemplate(
        input_variables=["context", "question"],
        template="""You are an insurance policy expert.

Answer using ONLY the context below. If you don't know, say so.

Context:
{context}

Question:
{question}

Answer clearly and cite policy sections."""
    )

    llm = ChatOpenAI(model="gpt-4-turbo-preview", temperature=0)
    retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

    chain = RetrievalQA.from_chain_type(
        llm=llm,
        retriever=retriever,
        return_source_documents=True,
        chain_type_kwargs={"prompt": prompt}
    )

    print("✅ System ready!\n")
    return chain

def ask(chain, question):
    """Ask a question"""
    result = chain.invoke({"query": question})
    print(f"\n❓ {question}")
    print(f"✅ {result['result']}\n")
    return result

# Run
if __name__ == "__main__":
    # Setup
    chain = setup_rag_system("your_policy.pdf")

    # Ask questions
    ask(chain, "Is shoulder dislocation covered?")
    ask(chain, "What about physiotherapy sessions?")
    ask(chain, "Are there any waiting periods?")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Run It&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;python main.py&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real Output Example&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;$ python main.py&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;📄 Loading policy...&lt;br&gt;
✂️ Creating chunks...&lt;br&gt;
🧠 Creating embeddings...&lt;br&gt;
🔗 Building RAG chain...&lt;br&gt;
✅ System ready!&lt;/p&gt;

&lt;p&gt;❓ Is shoulder dislocation covered?&lt;br&gt;
✅ Yes, shoulder dislocation from accidental injury is covered under &lt;br&gt;
Section 4.2 (Accidental Injury Coverage). No waiting period applies &lt;br&gt;
for accident-related injuries. Pre-authorization required within 24 &lt;br&gt;
hours for cashless treatment.&lt;/p&gt;

&lt;p&gt;❓ What about physiotherapy sessions?&lt;br&gt;
✅ Physiotherapy is covered under Section 6.4 for post-treatment &lt;br&gt;
rehabilitation. Maximum 10 sessions covered at ₹800 per session. &lt;br&gt;
Must be prescribed by treating doctor.&lt;/p&gt;

&lt;p&gt;❓ Are there any waiting periods?&lt;br&gt;
✅ Yes, standard waiting periods apply: 30 days for specific ailments, &lt;br&gt;
24 months for pre-existing conditions. However, these are WAIVED for &lt;br&gt;
accidental injuries.&lt;/p&gt;

&lt;p&gt;Limitations&lt;br&gt;
What It Does:&lt;br&gt;
✅ Answers factual coverage questions&lt;br&gt;
✅ Explains waiting periods&lt;br&gt;
✅ Finds relevant clauses&lt;br&gt;
✅ Handles Hindi + English questions&lt;br&gt;
What It Doesn't Do:&lt;br&gt;
❌ Legal advice&lt;br&gt;
❌ Guarantee claim approval&lt;br&gt;
❌ Medical diagnosis&lt;br&gt;
❌ Replace insurance company&lt;br&gt;
Always verify critical decisions with your insurer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real Impact&lt;/strong&gt;&lt;br&gt;
After sharing this, 200+ families asked for help with their policies.&lt;br&gt;
&lt;strong&gt;Common discoveries:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Room rent sub-limits nobody knew about&lt;br&gt;
Physiotherapy coverage never claimed&lt;br&gt;
Waiting periods wrongly applied&lt;br&gt;
Cashless hospitals not used&lt;/p&gt;

&lt;p&gt;Total saved by our community: ₹23+ lakhs&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Next Steps For You&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Get your policy PDF&lt;br&gt;
Get OpenAI API key (or use free Ollama)&lt;br&gt;
Run the code above&lt;br&gt;
Ask questions about YOUR policy&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For Developers&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Add more insurers (ICICI, HDFC, Care)&lt;br&gt;
Build mobile app&lt;br&gt;
Add Hindi voice interface&lt;br&gt;
Create comparison tool&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For Startups&lt;/strong&gt;&lt;br&gt;
This is a real problem. Millions of families need this. Build it right, help millions.&lt;/p&gt;

&lt;p&gt;**&lt;br&gt;
FAQs**&lt;br&gt;
Q: Do I need coding knowledge?&lt;br&gt;
A: Basic Python. If you can copy-paste, you can run this.&lt;br&gt;
Q: How long does setup take?&lt;br&gt;
A: 30 minutes first time. 2 minutes after that.&lt;br&gt;
Q: Is my policy data safe?&lt;br&gt;
A: Runs on your computer. Your policy never leaves your machine (except embeddings to OpenAI).&lt;br&gt;
Q: Can I use this for other documents?&lt;br&gt;
A: Yes! Works for any PDF - legal docs, manuals, research papers.&lt;br&gt;
Q: What if my policy updates?&lt;br&gt;
A: Rerun the setup script. Takes 2 minutes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Success Story&lt;/strong&gt;&lt;br&gt;
After building this:&lt;/p&gt;

&lt;p&gt;Father's treatment went smoothly&lt;br&gt;
No confusion, no panic&lt;br&gt;
Saved ₹30,000 by understanding coverage&lt;br&gt;
Discovered benefits we didn't know existed&lt;br&gt;
Helped 200+ other families&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Most importantly: Peace of mind during a medical emergency.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Final Thoughts&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Insurance shouldn't be a mystery box you open during emergencies.&lt;br&gt;
You pay premiums. You deserve to understand what you bought.&lt;br&gt;
This RAG system gives you that understanding—in seconds, in your language, with sources cited.&lt;br&gt;
Build it. Use it. Share it.&lt;br&gt;
Your family will thank you when it matters most.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>vectordatabase</category>
      <category>rag</category>
      <category>insurance</category>
    </item>
    <item>
      <title>If You’re Building in 2026, Start Here 📈</title>
      <dc:creator>Jaskirat Singh</dc:creator>
      <pubDate>Wed, 21 Jan 2026 11:21:04 +0000</pubDate>
      <link>https://forem.com/jaskirat_singh/if-youre-building-in-2026-start-here-a07</link>
      <guid>https://forem.com/jaskirat_singh/if-youre-building-in-2026-start-here-a07</guid>
      <description>&lt;h2&gt;
  
  
  &lt;strong&gt;&lt;em&gt;20 Open-Source Tools That Actually Moved the Needle&lt;/em&gt;&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;2025 was noisy.&lt;/p&gt;

&lt;p&gt;Every week, a new “must-use” open-source tool popped up on GitHub or X. I personally tried 40+ open-source tools across AI infra, LLM ops, developer productivity, and internal tooling—so you don’t have to.&lt;/p&gt;

&lt;p&gt;Most were impressive demos.&lt;br&gt;
Only a few actually shipped value in real projects.&lt;/p&gt;

&lt;p&gt;This blog is about those 20 tools. The ones that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reduced engineering friction&lt;/li&gt;
&lt;li&gt;Scaled beyond toy use cases&lt;/li&gt;
&lt;li&gt;Worked in production, not just on launch day&lt;/li&gt;
&lt;li&gt;Fit how teams will realistically build products in 2026&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you’re a builder, founder, or developer working with AI systems, workflows, or modern SaaS stacks—this list is a solid place to start.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;How to Read This List&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Each tool below includes:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it does&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it matters in 2026&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where it fits in real projects&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No hype. No paid placements. Just practical tools.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;1. Sourcebot&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwvgr5pxbbadwo368chxw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwvgr5pxbbadwo368chxw.png" alt="Sourcebot" width="800" height="397"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Fast, self-hosted code understanding &amp;amp; search for massive monorepos&lt;/p&gt;

&lt;p&gt;When codebases cross a certain size, grep and IDE search simply stop scaling. Sourcebot gives you semantic, blazing-fast search across large monorepos—while staying self-hosted.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt;&lt;br&gt;
AI-assisted development only works if your tools actually understand your codebase.&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://www.sourcebot.dev/" rel="noopener noreferrer"&gt;https://www.sourcebot.dev/&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;2. LiteLLM (YC W23)&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj3oeg3vdpnlxp5urjr4d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj3oeg3vdpnlxp5urjr4d.png" alt="LiteLLM" width="800" height="398"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;One OpenAI-compatible gateway for 100+ LLMs&lt;/p&gt;

&lt;p&gt;LiteLLM abstracts away vendor lock-in. You switch models, providers, and pricing without rewriting your app—while getting logging, rate limits, and cost controls.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt;&lt;br&gt;
2026 will be multi-model by default.&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://www.litellm.ai/" rel="noopener noreferrer"&gt;https://www.litellm.ai/&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;3. Langfuse&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ukhx0h7jo0kubw4qhik.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ukhx0h7jo0kubw4qhik.png" alt="Langfuse" width="800" height="399"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Tracing, evals, and prompt management for LLM apps&lt;/p&gt;

&lt;p&gt;Langfuse helps you understand why an LLM behaved the way it did. Traces, evaluations, prompt versions—everything you need once prototypes hit production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt;&lt;br&gt;
You can’t scale what you can’t observe.&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://langfuse.com/" rel="noopener noreferrer"&gt;https://langfuse.com/&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;4. Infisical&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F580l8izneuocb0101qki.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F580l8izneuocb0101qki.png" alt="Infisical" width="800" height="397"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Open-source secrets &amp;amp; config management&lt;/p&gt;

&lt;p&gt;Finally, a modern alternative to hard-coded env files and duct-taped secret sharing across teams and CI/CD.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt;&lt;br&gt;
AI apps touch more credentials than traditional apps ever did.&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://infisical.com/" rel="noopener noreferrer"&gt;https://infisical.com/&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;5. Ollama&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1zrl63sb2r65sxle7yaw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1zrl63sb2r65sxle7yaw.png" alt="Ollama" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Run LLMs locally with a simple CLI&lt;/p&gt;

&lt;p&gt;Ollama makes local inference approachable—for dev, testing, and privacy-sensitive workloads.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt;&lt;br&gt;
Not every LLM call should hit the cloud.&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://ollama.com/" rel="noopener noreferrer"&gt;https://ollama.com/&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;6. Browser Use&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8lehw179clbkpru0fv1g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8lehw179clbkpru0fv1g.png" alt="Browser Use" width="800" height="391"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let AI agents interact with real websites&lt;/p&gt;

&lt;p&gt;This unlocks agent workflows that actually work on the real web—not just APIs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt;&lt;br&gt;
Many real-world systems still don’t have APIs.&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://browser-use.com/" rel="noopener noreferrer"&gt;https://browser-use.com/&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;7. Mastra&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffid2z3kcyu7ezw5rht18.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffid2z3kcyu7ezw5rht18.png" alt="Mastra" width="800" height="394"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;TypeScript-first AI primitives (agents, RAG, workflows)&lt;/p&gt;

&lt;p&gt;Mastra feels like what many of us wish early AI frameworks were—composable, typed, and production-minded.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt;&lt;br&gt;
AI infra is moving closer to standard software patterns.&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://mastra.ai/" rel="noopener noreferrer"&gt;https://mastra.ai/&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;8. Continue&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj5wbc6b2m8mgakq4z3ju.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj5wbc6b2m8mgakq4z3ju.png" alt="Continue" width="800" height="401"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Background agents and continuous coding workflows&lt;/p&gt;

&lt;p&gt;Continue blends into developer workflows instead of interrupting them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt;&lt;br&gt;
AI copilots should assist, not distract.&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://www.continue.dev/" rel="noopener noreferrer"&gt;https://www.continue.dev/&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;9. Firecrawl&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F88wormuv3oatlj5rdcbk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F88wormuv3oatlj5rdcbk.png" alt="Firecrawl" width="800" height="394"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Turn websites into clean, LLM-ready data&lt;/p&gt;

&lt;p&gt;Scraping is messy. Firecrawl makes it boring—in the best way.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt;&lt;br&gt;
RAG pipelines are only as good as their input data.&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://www.firecrawl.dev/" rel="noopener noreferrer"&gt;https://www.firecrawl.dev/&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;10. Onyx&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F92hx3rzdvp8dzmyrbb72.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F92hx3rzdvp8dzmyrbb72.png" alt="Onyx" width="800" height="383"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Self-hostable enterprise chat UI with RAG and agents&lt;/p&gt;

&lt;p&gt;Think “internal ChatGPT,” but actually enterprise-ready.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt;&lt;br&gt;
Companies want control, not just convenience.&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://www.onyx.app/" rel="noopener noreferrer"&gt;https://www.onyx.app/&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;11. Trigger.dev&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F069ap0kfwmxvvhr6o2up.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F069ap0kfwmxvvhr6o2up.png" alt="Trigger.dev" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Long-running, reliable AI workflows in TypeScript&lt;/p&gt;

&lt;p&gt;Perfect for agent pipelines that don’t fit into request-response cycles.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt;&lt;br&gt;
AI workflows are asynchronous by nature.&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://trigger.dev/" rel="noopener noreferrer"&gt;https://trigger.dev/&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;12. ParadeDB&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5aauup82ki1vf9x6jvcu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5aauup82ki1vf9x6jvcu.png" alt="ParadeDB" width="800" height="396"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Postgres-native search &amp;amp; analytics&lt;/p&gt;

&lt;p&gt;A serious Elasticsearch alternative without leaving Postgres.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt;&lt;br&gt;
Operational simplicity wins long term.&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://www.paradedb.com/" rel="noopener noreferrer"&gt;https://www.paradedb.com/&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;13. Reflex&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5ynfg2pw6c7p0ssomvgc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5ynfg2pw6c7p0ssomvgc.png" alt="Reflex*" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Build full-stack web apps entirely in Python&lt;/p&gt;

&lt;p&gt;Reflex lowers the barrier for AI engineers to ship full products.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt;&lt;br&gt;
Builders shouldn’t need to context-switch stacks to ship.&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://reflex.dev/" rel="noopener noreferrer"&gt;https://reflex.dev/&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;14. Tiptap&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8kpq6imk19r6u1oc3tu8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8kpq6imk19r6u1oc3tu8.png" alt="Tiptap" width="800" height="397"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Headless editor for Notion-like experiences&lt;/p&gt;

&lt;p&gt;If you’re building collaborative or AI-assisted writing tools, Tiptap is battle-tested.&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://tiptap.dev/" rel="noopener noreferrer"&gt;https://tiptap.dev/&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;15. GrowthBook&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F00lbhscp3ipbf34ub5vx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F00lbhscp3ipbf34ub5vx.png" alt="GrowthBook" width="800" height="398"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Open-source feature flags and A/B testing&lt;/p&gt;

&lt;p&gt;Experimentation without SaaS lock-in.&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://www.growthbook.io/" rel="noopener noreferrer"&gt;https://www.growthbook.io/&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;16. Windmill&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3udxqev3hqbr9rbpn42r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3udxqev3hqbr9rbpn42r.png" alt=" Windmill" width="800" height="397"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Turn scripts into apps and workflows&lt;/p&gt;

&lt;p&gt;Windmill is what internal tooling should feel like in 2026.&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://www.windmill.dev/" rel="noopener noreferrer"&gt;https://www.windmill.dev/&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;17. LanceDB&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ferp5l8jlltztke5dhpvs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ferp5l8jlltztke5dhpvs.png" alt="LanceDB" width="800" height="368"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;High-performance vector DB for billion-scale search&lt;/p&gt;

&lt;p&gt;Fast, efficient, and built for serious scale.&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://lancedb.com/" rel="noopener noreferrer"&gt;https://lancedb.com/&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;18. Mattermost&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0em82upp9vm8hgnb6a1n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0em82upp9vm8hgnb6a1n.png" alt="Mattermost" width="800" height="368"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Secure, self-hosted team communication&lt;/p&gt;

&lt;p&gt;Used by organizations where Slack simply isn’t an option.&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://mattermost.com/" rel="noopener noreferrer"&gt;https://mattermost.com/&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;19. Tesseral&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjr74o7lyttve9w8ss140.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjr74o7lyttve9w8ss140.png" alt="Tesseral" width="800" height="398"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Open-source IAM for B2B SaaS&lt;/p&gt;

&lt;p&gt;SSO, SCIM, RBAC, audit logs—without reinventing the wheel.&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://tesseral.com/" rel="noopener noreferrer"&gt;https://tesseral.com/&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;20. Helicone (YC W23)&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkvtiv9q4yur1fgrakl5p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkvtiv9q4yur1fgrakl5p.png" alt="Helicone" width="800" height="386"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Metrics, traces, and experiment tooling for LLMs&lt;/p&gt;

&lt;p&gt;If you’re serious about LLM performance, Helicone is hard to ignore.&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://www.helicone.ai/" rel="noopener noreferrer"&gt;https://www.helicone.ai/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Final Thoughts&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Open source in 2026 won’t be about more tools.&lt;br&gt;
It will be about fewer, composable, production-ready ones.&lt;/p&gt;

&lt;p&gt;Every tool on this list earned its place by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Solving a real problem&lt;/li&gt;
&lt;li&gt;Working under load&lt;/li&gt;
&lt;li&gt;Respecting developer time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;♻️ If this helped, save it, share it, or pass it to someone building serious systems.&lt;/p&gt;

&lt;p&gt;👉 Which tool should I deep-dive next?&lt;br&gt;
Drop a name, a project, or tag someone who’d love this.&lt;/p&gt;

&lt;p&gt;Connect with me over LinkedIn for more such content. &lt;a href="https://www.linkedin.com/in/jaskiratai" rel="noopener noreferrer"&gt;https://www.linkedin.com/in/jaskiratai&lt;/a&gt;&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>ai</category>
      <category>startup</category>
      <category>development</category>
    </item>
    <item>
      <title>Detecting LLM Hallucinations Through Vector Geometry: A New Approach</title>
      <dc:creator>Jaskirat Singh</dc:creator>
      <pubDate>Tue, 20 Jan 2026 18:41:06 +0000</pubDate>
      <link>https://forem.com/jaskirat_singh/detecting-llm-hallucinations-through-vector-geometry-a-new-approach-828</link>
      <guid>https://forem.com/jaskirat_singh/detecting-llm-hallucinations-through-vector-geometry-a-new-approach-828</guid>
      <description>&lt;p&gt;Large language models generate convincing text regardless of factual accuracy. They cite nonexistent research papers, invent legal precedents, and state fabrications with the same confidence as verified facts. Traditional hallucination detection relies on using another LLM as a judge—essentially asking a system prone to hallucination whether it's hallucinating. This circular approach has fundamental limitations.&lt;/p&gt;

&lt;p&gt;Recent research reveals a geometric approach to hallucination detection that examines the mathematical structure of text embeddings rather than relying on another model's judgment. This method identifies when responses deviate from learned patterns by analyzing vector relationships in embedding space.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The Core Problem With Current Detection Methods&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Most hallucination detection systems employ an LLM-as-judge architecture. You generate a response, then ask another language model to evaluate its accuracy. The problems are obvious: you're using fallible systems to judge themselves, creating recursive uncertainty. The judge model can hallucinate about whether the original response hallucinated.&lt;/p&gt;

&lt;p&gt;This approach also requires additional API calls, increases latency, and scales poorly. For every response requiring verification, you need a second inference pass with comparable computational cost. Enterprise applications processing millions of requests face multiplied infrastructure expenses.&lt;/p&gt;

&lt;p&gt;The fundamental question becomes: can we detect hallucinations from intrinsic properties of the response itself, without external judgment?&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Understanding Embedding Space Structure&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Modern sentence encoders transform text into numerical vectors—points in high-dimensional space where semantically similar content clusters together. This is fundamental to how semantic search and retrieval systems work. But embeddings encode more than simple similarity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Question-Answer Relationship&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When you embed a question and its corresponding answer, they occupy different positions in vector space. The displacement between these positions—the vector pointing from question to answer—has both magnitude and direction. For grounded, factual responses within a specific domain, these displacement vectors exhibit remarkable consistency.&lt;/p&gt;

&lt;p&gt;Consider five different questions about molecular biology with accurate answers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"What organelle produces ATP?" → "Mitochondria produce ATP through cellular respiration"&lt;/li&gt;
&lt;li&gt;"How does oxidative phosphorylation work?" → "Oxidative phosphorylation generates ATP using the electron transport chain"&lt;/li&gt;
&lt;li&gt;"What is the Krebs cycle?" → "The Krebs cycle is a series of reactions producing electron carriers"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When embedded, the displacement vectors from each question to its answer point in roughly parallel directions. The magnitudes vary—some answers are longer or more detailed—but the directional consistency holds. This represents the "grounded response pattern" for this domain.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When Hallucination Breaks the Pattern&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Now consider a hallucinated response to a biology question:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;"What organelle produces ATP?" → "The Golgi apparatus manufactures ATP molecules through photosynthesis"&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This response is fluent, grammatically correct, and structurally resembles a proper answer. But when embedded, the displacement vector points in a fundamentally different direction than the established pattern. The response has strayed from the geometric structure characterizing grounded answers in this domain.&lt;/p&gt;

&lt;p&gt;This is the key insight: hallucinations don't just contain incorrect information—they occupy anomalous positions in embedding space relative to established truth patterns.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Displacement Consistency: Measuring Geometric Alignment&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feix6sbyfnwohnnt4r5lu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feix6sbyfnwohnnt4r5lu.png" alt="Displacement Consistency" width="800" height="781"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Displacement Consistency (DC) metric formalizes this geometric observation into a practical detection method. The process is straightforward:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Building the Reference Set&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;First, construct a collection of verified question-answer pairs from your target domain. This becomes your geometric baseline—the established pattern against which new responses are measured. For a medical chatbot, use medical Q&amp;amp;A pairs. For legal research, use legal examples.&lt;/p&gt;

&lt;p&gt;The reference set size can be modest—approximately 100 examples suffices for most domains. This is a one-time calibration cost performed offline before deployment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Computing Displacement Consistency&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When a new question-answer pair requires verification:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Find Neighboring Questions:&lt;/strong&gt; Identify the K nearest questions in the reference set to your new question (typically K=5-10)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Calculate Mean Displacement:&lt;/strong&gt;Compute the average displacement direction from these neighboring questions to their verified answers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Measure Alignment:&lt;/strong&gt;Calculate the cosine similarity between your new displacement vector and this mean direction&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Grounded responses align strongly with the reference pattern—DC scores approach 1.0. Hallucinated responses diverge significantly—DC scores drop toward 0.3 or lower.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Why This Works&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The method exploits how contrastive training shapes embedding space. Models learn to map semantically related content to nearby regions. For question-answer pairs, this creates directional consistency: truthful responses move in predictable directions from their questions within specific domains.&lt;/p&gt;

&lt;p&gt;Hallucinated content, while fluent and confident, doesn't respect these learned geometric relationships. The model generates text matching surface patterns (grammar, style, structure) but fails to maintain the deeper geometric consistency that characterizes grounded responses.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Empirical Performance Across Models&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Testing across architecturally diverse embedding models validates whether DC represents a fundamental property or model-specific artifact. Five models with distinct training approaches were evaluated:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architectural Diversity:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MPNet-based contrastive fine-tuning (all-mpnet-base-v2)&lt;/li&gt;
&lt;li&gt;Weakly-supervised pre-training (E5-large-v2)&lt;/li&gt;
&lt;li&gt;Instruction-tuned with hard negatives (BGE-large-en-v1.5)&lt;/li&gt;
&lt;li&gt;Encoder-decoder adaptation (GTR-T5-large)&lt;/li&gt;
&lt;li&gt;Efficient long-context architecture (nomic-embed-text-v1.5)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Benchmark Results&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;DC achieved near-perfect discrimination across multiple established hallucination datasets:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;HaluEval-QA:&lt;/strong&gt;Contains LLM-generated hallucinations designed to be subtle and plausible. DC achieved AUROC 1.0 across all five embedding models.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HaluEval-Dialogue:&lt;/strong&gt; Tests responses that deviate from conversational context. DC maintained perfect discrimination.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TruthfulQA:&lt;/strong&gt; Evaluates common misconceptions humans frequently believe. DC continued achieving AUROC 1.0.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Comparative Performance&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Alternative approaches measuring where responses land relative to queries (position-based rather than direction-based) achieved AUROC around 0.70-0.81. The consistent 0.20 AUROC gap across all models demonstrates DC's superior discriminative power.&lt;/p&gt;

&lt;p&gt;Score distributions reveal clear separation: grounded responses cluster tightly around DC values of 0.9, while hallucinations spread around 0.3 with minimal overlap.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The Domain Locality Constraint&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Perfect performance within domains masks an important limitation: DC does not transfer across domains. A reference set from legal Q&amp;amp;A cannot detect hallucinations in medical responses—performance degrades to random chance (AUROC ~0.50).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Understanding the Geometric Structure&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzwxprxab7q36qgondx3z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzwxprxab7q36qgondx3z.png" alt="Understanding the Geometric Structure" width="800" height="260"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This domain specificity reveals fundamental properties of how embeddings encode grounding. In geometric terms, embedding space resembles a fiber bundle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Base Manifold:&lt;/strong&gt; The surface representing all possible questions across all domains&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fibers:&lt;/strong&gt;At each point on this surface, a direction vector indicating where grounded responses should move&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Within any local region (one specific domain), fibers point in consistent directions—this enables DC's strong local performance. But globally, across different domains, fibers point in different directions. There's no universal "truthfulness direction" spanning all possible content.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Practical Implications&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No Universal Grounding Pattern:&lt;/strong&gt; Each domain develops distinct displacement patterns during training. Legal questions and medical questions establish different geometric structures for grounded responses.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Calibration Requirements:&lt;/strong&gt; Deploying DC requires domain-matched reference sets. A financial services chatbot needs financial examples; a technical support system needs technical documentation examples.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One-Time Cost:&lt;/strong&gt; Calibration happens offline before deployment. Once established, the reference set enables real-time detection without additional LLM calls.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This finding challenges assumptions about embedding space universality. Models don't learn a single global representation of truthfulness—they learn domain-specific mappings whose disruption signals hallucination.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Practical Implementation Considerations&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Deploying geometric hallucination detection involves several engineering decisions that impact effectiveness and operational cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reference Set Construction&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Size Requirements:&lt;/strong&gt; Testing shows 100-200 verified examples per domain provides robust baselines. Larger sets improve boundary case handling but deliver diminishing returns beyond 500 examples.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quality Over Quantity:&lt;/strong&gt; Reference examples must be verified as factually accurate. One hallucinated example in the reference set contaminates the geometric baseline, degrading detection accuracy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Domain Matching:&lt;/strong&gt; Reference content should align with production queries. Generic examples from unrelated domains contribute noise rather than signal.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Computational Efficiency&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Offline Costs:&lt;/strong&gt;Reference set embedding happens once during calibration. This one-time cost doesn't impact production latency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Online Costs:&lt;/strong&gt; Real-time detection requires:&lt;/p&gt;

&lt;p&gt;Embedding the new question-answer pair (two embedding calls)&lt;/p&gt;

&lt;p&gt;Finding K nearest neighbors in reference set (efficient vector search)&lt;/p&gt;

&lt;p&gt;Computing cosine similarities (simple linear algebra)&lt;/p&gt;

&lt;p&gt;Modern vector databases handle nearest neighbor search at scale with sub-millisecond latency. Total detection overhead remains minimal compared to LLM-as-judge approaches requiring full inference passes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Integration Patterns&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Post-Generation Filtering:&lt;/strong&gt;Generate responses normally, then apply DC scoring before returning to users. Responses below threshold trigger flagging, human review, or regeneration.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Confidence Scoring:&lt;/strong&gt;Surface DC scores alongside responses, letting downstream systems or users assess reliability.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hybrid Approaches:&lt;/strong&gt; Combine geometric detection with other signals (retrieval confidence, source citation verification) for comprehensive hallucination mitigation.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Advantages Over Alternative Methods&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Geometric hallucination detection offers distinct benefits compared to common alternatives.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Versus LLM-as-Judge&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;No Recursive Uncertainty:&lt;/strong&gt; DC doesn't rely on another LLM's judgment, eliminating circular reasoning where hallucination-prone systems evaluate hallucination.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Lower Latency:&lt;/strong&gt; Single embedding pass versus full text generation from judge model reduces response time significantly.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Cost Efficiency: Embedding inference costs far less than generative inference. For high-volume applications, the savings compound substantially.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Versus Source-Based Verification&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No Retrieval Required:&lt;/strong&gt; DC operates on response geometry alone, without needing to fetch and verify source documents at inference time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Works With Reasoning:&lt;/strong&gt; Many LLM applications involve synthesis and reasoning beyond simple retrieval. DC can still detect when reasoning outputs hallucinate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Simpler Infrastructure:&lt;/strong&gt; No need for document stores, retrieval systems, or citation parsing—just embe
ddings and vector similarity.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Versus Uncertainty Estimation&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No Model Internals:&lt;/strong&gt; DC uses standard embeddings without requiring access to model weights, attention patterns, or logit distributions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model-Agnostic:&lt;/strong&gt; Works across any LLM generating text, as long as embeddings can be computed for outputs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consistent Performance:&lt;/strong&gt; Uncertainty estimation quality varies significantly across model families. DC performance remains stable across embedding architectures.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Limitations and Open Questions&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;While geometric detection shows strong empirical results, several limitations and research questions remain.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Domain Boundary Definition&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Determining domain boundaries for reference set construction lacks clear guidelines. Are "cardiovascular surgery" and "orthopedic surgery" separate domains requiring distinct calibration? Or can a general "medical procedures" reference set serve both?&lt;/p&gt;

&lt;p&gt;Current practice relies on empirical testing: construct candidate reference sets, measure DC performance on held-out examples, and iterate. More principled approaches to domain scoping would improve deployment efficiency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Adversarial Robustness&lt;/strong&gt;&lt;br&gt;
Can models be trained to generate hallucinations that maintain geometric consistency? If LLMs explicitly optimize to preserve displacement patterns while fabricating content, does DC remain effective?&lt;/p&gt;

&lt;p&gt;Early exploration suggests this is difficult—maintaining geometric consistency while hallucinating requires coordinating both semantic content and embedding space positioning. But adversarial settings warrant further investigation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cross-Lingual Performance&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Testing has focused on English language content. Do displacement patterns transfer across languages? Can multilingual embedding models enable cross-lingual hallucination detection?&lt;/p&gt;

&lt;p&gt;Preliminary evidence suggests language-specific calibration may be necessary, similar to domain-specific calibration. But unified approaches for multilingual systems remain underexplored.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Temporal Drift&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;As language models evolve and embedding models receive updates, do established reference sets remain valid? How frequently does recalibration become necessary?&lt;/p&gt;

&lt;p&gt;Monitoring DC score distributions over time can detect drift, triggering recalibration when performance degrades. But proactive recalibration schedules remain an open operational question.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Research Directions&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Several promising avenues extend geometric hallucination detection.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Automatic Domain Discovery&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Rather than manually defining domains and constructing reference sets, can unsupervised clustering automatically identify geometric regions in embedding space corresponding to coherent domains? This would enable automated reference set construction from unlabeled data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-Domain Calibration&lt;/strong&gt;&lt;br&gt;
Investigating whether mixtures of domain-specific reference sets can provide broader coverage without requiring exact domain matching. Ensemble approaches combining multiple reference sets might improve robustness.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Explanation Generation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When DC flags a response as likely hallucinated, providing explanations beyond a numerical score would aid human review. Identifying which reference examples the response deviates from most strongly could highlight specific inconsistencies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Integration With Retrieval&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Combining geometric detection with retrieval-augmented generation (RAG) could provide complementary hallucination mitigation. RAG grounds responses in retrieved documents; DC verifies the response respects geometric consistency.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Conclusion&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Geometric hallucination detection shows that embedding space encodes structured, domain-specific directions linking questions to grounded answers. When these directions break, hallucination is likely.&lt;/p&gt;

&lt;p&gt;This approach is practical: it requires no LLM judge, adds minimal overhead, and achieves near-perfect discrimination within calibrated domains. While calibration is a one-time cost, it enables efficient real-time detection.&lt;/p&gt;

&lt;p&gt;The findings also reshape how we understand embeddings. There is no universal “truthfulness direction”—only local coherence within domains—challenging common assumptions and opening new research directions.&lt;/p&gt;

&lt;p&gt;For production systems, geometric detection complements retrieval, uncertainty estimates, and human review, improving reliability. The geometry was always there; we’re just learning how to read it.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>deeplearning</category>
      <category>llm</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
