Forem: Jaskirat Singh

When Your LLM Starts Bleeding Context (And How I Fixed It)

Jaskirat Singh — Mon, 16 Feb 2026 18:36:11 +0000

LLM batch processing failing? Learn how context bleeding between data rows tanks accuracy—plus the linearization technique that dropped our error rate to under 15%.

Here’s the thing about working with LLMs at scale: they’re incredible until they’re not.

I learned this the hard way while processing thousands of user feedback entries for sentiment analysis. We’re talking serious volume here — the kind that makes individual processing a non-starter from both a time and cost perspective. So naturally, I went with batch processing.

Smart move, right?

Wrong.

The Day My Accuracy Tanked

Three weeks into production, I started noticing something weird. Our false positive and false negative rates were climbing. Not catastrophically, but enough to make me nervous.

The really frustrating part? When I spot-checked the problematic entries by processing them individually, they came back accurate.

That’s when I knew we had a problem.

After digging into the data (because of course I did — my academic research background doesn’t let me quit), I found a pattern. When feedback entries followed a continuous sentiment sequence — such as five positive reviews in a row — a sudden negative review would be misclassified as positive.

And vice versa.

The numbers were brutal:

False positives jumped 23% when negative feedback followed strings of positive reviews
False negatives climbed 18% in the reverse scenario
Overall accuracy dropped from 91% to 76% on edge cases

The LLM wasn’t treating each entry independently. It was bleeding context across rows, letting previous patterns bias current predictions.

Research shows that LLMs struggle to maintain strict contextual boundaries when processing sequential rows, leading to performance degradation as input length increases.

This is what people in the industry call faithfulness hallucination — when the model generates content that diverges from the actual input because it’s confused by the surrounding context.

Not exactly what you want when you're trying to make data-driven decisions.

Quick Fixes (That Actually Worked)

I needed solutions yesterday, so here’s what I tried first:

1. Batch Size Reduction

I slashed our batch size from 32 entries down to 8. Throughput took a hit, but accuracy improved immediately.

Studies using models like Llama3-70B show that pushing batch sizes beyond 64 often produces diminishing returns. There’s a sweet spot between efficiency and accuracy.

2. Prompt Engineering for Independence

I rewrote our system prompt to explicitly hammer home row independence:

You are analysing user feedback entries independently. 
Each entry must be evaluated solely on its own content 
without influence from previous entries. 

Treat each review as a completely separate analysis task. 
Do not allow patterns from earlier reviews to bias your 
assessment of subsequent reviews.

Did it help? Yes.

Was it enough? Not even close.

The Real Fix: Learning from AutoPK

Here’s where things get interesting.

I started researching how other teams were handling LLMs with structured data, and I came across the AutoPK framework. AutoPK demonstrates that LLMs often fail to preserve spatial and structural relationships when processing raw tables, requiring transformation to explicit key-value representations.

And honestly?

It changed everything.

Step 1: Linearise Your Data

Instead of feeding the LLM raw tabular rows, I transformed each entry into an explicit key-value format.

Before (What Everyone Does)

Row_ID, Review_Text, Sentiment
1, Great product, positive
2, Terrible service, negative

After (What Actually Works)

<ENTRY_1 @ Review: "Great product" | Sentiment: positive>
<ENTRY_2 @ Review: "Terrible service" | Sentiment: negative>

Why This Works

By converting each relevant cell into a key-value pair, the transformation abstracts away layout differences. This enables text-based models to more effectively extract information from tabular data.

You’re shifting the cognitive load from:

❌ Spatial reasoning (which LLMs struggle with)
✅ Sequential text processing (which they’re actually good at)

Step 2: Add Few-Shot Examples

I included five examples in the linearized format before the actual task data. The examples showed the model exactly how to handle each entry independently.

Task: Analyze sentiment for each feedback entry independently.

Examples:
<ENTRY_A @ Review: "The interface is intuitive and fast" | Sentiment: positive>
<ENTRY_B @ Review: "Customer support was unhelpful" | Sentiment: negative>
<ENTRY_C @ Review: "Product works as described" | Sentiment: neutral>
<ENTRY_D @ Review: "Exceeded my expectations completely" | Sentiment: positive>
<ENTRY_E @ Review: "Shipping took three weeks" | Sentiment: negative>

Now analyze the following entries:
[Production data follows]

Research on TabLLM shows that this approach can outperform traditional deep-learning methods, particularly in few-shot scenarios with minimal labelled data.

We saw error rates drop from 60–95% down to under 15%.

Step 3: Simplify Your Input

I stripped out every column that wasn’t directly relevant to sentiment analysis.

No timestamps.

No URLs.

No user agents.

Just the entry ID and the feedback text.

Original Input (Noisy)

row_id, timestamp, user_id, session_id, feedback_text, page_url, user_agent, sentiment

Cleaned Input

entry_id, feedback_text, sentiment

Research confirms that heterogeneous features — ranging from dense numerical to sparse categorical — can confuse models, especially when columns contain information unrelated to the target task.

This single change reduced hallucination rates by about 12%.

Step 4: Control the Output Format

I explicitly instructed the model to return results in a structured format using only the provided identifiers:

Return results in the following format, using only the entry identifiers provided:

entry_id,predicted_sentiment,confidence
1,positive,0.92
2,negative,0.87

This prevents the model from inventing data or introducing information that wasn’t in the input.

It also makes validation way easier.

The Trade-Offs Nobody Talks About

Look — these solutions aren’t free.

Here’s what I learned about the cost-performance balance:

Smaller batch sizes = better accuracy but worse throughput
Linearization + few-shot prompting = higher token consumption per entry
Individual processing = guaranteed independence but eye-watering API costs

Anthropic optimised Claude 3 with continuous batching, increasing throughput from 50 to 450 tokens per second while reducing latency and cutting GPU costs by 40%.

The point is: you need to know what you’re optimising for.

For our use case — customer feedback analysis where 85–90% accuracy was acceptable — optimized batch processing with linearization hit the sweet spot.

Your mileage may vary.

How to Actually Implement This

Here’s my step-by-step process:

Establish a baseline by processing a representative sample individually
Convert your pipeline to generate key-value formatted entries
Create domain-specific examples that cover your edge cases
Experiment with batch sizes to find your accuracy-throughput balance
Build validation checks comparing batch results to individual processing
Monitor continuously for regression over time

Code: Linearization + Prompt Builder

def linearize_feedback(df, id_column, text_column):
    """
    Convert tabular feedback data into linearized key-value format.
    """
    linearized = []
    for _, row in df.iterrows():
        entry = f"<ENTRY_{row[id_column]} @ Review: \"{row[text_column]}\">"
        linearized.append(entry)
    return linearized


def create_few_shot_prompt(examples, task_data):
    """
    Construct prompt with few-shot examples and task data.
    """
    prompt = "Task: Analyze sentiment for each feedback entry independently.\n\n"
    prompt += "Examples:\n"

    for entry, label in examples:
        prompt += f"{entry} | Sentiment: {label}\n"

    prompt += "\nNow analyze the following entries:\n"
    prompt += "\n".join(task_data)
    prompt += "\n\nReturn results as CSV: entry_id,predicted_sentiment,confidence"

    return prompt

What I Wish Someone Had Told Me

Context bleeding isn’t some edge case that only affects massive enterprise deployments.

If you’re processing structured data with LLMs — feedback, survey responses, support tickets, whatever — you’re probably experiencing this problem right now.

You just might not know it yet.

Context errors can lead to:

Lost essential information
Misinterpreted model output
Incorrect downstream actions

The AutoPK-inspired pipeline approach fundamentally changed how I think about feeding data to LLMs.

Converting spatial table relationships into explicit textual representations isn’t just a workaround.

It’s actually aligning with what these models are good at.

What’s Next

The research community is exploring:

Attention mechanisms specifically designed for structured data
Architectural modifications that enforce entry boundaries
Hybrid approaches combining LLM strengths with traditional ML for tabular data

Continuous batching combines KV caching, chunked prefill, and ragged batching with dynamic scheduling to maximise throughput — but these optimizations focus on throughput rather than accuracy preservation.

For now, if you’re dealing with batch processing of structured data:

Start with linearization + few-shot prompting.

The results speak for themselves.

References

Anyscale. "Achieve 23x LLM Inference Throughput & Reduce p50 Latency."

https://www.anyscale.com/blog/continuous-batching-llm-inference
Hugging Face. "Continuous Batching from First Principles."

https://huggingface.co/blog/continuous_batching
Chroma Research. "Context Rot: How Increasing Input Tokens Impacts LLM Performance."

https://research.trychroma.com/context-rot
ACM Transactions. "A Survey on Hallucination in Large Language Models."

https://dl.acm.org/doi/10.1145/3703155
Nature. "Detecting Hallucinations in Large Language Models Using Semantic Entropy."

https://www.nature.com/articles/s41586-024-07421-0
Arxiv. "AutoPK: Leveraging LLMs and Hybrid Similarity Metric for Advanced Retrieval of Pharmacokinetic Data."

https://arxiv.org/html/2510.00039
Arxiv. "Large Language Models on Tabular Data: Prediction, Generation, and Understanding - A Survey."

https://arxiv.org/html/2402.17944v2
Predibase. "Maximize Zero-Shot LLM Performance on Tabular Data."

https://predibase.com/blog/getting-the-best-zero-shot-performance-on-your-tabular-data-with-llms
Latitude. "Scaling LLMs with Batch Processing: Ultimate Guide."

https://latitude-blog.ghost.io/blog/scaling-llms-with-batch-processing-ultimate-guide/

The Boring Truth About Letting AI Write Your Code

Jaskirat Singh — Mon, 02 Feb 2026 17:10:58 +0000

AI-assisted coding is incredible. It's also incredibly boring.

There, I said it.

The Promise vs. The Reality

We're living in the era of "vibe coding" where you describe what you want, hand it off to an AI agent, and watch it build your app while you sip coffee. Tools like Spec Kit, sudocode, and GitHub Copilot are genuinely powerful. They can take specifications and turn side project ideas into actual reality, saving you from the graveyard of purchased domain names that never became anything.

The technology works. Sometimes too well.

But here's what nobody talks about: watching AI code is the modern equivalent of watching paint dry.

The Missing Piece: Joy

I've literally dozed off this week watching agents work. Multiple times. The code appears line by line in the editor, technically correct, functionally sound, and completely devoid of any emotional satisfaction.

There's no "aha!" moment. No problem-solving rush. No "I AM A GENIUS" feeling when you finally crack that tricky algorithm. Just... code appearing. Like magic, except magic is supposed to be exciting.

The AI companies pitch this as freedom: "Let the computer do the boring work so you can focus on interesting work." But here's the uncomfortable realization coding itself isn't the boring work. At least not for those of us who genuinely love it.

When Vibe Coding Actually Makes Sense

Don't get me wrong—there's a place for this approach:

Use AI when you only care about the output. Those utility scripts, personal tools, or simple apps where the tech stack doesn't matter and you just want something functional. The projects sitting in your "someday" pile that are finally getting built because AI removed the activation energy barrier.

For those projects? Fine. Ship it. Get the result. Move on.

But don't use it for what you love. For apps you're proud of, projects using interesting tech stacks, or anything where the journey matters as much as the destination—drive the development yourself. The experience, the learning, the satisfaction of building something with your own hands (and brain) is irreplaceable.

The Real Risk Nobody Mentions

The danger isn't that AI will replace developers. The danger is that developers will outsource so much that they lose their skills—and worse, lose the joy that made them want to code in the first place.

Programming is problem-solving. It's creative. It's challenging in ways that feel rewarding. When you hand all of that to an AI, you're left with the least interesting part: project management for a robotic employee.

It's Just Another Tool

Vibe coding isn't revolutionary. It's not the future of all development. It's simply another tool in the toolbelt—useful for specific situations, boring for everything else.

I'll keep using it for projects where I genuinely don't care how they're built. I'll keep my coding skills sharp by building the things that matter with my own hands. And I'll accept that watching an AI agent work is about as thrilling as watching my code compile.

Which is to say: not at all.

The future where AI does all the "boring work" has arrived. Turns out, the work was never boring to begin with.

The Takeaway: Use AI to eliminate friction on projects you don't care about deeply. But protect the work you love. The satisfaction of solving problems yourself isn't a bug in the development process—it's the entire point.

I Texted 'Clean My Desktop' From My Phone. 30 Seconds Later, It Was Done.

Jaskirat Singh — Sun, 01 Feb 2026 18:07:49 +0000

ChatGPT tells you what to do. This AI actually does it for you.

While you're still copy-pasting code snippets and following step-by-step instructions from your AI assistant, there's a new breed of AI that's executing tasks autonomously on your computer organizing your files, researching content, building websites, and managing your digital life while you send commands from your phone.

Welcome to the era of AI employees, not AI advisors.

The Problem with Traditional AI Assistants

Let's be honest: ChatGPT and similar tools are incredibly smart, but fundamentally passive. They give you advice, generate text, explain concepts. Then you have to do the actual work open applications, move files, write code, execute commands.

You're still the employee. The AI is just the consultant.

But what if your AI could actually do the work?

Meet Claudebot: Your AI Employee That Actually Works

Claudebot (also known as Moldbot due to copyright reasons) flips the traditional AI model on its head. Instead of running in the cloud and chatting politely, it:

Runs locally on your computer, keeping your data completely private
Executes tasks autonomously without you lifting a finger
Accepts commands remotely via WhatsApp, Telegram, Discord, or Slack
Remembers everything you tell it, so you're not repeating instructions constantly
Builds its own tools when it doesn't have what it needs

Think of it as hiring an AI assistant that actually sits at your computer and gets work done while you're on the couch texting instructions from your phone.

What Can It Actually Do?

Here's where it gets interesting. Real-world examples:

File Organization That Actually Happens

Text from your phone: "Organize my external SSD"

Claudebot scans hundreds of messy files, proposes a logical structure, deletes junk, and reorganizes everything all while you're nowhere near your computer. No more "I'll clean that up this weekend" promises to yourself.

Content Research on Autopilot

Need a video script analyzing AI trends? Here's what happened in one test:

Command sent via Telegram: "Analyze top LinkedIn AI posts, my personal LinkedIn history, YouTube AI videos, and Twitter news then merge everything into one script"
Claudebot autonomously opened browsers, scraped data, analyzed patterns, and generated a comprehensive script
Total time: 13 minutes

Try doing that manually. You'd still be on tab 47 of your browser research session.

Website Development While You Sleep

"Redesign my landing page inspired by Apple's style using Replit"

Claudebot analyzed Apple's website, reviewed the existing site, generated code, and created a working preview. Was it perfect? No about 80% complete with some context loss. But that's 80% of the work done autonomously while you did literally nothing.

How It Works: The Technical Magic

Unlike cloud-based AI that sends your data to external servers, Claudebot operates entirely on your machine. You control it through messaging apps you already use daily.

The workflow:

Install Claudebot on your computer (ideally a secondary machine for safety)
Connect it to your preferred AI model (Anthropic's Claude recommended)
Link your messaging app (Telegram, WhatsApp, etc.)
Send text commands from anywhere
Claudebot executes tasks in real-time

The AI doesn't just follow rigid scripts. It adapts, problem-solves, and even creates new capabilities when needed. Can't do something? It builds the tool to do it.

The Privacy Advantage

Here's something ChatGPT can't offer: complete data privacy.

Because Claudebot runs locally, your files, projects, and sensitive information never leave your machine. You get the power of advanced AI without uploading your entire digital life to the cloud.

For developers, creators, and anyone handling confidential work, this is game-changing.

The Catch (Because There's Always One)

Let's be real this isn't plug-and-play for everyone:

You'll need:

Basic command-line familiarity
An API key setup (through Anthropic or other providers)
A secondary computer for installation (don't risk your primary machine)
Patience with occasional context loss during complex tasks

The risks:

This AI has real control over your computer
Memory limitations can cause it to "forget" mid-task
Complex automations might need human intervention

It's powerful, but with power comes responsibility. This isn't an AI you install on your main work laptop without understanding what it can do.

Who Should Use This?

Claudebot shines for:

Content creators drowning in research and organization
Developers automating repetitive file management and code tasks
Remote workers who need tasks executed while away from their desk
Anyone tired of being their AI's assistant instead of the other way around

If you're comfortable with technology and want an AI that actually does work rather than describing work, this is your tool.

The Future Is Already Here

We've spent years asking AI for advice. Now AI is asking us, "What do you want done?"

Claudebot represents a fundamental shift: from AI assistants that inform to AI workers that execute. It's not perfect, it's not risk-free, and it's definitely not for everyone.

But for those ready to hand real tasks to AI not just questions it's the most practical implementation of autonomous AI assistance available today.

And it's completely free to use.

The question isn't whether AI will eventually control our computers. It's whether you're ready to let it start now.

Quick Setup Overview

Visit the Claudebot/Moldbot website
Download for your OS (MacOS, Linux, or Windows)
Run the installation script in Terminal
Choose quick start mode
Connect your AI provider (Anthropic Claude recommended)
Link your messaging app (Telegram works great)
Start sending commands remotely

Remember: Install on a secondary machine first. Test carefully. Then decide if you want this level of AI autonomy in your workflow.

The future of AI isn't just smarter responses. It's AI that actually does the job.

AI Tools Are Cute. AI Agents Are Dangerous

Jaskirat Singh — Fri, 30 Jan 2026 18:52:00 +0000

For the last few years, the internet has been obsessed with AI tools.

Tools that write emails.
Tools that generate images.
Tools that autocomplete code.

They made us faster — but they didn’t fundamentally change how work gets done.

2026 marks a different shift.

This is the year AI stops waiting for instructions and starts acting on your behalf.

2025 Was About Discovery. 2026 Is About Delegation.

In 2025, AI agents quietly crossed a threshold.

Search trends spiked.

Developer tools exploded.

Startups pivoted from “AI assistant” to “AI agent.”

Yet most people missed what was actually happening.

They treated agents like smarter chatbots.

That misunderstanding is why many still don’t see what’s coming next.

What an AI Agent Really Is (No Buzzwords)

An AI agent is not a chatbot.

An AI agent is a system that can:

Accept a high-level goal
Decompose it into multiple steps
Choose tools and actions autonomously
Observe outcomes
Adjust its strategy
Continue until the goal is complete

You don’t guide it step by step.
You define success.

Everything else is delegated.

Tools Assist. Agents Decide.

Traditional AI tools follow a familiar loop:

You prompt.

The tool responds.

You decide what to do next.

AI agents collapse that loop.

You give a goal.
The agent plans the workflow.
The agent executes actions.
The agent handles failures.
The agent delivers an outcome.

This is not convenience.

This is a new interaction model.

Why AI Agents Suddenly Work in 2026

Agents existed before — but they failed quietly.

What changed?

Models Became Fast Enough to Think While Acting

Latency dropped.
Reasoning improved.
Long-context memory became reliable.

Agents can now reflect, correct, and continue without constant human intervention.

Software Became Actionable

Browsers, calendars, CRMs, file systems, APIs — everything became permissioned and callable.

Agents no longer simulate actions.
They perform them.

Work Became Workflow-Heavy

Modern work isn’t one task.

It’s dozens of interconnected micro-tasks:

Research
Switching tabs
Copying data
Formatting outputs
Following up
Tracking progress

Agents thrive in exactly this environment.

The Mental Shift Most People Haven’t Made

Most users still ask:

What can this AI do?

Advanced users ask:

What should never require my attention again?

This shift — from capability thinking to delegation thinking —
separates casual AI users from leverage builders.

Where AI Agents Are Already Replacing Human Effort

Not jobs.

Friction.

AI agents are already handling:

Multi-step web research and data extraction
Slide decks and report generation
Inbox triage and calendar coordination
File organization and cleanup
Repetitive browser workflows
Monitoring tasks and follow-ups

The human role moves up the stack.

Why This Isn’t Traditional Automation

Old automation was brittle.

One broken selector.
One changed page.
One unexpected input.

Everything failed.

AI agents adapt.

They retry.
They choose alternate paths.
They ask for clarification.
They escalate only when needed.

This isn’t scripting.

It’s situational decision-making.

The Real Value: Cognitive Offloading

Speed is a side effect.

The real value is mental bandwidth.

Agents absorb:

Context switching
Progress tracking
Failure recovery
Repetitive decisions

Humans regain:

Focus
Judgment
Creativity
Strategic thinking

This is why agents feel less like tools
and more like leverage.

Why Most People Will Use AI Agents Wrong

At first, people will:

Over-control every step
Micromanage execution
Panic when agents pause
Avoid granting permissions

Agents don’t fail because they’re weak.

They fail because humans don’t yet know how to delegate.

Clear goals.
Clear constraints.
Then trust.

Careers Are Quietly Being Reshaped

In 2026, advantage won’t come from:

Knowing more tools
Writing better prompts
Memorizing workflows

It will come from:

Designing systems
Knowing what to automate
Knowing what must remain human
Thinking in outcomes, not steps

This is a mindset shift, not a skill upgrade.

Final Thoughts

AI agents are not replacing humans.

They are replacing to-do lists.

In 2026, delegation won’t feel optional.
It will feel obvious.

And the people who learn to think in agents —
not tools —
will move faster with less effort than everyone else.

If Your Code Works on the First Try, There’s a Massive Mistake Somewhere

Jaskirat Singh — Thu, 29 Jan 2026 18:24:25 +0000

I read this line somewhere on Reddit and it stuck with me:

If your code works on the first try, there is a massive mistake somewhere.

It sounds like a joke, but every experienced developer knows there’s a painful truth hidden inside it.

TL;DR

When code works on the first run, it usually means you tested only the happy path. Missing edge cases, untested assumptions, environment mismatches, or silent failures are often hiding underneath. First-try success should trigger skepticism, not celebration.

Why This Quote Resonates With Developers

Every developer has lived through this moment:

Code runs perfectly on your machine
You push it
Production breaks in ways you never imagined

The quote isn’t saying good engineers write broken code. It’s saying real software lives in messy environments, not in ideal inputs and local setups.

Why First-Try Success Is Suspicious

If your code works immediately, one or more of these is likely true:

You only tested the happy path
Inputs were overly clean or hard-coded
Error handling was never exercised
The environment matches your local machine too closely
Concurrency and load were never tested
Latency and failure scenarios were ignored
Success criteria were too shallow

Passing once proves almost nothing.

A Simple Example

from datetime import datetime

def parse_date(s):
    return datetime.strptime(s, "%Y-%m-%d")

This works perfectly for:

2026-01-23

But what about:

23-01-2026
2026/01/23
Jan 23 2026
2026-02-30
Different locales or time zones

First-run success only proves your input matched your assumption.

The “Try to Break It” Mindset

When code works immediately, do this next:

Add unit tests for the happy path
Add tests for invalid and edge inputs
Run property-based or fuzz testing
Test against real dependencies, not mocks
Simulate production-like environments
Introduce latency and partial failures
Run concurrent requests
Verify logging and error reporting
Add static analysis and type checks
Get a second set of eyes in code review

If you can’t break it, write a test that proves why.

Why Production Is Where Bugs Are Born

Production introduces things your laptop never will:

Network delays
Partial outages
Race conditions
Different CPU architectures
Locale and encoding differences
Unexpected user behavior
Scale and concurrency

Code that works once in isolation hasn’t earned trust yet.

When First-Try Success Is Actually Fine

There are exceptions:

Pure deterministic functions
Well-specified algorithms
One-off scripts
Throwaway experiments

But the moment code becomes reusable, shared, or deployed, it needs tests and defensive thinking.

Turning First-Run Success Into Confidence

A healthy workflow looks like this:

Code works on first run
Add tests that assert that behavior
Add tests that try to break it
Automate those tests in CI
Observe behavior in staging
Monitor and log in production
Iterate when reality disagrees

Confidence comes from evidence, not luck.

Final Thoughts

That Reddit quote isn’t pessimistic — it’s pragmatic.

When your code works on the first try, don’t assume you’re done. Assume you haven’t looked hard enough yet. The best engineers aren’t the ones whose code works instantly — they’re the ones who expect it to fail and prepare for it anyway.

If your code survives your attempts to break it, then you can trust it.

Why Is Everyone Suddenly Talking About Voice Agents in 2026?

Jaskirat Singh — Wed, 28 Jan 2026 18:20:32 +0000

If you’ve been anywhere near AI conversations in 2026, one thing is impossible to ignore: voice agents are everywhere.

From contact centers and sales calls to internal support desks and consumer apps, businesses are racing to deploy AI-powered voice agents. This isn’t just another AI hype cycle. What we’re seeing now is the result of multiple technologies finally maturing at the same time—models, infrastructure, latency, and real-world trust.

Voice AI has entered its second phase of evolution. And 2026 is shaping up to be the year it becomes mainstream.

TL;DR — Why Voice Agents Matter Now

AI voice agents are no longer robotic scripts; they’re conversational, adaptive, and emotion-aware
Real-time personalization and sentiment detection are changing customer experience
Omnichannel continuity is becoming the default expectation
Businesses are adopting voice AI to reduce costs, scale faster, and improve CSAT
2026 marks a maturity point where voice AI is finally production-ready

The Big Shift: From Automation to Conversational Intelligence

For years, voice bots were little more than glorified IVRs. They followed scripts, failed on edge cases, and frustrated users more than they helped.

That era is over.

Modern voice agents are powered by:

Large Language Models (LLMs)
Advanced speech-to-text and text-to-speech systems
Context-aware dialogue management
Real-time analytics and feedback loops

Instead of rigid decision trees, today’s voice agents understand intent, context, and flow. They can handle interruptions, clarify ambiguities, and adapt their responses dynamically—much closer to how humans communicate.

This shift from scripted automation to conversational intelligence is the core reason voice AI is back in the spotlight.

Why 2026 Is the Inflection Point for Voice AI

Several forces converged to make 2026 a turning point:

Latency dropped enough for real-time conversations
Models became fast, cheap, and accurate enough for voice
Enterprises demanded measurable ROI from AI investments
Customer expectations rose dramatically post-chatbot era

More than 80% of contact center leaders now rank AI-driven productivity as a top priority. According to industry forecasts, AI agents could unlock hundreds of billions of dollars in economic value over the next few years through cost savings and revenue growth.

Voice AI is no longer experimental. It’s operational.

The Rise of Next-Gen AI Voice Agents

Next-generation voice agents are fundamentally different from their predecessors.

They are:

Context-aware across long conversations
Adaptive to user behavior and tone
Self-improving through learning loops
Integrated deeply into business systems

These agents don’t just answer questions—they participate in conversations. They understand when a user is confused, frustrated, or satisfied, and adjust accordingly.

This evolution turns voice AI from a support tool into a true engagement partner.

From Scripted Bots to Real Conversations

Traditional voice bots relied on predefined flows:

If user says X, respond with Y
If intent unclear, repeat menu options
Escalate early to human agents

Modern voice agents do the opposite.

They:

Interpret intent probabilistically
Maintain conversational memory
Adjust responses in real time
Handle complex, multi-turn dialogues

This enables more natural, efficient interactions and dramatically reduces call handling time and escalation rates.

Emotion and Empathy Are No Longer Optional

One of the biggest breakthroughs in voice AI is emotion recognition.

By analyzing tone, pace, pauses, and sentiment, voice agents can infer:

Frustration
Confusion
Urgency
Satisfaction

More importantly, they can respond empathetically—slowing down, changing tone, or escalating when necessary.

This transforms voice interactions from transactional to human-centric, helping businesses build trust instead of frustration.

Multilingual and Accent-Adaptive Voice Agents

Global businesses can no longer afford language barriers.

Modern voice agents:

Switch languages mid-call
Adapt to regional accents
Understand dialect variations
Maintain accuracy across geographies

This isn’t just about translation—it’s about inclusive communication. Accent-adaptive AI prevents misinterpretation, reduces bias, and creates a more equitable customer experience.

5 Voice Agent Trends Defining 2026

Trend 1: Generative AI for Real-Time Personalization

Voice agents now generate responses dynamically using customer context, history, and intent. This leads to:

Higher first-call resolution (FCR)
Lower average handle time (AHT)
More personalized customer journeys

Every call becomes unique.

Trend 2: Omnichannel Voice Experiences

Customers move seamlessly between voice, chat, and digital channels. Voice agents maintain context across all of them.

This continuity:

Improves retention
Reduces repetition
Increases operational efficiency

Omnichannel is no longer a feature—it’s an expectation.

Trend 3: Voice Analytics and Sentiment Tracking

Every call becomes a data source.

Voice AI now tracks:

Sentiment changes
Emotional triggers
Escalation patterns
Behavioral trends

These insights feed back into business strategy, enabling proactive CX improvements.

Trend 4: Data Privacy and Ethical Voice AI

Trust is becoming a competitive advantage.

Customers expect:

Transparency in data usage
Compliance with global regulations
Ethical AI behavior

Privacy-by-design and explainable AI are now mandatory for enterprise adoption.

Trend 5: Self-Learning Voice Agents

Voice agents improve with every interaction.

Using reinforcement learning, they:

Handle more complex cases over time
Reduce human intervention
Continuously optimize outcomes

This leads to massive operational cost reductions and faster scaling without additional headcount.

Business Impact: Why Companies Are Investing Heavily

Faster Resolution and Higher Satisfaction

Personalized, real-time conversations reduce friction and boost CSAT, NPS, and customer lifetime value.

Cost Efficiency and Workforce Optimization

Routine interactions are automated, freeing human agents to focus on complex, emotional, or high-value cases.

The result:

Lower cost per interaction
Higher agent productivity
Better workforce utilization

Global, 24/7 Availability

Voice AI eliminates time zone limitations, ensuring consistent service availability worldwide.

This directly translates to:

Higher retention
Reduced missed opportunities
Expanded market reach

Turning Conversations into Insights

Voice AI doesn’t just handle calls—it extracts intelligence.

Businesses use call data to:

Improve products
Refine marketing
Optimize customer journeys

Every conversation becomes a strategic input.

Challenges Ahead for Voice AI

Accent Bias and Inclusivity

As voice AI scales globally, ensuring fair and accurate recognition across accents and dialects is critical.

Inclusivity will define the next wave of innovation.

Preserving the Human Touch

Automation must enhance—not replace—human empathy.

The future belongs to hybrid systems where AI and humans collaborate seamlessly.

Regulation, Transparency, and Trust

Compliance is just the baseline. Transparency and explainability will differentiate leaders from laggards.

Ethical voice AI will become a brand value, not just a technical requirement.

Final Thoughts: Why Voice Agents Are the Conversation of 2026

Voice agents are no longer a novelty. They are becoming core infrastructure for customer engagement.

What changed?

The technology matured
The business case became undeniable
Customer expectations evolved

In 2026, voice AI isn’t about replacing humans—it’s about scaling empathy, intelligence, and efficiency at the same time.

And that’s why everyone is talking about it.

vLLM Explained: How PagedAttention Makes LLMs Faster and Cheaper

Jaskirat Singh — Mon, 26 Jan 2026 17:37:01 +0000

Picture this: you're firing up a large language model (LLM) for your chatbot app, and bam—your GPU memory is toast. Half of it sits idle because of fragmented key-value (KV) caches from all those user queries piling up. Requests queue up, latency spikes, and you're burning cash on extra hardware just to keep things running.

Sound familiar?

That’s the pain of traditional LLM inference, and it’s a headache for developers everywhere.

Enter vLLM, an open-source serving engine that acts like a smart memory manager for your LLMs. At its heart is PagedAttention, a clever technique that pages memory the same way an operating system does—dramatically reducing waste and boosting throughput.

In this post, we’ll dive deep into:

Where vLLM came from
How its core technology works
Key features that make it shine
Real-world wins
How to get started in minutes

Whether you're building a production API or just experimenting locally, vLLM makes LLM serving easy, fast, and cheap.

What Is vLLM?

vLLM is a high-throughput, open-source library designed specifically for serving LLMs at scale. Think of it as the turbocharged engine under the hood of your AI inference server.

It handles:

Concurrent request processing
GPU memory optimization
High-throughput token generation

At a glance, vLLM provides:

A production-ready server compatible with OpenAI’s API specification
A Python inference engine for custom integrations

The real magic comes from continuous batching and PagedAttention, which allow vLLM to pack far more requests onto a GPU than traditional inference engines. No more static batches waiting on slow prompts—requests flow in and out dynamically.

I’ve used vLLM myself to deploy LLaMA models, and it turned a sluggish single-A100 setup into a system pushing over 100+ tokens per second. If you’re tired of memory babysitting or horizontal scaling just to serve a handful of users, vLLM is your fix.

A Quick History of vLLM

vLLM emerged in 2023 from the UC Berkeley Sky Computing Lab. It was introduced in the paper:

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

Authored by Woosuk Kwon and team, the paper identified a massive inefficiency in LLM inference: KV cache fragmentation.

KV caches—temporary stores of attention keys and values—consume huge amounts of GPU memory during autoregressive generation. Traditional serving engines fragment this memory, leading to:

Out-of-memory errors
Underutilized GPUs
Poor scalability

PagedAttention was introduced as the solution, borrowing ideas from virtual memory systems in operating systems.

The idea quickly gained traction. GitHub stars skyrocketed, community adoption exploded, and by mid-2023, vLLM was powering production workloads at startups and large tech companies alike.

By 2026, vLLM supports:

100+ models (LLaMA 3.1, Mistral, and more)
Distributed serving with Ray
Custom kernels and hardware optimizations
Multi-LoRA and adapter-based serving

It’s a rare example of an academic idea scaling cleanly into real-world production.

Core Technology: PagedAttention and Continuous Batching

PagedAttention

PagedAttention is a non-contiguous memory management system for KV caches.

In standard Transformer inference:

KV caches grow with sequence length
Memory becomes fragmented
New requests fail despite free memory existing

PagedAttention fixes this by:

Splitting KV caches into fixed-size pages (e.g., 16 tokens)
Storing them in a shared memory pool
Tracking them using a logical-to-physical mapping table

The attention kernel gathers only the required pages at runtime—no massive copies, no fragmentation.

It’s essentially virtual memory for GPUs.

Continuous Batching

Traditional batching waits for a full batch to complete. If one request is slow, everything stalls.

vLLM uses continuous batching, where:

Completed sequences drop out mid-batch
New requests immediately take their place
GPU utilization stays consistently high

Analogy:

Old batching is fixed seating in a restaurant.

vLLM is flexible seating—tables free up instantly when diners leave.

The result:

Memory utilization jumps from 30–50% to over 90%
Throughput improves by 2–4x
Latency becomes predictable

Key Features That Make vLLM Stand Out

vLLM isn’t just about PagedAttention—it’s packed with production-grade features.

Quantization Support

Supports AWQ, GPTQ, FP8, and more. This reduces memory usage by 2–4x with minimal quality loss.

I’ve personally run a 70B model on two 40GB GPUs—something that was impossible before.

Tensor Parallelism

Seamlessly shards models across multiple GPUs using tensor parallelism. Scaling is near-linear up to 8+ GPUs.

Speculative Decoding

Uses a smaller draft model to propose tokens, which the main model verifies. This can double generation speed for interactive workloads.

Prefix Caching

Reuses KV caches for repeated prompts, ideal for chatbots and RAG pipelines with static system prompts.

Additional features include:

OpenAI-compatible API
Streaming responses
JSON mode and tool calling
Vision-language model support (e.g., LLaVA)

These optimizations stack. In benchmarks combining quantization, speculative decoding, and PagedAttention, vLLM exceeds 500 tokens/sec on H100 GPUs.

Benefits and Use Cases

Why Use vLLM?

Benchmarks consistently show:

2–4x higher throughput
24–48% memory savings
Significantly lower infrastructure costs

On ShareGPT-style workloads, vLLM serves over 2x more requests per second than standard Hugging Face pipelines.

Common Use Cases

Production APIs with OpenAI compatibility
Retrieval-Augmented Generation pipelines
Serving fine-tuned LoRAs and adapters
On-prem or edge deployments
Research and large-scale evaluation

A fintech team I know replaced TGI with vLLM for fraud detection. Throughput doubled, costs dropped 40%, and multi-tenancy became trivial.

Getting Started in 5 Minutes

Installation

pip install vllm

Start the Server

vllm serve meta-llama/Llama-2-7b-hf --port 8000

Your API is now live at:

http://localhost:8000/v1/chat/completions

Python Inference Example

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-2-7b-hf")
prompts = ["Hello, world!", "Why is the sky blue?"]

sampling_params = SamplingParams(
    temperature=0.8,
    top_p=0.95,
    max_tokens=100
)

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)

Batching, memory management, and scheduling are all handled automatically.

vLLM vs Other Serving Engines

Engine	Memory Efficiency	Throughput	Ease of Use
vLLM	High (PagedAttention)	2–4x faster	High
Hugging Face TGI	Medium	Standard	Medium
TensorRT-LLM	High	Very High	Low

vLLM offers the best balance of performance, usability, and flexibility—without vendor lock-in.

Final Thoughts

vLLM transforms LLM deployment from a resource-hungry problem into a scalable and affordable solution.

With:

PagedAttention eliminating memory waste
Continuous batching maximizing GPU utilization
Advanced features like quantization and speculative decoding

vLLM is quickly becoming the backbone of modern LLM serving.

Whether you’re a solo developer or running large-scale AI infrastructure, vLLM makes high-performance inference accessible. Clone the repository, experiment locally, and keep an eye on upcoming releases—this space is moving fast.

How I Explained LLMs, SLMs & VLMs at Microsoft

Jaskirat Singh — Sun, 25 Jan 2026 06:17:47 +0000

Why This Talk Mattered

I recently had the opportunity to present my thoughts on LLMs, SLMs, and VLMs at the Microsoft office during a community event. This wasn’t just another AI talk filled with buzzwords and hype. The goal was simple but powerful: help students and professionals understand why not all AI models are built the same—and why that’s actually a good thing.

This blog is a written walkthrough of that presentation. I’ll be embedding the same slides I used and expanding on the thinking behind them—what I wanted the audience to feel, question, and take back with them.

Slide 1: Not All AI Models Are Built the Same — And That’s the Point

Key idea: AI diversity is a feature, not a flaw.

Most AI conversations start with the wrong question:

Which model is the best?

I wanted to flip that narrative early and replace it with a better one:

Which model fits the problem we are actually trying to solve?

That framing sets the foundation for everything that follows.

Slide 2: Who Am I and Why This Perspective Matters

Before diving into models, I briefly introduced myself and my background—working as an AI Data Scientist, building SaaS products, publishing research, and deploying production-grade AI systems.

This mattered because the talk wasn’t theoretical. It was grounded in real-world AI, where cost, latency, privacy, and infrastructure constraints are non-negotiable.

Slide 3: What Even Are LLMs?

Large Language Models (LLMs) are neural networks trained on massive datasets. They represent the most powerful and versatile AI systems available today.

In simple terms, LLMs:

Contain billions to trillions of parameters
Use transformer architectures with attention mechanisms
Can reason, generate text, write code, translate languages, and analyze data

Examples include GPT-style models, Claude, Gemini, and LLaMA-based systems.

Slide 4: Why the AI Landscape Is Not One-Size-Fits-All

I used a smartphone analogy to make this intuitive.

Just like we have flagship phones, budget phones, and specialized devices, AI models exist for different needs.

LLMs are the heavyweights
SLMs are the efficiency experts
VLMs are the multimodal specialists

Different tools exist because different problems demand different trade-offs.

Slide 5–6: What LLMs Can Do Really Well

LLMs are incredibly versatile. They can:

Generate long-form content
Perform complex reasoning
Write and debug code across multiple languages
Translate between dozens of languages
Hold natural, human-like conversations
Assist with deep research and analysis

This is where most of the AI hype comes from—and rightly so.

Slide 7: The LLM Trade-Offs No One Talks About Enough

All that power comes at a cost.

Running LLMs is like driving a Ferrari. It’s impressive, but not always practical.

Real-world limitations include:

High computational requirements
Expensive inference costs
Higher latency
Heavy cloud dependency
Significant energy consumption

This is where many production systems start to struggle.

Slide 8: Enter SLMs — The Efficiency Experts

Small Language Models (SLMs) are often underestimated, but they are having their moment.

SLMs are:

Smaller and more focused
Optimized for specific tasks
Fast and cost-efficient
Capable of running on phones, laptops, and edge devices

They are designed for practicality, not bragging rights.

Slide 9: SLMs Are Not Weak, They Are Strategic

I highlighted several modern SLMs to make this concrete:

Phi-3 (Microsoft)
Gemini Nano
Mistral 7B
TinyLLaMA

These models prove that intelligence is not just about size—it’s about optimization.

Slide 10: When Should You Use SLMs?

SLMs shine in scenarios where:

Speed matters
Privacy is critical
Offline capability is required
Budgets are limited
Edge deployment is necessary
Tasks are domain-specific

SLMs are not budget LLMs—they are the right choice for the right job.

Slide 11: VLMs — When AI Learns to See

Vision-Language Models (VLMs) take things a step further.

They don’t just read text—they understand images as well. This is where AI becomes truly multimodal.

VLMs can:

Process images and text together
Understand visual context
Answer questions about photos
Generate descriptions from images

Slide 12: How VLMs Actually Work

Under the hood, VLMs combine:

A vision encoder for images
A language model for text
A fusion layer to connect meaning across modalities

This allows AI systems to see and reason at the same time.

Slide 13: VLMs in the Real World

VLMs are already transforming industries such as:

Medical imaging and diagnostics
Autonomous vehicles
Accessibility tools
Visual search engines
Content moderation
AR and VR experiences

Multimodal AI is no longer optional—it’s becoming standard.

Slide 14: Comparing LLMs, SLMs, and VLMs

There is no single best model.

LLMs excel at reasoning and versatility
SLMs excel at efficiency and speed
VLMs excel at multimodal understanding

The right choice depends entirely on context.

Slide 15: Speed and Latency Reality Check

Response time matters more than ever.

Approximate latency expectations:

LLMs: 1–5 seconds
SLMs: under 0.5 seconds
VLMs: 2–8 seconds depending on image complexity

For real-time applications, speed is not optional.

Slide 16: Choosing the Right Tool

Choosing a model is like choosing a vehicle. A monster truck is overkill for city driving.

Need complex reasoning → LLM
Need speed and efficiency → SLM
Need visual understanding → VLM
Need offline capability → SLM

Final Takeaways

The most important lessons from this talk:

There is no universally best AI model
Context beats capability
Efficiency matters as much as intelligence
Hybrid systems often outperform single-model setups

Choose models like an engineer, not like a fan.

Closing Thoughts

Presenting this at the Microsoft office was special—not because of the venue, but because the audience asked implementation-focused questions, not hype-driven ones.

If you’re building AI systems today, understanding LLMs, SLMs, and VLMs isn’t optional—it’s foundational.

Let’s Continue the Conversation

If this resonated with you, feel free to connect with me on LinkedIn or reach out directly. I’d love to hear how you’re thinking about model selection in your own AI stack.
https://www.linkedin.com/in/jaskiratai

Why Your AI Feels Dumb (And How MCP Fixes It)

Jaskirat Singh — Fri, 23 Jan 2026 18:02:11 +0000

Your AI isn’t actually dumb.

It can write code you’d normally Google for. It can explain system design better than most interview prep blogs. It can summarize a 100-page document before your coffee gets cold.

And yet, the moment you ask it to do something real—like read a file from your app, query a database, or trigger an internal API—it suddenly feels useless.

That disconnect isn’t a model problem.
It’s a context problem.

And this is where Model Context Protocol (MCP) quietly changes everything.

The Real Issue: Your AI Lives in a Bubble

Large Language Models don’t live inside your system. They live next to it.

Out of the box, an LLM has no idea about your files, your services, your permissions, or your business logic. So when we try to make AI “useful,” we start stuffing all of that information into prompts, tool schemas, and wrapper code.

At first, it feels like progress. The AI responds. It calls tools. It does something.

But over time, the cracks start showing.

Prompts grow massive. Tool logic becomes tightly coupled to specific models. Small changes ripple through the system. Switching models feels like rewriting half your stack.

Your AI didn’t get smarter.
You just buried the complexity deeper.

Life Before MCP: Clever Hacks, Fragile Systems

Before MCP, every AI app solved the same problem in its own way.

Some used function calling. Some used agent frameworks. Others built custom JSON protocols that only made sense to their team. Every solution worked—until it didn’t.

The real issue was that there was no standard contract between AI models and external systems. Each integration was handcrafted. Each prompt carried architectural responsibility it was never meant to handle.

We were asking language models to manage system design.

That was never going to scale.

MCP, Explained Without the Marketing

Model Context Protocol (MCP) is a standard that defines how AI models interact with tools, data, and context in a structured, predictable way.

It doesn’t make models smarter.
It makes AI systems usable.

MCP introduces a clean separation. Your application exposes what it can do and what data it can share. The AI consumes that information through a well-defined interface, without knowing how things work internally.

It’s boring in the best way possible. And boring is what production systems need.

The Shift MCP Introduces (And Why It Matters)

The biggest change MCP brings is this: context stops leaking into prompts.

Instead of encoding system behavior into text instructions, your system exposes capabilities directly. The AI no longer needs to guess how to interact with your app. It simply asks what’s available.

This means prompts become simpler. Logic becomes clearer. Security boundaries become explicit. And your AI stops feeling fragile.

You move from “prompt-powered hacks” to actual architecture.

MCP Servers: Teaching Your System How to Talk to AI

An MCP server acts as your system’s official interface for AI.

Rather than exposing raw internals, it presents a curated view of what the AI is allowed to see and do. This might include access to files, databases, APIs, or workflows—but only through controlled, structured capabilities.

From the AI’s perspective, the system becomes understandable. From your perspective, the system stays protected.

That balance is what makes MCP especially powerful in enterprise and production environments.

MCP Clients: Where Intelligence Finally Connects to Reality

The MCP client is the AI-powered application itself. This could be a coding assistant, an internal chatbot, an autonomous agent, or a RAG system with actions.

The key difference is that the client no longer needs custom logic for every integration. It connects to MCP servers, discovers available capabilities, and uses them consistently—regardless of the underlying implementation.

This makes AI apps more modular, more flexible, and far easier to evolve over time.

Why MCP Makes Your AI Feel Smarter (Without Changing the Model)

Once MCP is in place, something interesting happens.

Your AI doesn’t hallucinate how to use tools. It doesn’t rely on brittle prompt instructions. It doesn’t break when you refactor your backend.

It simply operates within a clear, structured environment.

That’s why MCP doesn’t feel like a flashy feature. It feels like stability. Like things finally make sense.

And in a world rushing to build agents, workflows, and autonomous systems, that kind of foundation matters more than ever.

Final Thought

Your AI was never dumb.

It was just disconnected.

MCP doesn’t give your model new abilities. It gives your system a language the model can actually understand. And once that connection is clean, reliable, and standardized, everything else becomes easier.

Smarter behavior isn’t always about better intelligence.

Sometimes, it’s just about better context.

How a RAG Agent Helped My Father's Shoulder Treatment (And Saved ₹30,000).

Jaskirat Singh — Thu, 22 Jan 2026 18:22:39 +0000

My father slipped on wet tiles while watering plants. Left shoulder dislocation. Emergency room. The doctor said he needed treatment, possibly surgery.

Hospital estimate: ₹30,000-50,000.

We had HDFC insurance for 4 years. But standing there, I had no idea what was covered.
The 84-page policy PDF was useless at that moment.

So I built an AI system that could answer our questions. It took one night. It saved us ₹30,000.

Here's exactly how I did it.

What Is a RAG Agent?
RAG = Retrieval-Augmented Generation
Think of it as ChatGPT that:

Only reads YOUR documents
Never makes up answers
Shows you which page it found information on

Perfect for insurance policies where accuracy matters.

How It Helped Us

Question 1
"MY Father's shoulder is dislocated. Is it covered?"
Agent Answer:
✅ YES - Covered under Accidental Injury (Section 4.2)

No waiting period for accidents
Manipal Hospital is a network hospital - use cashless
Apply for pre-authorization within 24 hours Result: Got cashless approval in 18 hours. No upfront payment needed.

Question 2
"Is an MRI scan also covered, or need to pay for it?"
Agent Answer:
✅ YES - Covered under Diagnostic Procedures (Section 5.1)

MRI needs pre-authorization
Already approved with treatment
Direct billing available Result: MRI done cashless. Saved ₹18,000 out-of-pocket.

Question 3
"Physiotherapy sessions are covered?"
Agent Answer:
✅ YES - 10 sessions covered (Section 6.4)

Up to ₹800 per session
Must be prescribed by a doctor
Can be done at empaneled centers Result: We had NO idea this was covered. Used all 10 sessions. Saved ₹8,000.

The Complete Code
Here's the exact system I built. Copy-paste ready.

Prerequisites

# Install Python packages
pip install langchain langchain-community langchain-openai faiss-cpu pypdf python-dotenv

Get OpenAI API key from platform.openai.com

Step 1: Setup (.env file)
OPENAI_API_KEY=your_key_here

Step 2: Load Policy PDF

# load_policy.py
from langchain_community.document_loaders import PyPDFLoader

def load_policy(pdf_path):
    """Load insurance policy PDF"""
    loader = PyPDFLoader(pdf_path)
    documents = loader.load()
    print(f"✅ Loaded {len(documents)} pages")
    return documents

### Usage
policy_docs = load_policy("star_health_policy.pdf")

Step 3: Create Smart Chunks

# chunking.py
from langchain.text_splitter import RecursiveCharacterTextSplitter

def create_chunks(documents):
    """Split into searchable pieces"""
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=500,
        chunk_overlap=100
    )
    chunks = splitter.split_documents(documents)
    print(f"✅ Created {len(chunks)} chunks")
    return chunks

# Usage
chunks = create_chunks(policy_docs)

Step 4: Create Vector Database

# vectorstore.py
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

def create_vectorstore(chunks):
    """Create searchable database"""
    embeddings = OpenAIEmbeddings()
    vectorstore = FAISS.from_documents(chunks, embeddings)
    vectorstore.save_local("policy_index")
    print("✅ Vector store created")
    return vectorstore

# Load existing
def load_vectorstore():
    embeddings = OpenAIEmbeddings()
    return FAISS.load_local(
        "policy_index", 
        embeddings,
        allow_dangerous_deserialization=True
    )

# Usage
vectorstore = create_vectorstore(chunks)

Step 5: Create the RAG System

# rag_system.py
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

def create_rag_system(vectorstore):
    """Create the question-answering system"""

    # Define prompt
    prompt = PromptTemplate(
        input_variables=["context", "question"],
        template="""You are an insurance policy expert.

Answer using ONLY the context below. If you don't know, say so.

Context:
{context}

Question:
{question}

Answer clearly and cite sections."""
    )

    # Create LLM
    llm = ChatOpenAI(
        model="gpt-4-turbo-preview",
        temperature=0
    )

    # Create retriever
    retriever = vectorstore.as_retriever(
        search_kwargs={"k": 5}
    )

    # Build chain
    chain = RetrievalQA.from_chain_type(
        llm=llm,
        retriever=retriever,
        return_source_documents=True,
        chain_type_kwargs={"prompt": prompt}
    )

    return chain

# Usage
rag_chain = create_rag_system(vectorstore)

Step 6: Ask Questions

# query.py
def ask_question(chain, question):
    """Ask and get answer with sources"""
    result = chain.invoke({"query": question})

    print(f"\n❓ Question: {question}")
    print(f"\n✅ Answer: {result['result']}")
    print(f"\n📚 Sources: {len(result['source_documents'])} sections")

    for i, doc in enumerate(result['source_documents'], 1):
        print(f"\n{i}. Page {doc.metadata.get('page', '?')}")
        print(f"   {doc.page_content[:200]}...")

    return result

# Usage
ask_question(rag_chain, "Is emergency treatment covered for accidents?")

Complete Script (main.py)

# main.py
import os
from dotenv import load_dotenv
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# Load API key
load_dotenv()

def setup_rag_system(pdf_path):
    """Complete setup in one function"""

    print("📄 Loading policy...")
    loader = PyPDFLoader(pdf_path)
    documents = loader.load()

    print("✂️ Creating chunks...")
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=500,
        chunk_overlap=100
    )
    chunks = splitter.split_documents(documents)

    print("🧠 Creating embeddings...")
    embeddings = OpenAIEmbeddings()
    vectorstore = FAISS.from_documents(chunks, embeddings)

    print("🔗 Building RAG chain...")
    prompt = PromptTemplate(
        input_variables=["context", "question"],
        template="""You are an insurance policy expert.

Answer using ONLY the context below. If you don't know, say so.

Context:
{context}

Question:
{question}

Answer clearly and cite policy sections."""
    )

    llm = ChatOpenAI(model="gpt-4-turbo-preview", temperature=0)
    retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

    chain = RetrievalQA.from_chain_type(
        llm=llm,
        retriever=retriever,
        return_source_documents=True,
        chain_type_kwargs={"prompt": prompt}
    )

    print("✅ System ready!\n")
    return chain

def ask(chain, question):
    """Ask a question"""
    result = chain.invoke({"query": question})
    print(f"\n❓ {question}")
    print(f"✅ {result['result']}\n")
    return result

# Run
if __name__ == "__main__":
    # Setup
    chain = setup_rag_system("your_policy.pdf")

    # Ask questions
    ask(chain, "Is shoulder dislocation covered?")
    ask(chain, "What about physiotherapy sessions?")
    ask(chain, "Are there any waiting periods?")

Run It
python main.py

Real Output Example
$ python main.py

📄 Loading policy...
✂️ Creating chunks...
🧠 Creating embeddings...
🔗 Building RAG chain...
✅ System ready!

❓ Is shoulder dislocation covered?
✅ Yes, shoulder dislocation from accidental injury is covered under
Section 4.2 (Accidental Injury Coverage). No waiting period applies
for accident-related injuries. Pre-authorization required within 24
hours for cashless treatment.

❓ What about physiotherapy sessions?
✅ Physiotherapy is covered under Section 6.4 for post-treatment
rehabilitation. Maximum 10 sessions covered at ₹800 per session.
Must be prescribed by treating doctor.

❓ Are there any waiting periods?
✅ Yes, standard waiting periods apply: 30 days for specific ailments,
24 months for pre-existing conditions. However, these are WAIVED for
accidental injuries.

Limitations
What It Does:
✅ Answers factual coverage questions
✅ Explains waiting periods
✅ Finds relevant clauses
✅ Handles Hindi + English questions
What It Doesn't Do:
❌ Legal advice
❌ Guarantee claim approval
❌ Medical diagnosis
❌ Replace insurance company
Always verify critical decisions with your insurer.

Real Impact
After sharing this, 200+ families asked for help with their policies.
Common discoveries:

Room rent sub-limits nobody knew about
Physiotherapy coverage never claimed
Waiting periods wrongly applied
Cashless hospitals not used

Total saved by our community: ₹23+ lakhs

Next Steps For You

Get your policy PDF
Get OpenAI API key (or use free Ollama)
Run the code above
Ask questions about YOUR policy

For Developers

Add more insurers (ICICI, HDFC, Care)
Build mobile app
Add Hindi voice interface
Create comparison tool

For Startups
This is a real problem. Millions of families need this. Build it right, help millions.

**
FAQs**
Q: Do I need coding knowledge?
A: Basic Python. If you can copy-paste, you can run this.
Q: How long does setup take?
A: 30 minutes first time. 2 minutes after that.
Q: Is my policy data safe?
A: Runs on your computer. Your policy never leaves your machine (except embeddings to OpenAI).
Q: Can I use this for other documents?
A: Yes! Works for any PDF - legal docs, manuals, research papers.
Q: What if my policy updates?
A: Rerun the setup script. Takes 2 minutes.

Success Story
After building this:

Father's treatment went smoothly
No confusion, no panic
Saved ₹30,000 by understanding coverage
Discovered benefits we didn't know existed
Helped 200+ other families

Most importantly: Peace of mind during a medical emergency.

Final Thoughts

Insurance shouldn't be a mystery box you open during emergencies.
You pay premiums. You deserve to understand what you bought.
This RAG system gives you that understanding—in seconds, in your language, with sources cited.
Build it. Use it. Share it.
Your family will thank you when it matters most.

If You’re Building in 2026, Start Here 📈

Jaskirat Singh — Wed, 21 Jan 2026 11:21:04 +0000

20 Open-Source Tools That Actually Moved the Needle

2025 was noisy.

Every week, a new “must-use” open-source tool popped up on GitHub or X. I personally tried 40+ open-source tools across AI infra, LLM ops, developer productivity, and internal tooling—so you don’t have to.

Most were impressive demos.
Only a few actually shipped value in real projects.

This blog is about those 20 tools. The ones that:

Reduced engineering friction
Scaled beyond toy use cases
Worked in production, not just on launch day
Fit how teams will realistically build products in 2026

If you’re a builder, founder, or developer working with AI systems, workflows, or modern SaaS stacks—this list is a solid place to start.

How to Read This List

Each tool below includes:

What it does

Why it matters in 2026

Where it fits in real projects

No hype. No paid placements. Just practical tools.

1. Sourcebot

Fast, self-hosted code understanding & search for massive monorepos

When codebases cross a certain size, grep and IDE search simply stop scaling. Sourcebot gives you semantic, blazing-fast search across large monorepos—while staying self-hosted.

Why it matters:
AI-assisted development only works if your tools actually understand your codebase.

🔗 https://www.sourcebot.dev/

2. LiteLLM (YC W23)

One OpenAI-compatible gateway for 100+ LLMs

LiteLLM abstracts away vendor lock-in. You switch models, providers, and pricing without rewriting your app—while getting logging, rate limits, and cost controls.

Why it matters:
2026 will be multi-model by default.

🔗 https://www.litellm.ai/

3. Langfuse

Tracing, evals, and prompt management for LLM apps

Langfuse helps you understand why an LLM behaved the way it did. Traces, evaluations, prompt versions—everything you need once prototypes hit production.

Why it matters:
You can’t scale what you can’t observe.

🔗 https://langfuse.com/

4. Infisical

Open-source secrets & config management

Finally, a modern alternative to hard-coded env files and duct-taped secret sharing across teams and CI/CD.

Why it matters:
AI apps touch more credentials than traditional apps ever did.

🔗 https://infisical.com/

5. Ollama

Run LLMs locally with a simple CLI

Ollama makes local inference approachable—for dev, testing, and privacy-sensitive workloads.

Why it matters:
Not every LLM call should hit the cloud.

🔗 https://ollama.com/

6. Browser Use

Let AI agents interact with real websites

This unlocks agent workflows that actually work on the real web—not just APIs.

Why it matters:
Many real-world systems still don’t have APIs.

🔗 https://browser-use.com/

7. Mastra

TypeScript-first AI primitives (agents, RAG, workflows)

Mastra feels like what many of us wish early AI frameworks were—composable, typed, and production-minded.

Why it matters:
AI infra is moving closer to standard software patterns.

🔗 https://mastra.ai/

8. Continue

Background agents and continuous coding workflows

Continue blends into developer workflows instead of interrupting them.

Why it matters:
AI copilots should assist, not distract.

🔗 https://www.continue.dev/

9. Firecrawl

Turn websites into clean, LLM-ready data

Scraping is messy. Firecrawl makes it boring—in the best way.

Why it matters:
RAG pipelines are only as good as their input data.

🔗 https://www.firecrawl.dev/

10. Onyx

Self-hostable enterprise chat UI with RAG and agents

Think “internal ChatGPT,” but actually enterprise-ready.

Why it matters:
Companies want control, not just convenience.

🔗 https://www.onyx.app/

11. Trigger.dev

Long-running, reliable AI workflows in TypeScript

Perfect for agent pipelines that don’t fit into request-response cycles.

Why it matters:
AI workflows are asynchronous by nature.

🔗 https://trigger.dev/

12. ParadeDB

Postgres-native search & analytics

A serious Elasticsearch alternative without leaving Postgres.

Why it matters:
Operational simplicity wins long term.

🔗 https://www.paradedb.com/

13. Reflex

Build full-stack web apps entirely in Python

Reflex lowers the barrier for AI engineers to ship full products.

Why it matters:
Builders shouldn’t need to context-switch stacks to ship.

🔗 https://reflex.dev/

14. Tiptap

Headless editor for Notion-like experiences

If you’re building collaborative or AI-assisted writing tools, Tiptap is battle-tested.

🔗 https://tiptap.dev/

15. GrowthBook

Open-source feature flags and A/B testing

Experimentation without SaaS lock-in.

🔗 https://www.growthbook.io/

16. Windmill

Turn scripts into apps and workflows

Windmill is what internal tooling should feel like in 2026.

🔗 https://www.windmill.dev/

17. LanceDB

High-performance vector DB for billion-scale search

Fast, efficient, and built for serious scale.

🔗 https://lancedb.com/

18. Mattermost

Secure, self-hosted team communication

Used by organizations where Slack simply isn’t an option.

🔗 https://mattermost.com/

19. Tesseral

Open-source IAM for B2B SaaS

SSO, SCIM, RBAC, audit logs—without reinventing the wheel.

🔗 https://tesseral.com/

20. Helicone (YC W23)

Metrics, traces, and experiment tooling for LLMs

If you’re serious about LLM performance, Helicone is hard to ignore.

🔗 https://www.helicone.ai/

Final Thoughts

Open source in 2026 won’t be about more tools.
It will be about fewer, composable, production-ready ones.

Every tool on this list earned its place by:

Solving a real problem
Working under load
Respecting developer time

♻️ If this helped, save it, share it, or pass it to someone building serious systems.

👉 Which tool should I deep-dive next?
Drop a name, a project, or tag someone who’d love this.

Connect with me over LinkedIn for more such content. https://www.linkedin.com/in/jaskiratai

Detecting LLM Hallucinations Through Vector Geometry: A New Approach

Jaskirat Singh — Tue, 20 Jan 2026 18:41:06 +0000

Large language models generate convincing text regardless of factual accuracy. They cite nonexistent research papers, invent legal precedents, and state fabrications with the same confidence as verified facts. Traditional hallucination detection relies on using another LLM as a judge—essentially asking a system prone to hallucination whether it's hallucinating. This circular approach has fundamental limitations.

Recent research reveals a geometric approach to hallucination detection that examines the mathematical structure of text embeddings rather than relying on another model's judgment. This method identifies when responses deviate from learned patterns by analyzing vector relationships in embedding space.

The Core Problem With Current Detection Methods

Most hallucination detection systems employ an LLM-as-judge architecture. You generate a response, then ask another language model to evaluate its accuracy. The problems are obvious: you're using fallible systems to judge themselves, creating recursive uncertainty. The judge model can hallucinate about whether the original response hallucinated.

This approach also requires additional API calls, increases latency, and scales poorly. For every response requiring verification, you need a second inference pass with comparable computational cost. Enterprise applications processing millions of requests face multiplied infrastructure expenses.

The fundamental question becomes: can we detect hallucinations from intrinsic properties of the response itself, without external judgment?

Understanding Embedding Space Structure

Modern sentence encoders transform text into numerical vectors—points in high-dimensional space where semantically similar content clusters together. This is fundamental to how semantic search and retrieval systems work. But embeddings encode more than simple similarity.

The Question-Answer Relationship

When you embed a question and its corresponding answer, they occupy different positions in vector space. The displacement between these positions—the vector pointing from question to answer—has both magnitude and direction. For grounded, factual responses within a specific domain, these displacement vectors exhibit remarkable consistency.

Consider five different questions about molecular biology with accurate answers:

"What organelle produces ATP?" → "Mitochondria produce ATP through cellular respiration"
"How does oxidative phosphorylation work?" → "Oxidative phosphorylation generates ATP using the electron transport chain"
"What is the Krebs cycle?" → "The Krebs cycle is a series of reactions producing electron carriers"

When embedded, the displacement vectors from each question to its answer point in roughly parallel directions. The magnitudes vary—some answers are longer or more detailed—but the directional consistency holds. This represents the "grounded response pattern" for this domain.

When Hallucination Breaks the Pattern

Now consider a hallucinated response to a biology question:

"What organelle produces ATP?" → "The Golgi apparatus manufactures ATP molecules through photosynthesis"

This response is fluent, grammatically correct, and structurally resembles a proper answer. But when embedded, the displacement vector points in a fundamentally different direction than the established pattern. The response has strayed from the geometric structure characterizing grounded answers in this domain.

This is the key insight: hallucinations don't just contain incorrect information—they occupy anomalous positions in embedding space relative to established truth patterns.

Displacement Consistency: Measuring Geometric Alignment

The Displacement Consistency (DC) metric formalizes this geometric observation into a practical detection method. The process is straightforward:

Building the Reference Set

First, construct a collection of verified question-answer pairs from your target domain. This becomes your geometric baseline—the established pattern against which new responses are measured. For a medical chatbot, use medical Q&A pairs. For legal research, use legal examples.

The reference set size can be modest—approximately 100 examples suffices for most domains. This is a one-time calibration cost performed offline before deployment.

Computing Displacement Consistency

When a new question-answer pair requires verification:

Find Neighboring Questions: Identify the K nearest questions in the reference set to your new question (typically K=5-10)
Calculate Mean Displacement:Compute the average displacement direction from these neighboring questions to their verified answers
Measure Alignment:Calculate the cosine similarity between your new displacement vector and this mean direction

Grounded responses align strongly with the reference pattern—DC scores approach 1.0. Hallucinated responses diverge significantly—DC scores drop toward 0.3 or lower.

Why This Works

The method exploits how contrastive training shapes embedding space. Models learn to map semantically related content to nearby regions. For question-answer pairs, this creates directional consistency: truthful responses move in predictable directions from their questions within specific domains.

Hallucinated content, while fluent and confident, doesn't respect these learned geometric relationships. The model generates text matching surface patterns (grammar, style, structure) but fails to maintain the deeper geometric consistency that characterizes grounded responses.

Empirical Performance Across Models

Testing across architecturally diverse embedding models validates whether DC represents a fundamental property or model-specific artifact. Five models with distinct training approaches were evaluated:

Architectural Diversity:

MPNet-based contrastive fine-tuning (all-mpnet-base-v2)
Weakly-supervised pre-training (E5-large-v2)
Instruction-tuned with hard negatives (BGE-large-en-v1.5)
Encoder-decoder adaptation (GTR-T5-large)
Efficient long-context architecture (nomic-embed-text-v1.5)

Benchmark Results

DC achieved near-perfect discrimination across multiple established hallucination datasets:

HaluEval-QA:Contains LLM-generated hallucinations designed to be subtle and plausible. DC achieved AUROC 1.0 across all five embedding models.
HaluEval-Dialogue: Tests responses that deviate from conversational context. DC maintained perfect discrimination.
TruthfulQA: Evaluates common misconceptions humans frequently believe. DC continued achieving AUROC 1.0.

Comparative Performance

Alternative approaches measuring where responses land relative to queries (position-based rather than direction-based) achieved AUROC around 0.70-0.81. The consistent 0.20 AUROC gap across all models demonstrates DC's superior discriminative power.

Score distributions reveal clear separation: grounded responses cluster tightly around DC values of 0.9, while hallucinations spread around 0.3 with minimal overlap.

The Domain Locality Constraint

Perfect performance within domains masks an important limitation: DC does not transfer across domains. A reference set from legal Q&A cannot detect hallucinations in medical responses—performance degrades to random chance (AUROC ~0.50).

Understanding the Geometric Structure

This domain specificity reveals fundamental properties of how embeddings encode grounding. In geometric terms, embedding space resembles a fiber bundle:

Base Manifold: The surface representing all possible questions across all domains
Fibers:At each point on this surface, a direction vector indicating where grounded responses should move

Within any local region (one specific domain), fibers point in consistent directions—this enables DC's strong local performance. But globally, across different domains, fibers point in different directions. There's no universal "truthfulness direction" spanning all possible content.

Practical Implications

No Universal Grounding Pattern: Each domain develops distinct displacement patterns during training. Legal questions and medical questions establish different geometric structures for grounded responses.
Calibration Requirements: Deploying DC requires domain-matched reference sets. A financial services chatbot needs financial examples; a technical support system needs technical documentation examples.
One-Time Cost: Calibration happens offline before deployment. Once established, the reference set enables real-time detection without additional LLM calls.

This finding challenges assumptions about embedding space universality. Models don't learn a single global representation of truthfulness—they learn domain-specific mappings whose disruption signals hallucination.

Practical Implementation Considerations

Deploying geometric hallucination detection involves several engineering decisions that impact effectiveness and operational cost.

Reference Set Construction

Size Requirements: Testing shows 100-200 verified examples per domain provides robust baselines. Larger sets improve boundary case handling but deliver diminishing returns beyond 500 examples.
Quality Over Quantity: Reference examples must be verified as factually accurate. One hallucinated example in the reference set contaminates the geometric baseline, degrading detection accuracy.
Domain Matching: Reference content should align with production queries. Generic examples from unrelated domains contribute noise rather than signal.

Computational Efficiency

Offline Costs:Reference set embedding happens once during calibration. This one-time cost doesn't impact production latency.

Online Costs: Real-time detection requires:

Embedding the new question-answer pair (two embedding calls)

Finding K nearest neighbors in reference set (efficient vector search)

Computing cosine similarities (simple linear algebra)

Modern vector databases handle nearest neighbor search at scale with sub-millisecond latency. Total detection overhead remains minimal compared to LLM-as-judge approaches requiring full inference passes.

Integration Patterns

Post-Generation Filtering:Generate responses normally, then apply DC scoring before returning to users. Responses below threshold trigger flagging, human review, or regeneration.
Confidence Scoring:Surface DC scores alongside responses, letting downstream systems or users assess reliability.
Hybrid Approaches: Combine geometric detection with other signals (retrieval confidence, source citation verification) for comprehensive hallucination mitigation.

Advantages Over Alternative Methods

Geometric hallucination detection offers distinct benefits compared to common alternatives.

Versus LLM-as-Judge

No Recursive Uncertainty: DC doesn't rely on another LLM's judgment, eliminating circular reasoning where hallucination-prone systems evaluate hallucination.
Lower Latency: Single embedding pass versus full text generation from judge model reduces response time significantly.
Cost Efficiency: Embedding inference costs far less than generative inference. For high-volume applications, the savings compound substantially.

Versus Source-Based Verification

No Retrieval Required: DC operates on response geometry alone, without needing to fetch and verify source documents at inference time.
Works With Reasoning: Many LLM applications involve synthesis and reasoning beyond simple retrieval. DC can still detect when reasoning outputs hallucinate.
Simpler Infrastructure: No need for document stores, retrieval systems, or citation parsing—just embe ddings and vector similarity.

Versus Uncertainty Estimation

No Model Internals: DC uses standard embeddings without requiring access to model weights, attention patterns, or logit distributions.
Model-Agnostic: Works across any LLM generating text, as long as embeddings can be computed for outputs.
Consistent Performance: Uncertainty estimation quality varies significantly across model families. DC performance remains stable across embedding architectures.

Limitations and Open Questions

While geometric detection shows strong empirical results, several limitations and research questions remain.

Domain Boundary Definition

Determining domain boundaries for reference set construction lacks clear guidelines. Are "cardiovascular surgery" and "orthopedic surgery" separate domains requiring distinct calibration? Or can a general "medical procedures" reference set serve both?

Current practice relies on empirical testing: construct candidate reference sets, measure DC performance on held-out examples, and iterate. More principled approaches to domain scoping would improve deployment efficiency.

Adversarial Robustness
Can models be trained to generate hallucinations that maintain geometric consistency? If LLMs explicitly optimize to preserve displacement patterns while fabricating content, does DC remain effective?

Early exploration suggests this is difficult—maintaining geometric consistency while hallucinating requires coordinating both semantic content and embedding space positioning. But adversarial settings warrant further investigation.

Cross-Lingual Performance

Testing has focused on English language content. Do displacement patterns transfer across languages? Can multilingual embedding models enable cross-lingual hallucination detection?

Preliminary evidence suggests language-specific calibration may be necessary, similar to domain-specific calibration. But unified approaches for multilingual systems remain underexplored.

Temporal Drift

As language models evolve and embedding models receive updates, do established reference sets remain valid? How frequently does recalibration become necessary?

Monitoring DC score distributions over time can detect drift, triggering recalibration when performance degrades. But proactive recalibration schedules remain an open operational question.

Research Directions

Several promising avenues extend geometric hallucination detection.

Automatic Domain Discovery

Rather than manually defining domains and constructing reference sets, can unsupervised clustering automatically identify geometric regions in embedding space corresponding to coherent domains? This would enable automated reference set construction from unlabeled data.

Multi-Domain Calibration
Investigating whether mixtures of domain-specific reference sets can provide broader coverage without requiring exact domain matching. Ensemble approaches combining multiple reference sets might improve robustness.

Explanation Generation

When DC flags a response as likely hallucinated, providing explanations beyond a numerical score would aid human review. Identifying which reference examples the response deviates from most strongly could highlight specific inconsistencies.

Integration With Retrieval

Combining geometric detection with retrieval-augmented generation (RAG) could provide complementary hallucination mitigation. RAG grounds responses in retrieved documents; DC verifies the response respects geometric consistency.

Conclusion

Geometric hallucination detection shows that embedding space encodes structured, domain-specific directions linking questions to grounded answers. When these directions break, hallucination is likely.

This approach is practical: it requires no LLM judge, adds minimal overhead, and achieves near-perfect discrimination within calibrated domains. While calibration is a one-time cost, it enables efficient real-time detection.

The findings also reshape how we understand embeddings. There is no universal “truthfulness direction”—only local coherence within domains—challenging common assumptions and opening new research directions.

For production systems, geometric detection complements retrieval, uncertainty estimates, and human review, improving reliability. The geometry was always there; we’re just learning how to read it.