Forem: Wauldo

Your RAG pipeline doesn't tell you when it's wrong. Here's how to fix that.

Wauldo — Sun, 12 Apr 2026 19:12:15 +0000

Here's something that bugged me for a while: every RAG framework tells you what the LLM said. None of them tell you if it was true.

You get confidence: 0.92 from the retriever. Cool. That means the retrieval was good. It says nothing about whether the LLM hallucinated on top of perfectly retrieved documents.

The LLM can retrieve the right chunk, read "14 days", and confidently write "60 days". Retrieval confidence: high. Answer accuracy: zero.

What if every answer came with a trust score?

Not retrieval confidence. Not perplexity. A score that compares the actual claims in the answer against the actual text in the sources.

from wauldo import HttpClient

client = HttpClient(base_url="https://api.wauldo.com", api_key="YOUR_KEY")

result = client.guard(
    text="The free trial lasts 60 days.",
    source_context="Free trial period: 14 days. No extensions.",
)

print(result.verdict)       # "rejected"
print(result.confidence)    # 0.0
print(result.is_blocked)    # True
print(result.claims[0].reason)  # "numerical_mismatch"

The trust score is a number between 0 and 1. It's not a probability — it's a factual verification score based on claim-by-claim comparison.

What it catches

Numerical mismatches — "60 days" vs "14 days" in the source:

r = client.guard("Price is $99/month", "Pricing: $49/month for Pro plan")
# verdict: "rejected", reason: "numerical_mismatch"

Correct claims — when the answer matches:

r = client.guard("Paris is the capital of France", "Paris is the capital of France.")
# verdict: "verified", confidence: 1.0

Partial evidence — when the source doesn't fully support the claim:

r = client.guard(
    "The API supports JSON and XML formats",
    "All requests must use JSON format."
)
# verdict: "weak", action: "review"

Plugging it into your existing code

Whatever you're using — LangChain, LlamaIndex, Haystack, raw OpenAI — the pattern is the same:

# Step 1: generate answer (your existing code)
answer = your_pipeline.run(question)

# Step 2: verify (3 lines)
check = client.guard(text=answer, source_context=retrieved_docs)
if check.is_blocked:
    answer = "I couldn't verify this answer against the sources."

That's it. No framework migration. No retraining. No prompt engineering.

Three modes, pick your tradeoff

Mode	Speed	What it does
`lexical`	<1ms	Token overlap matching
`hybrid`	~50ms	Token + semantic embeddings
`semantic`	~500ms	Full embedding comparison

Default is lexical. For most production use cases, <1ms verification on every response is the right tradeoff.

Try it right now

No signup needed — paste any text + source in the interactive tool and see the trust score live.

With code — install and test locally with the mock (no API key needed):

from wauldo import MockHttpClient

mock = MockHttpClient()

# Contradiction → rejected
print(mock.guard("60 days", "14 days").verdict)  # "rejected"

# Match → verified
print(mock.guard("14 days", "14 days").verdict)  # "verified"

SDKs: pip install wauldo · npm install wauldo · cargo add wauldo · API docs (Postman)

Free tier: 300 requests/month — get a key

I'm building this because I got tired of shipping RAG pipelines that work on demos and break on real data. If you've solved this differently, I'd genuinely like to hear how.

How We Achieved 0% Hallucination Rate in Our RAG API (With Benchmarks)

Wauldo — Sun, 05 Apr 2026 10:02:55 +0000

0% hallucination rate
83% accuracy across 61 tasks
4-layer verification system

Most RAG APIs generate answers.
We verify them.

After testing 14 LLMs across 61 evaluation tasks, our pipeline maintains 0% hallucination rate at 83% accuracy — in production conditions.

Here’s exactly how we did it.

The Problem Nobody Talks About

RAG is supposed to reduce hallucinations.
In reality, most implementations just move the problem.

They retrieve documents…
then blindly trust the model to interpret them correctly.

The result?

Missing critical facts
Conflicting sources ignored
Confident but wrong answers

And worst of all: no verification layer.

Most RAG systems don’t actually know if their answer is grounded.
They just hope it is.

Our Approach: A 4-Layer Defense System

We designed our pipeline with one goal:

Make hallucination structurally impossible — not just unlikely.

Layer 1: Retrieval That Doesn’t Miss

We use a hybrid retrieval system:

BM25 → precise keyword matching
Vector search → semantic recall

But the key isn’t hybrid search.
It’s how we handle failure cases.

If retrieval is weak → downstream layers compensate
If retrieval is strong → we stay fast

👉 Retrieval is treated as a signal, not a source of truth.

Layer 2: Slot-Based Critical Chunks

Most RAG pipelines rank chunks and pick the top K.

We don’t.

We introduced a slot-based system:

Detect critical query intents (numbers, entities, dates)
Force-include matching chunks in the context

This ensures:

No critical data is dropped
No reliance on ranking luck

👉 It’s constraint-based, not score-based.

Layer 3: Deterministic Key Facts Injection

Before calling the LLM, we extract key facts directly from the context:

Numbers
Dates
Percentages
Identifiers

Then inject them into the prompt as non-negotiable facts.

This removes ambiguity entirely.

The model doesn’t “guess” values anymore.
It anchors to verified data.

Layer 4: Post-Generation Grounding Check

This is where most systems stop.
We don’t.

After generation, we run a grounding verification step:

Extract terms from the answer
Check if ≥60% exist in the retrieved context
If not → reject or flag

This creates a closed-loop system:

No grounded context → no valid answer.

Benchmarks (Real Numbers)

We evaluated the system across 61 tasks and 14 LLMs.

Metric	Score
Eval (61 tasks)	83%
Hallucination rate	0%
RAG retrieval	88%
Cross-doc comparison	93%
Avg latency	1.2s

Key insight:

You don’t need 100% accuracy to achieve 0% hallucination.
You need verification.

What Didn’t Work (And Why It Matters)

We tried multiple “obvious” improvements that failed:

Multi-step retrieval → added noise, reduced precision
Header penalties → broke valid top chunks
Over-aggressive reranking → increased variance

Lesson:

RAG is a balanced system, not a collection of optimizations.

Small changes can silently degrade performance.

Why This Approach Works

Most systems try to make the model smarter.

We did the opposite:

Reduce model freedom
Increase constraints
Add verification

👉 The result is not just better answers.
👉 It’s reliable answers.

Try It

We made the API available publicly:

Free tier on Rapidapi
Docs: https://wauldo.com/docs

If you're building with RAG, this will save you months of trial and error.

Final Thought

Hallucination isn’t a model problem.

It’s a system design problem.

Solve it at the architecture level —
and the model becomes predictable.