Why Search-Enabled LLMs Still Get Numbers Wrong

Henry Yan — Wed, 15 Apr 2026 23:40:54 +0000

If a chatbot can search the web, shouldn't it become more accurate?

That sounds intuitive, but my pilot study suggests the answer is: not always.

I have been working on a small evaluation project for grounded numeric fact retrieval. The task sounds simple: ask a model a question like:

What was the birth rate in Angola in 2020?

These questions are nice for evaluation because they are public, structured, and easy to verify against real-world data sources. They also expose a subtle weakness in modern AI systems: finding a page is not the same as giving the right answer.

The Core Idea

I compare three ways an LLM might answer the same numeric question:

no-web: answer from internal model knowledge only
web-search: answer with built-in search or browsing
agentic: rewrite the query, search, retrieve sources, and try to validate the answer

The benchmark currently covers 20 indicators across 15 countries, for a total of 300 questions. Using existing pilot outputs, I already have 1,200 evaluated examples across four models.

What I Found So Far

The interesting result is not just that search can help. It is that search changes the error profile.

For example, in the current pilot:

GPT-4o improves from 31.67% A-level accuracy without search to 44.00% with search.
Gemini improves from 35.00% to 44.67%.
Qwen improves from 19.67% to 36.33%.
Claude, in this pilot, actually performs worse with search than without it.

So yes, retrieval can help. But it can also hurt.

Why This Happens

There are several ways a search-enabled model can still fail:

It pulls the wrong year.
It picks a source that uses a different definition.
It cites a page that does not actually support the answer.
It combines multiple weak hints into a confident but unsupported number.

In other words, the real question is not "Can the model search?" The real question is:

Can the model retrieve, interpret, and justify a correct answer from evidence?

Why I Think This Matters

Many frontier AI systems are becoming more agentic. They browse, call tools, and retrieve documents before responding. That makes evaluation harder, not easier. If we only reward systems for producing answers with links, we may miss whether those links actually support the answer.

Numeric facts are a good place to study this problem because they make grounding failures visible. Either the number matches the source and the reference value, or it does not.

Where This Project Goes Next

The current version is a pilot, not a finished benchmark. The next step is to turn it into a cleaner evaluation framework with:

reproducible data merging,
stronger citation checks,
a small manually verified subset,
and a clearer failure taxonomy for search-related errors.