<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Henry Yan</title>
    <description>The latest articles on Forem by Henry Yan (@yany-henry).</description>
    <link>https://forem.com/yany-henry</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3881362%2Ff0a3bdfe-99c8-4cde-b1b9-fc0476b2dcc4.png</url>
      <title>Forem: Henry Yan</title>
      <link>https://forem.com/yany-henry</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/yany-henry"/>
    <language>en</language>
    <item>
      <title>Why Search-Enabled LLMs Still Get Numbers Wrong</title>
      <dc:creator>Henry Yan</dc:creator>
      <pubDate>Wed, 15 Apr 2026 23:40:54 +0000</pubDate>
      <link>https://forem.com/yany-henry/why-search-enabled-llms-still-get-numbers-wrong-n6n</link>
      <guid>https://forem.com/yany-henry/why-search-enabled-llms-still-get-numbers-wrong-n6n</guid>
      <description>&lt;p&gt;If a chatbot can search the web, shouldn't it become more accurate?&lt;/p&gt;

&lt;p&gt;That sounds intuitive, but my pilot study suggests the answer is: not always.&lt;/p&gt;

&lt;p&gt;I have been working on a small evaluation project for &lt;strong&gt;grounded numeric fact retrieval&lt;/strong&gt;. The task sounds simple: ask a model a question like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What was the birth rate in Angola in 2020?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;These questions are nice for evaluation because they are public, structured, and easy to verify against real-world data sources. They also expose a subtle weakness in modern AI systems: finding a page is not the same as giving the right answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Core Idea
&lt;/h2&gt;

&lt;p&gt;I compare three ways an LLM might answer the same numeric question:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;no-web&lt;/code&gt;: answer from internal model knowledge only&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;web-search&lt;/code&gt;: answer with built-in search or browsing&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;agentic&lt;/code&gt;: rewrite the query, search, retrieve sources, and try to validate the answer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The benchmark currently covers 20 indicators across 15 countries, for a total of 300 questions. Using existing pilot outputs, I already have 1,200 evaluated examples across four models.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Found So Far
&lt;/h2&gt;

&lt;p&gt;The interesting result is not just that search can help. It is that &lt;strong&gt;search changes the error profile&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For example, in the current pilot:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPT-4o improves from &lt;strong&gt;31.67%&lt;/strong&gt; &lt;code&gt;A&lt;/code&gt;-level accuracy without search to &lt;strong&gt;44.00%&lt;/strong&gt; with search.&lt;/li&gt;
&lt;li&gt;Gemini improves from &lt;strong&gt;35.00%&lt;/strong&gt; to &lt;strong&gt;44.67%&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Qwen improves from &lt;strong&gt;19.67%&lt;/strong&gt; to &lt;strong&gt;36.33%&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Claude, in this pilot, actually performs worse with search than without it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So yes, retrieval can help. But it can also hurt.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Happens
&lt;/h2&gt;

&lt;p&gt;There are several ways a search-enabled model can still fail:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It pulls the wrong year.&lt;/li&gt;
&lt;li&gt;It picks a source that uses a different definition.&lt;/li&gt;
&lt;li&gt;It cites a page that does not actually support the answer.&lt;/li&gt;
&lt;li&gt;It combines multiple weak hints into a confident but unsupported number.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In other words, the real question is not "Can the model search?" The real question is:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can the model retrieve, interpret, and justify a correct answer from evidence?&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I Think This Matters
&lt;/h2&gt;

&lt;p&gt;Many frontier AI systems are becoming more agentic. They browse, call tools, and retrieve documents before responding. That makes evaluation harder, not easier. If we only reward systems for producing answers with links, we may miss whether those links actually support the answer.&lt;/p&gt;

&lt;p&gt;Numeric facts are a good place to study this problem because they make grounding failures visible. Either the number matches the source and the reference value, or it does not.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where This Project Goes Next
&lt;/h2&gt;

&lt;p&gt;The current version is a pilot, not a finished benchmark. The next step is to turn it into a cleaner evaluation framework with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reproducible data merging,&lt;/li&gt;
&lt;li&gt;stronger citation checks,&lt;/li&gt;
&lt;li&gt;a small manually verified subset,&lt;/li&gt;
&lt;li&gt;and a clearer failure taxonomy for search-related errors.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the research direction I want to push further: not just whether LLMs can search, but whether search actually makes them more trustworthy.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>nlp</category>
      <category>rag</category>
    </item>
  </channel>
</rss>
