<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Shuntaro Okuma</title>
    <description>The latest articles on Forem by Shuntaro Okuma (@shuntarookuma).</description>
    <link>https://forem.com/shuntarookuma</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3796661%2Fc4ad13e3-cc4c-4577-80ba-1a147449eddf.png</url>
      <title>Forem: Shuntaro Okuma</title>
      <link>https://forem.com/shuntarookuma</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/shuntarookuma"/>
    <language>en</language>
    <item>
      <title>I Tested 12 LLMs With Few-Shot Examples. The Results Were Not What I Expected.</title>
      <dc:creator>Shuntaro Okuma</dc:creator>
      <pubDate>Thu, 26 Mar 2026 13:56:16 +0000</pubDate>
      <link>https://forem.com/shuntarookuma/i-tested-12-llms-with-few-shot-examples-the-results-were-not-what-i-expected-2de6</link>
      <guid>https://forem.com/shuntarookuma/i-tested-12-llms-with-few-shot-examples-the-results-were-not-what-i-expected-2de6</guid>
      <description>&lt;p&gt;In a &lt;a href="https://dev.to/shuntarookuma/when-more-examples-make-your-llm-worse-discovering-few-shot-collapse-106i"&gt;previous article&lt;/a&gt;, I tested 8 models across 4 tasks and reported on "few-shot collapse" — cases where adding few-shot examples actually degrades LLM performance.&lt;/p&gt;

&lt;p&gt;This time, I expanded the experiment to 12 models (6 cloud + 6 local) and 5 tasks to see whether those findings hold at a larger scale. They do — and I found even more dramatic cases, including a model that dropped from 93% to 30% with more examples.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I tested
&lt;/h2&gt;

&lt;p&gt;I evaluated 12 models — 6 cloud APIs and 6 local models — across 5 tasks designed to mirror real business scenarios.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloud models:&lt;/strong&gt; Claude Haiku 4.5, Claude Sonnet 4.6, Gemini 2.5 Flash, Gemini 3 Flash, GPT-4o-mini, GPT-5.4-mini&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Local models:&lt;/strong&gt; Gemma 3 27B, LLaMA 4 Scout (17B active, MoE), GPT-OSS 120B, Qwen 3.5 (35B total / 3B active, MoE), Ministral 3 14B Reasoning, Phi-4 Reasoning Plus&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tasks:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Classification&lt;/strong&gt; — Categorize customer support inquiries into specific categories (exact match scoring)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code Fix&lt;/strong&gt; — Identify and fix bugs in short Python functions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Route Optimization&lt;/strong&gt; — Calculate optimal delivery routes with time windows and fuel costs (LLM-as-judge scoring)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sentiment Analysis&lt;/strong&gt; — Classify product reviews as positive/negative/neutral/mixed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Summarization&lt;/strong&gt; — Extract key points from news articles into summaries (F1 scoring)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each model-task pair was evaluated at 0, 1, 2, 4, and 8 shots, with 3 trials per configuration and TF-IDF-based dynamic example selection. That's 60 model-task pairs and over 27,000 individual evaluations.&lt;/p&gt;

&lt;p&gt;I'll describe how to explore the full results later, but here are three patterns that stood out.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pattern 1: The zero-shot leader can crash to last place
&lt;/h2&gt;

&lt;p&gt;Gemini 3 Flash scored &lt;strong&gt;93%&lt;/strong&gt; on route optimization at zero-shot — the highest of any model. Then I added examples.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Shots&lt;/th&gt;
&lt;th&gt;0&lt;/th&gt;
&lt;th&gt;1&lt;/th&gt;
&lt;th&gt;2&lt;/th&gt;
&lt;th&gt;4&lt;/th&gt;
&lt;th&gt;8&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Score&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;93%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;93%&lt;/td&gt;
&lt;td&gt;43%&lt;/td&gt;
&lt;td&gt;53%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;30%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;At 8-shot, it scored 30%. The model that was the best at zero-shot became the worst with examples. A &lt;strong&gt;63-point drop&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F77epb7ywhw7rix2p84t8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F77epb7ywhw7rix2p84t8.png" alt="Learning curves for route optimization. Gemini 3 Flash (red) collapses from 93% to 30%, while Gemma 3 27B (green, same model family) stays stable around 90%." width="800" height="439"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here's the twist: &lt;strong&gt;Gemma 3 27B — from the same model family — stayed stable around 90% across all shot counts.&lt;/strong&gt; Same architecture lineage, completely different behavior. This isn't a property of the Gemini/Gemma family. It's specific to Gemini 3 Flash on this task.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pattern 2: Most models benefit from few-shot examples
&lt;/h2&gt;

&lt;p&gt;On classification, every model scored between 0% and 20% at zero-shot. They all looked equally bad. Based on a zero-shot benchmark alone, you'd conclude these models can't classify customer support tickets.&lt;/p&gt;

&lt;p&gt;But with examples, performance improved dramatically across the board. The graph is a bit busy, but you can see the overall upward trend from 0-shot to 8-shot:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fybkclzp59jqzf4l56i4j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fybkclzp59jqzf4l56i4j.png" alt="Classification learning curves. All models start at 0-20% at zero-shot but diverge dramatically by 8-shot, spreading from 27% to 80%." width="800" height="367"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At 8-shot:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Claude Haiku: 80%&lt;/strong&gt; (from 20%)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Claude Sonnet: 73%&lt;/strong&gt; (from 20%)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPT-OSS 120B: 73%&lt;/strong&gt; (from 0%)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gemma 3 27B: 67%&lt;/strong&gt; (from 0%)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPT-4o-mini: 33%&lt;/strong&gt; (from 0%)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gemini 2.5 Flash: 27%&lt;/strong&gt; (from 13%)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Models that scored below 20% at zero-shot improved significantly with examples. Claude Haiku reached 80%, and Claude Sonnet and GPT-OSS 120B also showed strong gains. Gemma 3 27B, which performed well on route optimization in Pattern 1, went from 0% to 67%. On the other hand, models like GPT-4o-mini and Gemini 2.5 Flash barely improved.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you pick your model from zero-shot benchmarks, you might choose the wrong one.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Pattern 3: Models bad at a task stay bad — even with examples
&lt;/h2&gt;

&lt;p&gt;On summarization, most models improved steadily with more examples. This is the behavior everyone expects from few-shot prompting. The graph is busy again, but the overall upward trend is clearer than with classification:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuiykjtigps4ef7pzdgwt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuiykjtigps4ef7pzdgwt.png" alt="Summarization learning curves. Most models show steady improvement from 0-shot to 8-shot, with Gemma 3 27B reaching 75%." width="800" height="373"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Gemma 3 27B — a local model — achieved the highest score at 75%, outperforming all cloud models. Claude Sonnet followed at 73%, then Gemini 3 Flash at 72%. For straightforward tasks, local models can be more than enough.&lt;/p&gt;

&lt;p&gt;However, even on this task, &lt;strong&gt;Phi-4 Reasoning Plus&lt;/strong&gt; and &lt;strong&gt;Ministral 3 14B&lt;/strong&gt; scored poorly. Both are reasoning-specialized models, optimized for expanding and elaborating information — not compressing it as summarization requires. This isn't "collapse" from adding examples; they simply weren't suited for the task.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Few-shot prompting works well for most models on most tasks, but models that are fundamentally mismatched with a task won't be saved by more examples.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The 60 model-task combinations fall into three patterns
&lt;/h2&gt;

&lt;p&gt;To summarize the three patterns:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Few-shot causes collapse&lt;/strong&gt; — Like Gemini 3 Flash on route optimization in Pattern 1, adding examples dramatically degrades performance. The most notable cases:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Behavior&lt;/th&gt;
&lt;th&gt;Drop&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 3 Flash&lt;/td&gt;
&lt;td&gt;Route Optimization&lt;/td&gt;
&lt;td&gt;Gradual decline&lt;/td&gt;
&lt;td&gt;93% → 30%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen 3.5 (3B active)&lt;/td&gt;
&lt;td&gt;Code Fix&lt;/td&gt;
&lt;td&gt;Gradual decline&lt;/td&gt;
&lt;td&gt;56% → 0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ministral 3 14B&lt;/td&gt;
&lt;td&gt;Code Fix&lt;/td&gt;
&lt;td&gt;Peak regression&lt;/td&gt;
&lt;td&gt;44% → 33%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;2. Few-shot works as expected&lt;/strong&gt; — Like summarization for most models, performance improves steadily with more examples.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Task-model mismatch&lt;/strong&gt; — As described in Pattern 3, models like Phi-4 Reasoning Plus and Ministral 3 14B scored low on summarization even at zero-shot. Adding examples didn't help — this isn't "collapse" but a fundamental mismatch.&lt;/p&gt;




&lt;p&gt;Additionally, four pairs showed temporary dips that recovered. Scores eventually returned, but testing at a single shot count could lead to the wrong conclusion:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Detail&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5.4-mini&lt;/td&gt;
&lt;td&gt;Classification&lt;/td&gt;
&lt;td&gt;60% at 2-shot → 27% at 4-shot → 60% at 8-shot&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-OSS 120B&lt;/td&gt;
&lt;td&gt;Route Optimization&lt;/td&gt;
&lt;td&gt;78% at 0-shot → 58% at 1-shot → 74% at 8-shot&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 2.5 Flash&lt;/td&gt;
&lt;td&gt;Route Optimization&lt;/td&gt;
&lt;td&gt;63% at 2-shot → 52% at 4-shot → 63% at 8-shot&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen 3.5&lt;/td&gt;
&lt;td&gt;Classification&lt;/td&gt;
&lt;td&gt;40% at 2-shot → 20% at 4-shot → 40% at 8-shot&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Testing at multiple shot counts helps catch these, though the issue may be less about shot count itself and more about the interaction between the specific examples provided and the model's state.&lt;/p&gt;

&lt;h3&gt;
  
  
  Who performed best?
&lt;/h3&gt;

&lt;p&gt;Measuring adaptation efficiency (area under the learning curve across all tasks):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rank&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Avg AUC&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Claude Haiku 4.5&lt;/td&gt;
&lt;td&gt;Cloud&lt;/td&gt;
&lt;td&gt;0.815&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Gemma 3 27B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Local&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.814&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Claude Sonnet 4.6&lt;/td&gt;
&lt;td&gt;Cloud&lt;/td&gt;
&lt;td&gt;0.802&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;LLaMA 4 Scout&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Local&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.748&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;GPT-5.4-mini&lt;/td&gt;
&lt;td&gt;Cloud&lt;/td&gt;
&lt;td&gt;0.730&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;A 27B local model matched Claude Haiku's adaptation efficiency.&lt;/strong&gt; LLaMA 4 Scout, with only 17B active parameters (MoE), outperformed GPT-5.4-mini. Results will vary depending on the evaluation method and tasks, but this suggests that with proper few-shot prompting, local models can achieve performance comparable to cloud APIs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prior research
&lt;/h2&gt;

&lt;p&gt;Few-shot performance degradation has been reported by several independent studies:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tang et al. (2025)&lt;/strong&gt; documented &lt;a href="https://arxiv.org/abs/2509.13196" rel="noopener noreferrer"&gt;"over-prompting"&lt;/a&gt; — performance peaks then declines — across GPT-4o, DeepSeek-V3, Gemma-3, LLaMA-3, and Mistral.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lin &amp;amp; Mohaisen (NDSS 2025)&lt;/strong&gt; found that few-shot examples &lt;a href="https://www.ndss-symposium.org/wp-content/uploads/2025-1491-paper.pdf" rel="noopener noreferrer"&gt;degraded vulnerability detection&lt;/a&gt;: Gemma 7B dropped from 78% to 40%.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chroma Research (2025)&lt;/strong&gt; showed that simply &lt;a href="https://research.trychroma.com/context-rot" rel="noopener noreferrer"&gt;adding more tokens&lt;/a&gt; — even irrelevant ones — degrades performance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Min et al. (2022)&lt;/strong&gt; found that randomly replacing labels in few-shot examples &lt;a href="https://arxiv.org/abs/2202.12837" rel="noopener noreferrer"&gt;barely hurts performance&lt;/a&gt; — suggesting models aren't learning from examples the way we assume.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The phenomenon is well-documented. This makes it all the more important to evaluate whether few-shot prompting actually works for your specific use case before deploying to production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Don't assume more examples = better.&lt;/strong&gt; It's worth testing at multiple shot counts. The optimal number varies by model and task.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't pick models from zero-shot benchmarks alone.&lt;/strong&gt; We found that rankings can change significantly with examples. When referencing benchmarks, check whether they were measured at zero-shot or few-shot — the methodology matters.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Distinguish collapse from task mismatch.&lt;/strong&gt; If scores drop after adding examples, check the zero-shot baseline. Low from the start suggests a model-task compatibility issue. High at zero-shot but dropping with examples points to a few-shot prompting effect.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Measure, don't guess.&lt;/strong&gt; Whether few-shot prompting helps a specific model-task pair can only be determined by actually evaluating it. Tracking the full learning curve ensures you don't miss non-monotonic patterns.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Reproducing these results
&lt;/h2&gt;

&lt;p&gt;The evaluation was run with &lt;a href="https://github.com/ShuntaroOkuma/adapt-gauge-core" rel="noopener noreferrer"&gt;AdaptGauge&lt;/a&gt; (OSS, MIT license), a tool that tracks learning curves, auto-detects collapse, and classifies degradation patterns.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The full results from this 12-model × 5-task experiment are available as default demo data.&lt;/strong&gt; After installation, you can immediately explore the patterns and learning curves discussed in this article — no API keys needed.&lt;/p&gt;

&lt;p&gt;To evaluate your own tasks and models, AdaptGauge supports cloud APIs as well as local models via any OpenAI-compatible API (LM Studio, Ollama, etc.).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/ShuntaroOkuma/adapt-gauge-core" rel="noopener noreferrer"&gt;github.com/ShuntaroOkuma/adapt-gauge-core&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;References&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Chollet (2019) — &lt;a href="https://arxiv.org/abs/1911.01547" rel="noopener noreferrer"&gt;"On the Measure of Intelligence"&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Min et al. (2022) — &lt;a href="https://arxiv.org/abs/2202.12837" rel="noopener noreferrer"&gt;"Rethinking the Role of Demonstrations"&lt;/a&gt;, EMNLP&lt;/li&gt;
&lt;li&gt;Liu et al. (2024) — &lt;a href="https://arxiv.org/abs/2307.03172" rel="noopener noreferrer"&gt;"Lost in the Middle"&lt;/a&gt;, TACL&lt;/li&gt;
&lt;li&gt;Tang et al. (2025) — &lt;a href="https://arxiv.org/abs/2509.13196" rel="noopener noreferrer"&gt;"The Few-shot Dilemma"&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Lin &amp;amp; Mohaisen (2025) — &lt;a href="https://www.ndss-symposium.org/wp-content/uploads/2025-1491-paper.pdf" rel="noopener noreferrer"&gt;"From Large to Mammoth"&lt;/a&gt;, NDSS&lt;/li&gt;
&lt;li&gt;Chroma Research (2025) — &lt;a href="https://research.trychroma.com/context-rot" rel="noopener noreferrer"&gt;"Context Rot"&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>machinelearning</category>
      <category>discuss</category>
    </item>
    <item>
      <title>How I Measure My Dify Chatbot Quality with Scenario Testing</title>
      <dc:creator>Shuntaro Okuma</dc:creator>
      <pubDate>Mon, 23 Mar 2026 15:01:35 +0000</pubDate>
      <link>https://forem.com/shuntarookuma/how-i-measure-my-dify-chatbot-quality-with-scenario-testing-5bl0</link>
      <guid>https://forem.com/shuntarookuma/how-i-measure-my-dify-chatbot-quality-with-scenario-testing-5bl0</guid>
      <description>&lt;h2&gt;
  
  
  What I did
&lt;/h2&gt;

&lt;p&gt;I designed multi-turn conversation scenarios for a Dify chatbot, ran them automatically via the API, and measured response quality quantitatively.&lt;/p&gt;

&lt;p&gt;If you've built chatbots with Dify, you've probably noticed this: single-turn Q&amp;amp;A works fine, but once users get into 3-4 turn conversations, quality drops noticeably. So I built automated tests — multi-turn scenarios with expected responses, fired against Dify's API — to catch these problems before they reach production.&lt;/p&gt;




&lt;h2&gt;
  
  
  Background: existing eval tools and the remaining gap
&lt;/h2&gt;

&lt;p&gt;Dify has official integrations with several observability and evaluation tools. These tools aren't just for tracing — &lt;strong&gt;they also have evaluation capabilities&lt;/strong&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Evaluation features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LangSmith&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Datasets + Evaluators, LLM-as-Judge, human feedback&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Langfuse&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Datasets, LLM-as-Judge, human feedback, custom scores&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Opik&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;LLM-as-Judge, 8 conversation-specific metrics, dataset evaluation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Arize AX&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;LLM-as-Judge, Session Evals, human annotation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Phoenix&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;LLM-as-Judge, Evaluator Hub&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These tools can, for example, run an application against a dataset of &lt;code&gt;{input, expected_output}&lt;/code&gt; pairs and compare scores before and after changes. However, none of them seem to support designing and executing multi-turn conversation scenarios to check quality end-to-end.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I wanted
&lt;/h2&gt;

&lt;p&gt;Here's what I was looking for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Evaluate multi-turn conversations&lt;/strong&gt;: Test entire conversation flows (not just single Q&amp;amp;A), including context retention and information consistency across turns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Design branching based on bot responses&lt;/strong&gt;: Create scenarios where the user's next question depends on what the bot actually said in the previous turn&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Score each turn with LLM-as-Judge&lt;/strong&gt;: After running a scenario, automatically evaluate each turn's response on criteria like semantic accuracy and context retention&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run tests repeatedly and automatically&lt;/strong&gt;: Define scenarios once, run them as many times as needed, so quality issues that single manual tests miss get caught through continuous testing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auto-generate scenarios from Dify DSL&lt;/strong&gt;: Writing scenarios shouldn't be the bottleneck — just paste a Dify app's flow definition (YAML) and have test scenarios generated from its structure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I originally built a tool to do all of this for my own use. After using it heavily, it turned out to be more broadly useful than expected, so I published it as &lt;a href="https://convoprobe.vercel.app" rel="noopener noreferrer"&gt;ConvoProbe&lt;/a&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;A note on the Dify community's approach to quality:&lt;/strong&gt;&lt;br&gt;
I searched the &lt;a href="https://forum.dify.ai" rel="noopener noreferrer"&gt;Dify forum&lt;/a&gt; and &lt;a href="https://github.com/langgenius/dify/discussions" rel="noopener noreferrer"&gt;GitHub Discussions&lt;/a&gt; to see how others handle chatbot quality. The results were surprising:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Search&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Forum posts about chatbot quality evaluation&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GitHub Discussions about testing/validation&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GitHub Issues about regressions after updates&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;211&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GitHub Issues about observability/tracing&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;524&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;There's plenty of discussion about observability and regressions, but almost none about systematically evaluating quality.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What ConvoProbe does
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Evaluate multi-turn conversations
&lt;/h3&gt;

&lt;p&gt;ConvoProbe evaluates entire multi-turn conversations, not just individual Q&amp;amp;A pairs.&lt;/p&gt;

&lt;p&gt;Single-turn tests can verify whether individual answers are correct. But in real chatbot usage, problems emerge at turn 3 or 4 — the bot loses context, mixes up information, or contradicts what it said earlier. ConvoProbe lets you verify things like "does the bot at turn 4 correctly reference what it said at turn 1?"&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Design conversation scenarios visually
&lt;/h3&gt;

&lt;p&gt;You build conversation structures in a GUI — much like designing flows in Dify itself. For each turn, you set the user's message and the expected response.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Figf5zgr4xlt7xfz0g9xe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Figf5zgr4xlt7xfz0g9xe.png" alt="Design each turn's user message and expected response in a visual editor" width="800" height="654"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Design dynamic branching based on bot responses
&lt;/h3&gt;

&lt;p&gt;Real conversations aren't linear. What the user asks next depends on what the bot just said.&lt;/p&gt;

&lt;p&gt;ConvoProbe uses an LLM to evaluate the bot's response at runtime and dynamically determines which branch to follow. Static dataset evaluation can't express this kind of "output-dependent branching."&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7w7df33kehcfew2h0x31.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7w7df33kehcfew2h0x31.png" alt="At runtime, an LLM evaluates the bot's response to determine which branch to follow" width="800" height="380"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Auto-generate scenarios from Dify DSL
&lt;/h3&gt;

&lt;p&gt;Paste your Dify app's DSL (the YAML flow definition) into ConvoProbe, and it analyzes the flow structure to auto-generate test scenarios.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F42tfi55zmlp85nc9y03f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F42tfi55zmlp85nc9y03f.png" alt="Paste a Dify app's DSL (YAML) to auto-generate test scenarios from the flow structure" width="800" height="689"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;No need to design scenarios from scratch. For existing Dify apps, you can start testing immediately. Generated scenarios can be run as-is or edited in the GUI.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Score each turn with LLM-as-Judge
&lt;/h3&gt;

&lt;p&gt;When a scenario runs, each turn's response is automatically scored on the following criteria:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Criterion&lt;/th&gt;
&lt;th&gt;What it measures&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Semantic alignment&lt;/td&gt;
&lt;td&gt;Does the actual response convey the expected meaning and information?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Completeness&lt;/td&gt;
&lt;td&gt;Does the actual response cover all key points from the expected answer?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Accuracy&lt;/td&gt;
&lt;td&gt;Is the information in the actual response factually correct?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Relevance&lt;/td&gt;
&lt;td&gt;Is the actual response directly relevant to the question?&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fni2tt8g0bvdn9jxwsymh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fni2tt8g0bvdn9jxwsymh.png" alt="Each turn is scored on 4 evaluation criteria" width="800" height="466"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What scenario testing reveals
&lt;/h2&gt;

&lt;p&gt;Running multi-turn scenario tests surfaces quality problems that are otherwise hard to catch:&lt;/p&gt;

&lt;h3&gt;
  
  
  Quality degrades over multiple turns
&lt;/h3&gt;

&lt;p&gt;A chatbot that looks fine on single-turn tests can fall apart after 3-4 turns. RAG-based chatbots are especially prone to this — as conversations progress, the bot's ability to determine which retrieved information is relevant starts to drift.&lt;/p&gt;

&lt;p&gt;If you only test single turns, you'll miss this entirely.&lt;/p&gt;

&lt;h3&gt;
  
  
  Context loss is silent
&lt;/h3&gt;

&lt;p&gt;When a bot "forgets" earlier conversation history, there's no crash or error. It just generates a plausible-sounding but incorrect response.&lt;/p&gt;

&lt;p&gt;To verify whether "turn 4 correctly references turn 1," you need to intentionally design and execute that conversation flow as a test scenario.&lt;/p&gt;

&lt;h3&gt;
  
  
  Workflow updates cause regressions
&lt;/h3&gt;

&lt;p&gt;Updating a Dify workflow — changing a system prompt, adjusting RAG retrieval parameters — can silently break conversation patterns that were working before.&lt;/p&gt;

&lt;p&gt;Running the same scenarios before and after a change lets you catch degradation before it reaches production.&lt;/p&gt;




&lt;h2&gt;
  
  
  How ConvoProbe fits with existing tools
&lt;/h2&gt;

&lt;p&gt;ConvoProbe isn't a replacement for Langfuse or LangSmith — it's complementary.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Phase&lt;/th&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;During development&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;ConvoProbe&lt;/td&gt;
&lt;td&gt;Run scenario tests to verify it's safe to ship&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Before release&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;ConvoProbe&lt;/td&gt;
&lt;td&gt;Compare scenario scores before/after changes (regression testing)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;In production&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Langfuse / LangSmith / Opik&lt;/td&gt;
&lt;td&gt;Tracing, cost monitoring, post-hoc evaluation of real conversations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;When issues surface&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;ConvoProbe&lt;/td&gt;
&lt;td&gt;Create a scenario that reproduces the problem, fix, re-test&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Langfuse helps you &lt;em&gt;discover&lt;/em&gt; problems. ConvoProbe helps you &lt;em&gt;prevent them from recurring&lt;/em&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;ConvoProbe requires just a Dify API key and an LLM API key for evaluation to get started.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://convoprobe.vercel.app" rel="noopener noreferrer"&gt;https://convoprobe.vercel.app&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>testing</category>
      <category>machinelearning</category>
      <category>promptengineering</category>
    </item>
    <item>
      <title>When More Examples Make Your LLM Worse: Discovering Few-Shot Collapse</title>
      <dc:creator>Shuntaro Okuma</dc:creator>
      <pubDate>Fri, 27 Feb 2026 16:06:39 +0000</pubDate>
      <link>https://forem.com/shuntarookuma/when-more-examples-make-your-llm-worse-discovering-few-shot-collapse-106i</link>
      <guid>https://forem.com/shuntarookuma/when-more-examples-make-your-llm-worse-discovering-few-shot-collapse-106i</guid>
      <description>&lt;p&gt;Here's something everyone agrees on about few-shot prompting: give the model more examples, it performs better.&lt;/p&gt;

&lt;p&gt;I believed that too. Then I measured it.&lt;/p&gt;

&lt;p&gt;So I built &lt;a href="https://github.com/ShuntaroOkuma/adapt-gauge-core" rel="noopener noreferrer"&gt;AdaptGauge&lt;/a&gt;, an open-source tool that measures how efficiently LLMs learn from few-shot examples.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I tested
&lt;/h2&gt;

&lt;p&gt;I evaluated eight models across four tasks designed to mirror real business scenarios, at shot counts of 0, 1, 2, 4, and 8:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Classification&lt;/strong&gt; — Categorize customer support inquiries into one of 8 categories (billing, technical support, returns, etc.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code Fix&lt;/strong&gt; — Identify and fix bugs in short Python functions (off-by-one errors, missing edge cases)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Summarization&lt;/strong&gt; — Extract key points from Japanese news articles into bullet-point summaries&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Route Optimization&lt;/strong&gt; — Calculate optimal delivery routes across multiple destinations with time windows and fuel costs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Models tested:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cloud APIs&lt;/strong&gt;: Claude Haiku 4.5, Claude Opus 4.5, Gemini 2.5 Flash, Gemini 3 Flash, Gemini 3 Pro&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Local models&lt;/strong&gt;: Gemma 3 27B, GPT-OSS 120B, Qwen3-VL 8B&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For each model-task pair, I also compared two example selection strategies:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Fixed&lt;/strong&gt; — The same hand-picked examples used for every test input.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://en.wikipedia.org/wiki/Tf%E2%80%93idf" rel="noopener noreferrer"&gt;TF-IDF&lt;/a&gt; dynamic selection&lt;/strong&gt; — For each test input, score all candidate examples by word-overlap similarity and pick the closest matches. The idea: examples that resemble the current input should help the model more. Tang et al. (2025) &lt;a href="https://arxiv.org/abs/2509.13196" rel="noopener noreferrer"&gt;reported&lt;/a&gt; that combining this with stratified sampling achieves better performance with fewer examples.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Full task definitions — including prompts, examples, and scoring rubrics — are in the &lt;a href="https://github.com/ShuntaroOkuma/adapt-gauge-core/tree/main/tasks" rel="noopener noreferrer"&gt;demo task pack&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Most of the time, the results looked exactly like you'd expect. More examples, better scores. But not always.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three patterns that break the assumption
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Pattern 1: The model learns, then unlearns
&lt;/h3&gt;

&lt;p&gt;On a route optimization task, Gemini 3 Flash scored 33% at zero-shot. Adding examples helped — performance climbed to &lt;strong&gt;64% at 4-shot&lt;/strong&gt;. Textbook behavior.&lt;/p&gt;

&lt;p&gt;Then I added more. At 8-shot, the score &lt;strong&gt;crashed back to 33%&lt;/strong&gt;. Right back where it started.&lt;/p&gt;

&lt;p&gt;The model learned, then unlearned.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6bgc2q4cqjo0fhzq18eq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6bgc2q4cqjo0fhzq18eq.png" alt="Learning curves for the route optimization task. Gemini 3 Flash shows a dramatic V-shape (peak regression), while the other four models improve steadily."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Four models improved steadily. One didn't. I call this &lt;strong&gt;peak regression&lt;/strong&gt; — and you can't spot it without tracking the full learning curve.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 2: Rankings flip completely
&lt;/h3&gt;

&lt;p&gt;On a classification task, something even stranger happened. The &lt;strong&gt;model rankings reversed&lt;/strong&gt; between zero-shot and eight-shot:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flowtwressd1f2jqqmbfq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flowtwressd1f2jqqmbfq.png" alt="Classification scores at 0-shot vs 8-shot. Gemini 2.5 Flash surges from 20% to 80%, overtaking Gemini 3 Pro which stays flat at 60%."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Look at Gemini 2.5 Flash: it scored just 20% at zero-shot, but climbed to 80% with eight examples — the highest of any model. Meanwhile, Gemini 3 Pro stayed flat at 60% regardless of shot count.&lt;/p&gt;

&lt;p&gt;A "Pro" model isn't necessarily better than a "Flash" model — it depends on how you prompt it. Choosing a model based on public benchmarks alone can lead you to the wrong conclusion.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 3: How you pick examples can trigger collapse
&lt;/h3&gt;

&lt;p&gt;I tested two methods for selecting few-shot examples: &lt;strong&gt;fixed&lt;/strong&gt; (hand-picked) and &lt;strong&gt;TF-IDF&lt;/strong&gt; (dynamically selected by text similarity).&lt;/p&gt;

&lt;p&gt;Tang et al.'s &lt;a href="https://arxiv.org/abs/2509.13196" rel="noopener noreferrer"&gt;"The Few-shot Dilemma"&lt;/a&gt; (2025) found that TF-IDF-based selection combined with stratified sampling achieved superior performance with fewer examples. And on most of my tasks, TF-IDF did help.&lt;/p&gt;

&lt;p&gt;But on a route optimization task with GPT-OSS 120B, it &lt;strong&gt;made things dramatically worse&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fooxjt5gxdnjn2bx502wg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fooxjt5gxdnjn2bx502wg.png" alt="Fixed vs TF-IDF example selection for GPT-OSS 120B. TF-IDF collapses to 35% at 2-shot while fixed selection stays above 50%."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With fixed examples, the model stayed above 50%. With TF-IDF, it &lt;strong&gt;collapsed to 35% at 2-shot&lt;/strong&gt; — a 58% relative drop. The method designed to find "better" examples triggered a failure.&lt;/p&gt;




&lt;p&gt;Adding in-context examples — or changing how you select them — can actively degrade model performance. I call this &lt;strong&gt;few-shot collapse&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  I'm not the first to see this
&lt;/h2&gt;

&lt;p&gt;After finding these patterns, I dug into the literature. Turns out researchers have been documenting the same thing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The over-prompting problem.&lt;/strong&gt; Tang et al. (2025) showed that LLM performance peaks at a certain number of examples and then &lt;a href="https://arxiv.org/abs/2509.13196" rel="noopener noreferrer"&gt;declines&lt;/a&gt;. LLaMA and Gemma models showed dramatic degradation. GPT models held up better.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Catastrophic drops in security tasks.&lt;/strong&gt; An &lt;a href="https://www.ndss-symposium.org/wp-content/uploads/2025-1491-paper.pdf" rel="noopener noreferrer"&gt;NDSS 2025 study&lt;/a&gt; (Lin &amp;amp; Mohaisen) found that few-shot examples dramatically degraded vulnerability &lt;em&gt;type identification&lt;/em&gt;. In terms of AP (accurate-response percentage), Gemma 7B dropped from 77.9% to 39.9%, and LLaMA-2 70B from 68.6% to 21.0%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Labels don't even matter.&lt;/strong&gt; &lt;a href="https://arxiv.org/abs/2202.12837" rel="noopener noreferrer"&gt;Min et al. (2022)&lt;/a&gt; found that randomly replacing labels in few-shot examples barely hurts performance. Models aren't learning input-label mappings — they're picking up format and distribution cues. The mechanism behind few-shot "learning" is far more fragile than most people assume.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this happens
&lt;/h2&gt;

&lt;p&gt;A few factors are at play:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;More tokens = worse performance.&lt;/strong&gt; Chroma Research's &lt;a href="https://research.trychroma.com/context-rot" rel="noopener noreferrer"&gt;"Context Rot"&lt;/a&gt; study (2025) showed that simply increasing input tokens — even with irrelevant whitespace — significantly degrades performance across a wide range of models and tasks. More examples means more tokens.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Position matters.&lt;/strong&gt; &lt;a href="https://arxiv.org/abs/2307.03172" rel="noopener noreferrer"&gt;Liu et al. (2024)&lt;/a&gt; showed that models struggle with information in the middle of long contexts. When examples push the actual task further down, the model loses track.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pre-training biases conflict with examples.&lt;/strong&gt; Some models have strong priors. When examples contradict those priors, or introduce patterns the model over-indexes on, the result is worse than no examples at all.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example selection amplifies or dampens all of this.&lt;/strong&gt; My TF-IDF comparison showed that "textually similar" doesn't always mean "helpful." A relevant example can still confuse the model.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means for you
&lt;/h2&gt;

&lt;p&gt;If you're using LLMs in production:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Your prompt "improvements" might be breaking things.&lt;/strong&gt; Adding examples is the default fix when a model underperforms. My data shows it can backfire — and without measurement, you won't know until users complain.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Leaderboard rankings don't predict this.&lt;/strong&gt; &lt;a href="https://arxiv.org/abs/2402.01781" rel="noopener noreferrer"&gt;Alzahrani et al. (2024)&lt;/a&gt; showed that minor benchmark changes shift rankings by up to 8 positions. My classification results confirm it: the zero-shot leader dropped to third once examples were added.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Different models break on different tasks.&lt;/strong&gt; Gemini 3 Flash collapsed on route optimization but improved on summarization. There's no universal "safe" model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example selection is a variable, not a constant.&lt;/strong&gt; Switching from hand-picked to TF-IDF examples turned a working model into a broken one. This isn't a "set and forget" choice.&lt;/p&gt;




&lt;h2&gt;
  
  
  Detecting it automatically
&lt;/h2&gt;

&lt;p&gt;These findings led me to a framework inspired by Chollet's &lt;a href="https://arxiv.org/abs/1911.01547" rel="noopener noreferrer"&gt;"On the Measure of Intelligence"&lt;/a&gt; (2019):&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"The intelligence of a system is the measure of its skill-acquisition efficiency over a scope of tasks, with respect to priors, experience, and generalization difficulty."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Instead of "how good is this model?", the question should be "how efficiently does it adapt?" — and critically, &lt;em&gt;does it ever adapt in the wrong direction?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I built this idea into &lt;a href="https://github.com/ShuntaroOkuma/adapt-gauge-core" rel="noopener noreferrer"&gt;AdaptGauge&lt;/a&gt;. For each model-task pair across shot counts (0 through 8), it automatically computes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Learning Curve AUC&lt;/strong&gt; — Overall learning efficiency. Higher means the model learns faster from examples.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Few-Shot Collapse&lt;/strong&gt; — Auto-alerts when 8-shot performance drops below 80% of the 0-shot baseline.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Collapse Pattern&lt;/strong&gt; — Classifies each curve as &lt;em&gt;immediate collapse&lt;/em&gt;, &lt;em&gt;gradual decline&lt;/em&gt;, &lt;em&gt;peak regression&lt;/em&gt;, or &lt;em&gt;stable&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resilience Score&lt;/strong&gt; — How well the model holds up as shot count increases.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Example Selection Comparison&lt;/strong&gt; — Runs fixed vs TF-IDF side-by-side to find what works for each model-task pair.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;AdaptGauge is primarily a CLI tool, but it also includes a simple GUI for reviewing results:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fruuj5xzpzp528xtgyw1i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fruuj5xzpzp528xtgyw1i.png" alt="AdaptGauge output example showing learning curves, collapse patterns, resilience scores, and example selection comparison."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In my evaluation, it flagged the peak regression in Gemini 3 Flash and the TF-IDF-induced collapse in GPT-OSS 120B automatically. These are patterns that spot-checking would miss entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;AdaptGauge is open-source. Clone the repo, check the pre-computed demo results, or run your own evaluations against any model with an API. For local models, LM Studio makes it easy to test.&lt;/p&gt;

&lt;p&gt;If you've ever added examples to a prompt and wondered whether it actually helped — now you can find out.&lt;/p&gt;

&lt;p&gt;

&lt;/p&gt;
&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/ShuntaroOkuma" rel="noopener noreferrer"&gt;
        ShuntaroOkuma
      &lt;/a&gt; / &lt;a href="https://github.com/ShuntaroOkuma/adapt-gauge-core" rel="noopener noreferrer"&gt;
        adapt-gauge-core
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Measure LLM adaptation efficiency — how fast models learn from few examples
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;adapt-gauge-core&lt;/h1&gt;
&lt;/div&gt;
&lt;p&gt;&lt;a href="https://opensource.org/licenses/MIT" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/fdf2982b9f5d7489dcf44570e714e3a15fce6253e0cc6b5aa61a075aac2ff71b/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4c6963656e73652d4d49542d79656c6c6f772e737667" alt="License: MIT"&gt;&lt;/a&gt; &lt;a href="https://www.python.org/downloads/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/36cf3d0f7992a33a063d3833577d62204f8934d82b69874c086390608db4947c/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f707974686f6e2d332e31312b2d626c75652e737667" alt="Python 3.11+"&gt;&lt;/a&gt; &lt;a href="https://github.com/ShuntaroOkuma/adapt-gauge-core/actions/workflows/test.yml" rel="noopener noreferrer"&gt;&lt;img src="https://github.com/ShuntaroOkuma/adapt-gauge-core/actions/workflows/test.yml/badge.svg" alt="Tests"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/ShuntaroOkuma/adapt-gauge-core/README_ja.md" rel="noopener noreferrer"&gt;日本語&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Measure how fast LLMs learn from few-shot examples — and detect when they break.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;adapt-gauge-core is an open-source evaluation harness that measures &lt;strong&gt;Adaptation Efficiency&lt;/strong&gt; — how quickly a language model improves with few-shot examples (0, 1, 2, 4, 8 shots) and whether it suffers from &lt;strong&gt;few-shot collapse&lt;/strong&gt; (performance degradation with more examples).&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Why Adaptation Efficiency?&lt;/h2&gt;
&lt;/div&gt;
&lt;p&gt;Standard LLM benchmarks measure accuracy at a single point. But in production, teams often use few-shot prompting to adapt models to specific tasks. Two critical questions arise:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;How many examples does this model need?&lt;/strong&gt; Some models reach peak performance at 2 shots; others need 8.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Does adding examples ever hurt?&lt;/strong&gt; For some model-task combinations, performance &lt;em&gt;drops&lt;/em&gt; with more examples — a phenomenon known as &lt;strong&gt;few-shot collapse&lt;/strong&gt; (also called &lt;strong&gt;over-prompting&lt;/strong&gt; in the literature).&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;adapt-gauge-core answers both questions automatically.&lt;/p&gt;
&lt;p&gt;In our evaluations, we observed that &lt;strong&gt;leaderboard rankings reverse&lt;/strong&gt; depending on shot count — a model…&lt;/p&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/ShuntaroOkuma/adapt-gauge-core" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;







&lt;p&gt;&lt;strong&gt;References&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Chollet (2019) — &lt;a href="https://arxiv.org/abs/1911.01547" rel="noopener noreferrer"&gt;"On the Measure of Intelligence"&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Min et al. (2022) — &lt;a href="https://arxiv.org/abs/2202.12837" rel="noopener noreferrer"&gt;"Rethinking the Role of Demonstrations"&lt;/a&gt;, EMNLP&lt;/li&gt;
&lt;li&gt;Liu et al. (2024) — &lt;a href="https://arxiv.org/abs/2307.03172" rel="noopener noreferrer"&gt;"Lost in the Middle"&lt;/a&gt;, TACL&lt;/li&gt;
&lt;li&gt;Alzahrani et al. (2024) — &lt;a href="https://arxiv.org/abs/2402.01781" rel="noopener noreferrer"&gt;"When Benchmarks are Targets"&lt;/a&gt;, ACL&lt;/li&gt;
&lt;li&gt;Tang et al. (2025) — &lt;a href="https://arxiv.org/abs/2509.13196" rel="noopener noreferrer"&gt;"The Few-shot Dilemma"&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Lin &amp;amp; Mohaisen (2025) — &lt;a href="https://www.ndss-symposium.org/wp-content/uploads/2025-1491-paper.pdf" rel="noopener noreferrer"&gt;"From Large to Mammoth"&lt;/a&gt;, NDSS&lt;/li&gt;
&lt;li&gt;Chroma Research (2025) — &lt;a href="https://research.trychroma.com/context-rot" rel="noopener noreferrer"&gt;"Context Rot"&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>machinelearning</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
