Forem: Shuntaro Okuma

I Tested 12 LLMs With Few-Shot Examples. The Results Were Not What I Expected.

Shuntaro Okuma — Thu, 26 Mar 2026 13:56:16 +0000

In a previous article, I tested 8 models across 4 tasks and reported on "few-shot collapse" — cases where adding few-shot examples actually degrades LLM performance.

This time, I expanded the experiment to 12 models (6 cloud + 6 local) and 5 tasks to see whether those findings hold at a larger scale. They do — and I found even more dramatic cases, including a model that dropped from 93% to 30% with more examples.

What I tested

I evaluated 12 models — 6 cloud APIs and 6 local models — across 5 tasks designed to mirror real business scenarios.

Cloud models: Claude Haiku 4.5, Claude Sonnet 4.6, Gemini 2.5 Flash, Gemini 3 Flash, GPT-4o-mini, GPT-5.4-mini

Local models: Gemma 3 27B, LLaMA 4 Scout (17B active, MoE), GPT-OSS 120B, Qwen 3.5 (35B total / 3B active, MoE), Ministral 3 14B Reasoning, Phi-4 Reasoning Plus

Tasks:

Classification — Categorize customer support inquiries into specific categories (exact match scoring)
Code Fix — Identify and fix bugs in short Python functions
Route Optimization — Calculate optimal delivery routes with time windows and fuel costs (LLM-as-judge scoring)
Sentiment Analysis — Classify product reviews as positive/negative/neutral/mixed
Summarization — Extract key points from news articles into summaries (F1 scoring)

Each model-task pair was evaluated at 0, 1, 2, 4, and 8 shots, with 3 trials per configuration and TF-IDF-based dynamic example selection. That's 60 model-task pairs and over 27,000 individual evaluations.

I'll describe how to explore the full results later, but here are three patterns that stood out.

Pattern 1: The zero-shot leader can crash to last place

Gemini 3 Flash scored 93% on route optimization at zero-shot — the highest of any model. Then I added examples.

Shots	0	1	2	4	8
Score	93%	93%	43%	53%	30%

At 8-shot, it scored 30%. The model that was the best at zero-shot became the worst with examples. A 63-point drop.

Here's the twist: Gemma 3 27B — from the same model family — stayed stable around 90% across all shot counts. Same architecture lineage, completely different behavior. This isn't a property of the Gemini/Gemma family. It's specific to Gemini 3 Flash on this task.

Pattern 2: Most models benefit from few-shot examples

On classification, every model scored between 0% and 20% at zero-shot. They all looked equally bad. Based on a zero-shot benchmark alone, you'd conclude these models can't classify customer support tickets.

But with examples, performance improved dramatically across the board. The graph is a bit busy, but you can see the overall upward trend from 0-shot to 8-shot:

At 8-shot:

Claude Haiku: 80% (from 20%)
Claude Sonnet: 73% (from 20%)
GPT-OSS 120B: 73% (from 0%)
Gemma 3 27B: 67% (from 0%)
GPT-4o-mini: 33% (from 0%)
Gemini 2.5 Flash: 27% (from 13%)

Models that scored below 20% at zero-shot improved significantly with examples. Claude Haiku reached 80%, and Claude Sonnet and GPT-OSS 120B also showed strong gains. Gemma 3 27B, which performed well on route optimization in Pattern 1, went from 0% to 67%. On the other hand, models like GPT-4o-mini and Gemini 2.5 Flash barely improved.

If you pick your model from zero-shot benchmarks, you might choose the wrong one.

Pattern 3: Models bad at a task stay bad — even with examples

On summarization, most models improved steadily with more examples. This is the behavior everyone expects from few-shot prompting. The graph is busy again, but the overall upward trend is clearer than with classification:

Gemma 3 27B — a local model — achieved the highest score at 75%, outperforming all cloud models. Claude Sonnet followed at 73%, then Gemini 3 Flash at 72%. For straightforward tasks, local models can be more than enough.

However, even on this task, Phi-4 Reasoning Plus and Ministral 3 14B scored poorly. Both are reasoning-specialized models, optimized for expanding and elaborating information — not compressing it as summarization requires. This isn't "collapse" from adding examples; they simply weren't suited for the task.

Few-shot prompting works well for most models on most tasks, but models that are fundamentally mismatched with a task won't be saved by more examples.

The 60 model-task combinations fall into three patterns

To summarize the three patterns:

1. Few-shot causes collapse — Like Gemini 3 Flash on route optimization in Pattern 1, adding examples dramatically degrades performance. The most notable cases:

Model	Task	Behavior	Drop
Gemini 3 Flash	Route Optimization	Gradual decline	93% → 30%
Qwen 3.5 (3B active)	Code Fix	Gradual decline	56% → 0%
Ministral 3 14B	Code Fix	Peak regression	44% → 33%

2. Few-shot works as expected — Like summarization for most models, performance improves steadily with more examples.

3. Task-model mismatch — As described in Pattern 3, models like Phi-4 Reasoning Plus and Ministral 3 14B scored low on summarization even at zero-shot. Adding examples didn't help — this isn't "collapse" but a fundamental mismatch.

Additionally, four pairs showed temporary dips that recovered. Scores eventually returned, but testing at a single shot count could lead to the wrong conclusion:

Model	Task	Detail
GPT-5.4-mini	Classification	60% at 2-shot → 27% at 4-shot → 60% at 8-shot
GPT-OSS 120B	Route Optimization	78% at 0-shot → 58% at 1-shot → 74% at 8-shot
Gemini 2.5 Flash	Route Optimization	63% at 2-shot → 52% at 4-shot → 63% at 8-shot
Qwen 3.5	Classification	40% at 2-shot → 20% at 4-shot → 40% at 8-shot

Testing at multiple shot counts helps catch these, though the issue may be less about shot count itself and more about the interaction between the specific examples provided and the model's state.

Who performed best?

Measuring adaptation efficiency (area under the learning curve across all tasks):

Rank	Model	Type	Avg AUC
1	Claude Haiku 4.5	Cloud	0.815
2	Gemma 3 27B	Local	0.814
3	Claude Sonnet 4.6	Cloud	0.802
4	LLaMA 4 Scout	Local	0.748
5	GPT-5.4-mini	Cloud	0.730

A 27B local model matched Claude Haiku's adaptation efficiency. LLaMA 4 Scout, with only 17B active parameters (MoE), outperformed GPT-5.4-mini. Results will vary depending on the evaluation method and tasks, but this suggests that with proper few-shot prompting, local models can achieve performance comparable to cloud APIs.

Prior research

Few-shot performance degradation has been reported by several independent studies:

Tang et al. (2025) documented "over-prompting" — performance peaks then declines — across GPT-4o, DeepSeek-V3, Gemma-3, LLaMA-3, and Mistral.
Lin & Mohaisen (NDSS 2025) found that few-shot examples degraded vulnerability detection: Gemma 7B dropped from 78% to 40%.
Chroma Research (2025) showed that simply adding more tokens — even irrelevant ones — degrades performance.
Min et al. (2022) found that randomly replacing labels in few-shot examples barely hurts performance — suggesting models aren't learning from examples the way we assume.

The phenomenon is well-documented. This makes it all the more important to evaluate whether few-shot prompting actually works for your specific use case before deploying to production.

Practical takeaways

Don't assume more examples = better. It's worth testing at multiple shot counts. The optimal number varies by model and task.
Don't pick models from zero-shot benchmarks alone. We found that rankings can change significantly with examples. When referencing benchmarks, check whether they were measured at zero-shot or few-shot — the methodology matters.
Distinguish collapse from task mismatch. If scores drop after adding examples, check the zero-shot baseline. Low from the start suggests a model-task compatibility issue. High at zero-shot but dropping with examples points to a few-shot prompting effect.
Measure, don't guess. Whether few-shot prompting helps a specific model-task pair can only be determined by actually evaluating it. Tracking the full learning curve ensures you don't miss non-monotonic patterns.

Reproducing these results

The evaluation was run with AdaptGauge (OSS, MIT license), a tool that tracks learning curves, auto-detects collapse, and classifies degradation patterns.

The full results from this 12-model × 5-task experiment are available as default demo data. After installation, you can immediately explore the patterns and learning curves discussed in this article — no API keys needed.

To evaluate your own tasks and models, AdaptGauge supports cloud APIs as well as local models via any OpenAI-compatible API (LM Studio, Ollama, etc.).

GitHub: github.com/ShuntaroOkuma/adapt-gauge-core

References

Chollet (2019) — "On the Measure of Intelligence"
Min et al. (2022) — "Rethinking the Role of Demonstrations", EMNLP
Liu et al. (2024) — "Lost in the Middle", TACL
Tang et al. (2025) — "The Few-shot Dilemma"
Lin & Mohaisen (2025) — "From Large to Mammoth", NDSS
Chroma Research (2025) — "Context Rot"

How I Measure My Dify Chatbot Quality with Scenario Testing

Shuntaro Okuma — Mon, 23 Mar 2026 15:01:35 +0000

What I did

I designed multi-turn conversation scenarios for a Dify chatbot, ran them automatically via the API, and measured response quality quantitatively.

If you've built chatbots with Dify, you've probably noticed this: single-turn Q&A works fine, but once users get into 3-4 turn conversations, quality drops noticeably. So I built automated tests — multi-turn scenarios with expected responses, fired against Dify's API — to catch these problems before they reach production.

Background: existing eval tools and the remaining gap

Dify has official integrations with several observability and evaluation tools. These tools aren't just for tracing — they also have evaluation capabilities.

Tool	Evaluation features
LangSmith	Datasets + Evaluators, LLM-as-Judge, human feedback
Langfuse	Datasets, LLM-as-Judge, human feedback, custom scores
Opik	LLM-as-Judge, 8 conversation-specific metrics, dataset evaluation
Arize AX	LLM-as-Judge, Session Evals, human annotation
Phoenix	LLM-as-Judge, Evaluator Hub

These tools can, for example, run an application against a dataset of {input, expected_output} pairs and compare scores before and after changes. However, none of them seem to support designing and executing multi-turn conversation scenarios to check quality end-to-end.

What I wanted

Here's what I was looking for:

Evaluate multi-turn conversations: Test entire conversation flows (not just single Q&A), including context retention and information consistency across turns
Design branching based on bot responses: Create scenarios where the user's next question depends on what the bot actually said in the previous turn
Score each turn with LLM-as-Judge: After running a scenario, automatically evaluate each turn's response on criteria like semantic accuracy and context retention
Run tests repeatedly and automatically: Define scenarios once, run them as many times as needed, so quality issues that single manual tests miss get caught through continuous testing
Auto-generate scenarios from Dify DSL: Writing scenarios shouldn't be the bottleneck — just paste a Dify app's flow definition (YAML) and have test scenarios generated from its structure

I originally built a tool to do all of this for my own use. After using it heavily, it turned out to be more broadly useful than expected, so I published it as ConvoProbe.

A note on the Dify community's approach to quality:
I searched the Dify forum and GitHub Discussions to see how others handle chatbot quality. The results were surprising:

Search Count

Forum posts about chatbot quality evaluation 0

GitHub Discussions about testing/validation 3

GitHub Issues about regressions after updates 211

GitHub Issues about observability/tracing 524

There's plenty of discussion about observability and regressions, but almost none about systematically evaluating quality.

Search	Count
Forum posts about chatbot quality evaluation	0
GitHub Discussions about testing/validation	3
GitHub Issues about regressions after updates	211
GitHub Issues about observability/tracing	524

What ConvoProbe does

1. Evaluate multi-turn conversations

ConvoProbe evaluates entire multi-turn conversations, not just individual Q&A pairs.

Single-turn tests can verify whether individual answers are correct. But in real chatbot usage, problems emerge at turn 3 or 4 — the bot loses context, mixes up information, or contradicts what it said earlier. ConvoProbe lets you verify things like "does the bot at turn 4 correctly reference what it said at turn 1?"

2. Design conversation scenarios visually

You build conversation structures in a GUI — much like designing flows in Dify itself. For each turn, you set the user's message and the expected response.

3. Design dynamic branching based on bot responses

Real conversations aren't linear. What the user asks next depends on what the bot just said.

ConvoProbe uses an LLM to evaluate the bot's response at runtime and dynamically determines which branch to follow. Static dataset evaluation can't express this kind of "output-dependent branching."

4. Auto-generate scenarios from Dify DSL

Paste your Dify app's DSL (the YAML flow definition) into ConvoProbe, and it analyzes the flow structure to auto-generate test scenarios.

No need to design scenarios from scratch. For existing Dify apps, you can start testing immediately. Generated scenarios can be run as-is or edited in the GUI.

5. Score each turn with LLM-as-Judge

When a scenario runs, each turn's response is automatically scored on the following criteria:

Criterion	What it measures
Semantic alignment	Does the actual response convey the expected meaning and information?
Completeness	Does the actual response cover all key points from the expected answer?
Accuracy	Is the information in the actual response factually correct?
Relevance	Is the actual response directly relevant to the question?

What scenario testing reveals

Running multi-turn scenario tests surfaces quality problems that are otherwise hard to catch:

Quality degrades over multiple turns

A chatbot that looks fine on single-turn tests can fall apart after 3-4 turns. RAG-based chatbots are especially prone to this — as conversations progress, the bot's ability to determine which retrieved information is relevant starts to drift.

If you only test single turns, you'll miss this entirely.

Context loss is silent

When a bot "forgets" earlier conversation history, there's no crash or error. It just generates a plausible-sounding but incorrect response.

To verify whether "turn 4 correctly references turn 1," you need to intentionally design and execute that conversation flow as a test scenario.

Workflow updates cause regressions

Updating a Dify workflow — changing a system prompt, adjusting RAG retrieval parameters — can silently break conversation patterns that were working before.

Running the same scenarios before and after a change lets you catch degradation before it reaches production.

How ConvoProbe fits with existing tools

ConvoProbe isn't a replacement for Langfuse or LangSmith — it's complementary.

Phase	Tool	Role
During development	ConvoProbe	Run scenario tests to verify it's safe to ship
Before release	ConvoProbe	Compare scenario scores before/after changes (regression testing)
In production	Langfuse / LangSmith / Opik	Tracing, cost monitoring, post-hoc evaluation of real conversations
When issues surface	ConvoProbe	Create a scenario that reproduces the problem, fix, re-test

Langfuse helps you discover problems. ConvoProbe helps you prevent them from recurring.

Try it

ConvoProbe requires just a Dify API key and an LLM API key for evaluation to get started.

https://convoprobe.vercel.app

When More Examples Make Your LLM Worse: Discovering Few-Shot Collapse

Shuntaro Okuma — Fri, 27 Feb 2026 16:06:39 +0000

Here's something everyone agrees on about few-shot prompting: give the model more examples, it performs better.

I believed that too. Then I measured it.

So I built AdaptGauge, an open-source tool that measures how efficiently LLMs learn from few-shot examples.

What I tested

I evaluated eight models across four tasks designed to mirror real business scenarios, at shot counts of 0, 1, 2, 4, and 8:

Classification — Categorize customer support inquiries into one of 8 categories (billing, technical support, returns, etc.)
Code Fix — Identify and fix bugs in short Python functions (off-by-one errors, missing edge cases)
Summarization — Extract key points from Japanese news articles into bullet-point summaries
Route Optimization — Calculate optimal delivery routes across multiple destinations with time windows and fuel costs

Models tested:

Cloud APIs: Claude Haiku 4.5, Claude Opus 4.5, Gemini 2.5 Flash, Gemini 3 Flash, Gemini 3 Pro
Local models: Gemma 3 27B, GPT-OSS 120B, Qwen3-VL 8B

For each model-task pair, I also compared two example selection strategies:

Fixed — The same hand-picked examples used for every test input.
TF-IDF dynamic selection — For each test input, score all candidate examples by word-overlap similarity and pick the closest matches. The idea: examples that resemble the current input should help the model more. Tang et al. (2025) reported that combining this with stratified sampling achieves better performance with fewer examples.

Full task definitions — including prompts, examples, and scoring rubrics — are in the demo task pack.

Most of the time, the results looked exactly like you'd expect. More examples, better scores. But not always.

Three patterns that break the assumption

Pattern 1: The model learns, then unlearns

On a route optimization task, Gemini 3 Flash scored 33% at zero-shot. Adding examples helped — performance climbed to 64% at 4-shot. Textbook behavior.

Then I added more. At 8-shot, the score crashed back to 33%. Right back where it started.

The model learned, then unlearned.

Four models improved steadily. One didn't. I call this peak regression — and you can't spot it without tracking the full learning curve.

Pattern 2: Rankings flip completely

On a classification task, something even stranger happened. The model rankings reversed between zero-shot and eight-shot:

Look at Gemini 2.5 Flash: it scored just 20% at zero-shot, but climbed to 80% with eight examples — the highest of any model. Meanwhile, Gemini 3 Pro stayed flat at 60% regardless of shot count.

A "Pro" model isn't necessarily better than a "Flash" model — it depends on how you prompt it. Choosing a model based on public benchmarks alone can lead you to the wrong conclusion.

Pattern 3: How you pick examples can trigger collapse

I tested two methods for selecting few-shot examples: fixed (hand-picked) and TF-IDF (dynamically selected by text similarity).

Tang et al.'s "The Few-shot Dilemma" (2025) found that TF-IDF-based selection combined with stratified sampling achieved superior performance with fewer examples. And on most of my tasks, TF-IDF did help.

But on a route optimization task with GPT-OSS 120B, it made things dramatically worse:

With fixed examples, the model stayed above 50%. With TF-IDF, it collapsed to 35% at 2-shot — a 58% relative drop. The method designed to find "better" examples triggered a failure.

Adding in-context examples — or changing how you select them — can actively degrade model performance. I call this few-shot collapse.

I'm not the first to see this

After finding these patterns, I dug into the literature. Turns out researchers have been documenting the same thing.

The over-prompting problem. Tang et al. (2025) showed that LLM performance peaks at a certain number of examples and then declines. LLaMA and Gemma models showed dramatic degradation. GPT models held up better.

Catastrophic drops in security tasks. An NDSS 2025 study (Lin & Mohaisen) found that few-shot examples dramatically degraded vulnerability type identification. In terms of AP (accurate-response percentage), Gemma 7B dropped from 77.9% to 39.9%, and LLaMA-2 70B from 68.6% to 21.0%.

Labels don't even matter. Min et al. (2022) found that randomly replacing labels in few-shot examples barely hurts performance. Models aren't learning input-label mappings — they're picking up format and distribution cues. The mechanism behind few-shot "learning" is far more fragile than most people assume.

Why this happens

A few factors are at play:

More tokens = worse performance. Chroma Research's "Context Rot" study (2025) showed that simply increasing input tokens — even with irrelevant whitespace — significantly degrades performance across a wide range of models and tasks. More examples means more tokens.

Position matters. Liu et al. (2024) showed that models struggle with information in the middle of long contexts. When examples push the actual task further down, the model loses track.

Pre-training biases conflict with examples. Some models have strong priors. When examples contradict those priors, or introduce patterns the model over-indexes on, the result is worse than no examples at all.

Example selection amplifies or dampens all of this. My TF-IDF comparison showed that "textually similar" doesn't always mean "helpful." A relevant example can still confuse the model.

What this means for you

If you're using LLMs in production:

Your prompt "improvements" might be breaking things. Adding examples is the default fix when a model underperforms. My data shows it can backfire — and without measurement, you won't know until users complain.

Leaderboard rankings don't predict this. Alzahrani et al. (2024) showed that minor benchmark changes shift rankings by up to 8 positions. My classification results confirm it: the zero-shot leader dropped to third once examples were added.

Different models break on different tasks. Gemini 3 Flash collapsed on route optimization but improved on summarization. There's no universal "safe" model.

Example selection is a variable, not a constant. Switching from hand-picked to TF-IDF examples turned a working model into a broken one. This isn't a "set and forget" choice.

Detecting it automatically

These findings led me to a framework inspired by Chollet's "On the Measure of Intelligence" (2019):

"The intelligence of a system is the measure of its skill-acquisition efficiency over a scope of tasks, with respect to priors, experience, and generalization difficulty."

Instead of "how good is this model?", the question should be "how efficiently does it adapt?" — and critically, does it ever adapt in the wrong direction?

I built this idea into AdaptGauge. For each model-task pair across shot counts (0 through 8), it automatically computes:

Learning Curve AUC — Overall learning efficiency. Higher means the model learns faster from examples.
Few-Shot Collapse — Auto-alerts when 8-shot performance drops below 80% of the 0-shot baseline.
Collapse Pattern — Classifies each curve as immediate collapse, gradual decline, peak regression, or stable.
Resilience Score — How well the model holds up as shot count increases.
Example Selection Comparison — Runs fixed vs TF-IDF side-by-side to find what works for each model-task pair.

AdaptGauge is primarily a CLI tool, but it also includes a simple GUI for reviewing results:

In my evaluation, it flagged the peak regression in Gemini 3 Flash and the TF-IDF-induced collapse in GPT-OSS 120B automatically. These are patterns that spot-checking would miss entirely.

Try it

AdaptGauge is open-source. Clone the repo, check the pre-computed demo results, or run your own evaluations against any model with an API. For local models, LM Studio makes it easy to test.

If you've ever added examples to a prompt and wondered whether it actually helped — now you can find out.

ShuntaroOkuma / adapt-gauge-core

Measure LLM adaptation efficiency — how fast models learn from few examples

adapt-gauge-core

日本語

Measure how fast LLMs learn from few-shot examples — and detect when they break.

adapt-gauge-core is an open-source evaluation harness that measures Adaptation Efficiency — how quickly a language model improves with few-shot examples (0, 1, 2, 4, 8 shots) and whether it suffers from few-shot collapse (performance degradation with more examples).

Why Adaptation Efficiency?

Standard LLM benchmarks measure accuracy at a single point. But in production, teams often use few-shot prompting to adapt models to specific tasks. Two critical questions arise:

How many examples does this model need? Some models reach peak performance at 2 shots; others need 8.
Does adding examples ever hurt? For some model-task combinations, performance drops with more examples — a phenomenon known as few-shot collapse (also called over-prompting in the literature).

adapt-gauge-core answers both questions automatically.

In our evaluations, we observed that leaderboard rankings reverse depending on shot count — a model…

View on GitHub

References

Chollet (2019) — "On the Measure of Intelligence"
Min et al. (2022) — "Rethinking the Role of Demonstrations", EMNLP
Liu et al. (2024) — "Lost in the Middle", TACL
Alzahrani et al. (2024) — "When Benchmarks are Targets", ACL
Tang et al. (2025) — "The Few-shot Dilemma"
Lin & Mohaisen (2025) — "From Large to Mammoth", NDSS
Chroma Research (2025) — "Context Rot"