Forem: Dechun Wang

Do LLMs Lie? The Real Reason AI Sounds Smart While Making Things Up

Dechun Wang — Fri, 27 Feb 2026 01:59:54 +0000

AI “Lies”? Or Is It Just Doing Exactly What You Asked?

You’ve seen it.

You ask a model a question. It answers with:

a clean structure,
confident language,
a few very specific details,
maybe even a fake-looking citation for extra authority.

Then you Google it.

Nothing exists.

So the obvious conclusion is: “AI is lying.”

Here’s the more useful conclusion:

LLMs optimize for plausibility, not truth.

They’re not truth engines — they’re text engines.

And once you understand that, hallucination stops being “a mysterious model defect” and becomes an engineering problem you can design around.

This article is your field guide.

1) What Is an AI Hallucination?

In engineering terms, a hallucination is any output that is not grounded in either:

verifiable external reality (facts, sources, measurements), or
your provided context/instructions.

In human terms:

It’s fluent nonsense with good manners.

Two hallucinations that matter in real products

1.1 Factual hallucination (the model invents claims)

Example vibe:

“Yes, honey stabilizes blood sugar for diabetics because it’s natural.”

This is factually incorrect and unsafe. The model is using a natural = healthy pattern and filling in the rest.

1.2 Faithfulness hallucination (the model drifts away from what you asked)

Example vibe:

You asked: “Can diabetics replace sugar with honey?”

It answers: “Honey contains minerals and antioxidants.”

The answer might be broadly true-ish… but it didn’t answer the question.

This is instruction/context drift: the model chose a plausible response path instead of your intended one.

Why hallucinations are so dangerous

Because hallucinations are rarely random garbage.

They’re often:

internally consistent,
nicely written,
and optimised to sound helpful.

In high-stakes domains (medicine, law, finance, ops), that’s a failure mode with teeth.

2) Why Hallucinations Happen (Mechanics, Not Mysticism)

If you want one equation to explain the whole phenomenon, it’s this:

LLMs learn P(next token | context) —

not P(true statement | world).

A model can be excellent at the first probability while being mediocre at the second.

2.1 The objective function never included “truth”

Training pushes models to generate what looks like “a good continuation” of language.

If the training data contains:

outdated facts,
biased narratives,
low-quality posts,
or incorrect statements repeated many times,

the model will absorb those patterns.

Not because it’s dumb — because it’s doing gradient descent on the world’s mess.

2.2 The world is out-of-distribution by default

Most user questions are weird combinations:

“Explain how X affects Y in Z scenario, but for 2026 rules, in the UK, and for a small business.”

These compositional queries often don’t exist explicitly in training.

When an LLM faces an unfamiliar combination, it does what it’s trained to do:

Produce a plausible continuation anyway.

Silence wasn’t rewarded.

2.3 Parametric memory is confident by design

LLMs store “knowledge” in weights — not in a database that can be checked or updated.

That leads to classic failure modes:

fuzzy time boundaries (it may speak about “recent changes” without knowing what’s recent),
invented paper titles,
confident timelines that never happened.

The model often doesn’t know that it doesn’t know.

2.4 Language is ambiguous; instructions are under-specified

“Explain deep learning” could mean:

theory, math, history, code, architecture, best practices, business applications…

If you don’t lock scope, the model will choose a generic path.

That generic path can drift from your real intent — and suddenly you’re in faithfulness hallucination territory.

3) “Better Reasoning” Does Not Automatically Mean “More Truth”

Here’s the trap:

If a model can reason better, surely it hallucinates less… right?

Sometimes. Not always.

Reasoning increases coherence. Coherence can hide errors.

3.1 Reasoning helps in constrained tasks

Math, logic puzzles, rule-based workflows — where intermediate states can be checked.

In these settings, reasoning often reduces mistakes because the task has hard constraints.

3.2 Reasoning can amplify hallucinations in open-world facts

In unconstrained factual generation, stronger reasoning can do something scary:

take a wrong premise,
expand it into a beautiful argument,
and deliver a conclusion that feels more credible than a simple wrong answer.

Three common mechanisms:

(1) Over-extrapolation

The model extends a pattern beyond evidence (“A usually implies B, so it must here too”).

(2) Confidence misalignment

More structured answers sound more certain — even when the underlying claim is shaky.

(3) Correct reasoning over false premises

If the premise is wrong, the logic can still be flawless… and the result is still false.

That’s why “smart-sounding” is not a truth signal.

4) A Practical Taxonomy: How to Spot Hallucinations Fast

If you want a quick mental checklist, use F.A.C.T.

Fabricated specifics: suspiciously precise names, dates, numbers, citations
Authority cosplay: “According to a 2023 FDA paper…” with no traceable source
Context drift: it answers adjacent questions, not your question
Time confusion: “recently” / “latest” claims without anchoring time

If two or more show up, you should treat the output as a draft, not an answer.

5) What Normal Users Can Do (No PhD Required)

You can’t eliminate hallucinations. But you can lower risk massively with three moves.

5.1 Ground it: search / RAG / external evidence

If your question is:

time-sensitive,
domain-critical,
or requires citations,

then the model shouldn’t be your source of truth.

Use external grounding:

search,
your documentation,
a domain KB,
a RAG pipeline.

The goal is simple:

force the model to answer from evidence, not from vibes.

5.2 Verify it: two-model review + claim checking

A simple workflow that works shockingly well:

1) Model A produces an answer

2) Model B audits it like an aggressive reviewer

Tell Model B to:

list factual claims,
mark uncertainty,
flag anything that looks unverified,
propose how to validate.

This doesn’t guarantee correctness — but it exposes weak spots fast.

5.3 Constrain it: prompts that shrink the imagination space

Most hallucinations happen when the model has too much freedom.

Give it less.

Constraint prompt template (copy/paste)

Answer only for: [time range], [region], [source types].

If you are unsure, say “I don’t know” and list what you would need to verify.

Adversarial prompt template (copy/paste)

Before answering, list 3 ways your answer could be wrong.

Then answer, clearly separating facts vs assumptions.

This “self-audit” pattern reduces confident nonsense because it forces the model to expose uncertainty.

6) For Builders: The Four Defenses That Actually Scale

If you ship LLMs in production, the best mitigations are boring and systematic:

6.1 Retrieval grounding (RAG)

Bring in evidence, attach it to the context, and require citations internally (even if you hide them in UX).

6.2 Tool-based verification

If a claim can be checked by a tool, check it.

compute with a calculator,
validate with a DB query,
confirm with a search call,
cross-check with a rules engine.

6.3 Verifier / critic loop

Run a second pass that:

extracts claims,
checks them against retrieved sources,
rejects or rewrites unsupported sentences.

6.4 Observability + budgets

Hallucination mitigation is a runtime problem.

You need:

tool call logs,
retrieval traces,
evaluation scores,
step/time/token budgets,
and fail-closed behavior for critical tasks.

7) A Tiny “Hallucination Harness” You Can Use Today

This is a simplified testing script you can run against any model.

It does two things:

1) forces the model to output claims as bullets,

2) runs a crude “support check” against provided evidence.

It’s not perfect — but it makes hallucination visible.

import re
from typing import List, Dict

def extract_claims(answer: str) -> List[str]:
    # naive: treat bullet lines as claims
    lines = [l.strip("-•  ") for l in answer.splitlines()]
    claims = [l for l in lines if l and not l.lower().startswith(("tldr", "note:", "assumption:"))]
    return claims[:12]

def support_score(claim: str, evidence: str) -> float:
    # naive lexical overlap as a placeholder for real entailment checks
    c = set(re.findall(r"[a-zA-Z0-9]+", claim.lower()))
    e = set(re.findall(r"[a-zA-Z0-9]+", evidence.lower()))
    if not c:
        return 0.0
    return len(c & e) / len(c)

def flag_hallucinations(answer: str, evidence: str, threshold: float = 0.18) -> List[Dict]:
    results = []
    for claim in extract_claims(answer):
        score = support_score(claim, evidence)
        results.append({
            "claim": claim,
            "support_score": round(score, 3),
            "flag": score < threshold
        })
    return results

# Example usage
answer = """- Honey is safe for diabetics and stabilizes blood sugar.
- It has antioxidants and minerals."""
evidence = """Honey is a source of sugars (fructose and glucose) and can raise blood glucose."""

for r in flag_hallucinations(answer, evidence):
    print(r)

What to improve in real systems

replace lexical overlap with an entailment model / verifier LLM,
store evidence chunks with provenance,
require citations at claim level,
fail closed in high-risk contexts.

But even this toy harness teaches a powerful lesson:

Most hallucinations aren’t hard to spot once you force outputs into claims.

8) The Grown-Up Way to Use LLMs

Stop asking: “Is the model truthful?”

Start asking: “What’s my verification strategy?”

Because hallucinations are not a bug you patch once.
They’re a property you manage forever.

Treat an LLM like a brilliant intern:

fast,
creative,
productive…

…and absolutely capable of inventing a meeting that never happened.

Your job isn’t to fear it.

Your job is to build the seatbelts.

From LLM to Agent: How Memory + Planning Turn a Chatbot Into a Doer

Dechun Wang — Fri, 27 Feb 2026 01:59:39 +0000

The Day Your LLM Stops Talking and Starts Doing

There’s a moment in every LLM project where you realize the “chat” part is the easy bit.

The hard part is everything that happens between the user request and the final output:

gathering missing facts,
choosing which tools to call (and in what order),
handling failures,
remembering prior decisions,
and not spiraling into confident nonsense when the world refuses to match the model’s assumptions.

That’s the moment you’re no longer building “an LLM app.”

You’re building an agent.

In software terms, an agent is not a magical model upgrade. It’s a system design pattern:

Agent = LLM + tools + a loop + state

Once you see it this way, “memory” and “planning” stop being buzzwords and become engineering decisions you can reason about, test, and improve.

Let’s break down how it works.

1) What Is an LLM Agent, Actually?

A classic LLM app looks like this:

user_input -> prompt -> model -> answer

An agent adds a control loop:

user_input
  -> (state) -> model -> action
  -> tool/environment -> observation
  -> (state update) -> model -> action
  -> ... repeat ...
  -> final answer

The difference is subtle but massive:

A chatbot generates text.
An agent executes a policy over time.

The model is the policy engine; the loop is the runtime.

This means agents are fundamentally about systems: orchestration, state, observability, guardrails, and evaluation.

2) Memory: The Two Buckets You Can’t Avoid

Human-like “memory” in agents usually becomes two concrete buckets:

2.1 Short-Term Memory (Working Memory)

Short-term memory is whatever you stuff into the model’s current context:

the current conversation (or the relevant slice of it),
tool results you just fetched,
intermediate notes (“scratchpad”),
temporary constraints (deadlines, budgets, requirements).

Engineering reality check: short-term memory is limited by your context window and by model behavior.

Two classic failure modes show up in production:

1) Context trimming: you cut earlier messages to save tokens → the agent “forgets” key constraints.

2) Recency bias: even with long contexts, models over-weight what’s near the end → old-but-important details get ignored.

If you’ve ever watched an agent re-ask for information it already has, you’ve seen both.

2.2 Long-Term Memory (Persistent Memory)

Long-term memory is stored outside the model:

vector DB embeddings,
document stores,
user profiles/preferences (only if you can do it safely and legally),
task history and decisions,
structured records (tickets, orders, logs, CRM entries).

The mainstream pattern is: retrieve → inject → reason.

If that sounds like RAG (Retrieval-Augmented Generation), that’s because it is. Agents just make RAG operational: retrieval isn’t only for answering questions—it’s for deciding what to do next.

The part people miss: memory needs structure

A pile of vector chunks is not “memory.” It’s a landfill.

Practical long-term memory works best when you store:

semantic content (embedding),
metadata (timestamp, source, permissions, owner, reliability score),
a policy for when to write/read (what gets saved, what gets ignored),
decay or TTL for things that stop mattering.

If you don’t design write/read policies, you’ll build an agent that remembers the wrong things forever.

3) Planning: From Decomposition to Search

Planning sounds philosophical, but it maps to one question:

How does the agent choose the next action?

In real tasks, “next action” is rarely obvious. That’s why we plan: to reduce a big problem into smaller moves with checkpoints.

3.1 Task Decomposition: Why It’s Not Optional

When you ask an agent to “plan,” you’re buying:

controllability: you can inspect steps and constraints,
debuggability: you can see where it went wrong,
tool alignment: each step can map to a tool call,
lower hallucination risk: fewer leaps, more verification.

But planning can be cheap or expensive depending on the technique.

3.2 CoT: Linear Reasoning as a Control Interface

Chain-of-Thought style prompting nudges the model to produce intermediate reasoning before the final output.

From an engineering perspective, the key benefit is not “the model becomes smarter.”
It’s that the model becomes more steerable:

it externalizes intermediate state,
it decomposes implicitly into substeps,
and you can gate or validate those steps.

CoT ≠ show-the-user-everything

In production, you often want the opposite: use structured reasoning internally, then output a crisp answer.

This is both a UX decision (nobody wants a wall of text) and a safety decision (you don’t want to leak internal deliberations, secrets, or tool inputs).

3.3 ToT: When Reasoning Becomes Search

Linear reasoning fails when:

there are multiple plausible paths,
early choices are hard to reverse,
you need lookahead (trade-offs, planning, puzzles, strategy).

Tree-of-Thought style reasoning turns “thinking” into search:

expand: propose multiple candidate thoughts/steps,
evaluate: score candidates (by heuristics, constraints, or another model call),
select: continue exploring the best branches,
optionally backtrack if a branch collapses.

If CoT is “one good route,” ToT is “try a few routes, keep the ones that look promising.”

The cost: token burn

Search is expensive. If you expand branches without discipline, cost grows fast.

So ToT tends to shine in:

high-value tasks,
problems with clear evaluation signals,
situations where being wrong is more expensive than being slow.

3.4 GoT: The Engineering Upgrade (Reuse, Merge, Backtrack)

Tree search wastes work when branches overlap.

Graph-of-Thoughts takes a practical step:

treat intermediate reasoning as states in a directed graph,
allow merging equivalent states (reuse),
support backtracking to arbitrary nodes,
apply pruning more aggressively.

If ToT is a tree, GoT is a graph with memory: you don’t re-derive what you already know.

This matters in production where repeated tool calls and repeated reasoning are the real cost drivers.

3.5 XoT: “Everything of Thoughts” as Research Direction

XoT-style approaches try to unify thought paradigms and inject external knowledge and search methods (think: MCTS-style exploration + domain guidance).

It’s promising, but the engineering bar is high:

you need reliable evaluation,
tight budgets,
and a clear “why” for using something heavier than a well-designed plan loop.

In practice, many teams implement a lightweight ToT/GoT hybrid without the full research stack.

4) ReAct: The Loop That Makes Agents Feel Real

Planning is what the agent intends to do.

ReAct is what the agent actually does:

1) Reason about what’s missing / what to do next

2) Act by calling a tool

3) Observe the result

4) Reflect and adjust

Repeat until done.

This solves three real problems:

incomplete information: the agent can fetch what it doesn’t know,
verification: it can check assumptions against reality,
error recovery: it can reroute after failures.

If you’ve ever debugged a hallucination, you already know why this matters:
a believable explanation isn’t the same thing as a correct answer.

5) A Minimal Agent With Memory + Planning (Practical Version)

Below is a deliberately “boring” agent loop. That’s the point.

Most production agents are not sci-fi. They’re well-instrumented control loops with strict budgets.

from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional
import time

# --- Tools (stubs) ---------------------------------------------------------

def web_search(query: str) -> str:
    # Replace with your search API call + caching.
    return f"[search-results for: {query}]"

def calc(expression: str) -> str:
    # Replace with a safe evaluator.
    return str(eval(expression, {"__builtins__": {}}, {}))

# --- Memory ----------------------------------------------------------------

@dataclass
class MemoryItem:
    text: str
    ts: float = field(default_factory=lambda: time.time())
    meta: Dict[str, Any] = field(default_factory=dict)

@dataclass
class MemoryStore:
    short_term: List[MemoryItem] = field(default_factory=list)
    long_term: List[MemoryItem] = field(default_factory=list)  # stand-in for vector DB

    def remember_short(self, text: str, **meta):
        self.short_term.append(MemoryItem(text=text, meta=meta))

    def remember_long(self, text: str, **meta):
        self.long_term.append(MemoryItem(text=text, meta=meta))

    def retrieve_long(self, hint: str, k: int = 3) -> List[MemoryItem]:
        # Dummy retrieval: filter by substring.
        hits = [m for m in self.long_term if hint.lower() in m.text.lower()]
        return sorted(hits, key=lambda m: m.ts, reverse=True)[:k]

# --- Planner (very small ToT-ish idea) ------------------------------------

def propose_plans(task: str) -> List[str]:
    # In reality: this is an LLM call producing multiple plan candidates.
    return [
        f"Search key facts about: {task}",
        f"Break task into steps, then execute step-by-step: {task}",
        f"Ask a clarifying question if constraints are missing: {task}",
    ]

def score_plan(plan: str) -> int:
    # Heuristic scoring: prefer plans that verify facts.
    if "Search" in plan:
        return 3
    if "Break task" in plan:
        return 2
    return 1

# --- Agent Loop ------------------------------------------------------------

def run_agent(task: str, memory: MemoryStore, max_steps: int = 6) -> str:
    # 1) Retrieve long-term memory if relevant.
    recalled = memory.retrieve_long(hint=task)
    for item in recalled:
        memory.remember_short(f"Recalled: {item.text}", source="long_term")

    # 2) Plan (cheap multi-candidate selection).
    plans = propose_plans(task)
    plan = max(plans, key=score_plan)
    memory.remember_short(f"Chosen plan: {plan}")

    # 3) Execute loop.
    for step in range(max_steps):
        # In reality: this is an LLM call that decides "next tool" based on state.
        if "Search" in plan and step == 0:
            obs = web_search(task)
            memory.remember_short(f"Observation: {obs}", tool="web_search")
            continue

        # Example: do a small computation if the task contains a calc hint.
        if "calculate" in task.lower() and step == 1:
            obs = calc("6 * 7")
            memory.remember_short(f"Observation: {obs}", tool="calc")
            continue

        # Stop condition (simplified).
        if step >= 2:
            break

    # 4) Final answer: summarise short-term state.
    notes = "\n".join([f"- {m.text}" for m in memory.short_term[-8:]])
    return f"Task: {task}\n\nWhat I did:\n{notes}\n\nFinal: (produce a user-facing answer here.)"

# Demo usage:
mem = MemoryStore()
mem.remember_long("User prefers concise outputs with clear bullets.", tag="preference")
print(run_agent("Write a short guide on LLM agents with memory and planning", mem))

What this toy example demonstrates (and why it matters)

Memory is state, not vibe. It’s read/write with policy.
Planning can be multi-candidate without going full ToT. Generate a few, pick one, move on.
Tool calls are first-class. Observations update state, not just the transcript.
Budgets exist. max_steps is a real safety and cost control.

6) Production Notes: Where Agents Actually Fail

If you want this to work outside demos, you’ll spend most of your time on these five areas.

6.1 Tool reliability beats prompt cleverness

Tools fail. Time out. Rate limit. Return weird formats.

Your agent loop needs:

retries with backoff,
strict schemas,
parsing + validation,
and fallback strategies.

A “smart” agent without robust I/O is just a creative writer with API keys.

6.2 Memory needs permissions and hygiene

If you store user data, you need:

clear consent and retention rules,
permission checks at retrieval time,
deletion pathways,
and safe defaults.

In regulated environments, long-term memory is often the highest-risk component.

6.3 Planning needs evaluation signals

Search-based planning is only as good as its scoring.

You’ll likely need:

constraint checkers,
unit tests for tool outputs,
or a separate “critic” model call that can reject bad steps.

6.4 Observability is not optional

If you can’t trace:

which tool was called,
with what inputs,
what it returned,
and how it changed the plan,

you can’t debug. You also can’t measure improvements.

Log everything. Then decide what to retain.

6.5 Security: agents amplify blast radius

When a model can take actions, mistakes become incidents.

Guardrails look like:

allowlists (tools, domains, actions),
spend limits,
step limits,
sandboxing,
and human-in-the-loop gates for high-impact actions.

7) The Real “Agent Upgrade”: A Better Mental Model

If you remember one thing, make it this:

An agent is an LLM inside a state machine.

Memory = state
Planning = policy shaping
Tools = actuators
Observations = state transitions
Reflection = error-correcting feedback

Once you build agents this way, you stop chasing “the perfect prompt” and start shipping systems that can survive reality.

And reality is the only benchmark that matters.

Refactoring Agent Skills: From Context Explosion to a Fast, Reliable Workflow

Dechun Wang — Sun, 15 Feb 2026 22:13:15 +0000

Refactoring Agent Skills: The Day My Context Window Died

There’s a specific kind of pain you only experience once:

You’re in Claude Code, you trigger a couple of “helpful” Skills, and suddenly the model is chewing through thousands of lines of markdown + snippets it didn’t ask for.

Your “AI co-pilot” stops feeling like a co-pilot and starts feeling like a browser tab you can’t close.

This piece is a practical rewrite (and upgrade) of a popular Claude Code community refactor story: a developer thought “more info = better,” built Skills like mini-wikis, and accidentally created a context explosion. The fix wasn’t a clever prompt. It was an architectural refactor. The result: dramatically leaner initial context and much better token efficiency.

Let’s steal the playbook.

1) The Root Cause: Treating Skills Like Docs

The first trap is incredibly human:

“If I include everything, the model will always have what it needs.”

So you create one Skill per tool, and each Skill becomes a documentation dump:

setup steps
API references
exhaustive examples
“don’t do X” lists
every edge case since 2017

Then a task like “deploy a serverless function with a small UI” pulls in:

your Cloudflare skill,
your Docker skill,
your UI styling skill,
your web framework skill…

…and the model starts its job already half-drowned.

Claude Code’s own docs warn that Skills share the context window with the conversation, the request, and other Skills — which means uncontrolled loading is a direct performance tax. (You feel it as slowness, drift, and “why is it ignoring the obvious part?”)

So: your problem isn’t “lack of info.” It’s “too much irrelevant info.”

2) The Fix: Progressive Disclosure (Three Layers)

Claude Code docs explicitly recommend progressive disclosure: keep essential info in SKILL.md, and store the heavy stuff in separate files that get loaded only when the task requires them.

This maps cleanly to a three-layer system:

Layer 1 — Metadata (always loaded)

A short YAML frontmatter: name + description + the “routing signal.”

Think of it like a book cover and blurb. You’re not teaching. You’re helping the model decide whether to open the book.

Layer 2 — Entry point: `SKILL.md` (loaded on activation)

Your navigation map:

what the Skill is for
when to use it
what steps to follow
what files to open next

Not a tutorial. Not a wiki.

Layer 3 — References & scripts (loaded only when needed)

Small, focused files:

one topic per file
200–300 lines per file is a good target
scripts do deterministic work so the model doesn’t burn tokens “describing” actions

Here’s what that looks like in a real folder:

.claude/skills/devops/
├── SKILL.md
├── references/
│   ├── serverless-cloudflare.md
│   ├── containers-docker.md
│   └── ci-cd-basics.md
└── scripts/
    ├── validate_env.py
    └── deploy_helper.sh

3) The “200-Line Rule”: Brutal, Slightly Arbitrary, Weirdly Effective

In the community refactor story, the author landed on a hard constraint:

Keep SKILL.md under ~200 lines.

If you can’t, you’re putting too much in the entry point.

Claude’s own best practices docs recommend keeping the body under a few hundred lines (and splitting content as you approach that limit). But “200 lines” is a sharper knife: it forces you to write a table of contents, not a textbook.

Why it works:

The model can scan the entry quickly
It can decide what reference file to load next
Total “initial load” stays small enough that the conversation still has room to breathe

A quick test you can steal

Start a fresh session (cold start)
Trigger your Skill
If your first activation loads more than ~500 lines of content, your design is likely leaking scope

4) The Real Mental Shift: From Tool-Centric to Workflow-Centric

This is the part most people miss.

Tool-centric Skills look like:

cloudflare-skill
tailwind-skill
postgres-skill
kubernetes-skill

They’re encyclopedias. They don’t compose well.

Workflow-centric Skills look like:

devops (deploy + environments + CI/CD)
ui-styling (design rules + component patterns)
web-frameworks (routing + project structure + SSR pitfalls)
databases (schema design + migrations + query patterns)

They map to what you actually do during development.

A workflow Skill answers:

“When I’m in this stage of work, what does the agent need to know to act correctly?”

Not:

“What is everything this tool can do?”

That one reframing prevents context blowups almost by itself.

5) A Minimal, Production-Grade `SKILL.md` (Example)

Here’s a deliberately small entry point you can copy and customise.

Notice what’s missing: long examples, full docs, and “everything you might ever need.”

---
name: ui-styling
description: Apply consistent UI styling across the app (Tailwind + component conventions). Use when building or refactoring UI.
---

# UI Styling Skill

## When to use
- You are building UI components or pages
- You need consistent spacing, typography, and responsive behaviour
- You need to align with existing design conventions

## Workflow
1. Identify the UI surface (page/component) and constraints (responsive, dark mode, accessibility)
2. Apply styling rules from the references (pick only what you need)
3. Validate output against the checklist

## References (load only if needed)
- `references/design-tokens.md` — spacing, font scale, colour usage
- `references/tailwind-patterns.md` — layouts, common utility combos
- `references/accessibility-checklist.md` — keyboard, focus, contrast

## Output contract
- Use UK English in UI strings
- Prefer reusable components over copy-paste blocks
- Keep className readable (extract when it gets messy)

That’s it.

The Skill’s job is to route the agent to the right file at the right moment — not to become an on-page encyclopedia.

6) Measuring Improvements (Without Lying to Yourself)

If you want repeatable results, track metrics that actually matter:

Initial lines loaded on activation
Time to activation (roughly: how “snappy” it feels)
Relevance ratio (how much of the loaded content is used)
Context overflow frequency (how often long tasks crash)

You don’t need a full observability stack. A simple repo audit script helps.

Tiny Python audit: count lines per Skill

from pathlib import Path

skills_dir = Path(".claude/skills")

def count_lines(p: Path) -> int:
    return sum(1 for _ in p.open("r", encoding="utf-8", errors="ignore"))

for skill in sorted(skills_dir.iterdir()):
    skill_md = skill / "SKILL.md"
    if skill_md.exists():
        lines = count_lines(skill_md)
        status = "OK" if lines <= 200 else "REFACTOR"
        print(f"{skill.name:20} {lines:4} lines  ->  {status}")

If you run this weekly, you’ll catch “documentation creep” before it becomes a crisis.

7) Common Failure Modes (And How to Avoid Them)

Failure mode: Claude writes “a doc” instead of “a Skill”

LLMs love expanding markdown into tutorials.

Fix:

explicitly tell it: this is not documentation
remove “beginner” filler
keep examples short, push detail into references

Failure mode: Entry point bloats because the Skill scope is too wide

Fix:

split the Skill by workflow stage
or move decision trees into references

Failure mode: Too many references, still hard to navigate

Fix:

put a short “map” section in SKILL.md
keep reference files single-topic and named by intent, not by tool

8) A Copyable Refactor Checklist

1) Audit: list Skills + line counts, find any SKILL.md > 200 lines

2) Group by workflow: merge tool-specific Skills into capability Skills

3) Create references: move detailed info out of SKILL.md

4) Enforce entry constraints: keep SKILL.md lean and navigational

5) Cold start test: ensure first activation stays under your chosen budget

6) Keep scripts deterministic: offload “do the thing” to code where possible

7) Re-check monthly: Skills drift over time; treat them like code

Final Take: Context Engineering Is “Right Info, Right Time”

The big lesson isn’t “200 lines” or “three layers.”

It’s this:

Context is a budget.

And the best Skill design spends it like an engineer, not like a librarian.

Don’t load everything. Load what matters — when it matters — and keep the rest one file away.

Stop Fine-Tuning Blindly: When to Fine-Tune—and When Not to Touch Model Weights

Dechun Wang — Sun, 15 Feb 2026 22:12:59 +0000

Fine-Tuning Is a Knife, Not a Hammer

Fine-tuning has a reputation problem.

Some people treat it like magic: “Just fine-tune and the model will understand our domain.”

Others treat it like a sin: “Never touch weights, it’s all prompt engineering now.”

Both are wrong.

Fine-tuning is a precision tool. Used well, it turns a generic model into a specialist. Used badly, it burns GPU budgets, bakes in bias, and ships a model that performs worse than the base.

This is a field guide: what types of fine-tuning exist, what they cost, how to run them, and the traps that quietly ruin outcomes.

1) The Real Taxonomy of Fine-Tuning

There are multiple ways to classify fine-tuning. The cleanest is: what changes, what signal you train on, and what model type you’re adapting.

1.1 By training scope: Full FT vs PEFT

Full fine-tuning (Full FT)

Definition: update all model weights so the model fully adapts to the new task.

Traits:

Maximum flexibility, maximum cost
Requires strong data quality and careful regularisation
Risk: catastrophic forgetting (the model “forgets” general abilities)

When it makes sense:

You have a stable task and a solid dataset (usually 10k–100k+ high-quality samples)
You can afford experiments and regression testing
You need deeper behavioural change than PEFT can deliver

Parameter-Efficient Fine-Tuning (PEFT)

Definition: freeze most weights and train small, targeted parameters.

You get most of the gains with a fraction of the cost.

PEFT subtypes you’ll actually see in production:

(A) Adapters

Insert small modules inside transformer blocks; train only those adapter weights. Typically a few percent of the total parameters.

(B) Prompt tuning (soft prompts / prefix tuning)

Train learnable “prompt vectors” (or a prefix) that steer behaviour.

Soft prompts: continuous vectors
Hard prompts: discrete tokens (rarely “trained” in the same way)

(C) LoRA (Low-Rank Adaptation)

LoRA is the workhorse. It decomposes weight updates into low-rank matrices:

$$
\Delta W = BA, \quad B \in \mathbb{R}^{d \times r},\; A \in \mathbb{R}^{r \times k},\; r \ll \min(d,k)
$$

Why it wins:

You store only (\Delta W) (small)
Easy to swap adapters per task
Strong performance per compute

(D) QLoRA

QLoRA runs LoRA on a quantised base model (often 4-bit), slashing VRAM requirements and making “big-ish” fine-tuning viable on consumer GPUs.

1.2 By learning signal: SFT, RLHF, contrastive (and friends)

Supervised Fine-Tuning (SFT)

Train on labelled input-output pairs. This is the default for:

classification
extraction
instruction following (instruction tuning)
style / tone adaptation

Preference optimisation (RLHF / DPO / variants)

Classic RLHF pipeline: SFT → reward model → policy optimisation (e.g., PPO).

In practice, many teams now use direct preference optimisation (DPO)-style training because it’s simpler operationally, but the concept is the same: align the model to preferences.

Contrastive fine-tuning

Useful when you care about representations (retrieval, similarity, embedding quality), less common for everyday text generation.

1.3 By modality: language, vision, multimodal

NLP: BERT/GPT/T5-style models; instruction tuning and chain-of-thought-style supervision are common
Vision: ResNet/ViT; progressive unfreezing and strong augmentation matter
Multimodal: CLIP/BLIP/Flamingo-like; biggest challenge is aligning representations across modalities

2) When Fine-Tuning Actually Pays Off

Fine-tuning shines in three situations:

2.1 Your domain language is not optional

Example: finance risk text. If the base model misreads terms like “short”, “subprime”, “haircut”, it will miss signals no matter how clever the prompt is.

2.2 Your task needs consistent behaviour, not one-off brilliance

A model that produces “sometimes great” answers is a nightmare in production. Fine-tuning can stabilise behaviour and reduce prompt complexity.

2.3 Your deployment requires control

On-prem constraints, latency budgets, data residency: self-hosted models + PEFT are often the only workable path.

3) When You Should NOT Fine-Tune

Here are the expensive mistakes:

<100 labelled samples: you’ll overfit or learn noise
task changes weekly: your fine-tune becomes technical debt
you can solve it with retrieval: if the problem is “missing knowledge,” do RAG first
you can’t evaluate properly: if you can’t measure, don’t train

4) The Fine-Tuning Workflow That Survives Production

Forget “train.py and vibes.” A real pipeline has repeatable stages.

4.1 Environment

Core stack:

PyTorch
Transformers + Datasets
Accelerate
PEFT
Experiment tracking (Weights & Biases or MLflow)

4.2 Data

This is where most projects win or lose.

Minimum checklist:

label consistency (do two annotators agree?)
balanced distribution (avoid 10:1 class collapse unless you correct for it)
no leakage (train/val split must be clean)

4.3 Model config

pick base model
pick tuning method (LoRA vs QLoRA vs full)
decide what gets trained, what stays frozen

4.4 Training loop

forward → loss → backward
gradient clipping
mixed precision when appropriate
periodic eval

4.5 Evaluation + export

validate on held-out set
measure robustness and regression
export artefacts (base + adapter weights)

5) Practical Code: SFT + LoRA (PEFT) with Transformers

Below is a slightly tweaked version of the standard Hugging Face flow, tuned for clarity and real-world guardrails.

# pip install transformers datasets accelerate peft evaluate
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model
import evaluate
import numpy as np

dataset = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def preprocess(batch):
    return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=384)

tokenised = dataset.map(preprocess, batched=True)
tokenised = tokenised.remove_columns(["text"]).rename_column("label", "labels")
tokenised.set_format("torch")

base = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

lora_cfg = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["query", "value"],  # tweak per model architecture
    bias="none",
    task_type="SEQ_CLS",
)
model = get_peft_model(base, lora_cfg)

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return metric.compute(predictions=preds, references=labels)

args = TrainingArguments(
    output_dir="./ft_out",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    logging_steps=100,
    fp16=True,
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenised["train"],
    eval_dataset=tokenised["test"],
    compute_metrics=compute_metrics,
)

trainer.train()
trainer.evaluate()
model.save_pretrained("./ft_out/lora_adapter")

What’s different (and why it matters):

max_length trimmed to 384 to reduce waste
LoRA targets are explicit (you should verify for your model)
fp16 enabled, batch sizes set for typical GPUs

6) QLoRA in Practice: When VRAM Is Your Bottleneck

QLoRA is the “I don’t have an A100” option.

Use it when:

your model is too big to fine-tune in full precision
you want LoRA-level results with drastically less memory
you accept slightly more complexity in setup

Operational note: QLoRA is sensitive to:

quantisation config
optimizer choice
batch size / gradient accumulation

7) Hardware Planning (The Boring Part That Saves You £££)

A simple rule-of-thumb table (very rough, but directionally useful):

Model size	Practical approach	GPU class	Why
<1B	Full FT or LoRA	24GB consumer GPU	Cheap experiments
1–10B	LoRA/QLoRA	40–80GB	Stable training & eval
>10B	QLoRA or multi-GPU	80GB+ (multi-card)	Memory + throughput

If your goal is a production system, plan for:

checkpoints (storage balloons fast)
inference latency testing (p50/p95/p99)
versioning (base + adapters + configs)

8) Monitoring: How to Detect Failure Early

Track:

train vs val loss divergence (overfitting)
task metric (F1/AUC/accuracy) over time
gradient norms (explosions or vanishing)
GPU utilisation + VRAM (to catch bottlenecks)

Early stopping is not optional in small-data regimes.

9) The Pitfalls That Kill Fine-Tuning Projects

9.1 Data leakage

Validation looks amazing, test collapses.

Fix:

group-aware splits
time-based splits for temporal data
deduplicate aggressively

9.2 Class imbalance

Model learns the majority class.

Fix:

weighting
resampling
metric choice (F1 > accuracy in many cases)

9.3 “Bigger model = better”

On small data, bigger models can overfit harder.

Fix:

match model size to data
prefer PEFT
regularise

9.4 Ignoring deployment constraints

A model that hits 0.96 AUC but misses latency and memory budgets is a demo, not a product.

Fix:

benchmark early
export-friendly formats (ONNX/TensorRT) if needed
distil if latency matters

10) A Decision Cheat Sheet

Use this quick chooser:

Data < 100 → prompt + retrieval + synthetic data
100–1,000 → LoRA / adapters
1,000–10,000 → LoRA or full FT (small LR)
10,000+ → full FT can make sense (if eval + regression are solid)
VRAM tight → QLoRA
Need preference alignment → DPO/RLHF-style preference training
Task changes often → avoid weight updates, design workflows instead

Final Take

Successful fine-tuning isn’t “a training run.”

It’s a loop:
data → training → evaluation → deployment constraints → monitoring → back to data.

If you treat it as an engineering system (not a one-off experiment), PEFT methods like LoRA/QLoRA give you the best tradeoff curve in 2026: strong gains, manageable cost, and deployable artefacts.

And that’s what you want: not a model that’s “smart in a notebook,” but a model that’s reliable in production.

Agent Skills: When Your AI Learns to "Install Plugins

Dechun Wang — Mon, 02 Feb 2026 14:40:23 +0000

Agent Skills: The “Plugins” Moment for Everyday AI

There’s a specific kind of disappointment you only get after asking an LLM to “create an Excel report” and receiving… a beautifully formatted description of a spreadsheet that does not exist.

It’s not the model’s fault. LLMs are great at language. They’re not inherently great at deterministic, file-producing, structure-preserving operations. That’s where Agent Skills come in: Anthropic’s answer (announced October 16, 2025) to the question: “How do we give AI real capabilities without turning every user into a developer?”

If MCP is a highway system for AI tooling, Agent Skills are the roundabouts and on-ramps built right into Claude—fast, local, and predictable.

1) What Exactly Is an Agent Skill?

An Agent Skill is a packaged capability Claude can use when it recognises the situation.

A Skill typically includes:

Metadata (name + description) used for routing/selection
Instructions (the playbook / workflow)
Resources (scripts, templates, helper files) that execute in a sandboxed environment

Think: “a tiny, reusable workflow module” rather than “yet another prompt”.

Anthropic’s docs describe Skills as organised folders of instructions, scripts, and resources, including pre-built Skills for common document work (PowerPoint, Excel, Word, PDF), plus custom Skills you can write yourself.

2) Why This Is A Big Deal: Progressive Disclosure (AKA, Don’t Stuff the Context Window)

The clever part isn’t “Claude can run scripts”—lots of systems can do that.

The clever part is how little Claude loads until it needs to.

Claude Code docs describe a three-phase flow:

Startup: load only each Skill’s name + description (keeps startup fast)
Activation: when relevant, Claude asks to use the Skill and you confirm before full instructions load
Execution: load resources and run in the execution environment

In practice, that means:

You don’t pay a token tax upfront for 20 Skills you might need later.
Claude can route to the right tool without being bloated with detail.
The user gets a clear “yes/no” moment before full Skill content is injected.

This is the opposite of the “mega system prompt” era.

3) What Skills Are Great At (And Why LLMs Struggle Without Them)

Problem A: “I need a real file, not a bedtime story about a file.”

Without Skills, the model often produces representations of artefacts—tables in Markdown, pseudo-Excel, fake download links.

With Skills, Claude can actually generate the artefact (e.g., .xlsx, .pptx, .docx) and hand it back.

Problem B: Domain best practices are annoying to repeat

In real work, the prompt is rarely the hard part. The hard part is the standards:

“Use the company slide template”
“Always include a pivot table + chart + executive summary”
“In code review, flag SQL injection risks”
“For PDFs, preserve table structure and join split rows”

Skills let you bake these standards once—then reuse them consistently.

Problem C: Determinism matters (especially for structured extraction)

When extracting tables from PDFs or producing spreadsheets, you want repeatable output. Skills push more of the job into deterministic tooling rather than hoping the model “describes it correctly”.

4) Agent Skills vs MCP: It’s Not Redundant, It’s Layering

People see “Skills” and immediately ask: Wait, isn’t that what MCP is for?

Not really.

MCP in one paragraph

Anthropic introduced Model Context Protocol (MCP) in November 2024 as an open standard for building secure, two-way connections between AI tools and external data sources—via MCP clients talking to MCP servers.

In other words: MCP is about connecting to outside systems (databases, file stores, SaaS tools, internal services).

Skills in one paragraph

Agent Skills are about packaging repeatable workflows and execution logic inside Claude’s ecosystem, with progressive loading and sandboxed execution.

A useful mental model

Agent Skills = built-in “shortcuts” / workflow modules (fast, local, standardised)
MCP = app ecosystem infrastructure (powerful, external, programmable, operationally heavier)

Or: Skills optimise “doing the thing”. MCP optimises “reaching the thing”.

5) The Architecture Patterns You’ll Actually Use

5.1 Skills for document-heavy work (the “office grind” you shouldn’t be doing manually)

Pre-built Skills cover common doc tasks: spreadsheets, slides, PDFs, Word docs.

A realistic workflow:

1) User: “Turn this quarterly sales CSV into a management-ready workbook with a pivot table and chart.”

2) Claude uses a spreadsheet Skill to generate a real .xlsx artefact.

5.2 Skills for organisational consistency (the “one team, one way” problem)

A custom Skill can encode your team’s standards:

review checklist
risk scoring rubric
writing style guide
“definition of done”

This matters because humans forget standards. LLMs… forget them even faster unless you enforce them.

5.3 MCP for external systems (the “we need live data” problem)

If you need to:

query a database
call a third-party API
hit an internal service
read from a production repository

…that’s MCP territory—especially because MCP is designed as an open protocol for those client/server connections.

6) A Minimal Custom Skill Example (Tweakable, Practical)

Below is a lightweight Skill that turns “messy meeting notes” into a consistent UK-style action log.

`SKILL.md`

---
name: action-tracking
description: Turn meeting notes into a UK-style action log with owners, dates, and risks. Use when the user pastes notes or uploads minutes.
---

# Action Tracking Assistant

## When to use
- The user provides meeting notes, minutes, or a transcript
- They want actions, owners, deadlines, and risks in a consistent format

## Steps
1. Extract decisions (if any) and action items
2. Assign each action an owner (use names given; otherwise use "TBC")
3. Convert relative dates (e.g. "next Friday") into explicit dates if the date is known; otherwise mark "TBC"
4. Flag dependencies and risks

## Output format (must follow exactly)
### Decisions
- ...

### Action Log
| ID | Action | Owner | Due date | Status | Risk/Dependency |
|---:|---|---|---|---|---|
| A1 | ... | ... | ... | Not started | ... |

Why this works: the description is written in language users actually type, which improves routing. Anthropic also explicitly recommends paying attention to name and description because Claude uses them to decide whether to trigger a Skill.

7) Safety: The Unsexy Part That Makes Skills Usable

Skills run code and interact with files in an execution environment, so the safety model matters. Claude Code docs frame Skills as folders Claude can navigate and execute within a constrained environment.

Practical safety rules that hold up in real teams:

Prefer official Skills or skills you authored and reviewed
Treat third-party Skills like you’d treat random shell scripts from the internet
Maintain a small allowlist; remove Skills you don’t use

8) The Real Limitation: Ecosystem Lock-In (For Now)

Skills are incredibly pragmatic—but they’re also ecosystem-bound:

Skills are designed around Claude’s tooling and execution model.
MCP, by contrast, is explicitly positioned as an open protocol for interoperability across tools and platforms.

So the trade-off is clear:

Skills = speed + simplicity
MCP = reach + portability

If you’re building internal workflows today, Skills are the “get it done this afternoon” move. If you’re building tooling that must survive model churn, MCP becomes increasingly attractive.

Final Take: Skills Make AI Feel Less Like Chat, More Like Software

The biggest shift here isn’t technical—it’s product-shaped:

Before: “AI answers questions.”
Now: “AI executes workflows.”

Agent Skills push Claude closer to what office software has always promised: less time formatting and moving things around, more time deciding what matters.

And that’s the quiet superpower: when execution gets cheaper, ambition gets bigger.

Getting High-Quality Output from 7B Models: A Production-Grade Prompting Playbook

Dechun Wang — Mon, 02 Feb 2026 14:39:24 +0000

7B Models: Cheap, Fast… and Brutally Honest About Your Prompting

If you’ve deployed a 7B model locally (or on a modest GPU), you already know the trade:

Pros

low cost
low latency
easy to self-host

Cons

patchy world knowledge
weaker long-chain reasoning
worse instruction-following
unstable formatting (“JSON… but not really”)

The biggest mistake is expecting 7B models to behave like frontier models. They won’t.

But you can get surprisingly high-quality output if you treat prompting like systems design, not “creative writing.”

This is the playbook.

1) The 7B Pain Points (What You’re Fighting)

1.1 Limited knowledge coverage

7B models often miss niche facts and domain jargon. They’ll bluff or generalise.

Prompt implication: provide the missing facts up front.

1.2 Logic breaks on multi-step tasks

They may skip steps, contradict themselves, or lose track halfway through.

Prompt implication: enforce short steps and validate each step.

1.3 Low instruction adherence

Give three requirements, it fulfils one. Give five, it panics.

Prompt implication: one task per prompt, or a tightly gated checklist.

1.4 Format instability

They drift from tables to prose, add commentary, break JSON.

Prompt implication: treat output format as a contract and add a repair loop.

2) The Four High-Leverage Prompt Tactics

2.1 Simplify the instruction and focus the target

Rule: one prompt, one job.

Bad:

“Write a review with features, scenarios, advice, and a conclusion, plus SEO keywords.”

Better:

“Generate 3 key features (bullets). No intro. No conclusion.”

Then chain steps: features → scenarios → buying advice → final assembly.

A reusable “micro-task” skeleton

ROLE: You are a helpful assistant.
TASK: <single concrete output>
CONSTRAINTS:
- length:
- tone:
- must include:
FORMAT:
- output as:
INPUT:
<your data>

7B models love rigid scaffolding.

2.2 Inject missing knowledge (don’t make the model guess)

If accuracy matters, give the model the facts it needs, like a tiny knowledge base.

Use “context injection” blocks:

FACTS (use only these):
- Battery cycles: ...
- Fast charging causes: ...
- Definition of ...

Then ask the question.

Hidden benefit: this reduces hallucination because you’ve narrowed the search space.

2.3 Few-shot + strict formats (make parsing easy)

7B models learn best by imitation. One good example beats three paragraphs of explanation.

The “format contract” trick

Tell it: “Output JSON only.”
Define keys.
Provide a small example.
Add failure behaviour: “If missing info, output INSUFFICIENT_DATA.”

This reduces the “creative drift” that kills batch workflows.

2.4 Step-by-step + multi-turn repair (stop expecting perfection first try)

Small models benefit massively from:

step decomposition
targeted corrections (“you missed field X”)
re-generation of only the broken section

Think of it as unit tests for text.

3) Real Scenarios: Before/After Prompts That Boost Output Quality

3.1 Content creation: product promo copy

❌ Before (too vague)

“Write a fun promo for a portable wireless power bank for young people.”

✅ After (facts + tone + example)

TASK: Write ONE promo paragraph (90–120 words) for a portable wireless power bank.
AUDIENCE: young commuters in the UK.
TONE: lively, casual, not cringe.
MUST INCLUDE (in any order):
- 180g weight and phone-sized body
- 22.5W fast charging: ~60% in 30 minutes
- 10,000mAh: 2–3 charges
AVOID:
- technical jargon, long specs tables

STYLE EXAMPLE (imitate the vibe, not the product):
"This mini speaker is unreal — pocket-sized, loud, and perfect for the commute."

OUTPUT: one paragraph only. No title. No emojis.

Product:
Portable wireless power bank

Why this works on 7B:

facts are injected
task is singular
style anchor reduces tone randomness
strict output scope prevents rambling

3.2 Coding: pandas data cleaning

❌ Before

“Write Python code to clean user spending data.”

✅ After (step contract + no guessing)

You are a Python developer. Output code only.

Goal: Clean a CSV file and save the result.

Input file: customer_spend.csv
Columns:
- age (numeric)
- consumption_amount (numeric)

Steps (must follow in order):
1) Read the CSV into `df`
2) Fill missing `age` with the mean of age
3) Fill missing `consumption_amount` with 0
4) Cap `consumption_amount` at 10000 (values >10000 become 10000)
5) Save to clean_customer_spend.csv with index=False
6) Print a single success message

Constraints:
- Use pandas only
- Do not invent extra columns
- Include basic comments

7B bonus: by forcing explicit steps, you reduce the chance it “forgets” a requirement.

3.3 Data analysis: report without computing numbers

This is where small models often hallucinate numbers. So don’t let them.

✅ Prompt that forbids fabrication

TASK: Write a short analysis report framework for 2024 monthly sales trends.
DATA SOURCE: 2024_monthly_sales.xlsx with columns:
- Month (Jan..Dec)
- Units sold
- Revenue (£k)

IMPORTANT RULE:
- You MUST NOT invent any numbers.
- Use placeholders like [X month], [X], [Y] where values are unknown.

Report structure (must match):
# 2024 Product Sales Trend Report
## 1. Sales overview
## 2. Peak & trough months
## 3. Overall trend summary

Analysis requirements:
- Define how to find peak/low months
- Give 2–3 plausible reasons for peaks and troughs (seasonality, promo, stock issues)
- Summarise likely overall trend patterns (up, down, volatile, U-shape)

Output: the full report with placeholders only.

Why this works: it turns the model into a “framework generator,” which is a sweet spot for 7B.

4) Quality Evaluation: How to Measure Improvements (Not Just “Feels Better”)

If you can’t measure it, you’ll end up “prompt-shopping” forever. For 7B models, I recommend an evaluation loop that’s cheap, repeatable, and brutally honest.

4.1 A lightweight scorecard (run 10–20 samples)

Pick a small test set (even 20–50 prompts is enough) and record:

Adherence: did it satisfy every MUST requirement? (hit-rate %)
Factuality vs context: count statements that contradict your provided facts (lower is better)
Format pass-rate: does the output parse 100% of the time? (JSON/schema/table)
Variance: do “key decisions” change across runs? (stability %)
Cost: avg tokens + avg latency (P50/P95)

A simple rubric:

Green: ≥95% adherence + ≥95% format pass-rate
Yellow: 80–95% (acceptable for drafts)
Red: <80% (you’re still guessing / under-specifying)

4.2 Common failure modes (and what to change)

If it invents facts

inject missing context
add a “no fabrication” rule
require citations from the provided context only (not web links)

If it ignores constraints

move requirements into a short MUST list
reduce optional wording (“try”, “maybe”, “if possible”)
cap output length explicitly

If the format drifts

add a schema + a “format contract”
include a single example that matches the schema
set a stop sequence (e.g., stop after the closing brace)

4.3 Iteration loop that doesn’t waste time

1) Freeze your test prompts

2) Change one thing (constraints, example, context, or step plan)

3) Re-run the scorecard

4) Keep changes that improve adherence/format without increasing cost too much

5) Only then generalise to new tasks

5) Iteration Methods That Work on 7B

5.1 Prompt iteration loop

1) Run prompt
2) Compare output to checklist
3) Issue a repair prompt targeting only the broken parts
4) Save the “winning” prompt as a template

A reusable repair prompt

You did not follow the contract.

Fix ONLY the following issues:
- Missing: <field>
- Format error: <what broke>
- Constraint violation: <too long / invented numbers / wrong tone>

Return the corrected output only.

5.2 Few-shot tuning (but keep it small)

For 7B models: 1–3 examples usually beats 5–10 (too much context = distraction).

5.3 Format simplification

If strict JSON fails:

use a Markdown table with fixed columns
or “Key: Value” lines Then parse with regex.

6) Deployment Notes: Hardware, Quantisation, and Inference Choices

7B models can run on consumer hardware, but choices matter:

6.1 Hardware baseline

CPU-only: workable, but slower; more RAM helps (16GB+)
GPU: smoother UX; 3090/4090 class GPUs are comfortable for many 7B setups

6.2 Quantisation

INT8/INT4 reduces memory and speeds up inference, but can slightly degrade accuracy.
Common approaches:

GPTQ / AWQ
4-bit quant + LoRA adapters for domain tuning

6.3 Inference frameworks

llama.cpp for fast local CPU/GPU setups
vLLM for server-style throughput
Transformers.js for lightweight client-side experiments

Final Take

7B models don’t reward “clever prompts.”

They reward clear contracts.

If you:

keep tasks small,
inject missing facts,
enforce formats,
and use repair loops,

you can make a 7B model deliver output that feels shockingly close to much larger systems—especially for structured work like copy templates, code scaffolding, and report frameworks.

That’s the point of low-resource LLMs: not to replace frontier models, but to own the “cheap, fast, good enough” layer of your stack.

Prompt Rate Limits & Batching: How to Stop Your LLM API From Melting Down

Dechun Wang — Mon, 26 Jan 2026 10:53:16 +0000

Prompt Rate Limits & Batching: Your LLM API Has a Speed Limit (Even If Your Product Doesn’t)

You ship a feature, your traffic spikes, and suddenly your LLM layer starts returning 429s like it’s handing out parking tickets.

The bad news: rate limits are inevitable.

The good news: most LLM “rate limit incidents” are self-inflicted—usually by oversized prompts, bursty traffic, and output formats that are impossible to parse at scale.

This article is a practical playbook for:

1) understanding prompt-related throttles,

2) avoiding the common failure modes, and

3) batching requests without turning your responses into soup.

1) The Three Limits You Actually Hit (And What They Mean)

Different providers name things differently, but the mechanics are consistent:

1.1 Context window (max tokens per request)

If your input + output exceeds the model context window, the request fails immediately.

Symptoms:

“Maximum context length exceeded”
“Your messages resulted in X tokens…”

Fix:

shorten, summarise, or chunk data.

1.2 RPM (Requests Per Minute)

You can be under token limits and still get throttled if you burst too many calls. Gemini explicitly documents RPM as a core dimension.

Symptoms:

“Rate limit reached for requests per minute”
HTTP 429

Fix:

client-side pacing, queues, and backoff.

1.3 TPM / Token throughput limits

Anthropic measures rate limits in RPM + input tokens/minute + output tokens/minute (ITPM/OTPM).

Gemini similarly describes token-per-minute as a key dimension.

Symptoms:

“Rate limit reached for token usage per minute”
429 + Retry-After header (Anthropic calls this out)

Fix:

reduce tokens, batch efficiently, or request higher quota.

2) The Most Common “Prompt Limit” Failure Patterns

2.1 The “one prompt to rule them all” anti-pattern

You ask for:

extraction
classification
rewriting
validation
formatting
business logic

…in a single request, and then you wonder why token usage spikes.

Split the workflow. If you need multi-step logic, use Prompt Chaining (small prompts with structured intermediate outputs).

2.2 Bursty traffic (the silent RPM killer)

Production traffic is spiky. Cron jobs, retries, user clicks, webhook bursts—everything aligns in the worst possible minute.

If your client sends requests like a machine gun, your provider will respond like a bouncer.

2.3 Unstructured output = expensive parsing

If your output is “kinda JSON-ish”, your parser becomes a full-time therapist.

Make the model output strict JSON or a fixed table. Treat format as a contract.

3) Rate Limit Survival Kit (Compliant, Practical, Boring)

3.1 Prompt-side: shrink tokens without losing signal

Delete marketing fluff (models don’t need your company origin story).
Convert repeated boilerplate into a short “policy block” and reuse it.
Prefer fields over prose (“material=316 stainless steel” beats a paragraph).

A tiny prompt rewrite that usually saves 30–50%

Before (chatty):

“We’re a smart home brand founded in 2010… please write 3 marketing lines…”

After (dense + precise):

“Write 3 UK e-commerce lines. Product: smart bulb. Material=PC flame-retardant. Feature=3 colour temperatures. Audience=living room.”

3.2 Request-side: backoff like an adult

If the provider returns Retry-After, respect it. Anthropic explicitly returns Retry-After on 429s.

Use exponential backoff + jitter:

attempt 1: 1s
attempt 2: 2–3s
attempt 3: 4–6s
then fail gracefully

3.3 System-side: queue + concurrency caps

If your account supports 10 concurrent requests, do not run 200 coroutines and “hope”.

Use:

a work queue
a semaphore for concurrency
and a rate limiter for RPM/TPM

4) Batching: The Fastest Way to Cut Calls, Cost, and 429s

Batching means: one API request handles multiple independent tasks.

It works best when tasks are:

same type (e.g., 20 product blurbs)
independent (no step depends on another)
same output schema

Why it helps

fewer network round-trips
fewer requests → lower RPM pressure
more predictable throughput

Also: OpenAI’s pricing pages explicitly include a “Batch API price” column for several models.
(That doesn’t mean “batching is free”, but it’s a strong hint the ecosystem expects this pattern.)

5) The Batching Prompt Template That Doesn’t Fall Apart

Here’s a format that stays parseable under pressure.

5.1 Use task blocks + a strict JSON response schema

SYSTEM: You output valid JSON only. No Markdown. No commentary.

USER:
You will process multiple tasks. 
Return a JSON array. Each item must be:
{
  "task_id": <int>,
  "title": <string>,
  "bullets": [<string>, <string>, <string>]
}

Rules:
- UK English spelling
- Title ≤ 12 words
- 3 bullets, each ≤ 18 words
- If input is missing: set title="INSUFFICIENT_DATA" and bullets=[]

TASKS:
### TASK 1
product_name: Insulated smart mug
material: 316 stainless steel
features: temperature alert, 7-day battery
audience: commuters

### TASK 2
product_name: Wireless earbuds
material: ABS shock-resistant
features: ANC, 24-hour battery
audience: students

That “INSUFFICIENT_DATA” clause is your lifesaver. One broken task shouldn’t poison the whole batch.

6) Python Implementation: Batch → Call → Parse (With Guardrails)

Below is a modern-ish pattern you can adapt (provider SDKs vary, so treat it as structure, not a copy‑paste guarantee).

import json
import random
import time
from typing import Any, Dict, List, Tuple

MAX_RETRIES = 4

def backoff_sleep(attempt: int, retry_after: float | None = None) -> None:
    if retry_after is not None:
        time.sleep(retry_after)
        return
    base = 2 ** attempt
    jitter = random.random()
    time.sleep(min(10, base + jitter))

def build_batch_prompt(tasks: List[Dict[str, str]]) -> str:
    header = (
        "You output valid JSON only. No Markdown. No commentary.\n\n"
        "Return a JSON array. Each item must be:\n"
        "{\n  \"task_id\": <int>,\n  \"title\": <string>,\n  \"bullets\": [<string>, <string>, <string>]\n}\n\n"
        "Rules:\n"
        "- UK English spelling\n"
        "- Title ≤ 12 words\n"
        "- 3 bullets, each ≤ 18 words\n"
        "- If input is missing: set title=\"INSUFFICIENT_DATA\" and bullets=[]\n\n"
        "TASKS:\n"
    )

    blocks = []
    for t in tasks:
        blocks.append(
            f"### TASK {t['task_id']}\n"
            f"product_name: {t.get('product_name','')}\n"
            f"material: {t.get('material','')}\n"
            f"features: {t.get('features','')}\n"
            f"audience: {t.get('audience','')}\n"
        )
    return header + "\n".join(blocks)

def parse_json_strict(text: str) -> List[Dict[str, Any]]:
    # Hard fail if it's not JSON. This is intentional.
    return json.loads(text)

def call_llm(prompt: str) -> Tuple[str, float | None]:
    """Return (text, retry_after_seconds). Replace with your provider call."""
    raise NotImplementedError

def run_batch(tasks: List[Dict[str, str]]) -> List[Dict[str, Any]]:
    prompt = build_batch_prompt(tasks)

    for attempt in range(MAX_RETRIES):
        try:
            raw_text, retry_after = call_llm(prompt)
            return parse_json_strict(raw_text)
        except json.JSONDecodeError:
            # Ask the model to repair formatting in a second pass (or log + retry)
            prompt = (
                "Fix the output into valid JSON only. Preserve meaning.\n\n"
                f"BAD_OUTPUT:\n{raw_text}"
            )
            backoff_sleep(attempt)
        except Exception as e:
            # If your SDK exposes HTTP status + retry-after, use it here
            backoff_sleep(attempt)
            last_error = e

    raise RuntimeError(f"Batch failed after retries: {last_error}")

What changed vs “classic” snippets?

We treat JSON as a hard contract.
We handle format repair explicitly (and keep it cheap).
We centralise backoff logic so every call behaves the same way.

7) How to Choose Batch Size (The Rule Everyone Learns the Hard Way)

Batch size is constrained by:

context window (max tokens per request)
TPM throughput
response parsing stability
your business tolerance for “one batch failed”

A practical heuristic:

start with 10–20 items per batch
measure token usage
increase until you see:
- output format drift, or
- timeouts / latency spikes, or
- context overflow risk

And always keep a max batch token budget.

8) “Cost Math” Without Fantasy Numbers

Pricing changes. Tiers change. Models change.

So instead of hard-coding ancient per-1K token values, calculate cost using the provider’s current pricing page.

OpenAI publishes per‑token pricing on its API pricing pages.

Anthropic also publishes pricing and documents rate limit tiers.

A useful cost estimator:

cost ≈ (input_tokens * input_price + output_tokens * output_price) / 1,000,000

Then optimise the variables you control:

shrink input tokens
constrain output tokens
reduce number of calls (batch)

9) Risks of Batching (And How to Not Get Burnt)

Risk 1: one bad item ruins the batch

Fix: “INSUFFICIENT_DATA” fallback per task.

Risk 2: output format drift breaks parsing

Fix: strict JSON, repair step, and logging.

Risk 3: batch too big → context overflow

Fix: token budgeting + auto-splitting.

Risk 4: “creative” attempts to bypass quotas

Fix: don’t. If you need more capacity, request higher limits and follow provider terms.

Final Take

Rate limits aren’t the enemy. They’re your early warning system that:

prompts are too long,
traffic is too bursty,
or your architecture assumes “infinite throughput”.

If you treat prompts like payloads (not prose), add pacing, and batch like a grown-up, you’ll get:

fewer 429s
lower cost
and a system that scales without drama

That’s the whole game.

Choosing an LLM in 2026: The Practical Comparison Table (Specs, Cost, Latency, Compatibility)

Dechun Wang — Mon, 26 Jan 2026 10:52:49 +0000

The uncomfortable truth: “model choice” is half your prompt engineering

If your prompt is a recipe, the model is your kitchen.

A great recipe doesn’t help if:

the oven is tiny (context window),
the ingredients are expensive (token price),
the chef is slow (latency),
or your tools don’t fit (function calling / JSON / SDK / ecosystem).

So here’s a practical comparison you can actually use.

Note on “parameters”: for many frontier models, parameter counts are not publicly disclosed. In practice, context window + pricing + tool features predict “fit” better than guessing parameter scale.

1) Quick comparison: what you should care about first

1.1 The “four knobs” that matter

1) Context: can you fit the job in one request?

2) Cost: can you afford volume?

3) Latency: does your UX tolerate the wait?

4) Compatibility: will your stack integrate cleanly?

Everything else is second order.

2) Model spec table (context + positioning)

This table focuses on what’s stable: family, positioning, and context expectations.

| Provider | Model family (examples) | Typical positioning | Notes |
|---|---|---|
| OpenAI | GPT family (e.g., gpt-4o, gpt-4.1, gpt-5*) | General-purpose, strong tooling ecosystem | Pricing + cached input are clearly published.
| OpenAI | “o” reasoning family (e.g., o3, o1) | Deep reasoning / harder planning | Often higher cost; use selectively.
| Anthropic | Claude family (e.g., Haiku / Sonnet tiers) | Strong writing + safety posture; clean docs | Pricing table includes multiple rate dimensions.
| Google | Gemini family (Flash / Pro tiers) | Multimodal + Google ecosystem + caching/grounding options | Pricing page explicitly covers caching + grounding.
| DeepSeek | DeepSeek chat + reasoning models | Aggressive price/perf, popular for scale | Official pricing docs available.
| Open source | Llama / Qwen / Mistral etc. | Self-host for privacy/control | Context depends on model; Llama 3.1 supports 128K.

3) Pricing table (the part your CFO actually reads)

Below are public list prices from official docs (USD per 1M tokens).

Use this as a baseline, then apply: caching, batch discounts, and your real output length.

3.1 OpenAI (selected highlights)

OpenAI publishes input, cached input, and output prices per 1M tokens.

Model	Input / 1M	Cached input / 1M	Output / 1M	When to use
`gpt-4.1`	$2.00	$0.50	$8.00	High-quality general reasoning with sane cost
`gpt-4o`	$2.50	$1.25	$10.00	Multimodal-ish “workhorse” if you need it
`gpt-4o-mini`	$0.15	$0.075	$0.60	High-throughput chat, extraction, tagging
`o3`	$2.00	$0.50	$8.00	Reasoning-heavy tasks without the top-end pricing
`o1`	$15.00	$7.50	$60.00	“Use sparingly”: hard reasoning where mistakes are expensive

If you’re building a product: you’ll often run 80–95% of calls on a cheaper model (mini/fast tier), and escalate only the hard cases.

3.2 Anthropic (Claude)

Anthropic publishes a model pricing table in Claude docs.

Model	Input / MTok	Output / MTok	Notes
Claude Haiku 4.5	$1.00	$5.00	Fast, budget-friendly tier
Claude Haiku 3.5	$0.80	$4.00	Even cheaper tier option
Claude Sonnet 3.7 (deprecated)	$3.75	$15.00	Listed as deprecated on pricing
Claude Opus 3 (deprecated)	$18.75	$75.00	Premium, but marked deprecated

Important: model availability changes. Treat the pricing table as the authoritative “what exists right now.”

3.3 Google Gemini (Developer API)

Gemini pricing varies by tier and includes context caching + grounding pricing.

Tier (example rows from pricing page)	Input / 1M (text/image/video)	Output / 1M	Notable extras
Gemini tier (row example)	$0.30	$2.50	Context caching + grounding options
Gemini Flash-style row example	$0.10	$0.40	Very low output cost; good for high volume

Gemini’s pricing page also lists:

context caching prices, and
grounding with Google Search pricing/limits.

3.4 DeepSeek (API)

DeepSeek publishes pricing in its API docs and on its pricing page.

Model family (per DeepSeek pricing pages)	What to expect
DeepSeek-V3 / “chat” tier	Very low per-token pricing compared to many frontier models
DeepSeek-R1 reasoning tier	Higher than chat tier, still aggressively priced

4) Latency: don’t use fake “average seconds” tables

Most blog latency tables are either:

measured on one day, one region, one payload, then recycled forever, or
pure fiction.

Instead, use two metrics you can actually observe:

1) TTFT (time to first token) — how fast streaming starts

2) Tokens/sec — how fast output arrives once it starts

4.1 Practical latency expectations (directional)

“Mini/Flash” tiers usually win TTFT and throughput for chat-style workloads.
“Reasoning” tiers typically have slower TTFT and may output more tokens (more thinking), so perceived latency increases.
Long context inputs increase latency everywhere.

4.2 How to benchmark for your own product (a 15-minute method)

Create a small benchmark script that sends:

the same prompt (e.g., 400–800 tokens),
fixed max output (e.g., 300 tokens),
in your target region,
for 30–50 runs.

Record:

p50 / p95 TTFT,
p50 / p95 total time,
tokens/sec.

Then make the decision with data, not vibes.

5) Compatibility: why “tooling fit” beats raw model quality

A model that’s 5% “smarter” but breaks your stack is a net loss.

5.1 Prompt + API surface compatibility (what breaks when you switch models)

| Feature | OpenAI | Claude | Gemini | Open-source (self-host) |
|---|---|---|---|
| Strong “system instruction” control | Yes (explicit system role) | Yes (instructions patterns supported) | Yes | Depends on serving stack |
| Tool / function calling | Widely used in ecosystem | Supported via tools patterns (provider-specific) | Supports tools + grounding options | Often “prompt it to emit JSON”, no native tools |
| Structured output reliability | Strong with constraints | Strong, especially on long text | Strong with explicit schemas | Varies a lot; needs examples + validators |
| Caching / batch primitives | Cached input pricing published | Provider features vary | Context caching explicitly priced | You implement caching yourself |

5.2 Ecosystem fit (a.k.a. “what do you already use?”)

If you live in Google Workspace / Vertex-style workflows, Gemini integration + grounding options can be a natural fit.
If you rely on a broad third-party automation ecosystem, OpenAI + Claude both have mature SDK + tooling coverage (LangChain etc.).
If you need data residency / on-prem, open-source models (Llama/Qwen) let you keep data inside your boundary, but you pay in MLOps.

6) The decision checklist: pick models like an engineer

Step 1 — classify the task

High volume / low stakes: tagging, rewrite, FAQ, extraction
Medium stakes: customer support replies, internal reporting
High stakes: legal, finance, security, medical-like domains (be careful)

Step 2 — decide your stack (the “2–3 model rule”)

A common setup:

1) Fast cheap tier for most requests

2) Strong tier for hard prompts, long context, tricky reasoning

3) Optional: realtime or deep reasoning tier for specific UX/features

Step 3 — cost control strategy (before you ship)

enforce output length limits
cache repeated system/context
batch homogeneous jobs
add escalation rules (don’t send everything to your most expensive model)

7) A practical comparison table you can paste into a PRD

Here’s a short “copy/paste” table for stakeholders.

Scenario	Priority	Default pick	Escalate to	Why
Customer support chatbot	Latency + cost	`gpt-4o-mini` (or Gemini Flash-tier)	`gpt-4.1` / Claude higher tier	Cheap 80–90%, escalate only ambiguous cases
Long document synthesis	Context + format stability	Claude tier with strong long-form behaviour	`gpt-4.1`	Long prompts + structured output
Coding helper in IDE	Tooling + correctness	`gpt-4.1` or equivalent	`o3` / `o1`	Deep reasoning for tricky bugs
Privacy-sensitive internal assistant	Data boundary	Self-host Llama/Qwen	Cloud model for non-sensitive output	Keep raw data in-house

Final take

“Best model” is not a thing.

There’s only best model for this prompt, this latency budget, this cost envelope, and this ecosystem.

If you ship with:

a measured benchmark,
a 2–3 model stack,
strict output constraints,
and caching/batching,

…you’ll outperform teams who chase the newest model every month.

When AI Can Make “Perfect Decisions”: Why Dynamic Contracts Are the Real Safety Layer

Dechun Wang — Fri, 16 Jan 2026 17:28:48 +0000

The “Perfect Decision” Trap

We’re entering the era where AI doesn’t just answer questions — it selects actions.

Supply chain routing. Credit risk. Fraud detection. Treatment planning. Portfolio optimisation.

The pitch is always the same:

“Give the model data and objectives, and it will find the best move.”

And in a narrow, mathematical sense, it can.

But here’s the catch: optimisation is a superpower and a liability.

Because if a system can optimise perfectly, it can also optimise perfectly for the wrong thing — quietly, consistently, at scale.

That’s why the most important design problem isn’t “make the AI smarter.”

It’s “make the relationship between humans and AI adaptive, observable, and enforceable.”

Call that relationship a dynamic contract.

1) Why “Perfect” AI Decisions Are a Double-Edged Sword

AI’s “perfection” is usually:

statistical (best expected value given assumptions),
objective-driven (maximise what you told it to maximise),
context-blind (it doesn’t feel the consequences).

A model can deliver the highest-return portfolio while ignoring:

reputational risk,
regulatory risk,
long-term trust erosion,
human welfare.

A model can produce the fastest medical plan while ignoring:

quality of life,
patient preferences,
risk tolerance.

AI can optimise the map while humans live on the territory.

The problem is not malice.

It’s that objectives are incomplete, and the world changes faster than your policy doc.

2) Static Rules vs Dynamic Contracts

Static rules are how we’ve governed software for decades:

“Do X, don’t do Y.”
“If this, then that.”
“Hard limits.”

They’re easy to explain, test, and audit — until they meet reality.

2.1 The limits of static rules

1) The world changes, your rules don’t

Market regimes shift. User behaviour shifts. Regulations shift. Data pipelines shift.

Static rules drift from reality, and “optimal” actions start producing weird harm.

2) Objective–value mismatch grows over time

A fixed objective function (“maximise conversion”, “minimise cost”) slowly detaches from what you mean (“healthy growth”, “fair treatment”, “sustainable outcomes”).

3) Risk accumulates silently

When the system makes thousands of decisions per hour, small misalignments compound.

Static constraints become a thin fence around a fast-moving machine.

2.2 Dynamic contracts (the upgrade)

A dynamic contract is not “no rules.” It’s rules with a control system:

goals can be updated,
constraints can be tightened or relaxed,
the system is monitored continuously,
humans can intervene,
accountability is explicit.

Think: not a fence — a safety harness with sensors, alarms, and a manual brake.

3) What a Dynamic Contract Actually Looks Like

A dynamic contract has four components. Miss one, and you’re back to vibes.

3.1 Continuous adjustment (rules are living, not laminated)

A dynamic contract assumes:

objectives evolve,
risk tolerances evolve,
incentives evolve.

So the system must support:

updating thresholds,
changing objective weights,
enabling/disabling actions by context.

This is not “moving goalposts.”

It’s acknowledging that the goalposts move whether you admit it or not.

3.2 Real-time observability (decisions must be inspectable)

If the system can’t show:

what it decided,
why it decided,
what data it used,
what constraints were active,

…then you don’t have governance. You have hope.

Observability means:

decision logs,
feature/inputs snapshots,
model version and prompt version tracking,
anomaly alerts (distribution shift, rising error rates, unusual outputs).

3.3 Human override (intervention must be executable)

A contract without an override is a ceremony.

You need:

pause switches (kill switch / degrade mode),
policy overrides (block a class of actions),
manual approvals for high-risk actions,
rollback to a previous safe configuration.

3.4 Responsibility chain (power and risk must align)

If AI makes decisions, who owns:

outcomes?
regressions?
incidents?
compliance?

Dynamic contracts require a clear chain:

who approves contract changes,
who monitors alerts,
who signs off on high-risk domains,
how you do post-incident review.

This is less “ethics theatre,” more on-call rotation for decision systems.

4) Dynamic Contracts as a Control Loop (Not a Buzzword)

At a systems level, this is a closed loop:

This loop is the difference between:

“it worked in staging” and
“it survives the real world.”

5) Three Real-World Patterns Where Dynamic Contracts Matter

5.1 Supply chain: “lowest cost” vs “lowest risk”

A routing model might optimise purely for cost.

But real operations have constraints that appear mid-flight:

strike actions,
supplier delays,
customs bottlenecks,
seasonal demand spikes.

Dynamic contract move: temporarily reweight objectives toward reliability, tighten risk limits, trigger manual approval for reroutes above a threshold.

5.2 Finance: “best return” vs “acceptable behaviour”

A portfolio optimiser can deliver higher returns by exploiting correlations that become fragile under stress — or by concentrating in ethically questionable exposure.

Dynamic contract move: enforce shifting exposure caps, add human approval gates when volatility spikes, record decision provenance for audit.

5.3 Healthcare: “fastest recovery” vs “patient values”

AI can recommend the most statistically effective treatment, but “best” depends on:

side effects tolerance,
personal priorities,
comorbidities,
informed consent.

Dynamic contract move: require preference capture, enforce explainability, and make clinician override first-class, not an afterthought.

6) How to Implement Dynamic Contracts (Without Building a Religion)

Here’s the pragmatic blueprint.

6.1 Start with a contract schema

Define the contract in machine-readable form (YAML/JSON), e.g.:

objective weights
hard constraints
approval thresholds
escalation rules
logging requirements

Treat it like code:

version it
review it
deploy it
roll it back

6.2 Add a “policy engine” layer

Your model shouldn’t directly execute actions.

It should propose actions that pass through a policy layer.

Policy layer responsibilities:

enforce constraints,
require approvals,
route to safe fallbacks,
attach provenance metadata.

6.3 Add monitoring that’s tied to actions, not dashboards

Dashboards are passive. You need alerts linked to contract changes:

“False positive rate increased after contract v12”
“Decision distribution drifted post-update”
“High-risk actions exceeded threshold”

6.4 Build the incident playbook now, not after the incident

At minimum:

stop-the-world switch
degrade mode (smaller scope, higher human review)
rollback to last safe contract
postmortem template (what changed, what broke, how detected, how prevented)

7) A Quick Checklist: Are You Actually Running a Dynamic Contract?

If you answer “no” to any of these, you’re still on static rules.

Can we update objectives without redeploying the model?
Can we see why each decision happened (inputs + policy + version)?
Do we have a kill switch and rollback that works in minutes?
Do we have approval gates for high-risk actions?
Can we audit who changed what and when?
Do we measure harm, not just performance?

Final Take

AI will keep getting better at optimisation.

That’s not the scary part.

The scary part is that our objectives will remain incomplete, and our environments will keep changing.

So the only sane way forward is to treat AI decision-making as a governed system:

not static rules,
not blind trust,
but a dynamic contract — living guardrails with observability, override, and accountability.

Because the future isn’t “AI makes decisions.”

It’s “humans and AI co-manage a decision system — continuously.”

That’s how you get “perfect decisions” without perfect disasters.

AI Slop in 2025: The Year the Internet Started Eating Itself

Dechun Wang — Fri, 16 Jan 2026 17:28:23 +0000

The Year the Internet Started Eating Itself

If “the internet is cooked” had a corporate deck, 2025 would be slide one.

The word slop—once basically “wet leftovers for animals”—became a serious label for something we all recognise: cheap, mass-produced AI content that looks like information but behaves like noise. Merriam-Webster didn’t just nod at the trend; it crowned “slop” Word of the Year and formalised the definition as AI-produced low-quality digital content made in quantity.

Here’s the uncomfortable part: AI slop isn’t just a “quality problem.”

It’s a systems problem:

incentives (traffic, ad money, affiliate payouts)
distribution (recommendation engines)
low marginal cost (content is basically free to generate)
weak provenance (hard to tell what’s real, or who made it)

So this isn’t a rant. It’s a field report: scale, impact, and governance paths that don’t rely on wishful thinking.

1) What Counts as “AI Slop” (And What Doesn’t)

Let’s separate terms that get mixed together:

Deepfakes: synthetic media meant to deceive (often identity/voice/video).
Hallucinations: model errors (it makes up citations or facts).
AI slop: the broader category—content churned out cheaply at scale, with low effort and low information value.

That’s why the Merriam-Webster definition matters. It anchors the term in two critical properties:

1) produced usually in quantity

2) low quality

Intent can be malicious or merely lazy. Slop doesn’t need a conspiracy; it only needs a CPM.

2) The Scale: Are We Actually Past the Tipping Point?

A lot of “AI dominates the internet” claims are vibes dressed as statistics.

The cleaner way to talk about scale is: in specific measurable slices of the web, AI output is now comparable to human output.

Graphite analysed a dataset of online English-language articles over 2020–2025 and Axios visualised the result: by May 2025, human-written articles were about 52%, while AI-written articles were about 48%.

That’s not “the whole internet” (and the methodology matters), but it’s still a cultural threshold:

search results become easier to manipulate
content farms become economically viable
“average quality” gets pulled down by volume

Also: even Wikipedia’s “AI slop” entry frames the term around mass production and monetisation dynamics.

3) Why 2025 Made Slop Inevitable: The Incentive Stack

Three forces made 2025 feel different:

3.1 The marginal cost of content went to ~zero

If you can generate 1,000 “SEO articles” in an afternoon, content becomes a commodity.

And commodities flood markets.

3.2 Distribution is automated

Recommendation algorithms don’t ask “is this meaningful?”

They ask “will someone pause for 1.2 seconds?”

3.3 Monetisation doesn’t care who wrote it

Ad networks pay impressions. Affiliate programmes pay conversions.

That’s a slop subsidy.

This is why slop is not only on social media. It’s also in:

search spam
product reviews
workplace docs
low-effort “expert” explainers

4) The Damage: It’s Not Just Annoying

4.1 Workslop: the workplace version of slop

BetterUp Labs + Stanford Social Media Lab describe “workslop” as AI-generated content that looks polished but lacks substance, creating an “invisible tax” where colleagues spend time cleaning it up.

The key takeaway isn’t the exact dollar figure. It’s the dynamic:

AI doesn’t always remove work; sometimes it moves work downstream to the next person.

4.2 Marketplace trust is getting sandblasted

There’s measurable evidence of likely AI-written reviews climbing on major fast-fashion / low-cost e-commerce platforms.

Originality.ai reports likely AI reviews about Temu on Trustpilot rising to 10.90% in 2025 (and up sharply from earlier years).
Cybernews reports similar figures for Temu and Shein, framing this as a surge in AI-generated reviews.

When reviews become synthetic, “social proof” collapses.

And platforms pay the cost in customer support, refunds, and reputation.

4.3 Elections: deepfakes as the sharpest edge of slop

Not all slop is deepfake, but deepfake is the slop category that gets people hurt.

During Ireland’s presidential race, multiple outlets reported an AI-generated video falsely claiming candidate Catherine Connolly had quit, prompting platform removals and legal/political scrutiny.

Even if the election outcome doesn’t flip, the trust damage compounds:

“I saw it with my own eyes” becomes unreliable
rebuttals are slower than virality
cynicism rises (“everything is fake anyway”)

5) The Governance Reality: The 3-Layer Strategy That Might Work

We don’t “solve” slop with one law or one detector.

We reduce it by attacking three layers: provenance, distribution, and economics.

5.1 Provenance: labels, metadata, and traceability

China’s Cyberspace Administration of China (CAC) published the Measures for the Labeling of AI-Generated Synthetic Content, with an effective date of September 1, 2025.

The directional lesson isn’t “copy China.” It’s that labelling becomes part of infrastructure:

visible labels for humans (“AI-generated”)
machine-readable metadata for platforms and auditors
rules against stripping labels

5.2 Distribution: platform friction beats perfect detection

Detection is hard—and getting harder.

Research on “faithfulness hallucination” detection is improving (e.g., FaithLens proposes joint detection + explanation).

But even great detectors won’t catch everything, and false positives create backlash.

So platforms will need friction:

throttle low-reputation bulk posters
rate-limit new domains that behave like content farms
reduce reach for unverified media in high-risk categories (health, elections)
“are you sure?” prompts before resharing unverified clips

Friction is less glamorous than AI-vs-AI, but it works because it changes behaviour.

5.3 Economics: stop paying for junk

If ad networks and affiliate programmes keep paying for slop, the slop machine will keep running.

A practical governance move:

require provenance and publisher reputation for premium ads
penalise repeat slop farms (not just single pages)
increase the cost of automation (verification, KYC, rate caps)

This is the “follow the money” layer, and it’s the most neglected.

6) The Weird Twist: Even “Words of the Year” Became a Signal

You’ll see sloppy claims floating around, like “The Economist picked slop as word of the year.”

It didn’t. The Economist’s official Word of the Year piece for 2025 chose TACO (“Trump Always Chickens Out”).

But the broader cultural signal is real:

Merriam-Webster picked slop.
Oxford chose rage bait—a cousin phenomenon about engagement-driven outrage loops.

Together, they map a coherent 2025 mood:

we’re drowning in synthetic content and engineered emotion.

7) A Practical Playbook for Teams (Not Governments)

If you’re not a regulator, here’s what you can do inside a company/product:

7.1 For content teams

make “human-reviewed” a product feature
measure information density (how much substance per token)
require sources (or ban claims without citations)

7.2 For product + platform teams

treat provenance as a first-class signal
add friction for reposting and bulk upload
build reputation systems that can’t be trivially gamed by bots

7.3 For engineering teams shipping AI features

default to “assist, don’t autoplay”
add “verification steps” in workflows (citations, checks, validation prompts)
log and audit generated content downstream

The goal is not “no AI.”

The goal is no cheap amplification of low-value output.

Final Take: Slop Is an Incentive Failure Wearing a Tech Costume

AI slop exists because it’s profitable to produce and cheap to distribute.

So the fix is unsexy:

provenance infrastructure
platform friction
economic disincentives
and literacy (people learning what “synthetic” looks like)

If we do nothing, the web doesn’t “end.”

It just becomes a place where signal costs money and noise is free.

And that is how a civilisation ends up eating content troughs for breakfast.

Instruction Tuning and Custom Instruction Libraries: Your Model’s Real ‘Operating Manual

Dechun Wang — Wed, 31 Dec 2025 14:02:17 +0000

Prompt Tricks Don’t Scale. Instruction Tuning Does.

If you’ve ever shipped an LLM feature, you know the pattern:

You craft a gorgeous prompt.
It works… until real users show up.
Suddenly your “polite customer support bot” becomes a poetic philosopher who forgets the refund policy.

That’s the moment you realise: prompting is configuration; Instruction Tuning is installation.

Instruction Tuning is how you teach a model to treat your requirements like default behaviour—not a suggestion it can “creatively interpret”.

What Is Instruction Tuning, Really?

Definition

Instruction Tuning is a post-training technique that trains a language model on Instruction–Response pairs so it learns to:

understand the intent behind a task (“summarise”, “classify”, “extract”, “fix code”)
produce output that matches format, tone, and constraints reliably

In other words, you’re moving from:

“Generate coherent text”

to:

“Execute tasks like a dependable system component.”

A quick intuition

A base model may respond to:

“Summarise this document”

with something long, vague, and slightly dramatic.

A tuned model learns:

what counts as a summary
what “key points” actually means
how to keep it short, structured, and consistent

Instruction Tuning vs Prompt Tuning: The Difference That Matters

Dimension	Instruction Tuning	Traditional Prompt Tuning
Where it acts	Model weights (behaviour changes)	Input text only
Data need	Needs many labelled examples	Needs few examples
Best for	Long-term, repeated tasks (support, compliance, extraction)	Ad-hoc tasks (one-off writing, translation)
Persistence	Behaviour sticks after training	You redesign prompts repeatedly
Barrier	Higher (data + training setup)	Lower (just write prompts)

If you’re running the same workflows every day, Instruction Tuning pays off fast.

The Custom Instruction Library: Your “Ammo Depot”

Instruction Tuning is the strategy.

A custom instruction library is the ammunition.

It’s a curated dataset that encodes:

what tasks you care about
how they should be handled
how outputs must look and sound

Think of it as your model’s operating manual, written in training data.

What a High-Quality Instruction Pair Must Contain

Every example should have four parts. No shortcuts.

Task type

e.g., summarisation, translation, extraction, code repair, sentiment analysis
Instruction

Clear, unambiguous, with output constraints (format, length, tone)
Input

The raw material: document, conversation log, code snippet, medical notes, etc.
Reference response

The ideal output: correct, complete, consistent, and formatted

A simple schema (JSONL)

{"task":"refund_support","instruction":"Reply in friendly UK English. Confirm refund status and estimate payout time. Avoid jargon.","input":"Order #A91822. I requested a refund yesterday due to a faulty item. What's happening and when will I get the money?","output":"Hi! I can see refund request for order A91822 is approved and currently processing. You should receive the funds within 1–3 working days (depending on your bank). You don’t need to do anything else — we’ll notify you once it’s completed."}

Notice what’s happening: tone, region, and format are all part of the spec.

Design Principles That Actually Move the Needle

1) Coverage: hit the long tail, not just the happy path

If you’re tuning for e-commerce support, don’t only include:

“Where’s my parcel?”
“I want a refund.”

Also include the messy real world:

partial refunds
missing items
“Royal Mail says delivered but it isn’t”
chargebacks
angry customers who won’t provide an order number

A model trained on only “clean” scenarios will panic the first time the input isn’t.

2) Precision: remove ambiguity from your instructions

Bad instruction:

“Handle this user request.”

Better instruction:

“Classify the sentiment as Positive/Neutral/Negative, then give a one-sentence reason.”

Best instruction:

“Return JSON exactly: {\"label\": \"Positive|Neutral|Negative\", \"reason\": \"...\"}. No extra text.”

3) Diversity: vary inputs aggressively

Include:

short vs long inputs
slang, typos, mixed languages
formal tickets vs WhatsApp-style messages
different difficulty levels

Your production users are a chaos generator. Train accordingly.

4) Consistency: standardise output formats like you mean it

If 200 “Order status” answers are formatted differently, the model will learn inconsistency.

Pick a template and enforce it:

fixed headings
stable phrasing
stable field order

Three Domain Examples (With Slightly More Realism)

1) Healthcare: extract case notes into a structured table (with safety)

Task: clinical note extraction

Instruction: extract fields, present as a table, avoid inventing missing values

Input (example):

Patient: male, 45. Complains of cough and phlegm for 8 weeks, worsened with chest tightness for 1 week. History: hypertension x5 years, on nifedipine. CT: inflammation in both lower lobes. WBC: 12.5×10^9/L (high). Preliminary diagnosis: community-acquired pneumonia; hypertension grade 2.

Output (example):

Category	Details
Demographics	Age: 45; Sex: Male
Presenting complaint	Cough and phlegm for 8 weeks; worse with chest tightness for 1 week
History	Hypertension for 5 years; nifedipine
Tests	CT: bilateral lower-lobe inflammation; WBC: 12.5×10^9/L (elevated)
Preliminary diagnosis	Community-acquired pneumonia; Hypertension (grade 2)

Note: In medical settings, you typically add safety rules (“don’t diagnose beyond provided evidence”) directly into the instruction library.

2) Programming: fix broken Python (and explain why)

Input (broken code):

def calculate_average(nums):
    total = 0
    for n in nums
        total += n
    return total / len(nums)

Output (fixed + explanation):

def calculate_average(nums):
    total = 0
    for n in nums:  # missing colon fixed
        total += n
    return total / len(nums)

Explanation: for loops in Python require a colon (:). Without it, the interpreter can’t parse the loop body.

(Also: avoid naming a variable sum—it shadows Python’s built-in sum().)

3) Education: generate a Year 4 maths word problem (UK flavour)

Instruction:

Generate one word problem
Topic: two-digit × one-digit multiplication
Include: scenario, known facts, question
Use £ and UK context

Output:

Scenario: School fair

Known facts: Each ticket costs £24. A parent buys 3 tickets.

Question: How much do they pay in total?

Implementation Workflow: From Library to Tuned Model

Instruction Tuning is a pipeline. If you skip steps, you pay later.

Step 1: Build the dataset

Sources

Internal logs: support tickets, agent replies, summaries
Public datasets: Alpaca, FLAN, ShareGPT (filter aggressively)
Human labelling: for tasks where correctness matters

Cleaning checklist

remove vague instructions (“handle this”, “do the thing”)
remove wrong/incomplete outputs
deduplicate near-identical samples
standardise formatting and field names

Split

Train: 70–80%
Validation: 10–15%
Test: 10–15%

No leakage. No overlap.

Step 2: Choose the base model (with reality constraints)

Pick based on:

task complexity
deployment constraints
compute budget

A practical rule:

7B–13B models + LoRA work well for many enterprise workflows
bigger models help with harder reasoning and richer generation
if you need on-device or edge, plan for quantisation and smaller footprints

Step 3: Fine-tune strategy and hyperparameters

Typical starting points (LoRA):

learning rate: 1e-4 to 5e-5
batch size: 8–32 (use gradient accumulation if needed)
epochs: 3–5
early stopping if validation stops improving

LoRA is popular because it’s efficient: you train small adapter matrices instead of all weights.

Step 4: Evaluate like you’re going to ship it

Quantitative (depends on task)

classification accuracy (for extract/classify)
BLEU/ROUGE (for summarise/translate — imperfect but useful)
perplexity (language quality proxy)

Qualitative (the “would you trust this?” test)

Get 3–5 reviewers to score:

instruction adherence
completeness
format correctness
domain alignment

Also run scenario tests: 10–20 realistic edge cases.

Step 5: Deploy + keep tuning

After deployment:

log response times, failures, “format drift”
collect bad outputs and user feedback
convert them into new instruction pairs
periodically re-tune

Instruction libraries are living assets.

A Practical UK-Focused Case Study: E‑Commerce Support

Goal

Teach a model to handle:

order status
refunds
product recommendations

with:

consistent format
friendly UK English
accurate, policy-compliant answers

Dataset (example proportions)

300 order status
400 refunds
300 recommendations

Training setup (example)

base model: a chat-tuned 7B model
LoRA rank: 8
LoRA dropout: 0.05
epochs: 3
one consumer GPU class machine (or cloud instance)

Deployment trick

Quantise to 4-bit / 8-bit for serving efficiency, then integrate with your order/refund systems:

model drafts the response
the system injects verified order facts
the final message is generated with hard constraints

This hybrid approach reduces hallucinations dramatically.

Common Failure Modes (And Fixes)

1) “We don’t have enough data”

If you’re below ~500 high-quality pairs, results can be shaky.

Fix:

generate candidate pairs with a strong model
have humans review and correct
do light data augmentation (rephrase, swap names, vary order numbers)

2) Overfitting: great on train, bad on test

Fix:

reduce epochs
add diversity
add dropout/weight decay
use early stopping based on validation performance

3) Domain terms confuse the model

Fix:

add “term explanation” examples
include lightweight domain Q&A pairs so the model builds vocabulary

4) Output format keeps drifting

Fix:

standardise reference outputs
make formatting requirements explicit
add negative examples (“do not output extra text”)

The Future: Auto-Instructions, Multimodal, and Edge Fine-Tuning

Where this is heading:

Auto-instruction generation: models produce draft datasets, humans curate
multimodal tuning: text + images + audio instructions become normal (e.g., “analyse this product photo and write a listing”)
lighter tuning on edge devices: smaller models + efficient adapters + on-device updates

Final Take

Prompting is a great way to ask a model to behave.

Instruction Tuning is how you teach it to behave.

If you want reliable outputs across many tasks, stop writing prompts like spells—and start building a custom instruction library like a real product asset:

comprehensive coverage
precise instructions
diverse inputs
consistent outputs

That’s how you get “one fine-tune, many tasks” without babysitting the model forever.

Stop Fine-Tuning Everything: Inject Knowledge with Few‑Shot In‑Context Learning

Dechun Wang — Wed, 31 Dec 2025 13:50:57 +0000

You don’t always need a new model. Most of the time, you just need better examples.

Large language models aren’t “smart” in the human sense; they’re pattern addicts.

Give them the right patterns in the prompt, and they start behaving like a domain expert you never trained.

This is basically what Few‑Shot In‑Context Learning (ICL) is about:

instead of retraining or fine‑tuning the model, you inject knowledge directly through carefully crafted examples in the prompt.

In this piece, we’ll walk through:

What “knowledge injection” really means in practice
How Few‑Shot‑in‑Context sits between “do nothing” and “full fine‑tune”
A concrete mental model of how LLMs learn from examples
Five battle‑tested design principles for good few‑shot prompts
Cross‑industry examples (medical, finance, programming, education, law)
Six common failure modes — and how to debug them
Where this technique is going next (RAG, multi‑modal, personal knowledge, etc.)

If you’re tired of hearing “we need to fine‑tune a custom model” every time someone wants to add new knowledge… this article is for you.

1. Why “knowledge injection” matters (and where Few‑Shot ICL fits)

Most real‑world LLM problems boil down to one painful fact:

The model doesn’t actually know what you care about.

There are two big gaps:

Fresh knowledge

Anything that happened after the model’s training cutoff:
- 2024–2025 policy changes
- New internal processes
- Product updates, pricing changes, API deprecations…
Niche / long‑tail knowledge

Things that were never common enough to appear in large quantities online:
- A tiny vertical’s industry standard
- Rare diseases’ treatment guidelines
- Your company’s weird internal naming conventions

The classic response is:

“Let’s fine‑tune a model.”

Which sounds cool until you realise what that implies:

Curate + clean + label a dataset
Pay for training (GPU hours, infra, dev time)
Wait days or weeks
Repeat whenever something changes

For many teams (and almost all solo builders), that’s overkill.

Enter Few‑Shot‑in‑Context

Few‑Shot‑in‑Context Learning = you don’t touch model weights.

You just:

Pick 3–5 good examples that:
- Embed the knowledge you care about
- Match the task format you want
Stick them into the prompt before the actual query
Let the model infer the pattern and apply it to new inputs

You’ve just “fine‑tuned” the model’s behaviour… without training anything.

Example: suppose you want the model to answer questions about a 2025 EV subsidy policy that it’s never seen.

Instead of fine‑tuning, you can drop something like this into the prompt:

Task: Decide if a car model qualifies for the 2025 EV subsidy and explain why.

Example 1
Input: Battery EV, 500km range
Output: Eligible. Reason: battery electric vehicle with range ≥ 500km gets a 15,000 GBP subsidy.

Example 2
Input: Plug‑in hybrid, 520km range
Output: Eligible. Reason: plug‑in hybrid with range ≥ 500km also gets 15,000 GBP.

Example 3
Input: Battery EV, 480km range
Output: Not eligible. Reason: 480km < 500km threshold, so no subsidy.

Now solve this:
Input: Battery EV, 530km range
Output:

From these 3 examples, the model can infer the rule:

“Type ∈ {BEV, PHEV} AND range ≥ 500km → 15k subsidy; otherwise 0.”

That’s knowledge injection via examples.

2. How Few‑Shot‑in‑Context actually works (mental model)

To understand why this works, forget about gradient descent and transformers for a second and think in more human terms.

When you few‑shot an LLM, it roughly goes through three internal phases:

2.1 Example parsing: “What’s going on here?”

The model scans your prompt and:

Detects the task pattern: “Ah, it’s always Task → Input → Output.”
Extracts knowledge bits: “EV subsidy depends on:
- type: BEV / PHEV vs fuel car
- range: threshold 500km
- standard subsidy amount: 15k”

It’s not memorising the car models; it’s capturing the relationship between input features and output labels.

2.2 Pattern induction: “What’s the general rule?”

Given multiple examples, the model tries to generalise:

Example A: BEV, 500km → 15k
Example B: PHEV, 520km → 15k
Example C: BEV, 480km → 0
Example D: Fuel car, 600km → 0

It can then infer something like:

“Subsidy depends on being a new energy vehicle AND hitting the range threshold. Fuel cars never qualify.”

This is the “few‑shot learning” part: very little data, but high‑signal structure.

2.3 Task transfer: “Apply that rule over here”

When the user sends a new query — say:

Input: PHEV, 510km

The model doesn’t go “I’ve seen this exact string before”.

It goes: “This matches the pattern → I know the rule → apply → 15k subsidy.”

This is the crucial distinction:

Good few‑shot prompts teach the model the logic, not just the answer.

If your examples are just random Q&A with no visible structure, the model may end up parroting instead of reasoning. The rest of this article is about avoiding that.

3. Five design principles for Few‑Shot knowledge injection

You can’t just throw examples at the model and hope for the best.

The quality of your examples is 90% of the game.

Here are five principles I keep coming back to.

3.1 Cover both core dimensions and edge cases

Real rules always have:

Normal cases (most inputs)
Edge cases (boundary conditions, exclusions, weird situations)

Your examples should cover both.

Bad pattern (only “happy paths”):

Example 1: BEV, 500km → subsidy
Example 2: PHEV, 520km → subsidy

The model might incorrectly infer:

“If it’s an EV and the number looks big-ish, say ‘subsidy’.”

Better pattern (explicit edges):

Example 1: BEV, 500km → eligible (base case)
Example 2: PHEV, 520km → eligible (second base case)
Example 3: BEV, 480km → not eligible (range too low)
Example 4: Fuel car, 600km → not eligible (wrong type)

Now the model has:

Dimensions: drive type, range
Edges:
- Range < threshold
- Non‑EV types

Design rule:

For any rule you’re encoding, ask:

“What are the obvious ‘no’ cases? Did I show at least one of each?”

3.2 Keep the format perfectly consistent

Models are surprisingly picky about format.

If your examples look one way and your “real” task looks another, performance tanks.

You need consistency at three levels:

Structure — same high‑level layout

e.g. always Task → Input → Output, in that order.
Terminology — same words for the same concepts

Don’t alternate between EV, electric car, new energy car unless you really have to.
Output shape — same style of answer

e.g. always “Yes/No + Reason”, or always a JSON object, or always a 3‑bullet explanation.

Inconsistent example (what not to do):

Example 1: “2025 EV subsidy: BEV A, 500km → 15k.”   (terse)
Example 2: “According to the 2025 regulation, model B qualifies for…”
Task: “Explain whether model C qualifies and justify your reasoning.”

The model has to guess which style to imitate.

Consistent example (what you want):

Task: Decide if the car qualifies for subsidy and explain why.

Example 1
Input: Type=BEV, Range=500km
Output: Eligible.
Reason: Battery EV with range ≥ 500km meets the threshold; base subsidy is 15,000 GBP.

Example 2
Input: Type=PHEV, Range=490km
Output: Not eligible.
Reason: Plug‑in hybrid but range 490km < 500km threshold, so no subsidy.

Now solve this:

Input: Type=BEV, Range=530km
Output:

Everything lines up, so the model can just continue the pattern.

3.3 Sample count: 3–5 examples is usually enough

This one is empirical but holds up annoyingly well:

Less than 3 examples → shaky generalisation, easy overfitting

More than 5–6 examples → diminishing returns + wasted context

Why not 20 examples? Because:

LLMs have a context window; you pay in tokens
Extra examples can actually dilute the pattern if they’re noisy
You often don’t have 20 high‑quality labelled examples for a new rule

The sweet spot for most tasks is:

3–4 examples for simple rules
4–5 examples when you need to show edge cases + 1–2 tricky combos

If you feel you need 12 examples, it’s often a smell that:

You haven’t factored the rule cleanly, or
You’re mixing multiple tasks into one prompt

3.4 Your examples must be factually correct

This sounds obvious, but in practice… we all copy‑paste in a hurry.

The model assumes your examples are ground truth.

If you smuggle in a bug, it’ll confidently reproduce it everywhere.

Example of hidden poison:

Example 1: Range ≥ 500km → subsidy 15,000
Example 2: Range 520km → subsidy 20,000

Now the model has to guess whether:

You changed the policy halfway through, or
You made a mistake

There’s no way for it to “correct” you; it’s not cross‑checking with the internet.

Treat your few‑shot block as production code:

Double‑check numbers & thresholds
Make sure all examples obey the same rule
If you’re encoding law / medicine / finance, verify against the source document

3.5 Order your examples from simple → complex

Humans don’t like learning by being dropped straight into edge cases; neither do LLMs.

If your first example already mixes three special conditions, the model might:

Over‑weight that weird case
Miss the simpler underlying rule

A nicer pattern:

Base case A
Base case B
Clear negative / boundary
Then maybe one tricky combo

Example in subsidy land:

Example 1: Simple positive (type OK + range OK)
Example 2: Simple negative (type OK + range too low)
Example 3: Simple negative (wrong type)
Example 4: Complex (meets base rule + qualifies for extra top‑up)
Example 5: Complex (meets base rule but not the extra condition)

By the time you show examples 4 and 5, the model already knows the base rule and can now learn the extra dimension cleanly.

4. Cross‑industry case studies

Let’s look at how this plays out in different domains.

Same technique, very different flavours.

We’ll keep the prompts compact but expressive; you can expand them in your own stack.

4.1 Medical: encoding a rare disease guideline (HAE)

Problem

You want the model to assist clinicians with a 2024 guideline on a rare disease:

Hereditary Angioedema (HAE)

– recurrent non‑itchy swelling

– C1‑INH deficiency

– often misdiagnosed as allergy

The base model either:

Has never seen the latest guideline, or
Hallucinates based on generic “allergy” patterns

Instead of fine‑tuning, you inject the guideline logic as examples.

Few‑shot pattern

You define a task:

Task: Based on the 2024 HAE guideline, decide:
1) Does this case meet the diagnostic criteria?
2) What is the reasoning?
3) What is an appropriate initial management plan?

Then you show 3–4 cases that span:

Typical family case
Clear non‑HAE allergic case
“Sporadic” case with no family history but positive lab markers
Another non‑HAE case with itchy rash

Each example is shaped like:

Example 1
Input (summary):
- Recurrent facial swelling, non‑itchy
- Mother with similar episodes
- C1‑INH activity 20% (low)

Output:
1) Diagnosis: Consistent with HAE.
2) Reasoning: Non‑itchy angioedema + positive family history + low C1‑INH.
3) Management: …

Then your real query:

Now analyse this patient:

- 35‑year‑old male
- Recurrent eyelid swelling for 2 years
- No itch, no urticaria
- Father with similar symptoms
- C1‑INH activity 25% (low), C4 reduced

Return the same 3‑part structure.

The model doesn’t need to “know” HAE from training — it learns the diagnostic logic on the fly from your few examples.

You still need a human in the loop (this is medicine), but the model stops hallucinating generic “take some antihistamines” advice.

4.2 Finance: injecting a brand‑new IPO regulation

Problem

You’re building an assistant for investment bankers analysing 2025 IPO rules for a specific exchange (say, a revised science‑tech board in UK).

The rulebook changed in March 2025:

New market‑cap + revenue + profit combos
Special rules for red‑chip / dual‑class structures
Stricter restrictions on use of proceeds

Your base model has never seen this document.

Few‑shot pattern

You declare a task:

Task: Based on the 2025 revised IPO rules, decide whether a company qualifies for listing on the sci‑tech board. Explain your reasoning by:
- Market cap & financial metrics
- Tech / “hard‑tech” profile
- Use of proceeds
- Any special structure (red‑chip, VIE, dual‑class)

Then you show three archetypal examples:

Clean domestic hard‑tech issuer
- R&D‑heavy AI chip company
- Market cap ≥ 50B, revenue ≥ 10B, net profit ≥ 2B
- Proceeds used to build new chip fab
- Result: qualifies
Red‑chip, loss‑making biotech
- Cayman / Hong Kong structure
- Market cap ≥ 150B, loss‑making but high R&D ratio
- Proceeds for overseas R&D centres
- Result: qualifies under special red‑chip route
Non‑tech traditional business misusing funds
- Clothing manufacturer
- Market cap and profits below threshold
- Wants to use proceeds to buy wealth‑management products
- Result: does not qualify (fails both “hard‑tech” positioning and funds‑use rule)

Now when you feed in:

Company D:
- Hong Kong‑registered red‑chip
- Quantum computing hardware
- Projected post‑IPO market cap: 180B
- Still loss‑making; R&D / revenue ratio 30%
- No VIE structure
- Proceeds used to hire researchers and develop quantum algorithms

Question: Does it qualify? Explain by the same dimensions as the examples.

The model can map this onto the red‑chip example and apply the rule: “large market cap + genuine hard‑tech + loss‑making allowed under route X + proceeds used for R&D” → qualifies.

No regulation fine‑tuning required.

4.3 Programming: teaching Python 3.12 type parameter syntax

Problem

You’re generating code, but the base model:

Was trained mostly on Python ≤ 3.11
Keeps suggesting TypeVar boilerplate
Doesn’t use Python 3.12’s new generic syntax

You want the model to write idiomatic 3.12 code, today.

Key new idea

Instead of:

from typing import TypeVar, List

T = TypeVar("T")

def identity(x: T) -> T:
    return x

Python 3.12 lets you write:

def identity[T](x: T) -> T:
    return x

And similarly for classes:

class Stack[T]:
    ...

Few‑shot pattern

Define the task:

Task: You are coding in Python 3.12.
- Prefer the new type parameter syntax: def func[T](...) → ...
- Prefer built‑in generics (list[T], dict[str, T]) over typing.List / typing.Dict.
- When you see older TypeVar patterns, refactor them into the new style.

Then show a few before/after examples.

Example 1 — basic function

# Before (pre‑3.12)
from typing import TypeVar
T = TypeVar("T")

def echo(x: T) -> T:
    return x

# After (Python 3.12)
def echo[T](x: T) -> T:
    return x

Example 2 — list generics

# Before
from typing import TypeVar, List
U = TypeVar("U")

def head(values: List[U]) -> U:
    return values[0]

# After
def head[U](values: list[U]) -> U:
    return values[0]

Example 3 — generic class

# Before
from typing import TypeVar, Dict
K = TypeVar("K")
V = TypeVar("V")

class SimpleCache:
    def __init__(self) -> None:
        self._data: Dict[K, V] = {}

    def get(self, key: K) -> V | None:
        return self._data.get(key)

# After
class SimpleCache[K, V]:
    def __init__(self) -> None:
        self._data: dict[K, V] = {}

    def get(self, key: K) -> V | None:
        return self._data.get(key)

Example 4 — dictionary helper (the “target” pattern)

Now you show a case very close to what you actually want:

# Before
from typing import TypeVar, Dict
T = TypeVar("T")

def get_value(d: Dict[str, T], key: str) -> T | None:
    return d.get(key)

# After
def get_value[T](d: dict[str, T], key: str) -> T | None:
    return d.get(key)

Now when you ask:

“Write a generic lookup function in Python 3.12 that looks up a key in a dict[str, T] and raises KeyError if missing, using the new type parameter syntax.”

…you’ll get something like:

def lookup[T](store: dict[str, T], key: str) -> T:
    if key not in store:
        raise KeyError(key)
    return store[key]

Exactly the behaviour you want — without touching the model weights.

4.4 Education: encoding a new curriculum standard

Problem

In 2025, the middle‑school math curriculum adds a new unit:

“Data & algorithms basics”
- Simple visualisation (e.g. box plots)
- Basic algorithm concepts (e.g. bubble sort, described in natural language)

You’re building a teacher‑assistant tool that:

Reviews lesson plans
Flags whether they align with the new standard
Explains why in teacher‑friendly language

Few‑shot pattern

Define the task:

Task: Given a lesson description for grades 7–9, decide:
1) Does it align with the 2025 “Data & Algorithms Basics” standard?
2) Why or why not?
3) If not, how could it be adjusted?

Show three archetypes:

Good “box plot” lesson
- Students compute quartiles for real exam‑score data
- Draw box plots
- Discuss concentration and spread → Meets “data visualisation” requirement
Good “bubble sort” lesson
- Teacher explains algorithm with diagrams
- Students describe steps in natural language
- No Python coding required → Matches “understand algorithm idea”, not “implement deep learning”
Overkill AI lesson
- Teacher asks 9th graders to code a neural network in Python
- Predict exam scores from features → Way beyond the standard; flagged as misaligned

Then test case:

Lesson: Grade 8 “Data & Algorithms Basics”.
- Students design a simple random sampling plan to pick 50 students out of 2,000.
- They collect “weekly exercise time” data.
- They draw a box plot of exercise time and discuss the results.

Question: Aligned or not? Analyse using the three points as in the examples.

The model learns:

Random sampling + box plots + interpretation
Is very much in‑scope for the new standard
And it can explain that in the same structure as the examples.

4.5 Law: encoding new labour‑law amendments

Problem

You have a legal assistant for HR that needs to know about 2025 labour‑law amendments, including:

Occupational injury insurance for platform workers / gig economy
How to measure working time under remote work
Minimum compensation for non‑compete agreements

The base model doesn’t know about the 2025 changes.

Few‑shot pattern

Define the task:

Task: Given a short labour dispute case, do:
1) Legal assessment under the 2025 amendments
2) Cite the relevant new article(s) in plain language
3) Suggest next steps for the worker and/or employer

Show three examples:

Gig worker injury
- Platform courier, no formal contract, injured on delivery
- Platform refuses compensation → New rule: factual employment + platform must pay into injury insurance → Suggest: apply for recognition of labour relationship, then injury claim
Remote work time tracking
- Employer arbitrarily deducts pay by claiming “low efficiency”
- No proper time‑tracking system → New rule: use “effective working time” and require documented tracking → Suggest: ask employer for evidence; if none, pay must be corrected
Non‑compete with absurdly low pay
- 2‑year non‑compete, but only 500 GBP/month compensation → New rule: at least 30% of average salary and no lower than minimum wage → Worker can refuse to comply or demand revised compensation

Then you ask about a design worker who:

Worked fully remotely
Got only 3,000 GBP in June, while their historical average is 8,000
Employer never tracked “effective working time”
Employer claims “your efficiency was low”

The model can now chain:

This matches the remote work example
Employer failed their duty to track time
Worker can demand full pay based on historical average, minus what’s been paid

Again: the law text itself lives outside the model.

The usable logic gets distilled into 3–4 examples.

5. Six common failure modes (and how to fix them)

Even with the right idea, few‑shot prompts can fail in subtle ways.

Let’s debug some typical issues.

5.1 The model just parrots your examples

Symptom

Works fine on seen cases
On new input, it copies values from examples or repeats entire example answers

Likely causes

Examples are purely concrete — lots of “case A / case B” but no visible rule
Task wording doesn’t force generalisation

Fixes

Explicitly encode the rule once

Instead of only:

Example: EV A, 500km → 15k
Example: EV B, 520km → 15k

Add a line like:

Rule: For all BEVs and PHEVs with range ≥ 500km, subsidy is 15,000 GBP.

Align features

If your examples all use “Car: A, range 500, type BEV”, and your real task says “Model E, long‑range battery, SUV body style”…

the model may not spot that “long‑range battery” = “range feature”.

Be boring and explicit:

Always show the same fields: Type=…, Range=…, BodyStyle=….

5.2 The model ignores edge cases

Symptom

Handles normal cases well
Fails on boundary or special cases that were in your examples

Example: you showed one “HAE with no family history” case, but the model still insists “no family history → no HAE”.

Likely causes

Only a single edge example vs many normal examples
Edge case buried at the end with no emphasis

Fixes

Add at least two edge‑case examples
Label them clearly, e.g.:

[Boundary case] Patient C: no family history, but low C1‑INH and typical symptoms → still HAE.

Explicit tags like [Boundary case] or [Special exception] help the model treat them as part of the rule, not noise.

5.3 Examples are too long (signal drowned in noise)

Symptom

Prompt looks like a wall of text
Model latches onto random parts (e.g. story flavour) instead of the core logic
Sometimes it even ignores later examples because the context is overloaded

Likely causes

You copy‑pasted entire documents instead of minimal supervised examples
You included narrative fluff, version history, citations, etc.

Fixes

Trim each example to: Input → Output → Short explanation
Put long policy / guideline text in a separate section above, and summarise the operative rule in the example

Try to keep:

Each example ≤ ~100 tokens if possible
All examples + instructions under ~80% of your context window, leaving room for the actual query and model’s reasoning

5.4 Terminology collision across domains

Symptom

You re‑use generic terms like “subsidy”, “margin”, “policy” in multiple domains
The model mixes up meanings (e.g. subsidy in finance vs subsidy in automotive)

Likely causes

No domain qualification: “subsidy” means different things in 3 prompts
Inconsistent phrasing: “post‑IPO market cap” vs “total value after listing” vs “size”

Fixes

Define terms in your instructions:

In this task, “market cap” always means “projected post‑IPO market cap” (shares × offering price).

Use domain‑specific names where possible:
- ev_subsidy
- ipo_post_listing_market_cap
- hae_lab_marker

Consistency beats cleverness.

5.5 Small models can’t juggle too many conditions

Symptom

A big model (e.g. GPT‑4‑class) handles your prompt perfectly
A smaller one (7B / 13B) seems to ignore half the conditions

Example: in HAE diagnosis, the small model only looks at symptoms but ignores lab values and family history.

Likely causes

You’re asking it to learn a 3–4‑dimensional rule in one shot
Each example mixes many conditions at once

Fixes

Break the problem into layers of examples:

Single‑dimension examples
- Only symptoms: itchy vs non‑itchy swelling
Two‑dimension examples
- Symptoms + lab values
Three‑dimension examples
- Symptoms + lab values + family history

The smaller model can first lock in:

“Non‑itchy swelling” vs “allergic swelling”
Then “low C1‑INH” vs “normal”
Then “family history strengthens the case”

Instead of being hit with all three dimensions at once.

5.6 Silent errors in your examples poison everything

Symptom

Everything feels coherent
But the model’s answers are consistently off from the real rules

When you go back and re‑read your few‑shot block, you find:

One threshold is wrong
Or two examples contradict each other

Fixes

Treat prompt debugging like code review:

Validate against the source of truth
- Policy document
- Official guideline
- Legal article
Check internal consistency
- Same thresholds everywhere
- Same units (km vs miles, %)
- No “≥ 450km” in one example and “≥ 500km” in the next

If you fix the examples and the model’s behaviour changes, you just proved that your few‑shot block is acting like a miniature “training set in context” — which is exactly the point.

6. Where this is going next

Few‑Shot‑in‑Context is not a temporary hack; it’s becoming part of the core LLM engineering toolbox, especially when combined with other techniques.

A few directions that are already practical today:

6.1 Few‑shot + RAG = dynamic knowledge injection

Instead of hard‑coding your examples into the prompt, you can:

Store policy snippets / typical cases in a vector DB
At query time:
- Retrieve the most relevant 3–5 items
- Format them as few‑shot examples
- Feed them to the model

You get:

Up‑to‑date knowledge (change the DB, not the model)
Domain‑specific behaviour
No retraining loops

6.2 Multi‑modal few‑shot

With vision‑capable models, examples don’t have to be text‑only.

You can show:

An image of a box plot + text interpretation → teach data‑viz reading
A scan of a legal clause + structured summary → teach contract analysis
A medical image + diagnosis → teach pattern recognition frameworks (with humans supervising)

The principle is the same: a tiny set of high‑quality, well‑structured examples sets the behaviour.

6.3 Personal and team‑level “knowledge presets”

For individuals and small teams, we’ll likely see:

Tools that help you generate few‑shot blocks from your own notes
“Profiles” that encode:
- Your coding style
- Your company’s policy interpretations
- Your favourite data formats

Think of it as a “soft fine‑tune” you can edit in a text editor.

7. Takeaways

If you remember only three things from this article, make them these:

Fine‑tuning is rarely your first move.

Often, you can get 80–90% of the value by injecting knowledge with 3–5 carefully chosen examples.
Good few‑shot prompts encode logic, not trivia.

Cover both normal and edge cases, be brutally consistent in format, and keep examples factual and compact.
Few‑shot is a bridge between raw models and real products.

It lets you adapt general‑purpose LLMs to fast‑moving, niche, or private knowledge — on your laptop, in minutes, without a GPU farm.

Once you start thinking of prompts as tiny, editable “on‑the‑fly training sets”, you stop reaching for fine‑tuning by default — and start shipping faster.

Forem: Dechun Wang

Do LLMs Lie? The Real Reason AI Sounds Smart While Making Things Up

AI “Lies”? Or Is It Just Doing Exactly What You Asked?

1) What Is an AI Hallucination?

Two hallucinations that matter in real products

1.1 Factual hallucination (the model invents claims)

1.2 Faithfulness hallucination (the model drifts away from what you asked)

Why hallucinations are so dangerous

2) Why Hallucinations Happen (Mechanics, Not Mysticism)

2.1 The objective function never included “truth”

2.2 The world is out-of-distribution by default

2.3 Parametric memory is confident by design

2.4 Language is ambiguous; instructions are under-specified

3) “Better Reasoning” Does Not Automatically Mean “More Truth”

3.1 Reasoning helps in constrained tasks

3.2 Reasoning can amplify hallucinations in open-world facts

(1) Over-extrapolation

(2) Confidence misalignment

(3) Correct reasoning over false premises

4) A Practical Taxonomy: How to Spot Hallucinations Fast

5) What Normal Users Can Do (No PhD Required)

5.1 Ground it: search / RAG / external evidence

5.2 Verify it: two-model review + claim checking

5.3 Constrain it: prompts that shrink the imagination space

Constraint prompt template (copy/paste)

Adversarial prompt template (copy/paste)

6) For Builders: The Four Defenses That Actually Scale

6.1 Retrieval grounding (RAG)

6.2 Tool-based verification

6.3 Verifier / critic loop

6.4 Observability + budgets

7) A Tiny “Hallucination Harness” You Can Use Today

What to improve in real systems

8) The Grown-Up Way to Use LLMs

From LLM to Agent: How Memory + Planning Turn a Chatbot Into a Doer

The Day Your LLM Stops Talking and Starts Doing

1) What Is an LLM Agent, Actually?

2) Memory: The Two Buckets You Can’t Avoid

2.1 Short-Term Memory (Working Memory)

2.2 Long-Term Memory (Persistent Memory)

The part people miss: memory needs structure

3) Planning: From Decomposition to Search

3.1 Task Decomposition: Why It’s Not Optional

3.2 CoT: Linear Reasoning as a Control Interface

CoT ≠ show-the-user-everything

3.3 ToT: When Reasoning Becomes Search

The cost: token burn

3.4 GoT: The Engineering Upgrade (Reuse, Merge, Backtrack)

3.5 XoT: “Everything of Thoughts” as Research Direction

4) ReAct: The Loop That Makes Agents Feel Real

5) A Minimal Agent With Memory + Planning (Practical Version)

What this toy example demonstrates (and why it matters)

6) Production Notes: Where Agents Actually Fail

6.1 Tool reliability beats prompt cleverness

6.2 Memory needs permissions and hygiene

6.3 Planning needs evaluation signals

6.4 Observability is not optional

6.5 Security: agents amplify blast radius

7) The Real “Agent Upgrade”: A Better Mental Model

Refactoring Agent Skills: From Context Explosion to a Fast, Reliable Workflow

Refactoring Agent Skills: The Day My Context Window Died

1) The Root Cause: Treating Skills Like Docs

2) The Fix: Progressive Disclosure (Three Layers)

Layer 1 — Metadata (always loaded)

Layer 2 — Entry point: SKILL.md (loaded on activation)

Layer 3 — References & scripts (loaded only when needed)

3) The “200-Line Rule”: Brutal, Slightly Arbitrary, Weirdly Effective

A quick test you can steal

4) The Real Mental Shift: From Tool-Centric to Workflow-Centric

5) A Minimal, Production-Grade SKILL.md (Example)

6) Measuring Improvements (Without Lying to Yourself)

Tiny Python audit: count lines per Skill

7) Common Failure Modes (And How to Avoid Them)

Failure mode: Claude writes “a doc” instead of “a Skill”

Failure mode: Entry point bloats because the Skill scope is too wide

Failure mode: Too many references, still hard to navigate

8) A Copyable Refactor Checklist

Final Take: Context Engineering Is “Right Info, Right Time”

Stop Fine-Tuning Blindly: When to Fine-Tune—and When Not to Touch Model Weights

Fine-Tuning Is a Knife, Not a Hammer

Layer 2 — Entry point: `SKILL.md` (loaded on activation)

5) A Minimal, Production-Grade `SKILL.md` (Example)

`SKILL.md`