<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Dechun Wang</title>
    <description>The latest articles on Forem by Dechun Wang (@superorange0707).</description>
    <link>https://forem.com/superorange0707</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3023932%2F02332919-44aa-45f2-8387-c879c55d794f.jpeg</url>
      <title>Forem: Dechun Wang</title>
      <link>https://forem.com/superorange0707</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/superorange0707"/>
    <language>en</language>
    <item>
      <title>Do LLMs Lie? The Real Reason AI Sounds Smart While Making Things Up</title>
      <dc:creator>Dechun Wang</dc:creator>
      <pubDate>Fri, 27 Feb 2026 01:59:54 +0000</pubDate>
      <link>https://forem.com/superorange0707/do-llms-lie-the-real-reason-ai-sounds-smart-while-making-things-up-4jep</link>
      <guid>https://forem.com/superorange0707/do-llms-lie-the-real-reason-ai-sounds-smart-while-making-things-up-4jep</guid>
      <description>&lt;h1&gt;
  
  
  AI “Lies”? Or Is It Just Doing Exactly What You Asked?
&lt;/h1&gt;

&lt;p&gt;You’ve seen it.&lt;/p&gt;

&lt;p&gt;You ask a model a question. It answers with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a clean structure,&lt;/li&gt;
&lt;li&gt;confident language,&lt;/li&gt;
&lt;li&gt;a few &lt;em&gt;very specific&lt;/em&gt; details,&lt;/li&gt;
&lt;li&gt;maybe even a fake-looking citation for extra authority.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then you Google it.&lt;/p&gt;

&lt;p&gt;Nothing exists.&lt;/p&gt;

&lt;p&gt;So the obvious conclusion is: &lt;strong&gt;“AI is lying.”&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here’s the more useful conclusion:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;LLMs optimize for plausibility, not truth.&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
They’re not truth engines — they’re &lt;em&gt;text engines&lt;/em&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And once you understand that, hallucination stops being “a mysterious model defect” and becomes an engineering problem you can design around.&lt;/p&gt;

&lt;p&gt;This article is your field guide.&lt;/p&gt;




&lt;h1&gt;
  
  
  1) What Is an AI Hallucination?
&lt;/h1&gt;

&lt;p&gt;In engineering terms, a hallucination is any output that is &lt;strong&gt;not grounded&lt;/strong&gt; in either:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;verifiable external reality (facts, sources, measurements), or&lt;/li&gt;
&lt;li&gt;your provided context/instructions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In human terms:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;It’s fluent nonsense with good manners.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Two hallucinations that matter in real products
&lt;/h3&gt;

&lt;h2&gt;
  
  
  1.1 Factual hallucination (the model invents claims)
&lt;/h2&gt;

&lt;p&gt;Example vibe:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Yes, honey stabilizes blood sugar for diabetics because it’s natural.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is &lt;strong&gt;factually incorrect&lt;/strong&gt; and unsafe. The model is using a &lt;em&gt;natural = healthy&lt;/em&gt; pattern and filling in the rest.&lt;/p&gt;

&lt;h2&gt;
  
  
  1.2 Faithfulness hallucination (the model drifts away from what you asked)
&lt;/h2&gt;

&lt;p&gt;Example vibe:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;You asked: “Can diabetics replace sugar with honey?”&lt;br&gt;&lt;br&gt;
It answers: “Honey contains minerals and antioxidants.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The answer might be broadly true-ish… but it &lt;strong&gt;didn’t answer the question&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This is &lt;em&gt;instruction/context drift&lt;/em&gt;: the model chose a plausible response path instead of your intended one.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why hallucinations are so dangerous
&lt;/h3&gt;

&lt;p&gt;Because hallucinations are rarely random garbage.&lt;/p&gt;

&lt;p&gt;They’re often:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;internally consistent,&lt;/li&gt;
&lt;li&gt;nicely written,&lt;/li&gt;
&lt;li&gt;and optimised to &lt;em&gt;sound&lt;/em&gt; helpful.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In high-stakes domains (medicine, law, finance, ops), that’s a failure mode with teeth.&lt;/p&gt;




&lt;h1&gt;
  
  
  2) Why Hallucinations Happen (Mechanics, Not Mysticism)
&lt;/h1&gt;

&lt;p&gt;If you want one equation to explain the whole phenomenon, it’s this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;LLMs learn &lt;strong&gt;P(next token | context)&lt;/strong&gt; —&lt;br&gt;&lt;br&gt;
not &lt;strong&gt;P(true statement | world)&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A model can be &lt;em&gt;excellent&lt;/em&gt; at the first probability while being mediocre at the second.&lt;/p&gt;

&lt;h2&gt;
  
  
  2.1 The objective function never included “truth”
&lt;/h2&gt;

&lt;p&gt;Training pushes models to generate what looks like “a good continuation” of language.&lt;/p&gt;

&lt;p&gt;If the training data contains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;outdated facts,&lt;/li&gt;
&lt;li&gt;biased narratives,&lt;/li&gt;
&lt;li&gt;low-quality posts,&lt;/li&gt;
&lt;li&gt;or incorrect statements repeated many times,&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;the model will absorb those patterns.&lt;/p&gt;

&lt;p&gt;Not because it’s dumb — because it’s doing gradient descent on the world’s mess.&lt;/p&gt;

&lt;h2&gt;
  
  
  2.2 The world is out-of-distribution by default
&lt;/h2&gt;

&lt;p&gt;Most user questions are weird combinations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“Explain how X affects Y in Z scenario, but for 2026 rules, in the UK, and for a small business.”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These compositional queries often don’t exist explicitly in training.&lt;/p&gt;

&lt;p&gt;When an LLM faces an unfamiliar combination, it does what it’s trained to do:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Produce a plausible continuation anyway.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Silence wasn’t rewarded.&lt;/p&gt;

&lt;h2&gt;
  
  
  2.3 Parametric memory is confident by design
&lt;/h2&gt;

&lt;p&gt;LLMs store “knowledge” in weights — not in a database that can be checked or updated.&lt;/p&gt;

&lt;p&gt;That leads to classic failure modes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;fuzzy time boundaries (it may speak about “recent changes” without knowing what’s recent),&lt;/li&gt;
&lt;li&gt;invented paper titles,&lt;/li&gt;
&lt;li&gt;confident timelines that never happened.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model often &lt;strong&gt;doesn’t know that it doesn’t know&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  2.4 Language is ambiguous; instructions are under-specified
&lt;/h2&gt;

&lt;p&gt;“Explain deep learning” could mean:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;theory, math, history, code, architecture, best practices, business applications…&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you don’t lock scope, the model will choose a generic path.&lt;/p&gt;

&lt;p&gt;That generic path can drift from your real intent — and suddenly you’re in faithfulness hallucination territory.&lt;/p&gt;




&lt;h1&gt;
  
  
  3) “Better Reasoning” Does Not Automatically Mean “More Truth”
&lt;/h1&gt;

&lt;p&gt;Here’s the trap:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If a model can reason better, surely it hallucinates less… right?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Sometimes. Not always.&lt;/p&gt;

&lt;p&gt;Reasoning increases &lt;strong&gt;coherence&lt;/strong&gt;. Coherence can &lt;em&gt;hide&lt;/em&gt; errors.&lt;/p&gt;

&lt;h2&gt;
  
  
  3.1 Reasoning helps in constrained tasks
&lt;/h2&gt;

&lt;p&gt;Math, logic puzzles, rule-based workflows — where intermediate states can be checked.&lt;/p&gt;

&lt;p&gt;In these settings, reasoning often reduces mistakes because the task has &lt;strong&gt;hard constraints&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  3.2 Reasoning can amplify hallucinations in open-world facts
&lt;/h2&gt;

&lt;p&gt;In unconstrained factual generation, stronger reasoning can do something scary:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;take a wrong premise,&lt;/li&gt;
&lt;li&gt;expand it into a beautiful argument,&lt;/li&gt;
&lt;li&gt;and deliver a conclusion that feels &lt;em&gt;more credible&lt;/em&gt; than a simple wrong answer.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Three common mechanisms:&lt;/p&gt;

&lt;h3&gt;
  
  
  (1) Over-extrapolation
&lt;/h3&gt;

&lt;p&gt;The model extends a pattern beyond evidence (“A usually implies B, so it must here too”).&lt;/p&gt;

&lt;h3&gt;
  
  
  (2) Confidence misalignment
&lt;/h3&gt;

&lt;p&gt;More structured answers sound more certain — even when the underlying claim is shaky.&lt;/p&gt;

&lt;h3&gt;
  
  
  (3) Correct reasoning over false premises
&lt;/h3&gt;

&lt;p&gt;If the premise is wrong, the logic can still be flawless… and the result is still false.&lt;/p&gt;

&lt;p&gt;That’s why “smart-sounding” is not a truth signal.&lt;/p&gt;




&lt;h1&gt;
  
  
  4) A Practical Taxonomy: How to Spot Hallucinations Fast
&lt;/h1&gt;

&lt;p&gt;If you want a quick mental checklist, use &lt;strong&gt;F.A.C.T.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;F&lt;/strong&gt;abricated specifics: suspiciously precise names, dates, numbers, citations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A&lt;/strong&gt;uthority cosplay: “According to a 2023 FDA paper…” with no traceable source&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;C&lt;/strong&gt;ontext drift: it answers adjacent questions, not your question&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;T&lt;/strong&gt;ime confusion: “recently” / “latest” claims without anchoring time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If two or more show up, you should treat the output as &lt;em&gt;a draft&lt;/em&gt;, not an answer.&lt;/p&gt;




&lt;h1&gt;
  
  
  5) What Normal Users Can Do (No PhD Required)
&lt;/h1&gt;

&lt;p&gt;You can’t eliminate hallucinations. But you can lower risk massively with three moves.&lt;/p&gt;

&lt;h2&gt;
  
  
  5.1 Ground it: search / RAG / external evidence
&lt;/h2&gt;

&lt;p&gt;If your question is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;time-sensitive,&lt;/li&gt;
&lt;li&gt;domain-critical,&lt;/li&gt;
&lt;li&gt;or requires citations,&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;then the model shouldn’t be your source of truth.&lt;/p&gt;

&lt;p&gt;Use external grounding:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;search,&lt;/li&gt;
&lt;li&gt;your documentation,&lt;/li&gt;
&lt;li&gt;a domain KB,&lt;/li&gt;
&lt;li&gt;a RAG pipeline.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal is simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;force the model to &lt;strong&gt;answer from evidence&lt;/strong&gt;, not from vibes.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  5.2 Verify it: two-model review + claim checking
&lt;/h2&gt;

&lt;p&gt;A simple workflow that works shockingly well:&lt;/p&gt;

&lt;p&gt;1) Model A produces an answer&lt;br&gt;&lt;br&gt;
2) Model B audits it like an aggressive reviewer&lt;/p&gt;

&lt;p&gt;Tell Model B to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;list factual claims,&lt;/li&gt;
&lt;li&gt;mark uncertainty,&lt;/li&gt;
&lt;li&gt;flag anything that looks unverified,&lt;/li&gt;
&lt;li&gt;propose how to validate.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This doesn’t guarantee correctness — but it exposes weak spots fast.&lt;/p&gt;
&lt;h2&gt;
  
  
  5.3 Constrain it: prompts that shrink the imagination space
&lt;/h2&gt;

&lt;p&gt;Most hallucinations happen when the model has too much freedom.&lt;/p&gt;

&lt;p&gt;Give it less.&lt;/p&gt;
&lt;h3&gt;
  
  
  Constraint prompt template (copy/paste)
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;Answer only for: &lt;strong&gt;[time range]&lt;/strong&gt;, &lt;strong&gt;[region]&lt;/strong&gt;, &lt;strong&gt;[source types]&lt;/strong&gt;.&lt;br&gt;&lt;br&gt;
If you are unsure, say &lt;strong&gt;“I don’t know”&lt;/strong&gt; and list what you would need to verify.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;
  
  
  Adversarial prompt template (copy/paste)
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;Before answering, list 3 ways your answer could be wrong.&lt;br&gt;&lt;br&gt;
Then answer, clearly separating &lt;strong&gt;facts vs assumptions&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This “self-audit” pattern reduces confident nonsense because it forces the model to expose uncertainty.&lt;/p&gt;


&lt;h1&gt;
  
  
  6) For Builders: The Four Defenses That Actually Scale
&lt;/h1&gt;

&lt;p&gt;If you ship LLMs in production, the best mitigations are boring and systematic:&lt;/p&gt;
&lt;h2&gt;
  
  
  6.1 Retrieval grounding (RAG)
&lt;/h2&gt;

&lt;p&gt;Bring in evidence, attach it to the context, and require citations internally (even if you hide them in UX).&lt;/p&gt;
&lt;h2&gt;
  
  
  6.2 Tool-based verification
&lt;/h2&gt;

&lt;p&gt;If a claim can be checked by a tool, check it.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;compute with a calculator,&lt;/li&gt;
&lt;li&gt;validate with a DB query,&lt;/li&gt;
&lt;li&gt;confirm with a search call,&lt;/li&gt;
&lt;li&gt;cross-check with a rules engine.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  6.3 Verifier / critic loop
&lt;/h2&gt;

&lt;p&gt;Run a second pass that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;extracts claims,&lt;/li&gt;
&lt;li&gt;checks them against retrieved sources,&lt;/li&gt;
&lt;li&gt;rejects or rewrites unsupported sentences.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  6.4 Observability + budgets
&lt;/h2&gt;

&lt;p&gt;Hallucination mitigation is a &lt;strong&gt;runtime&lt;/strong&gt; problem.&lt;/p&gt;

&lt;p&gt;You need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;tool call logs,&lt;/li&gt;
&lt;li&gt;retrieval traces,&lt;/li&gt;
&lt;li&gt;evaluation scores,&lt;/li&gt;
&lt;li&gt;step/time/token budgets,&lt;/li&gt;
&lt;li&gt;and fail-closed behavior for critical tasks.&lt;/li&gt;
&lt;/ul&gt;


&lt;h1&gt;
  
  
  7) A Tiny “Hallucination Harness” You Can Use Today
&lt;/h1&gt;

&lt;p&gt;This is a simplified testing script you can run against any model.&lt;/p&gt;

&lt;p&gt;It does two things:&lt;/p&gt;

&lt;p&gt;1) forces the model to output &lt;strong&gt;claims as bullets&lt;/strong&gt;,&lt;br&gt;&lt;br&gt;
2) runs a crude “support check” against provided evidence.&lt;/p&gt;

&lt;p&gt;It’s not perfect — but it makes hallucination visible.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_claims&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="c1"&gt;# naive: treat bullet lines as claims
&lt;/span&gt;    &lt;span class="n"&gt;lines&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-•  &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;splitlines&lt;/span&gt;&lt;span class="p"&gt;()]&lt;/span&gt;
    &lt;span class="n"&gt;claims&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;lines&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tldr&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;note:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;assumption:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;claims&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;support_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;claim&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;evidence&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# naive lexical overlap as a placeholder for real entailment checks
&lt;/span&gt;    &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;findall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[a-zA-Z0-9]+&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;claim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt;
    &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;findall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[a-zA-Z0-9]+&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;evidence&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;flag_hallucinations&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;evidence&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.18&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;claim&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;extract_claims&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;support_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;claim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;evidence&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claim&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;claim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;support_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;flag&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;

&lt;span class="c1"&gt;# Example usage
&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;- Honey is safe for diabetics and stabilizes blood sugar.
- It has antioxidants and minerals.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;span class="n"&gt;evidence&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Honey is a source of sugars (fructose and glucose) and can raise blood glucose.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;flag_hallucinations&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;evidence&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What to improve in real systems
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;replace lexical overlap with an entailment model / verifier LLM,&lt;/li&gt;
&lt;li&gt;store evidence chunks with provenance,&lt;/li&gt;
&lt;li&gt;require citations at claim level,&lt;/li&gt;
&lt;li&gt;fail closed in high-risk contexts.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But even this toy harness teaches a powerful lesson:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Most hallucinations aren’t hard to spot once you force outputs into claims.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h1&gt;
  
  
  8) The Grown-Up Way to Use LLMs
&lt;/h1&gt;

&lt;p&gt;Stop asking: “Is the model truthful?”&lt;/p&gt;

&lt;p&gt;Start asking: “What’s my verification strategy?”&lt;/p&gt;

&lt;p&gt;Because hallucinations are not a bug you patch once.&lt;br&gt;
They’re a property you manage forever.&lt;/p&gt;

&lt;p&gt;Treat an LLM like a brilliant intern:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;fast,&lt;/li&gt;
&lt;li&gt;creative,&lt;/li&gt;
&lt;li&gt;productive…&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;…and absolutely capable of inventing a meeting that never happened.&lt;/p&gt;

&lt;p&gt;Your job isn’t to fear it.&lt;/p&gt;

&lt;p&gt;Your job is to build the seatbelts.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>promptengineering</category>
      <category>llm</category>
      <category>rag</category>
    </item>
    <item>
      <title>From LLM to Agent: How Memory + Planning Turn a Chatbot Into a Doer</title>
      <dc:creator>Dechun Wang</dc:creator>
      <pubDate>Fri, 27 Feb 2026 01:59:39 +0000</pubDate>
      <link>https://forem.com/superorange0707/from-llm-to-agent-how-memory-planning-turn-a-chatbot-into-a-doer-35ck</link>
      <guid>https://forem.com/superorange0707/from-llm-to-agent-how-memory-planning-turn-a-chatbot-into-a-doer-35ck</guid>
      <description>&lt;h1&gt;
  
  
  The Day Your LLM Stops Talking and Starts Doing
&lt;/h1&gt;

&lt;p&gt;There’s a moment in every LLM project where you realize the “chat” part is the easy bit.&lt;/p&gt;

&lt;p&gt;The hard part is everything that happens &lt;strong&gt;between&lt;/strong&gt; the user request and the final output:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;gathering missing facts,&lt;/li&gt;
&lt;li&gt;choosing &lt;em&gt;which&lt;/em&gt; tools to call (and in what order),&lt;/li&gt;
&lt;li&gt;handling failures,&lt;/li&gt;
&lt;li&gt;remembering prior decisions,&lt;/li&gt;
&lt;li&gt;and not spiraling into confident nonsense when the world refuses to match the model’s assumptions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That’s the moment you’re no longer building “an LLM app.”&lt;/p&gt;

&lt;p&gt;You’re building an &lt;strong&gt;agent&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In software terms, an agent is &lt;em&gt;not&lt;/em&gt; a magical model upgrade. It’s a &lt;strong&gt;system design&lt;/strong&gt; pattern:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Agent = LLM + tools + a loop + state&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Once you see it this way, “memory” and “planning” stop being buzzwords and become engineering decisions you can reason about, test, and improve.&lt;/p&gt;

&lt;p&gt;Let’s break down how it works.&lt;/p&gt;




&lt;h1&gt;
  
  
  1) What Is an LLM Agent, Actually?
&lt;/h1&gt;

&lt;p&gt;A classic LLM app looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;user_input -&amp;gt; prompt -&amp;gt; model -&amp;gt; answer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;An agent adds a control loop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;user_input
  -&amp;gt; (state) -&amp;gt; model -&amp;gt; action
  -&amp;gt; tool/environment -&amp;gt; observation
  -&amp;gt; (state update) -&amp;gt; model -&amp;gt; action
  -&amp;gt; ... repeat ...
  -&amp;gt; final answer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The difference is subtle but massive:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A chatbot &lt;strong&gt;generates&lt;/strong&gt; text.&lt;/li&gt;
&lt;li&gt;An agent &lt;strong&gt;executes&lt;/strong&gt; a policy over time.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model is the policy engine; the loop is the runtime.&lt;/p&gt;

&lt;p&gt;This means agents are fundamentally about &lt;strong&gt;systems&lt;/strong&gt;: orchestration, state, observability, guardrails, and evaluation.&lt;/p&gt;




&lt;h1&gt;
  
  
  2) Memory: The Two Buckets You Can’t Avoid
&lt;/h1&gt;

&lt;p&gt;Human-like “memory” in agents usually becomes two concrete buckets:&lt;/p&gt;

&lt;h2&gt;
  
  
  2.1 Short-Term Memory (Working Memory)
&lt;/h2&gt;

&lt;p&gt;Short-term memory is whatever you stuff into the model’s current context:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the current conversation (or the relevant slice of it),&lt;/li&gt;
&lt;li&gt;tool results you just fetched,&lt;/li&gt;
&lt;li&gt;intermediate notes (“scratchpad”),&lt;/li&gt;
&lt;li&gt;temporary constraints (deadlines, budgets, requirements).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Engineering reality check:&lt;/strong&gt; short-term memory is limited by your context window &lt;em&gt;and&lt;/em&gt; by model behavior.&lt;/p&gt;

&lt;p&gt;Two classic failure modes show up in production:&lt;/p&gt;

&lt;p&gt;1) &lt;strong&gt;Context trimming:&lt;/strong&gt; you cut earlier messages to save tokens → the agent “forgets” key constraints.&lt;/p&gt;

&lt;p&gt;2) &lt;strong&gt;Recency bias:&lt;/strong&gt; even with long contexts, models over-weight what’s near the end → old-but-important details get ignored.&lt;/p&gt;

&lt;p&gt;If you’ve ever watched an agent re-ask for information it already has, you’ve seen both.&lt;/p&gt;

&lt;h2&gt;
  
  
  2.2 Long-Term Memory (Persistent Memory)
&lt;/h2&gt;

&lt;p&gt;Long-term memory is stored outside the model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;vector DB embeddings,&lt;/li&gt;
&lt;li&gt;document stores,&lt;/li&gt;
&lt;li&gt;user profiles/preferences (only if you can do it safely and legally),&lt;/li&gt;
&lt;li&gt;task history and decisions,&lt;/li&gt;
&lt;li&gt;structured records (tickets, orders, logs, CRM entries).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The mainstream pattern is: &lt;strong&gt;retrieve → inject → reason&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If that sounds like RAG (Retrieval-Augmented Generation), that’s because it is. Agents just make RAG &lt;em&gt;operational&lt;/em&gt;: retrieval isn’t only for answering questions—it’s for &lt;strong&gt;deciding what to do next&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  The part people miss: memory needs structure
&lt;/h3&gt;

&lt;p&gt;A pile of vector chunks is not “memory.” It’s a landfill.&lt;/p&gt;

&lt;p&gt;Practical long-term memory works best when you store:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;semantic content&lt;/strong&gt; (embedding),&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;metadata&lt;/strong&gt; (timestamp, source, permissions, owner, reliability score),&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;a policy&lt;/strong&gt; for when to write/read (what gets saved, what gets ignored),&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;decay or TTL&lt;/strong&gt; for things that stop mattering.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you don’t design write/read policies, you’ll build an agent that remembers the wrong things forever.&lt;/p&gt;




&lt;h1&gt;
  
  
  3) Planning: From Decomposition to Search
&lt;/h1&gt;

&lt;p&gt;Planning sounds philosophical, but it maps to one question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How does the agent choose the next action?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In real tasks, “next action” is rarely obvious. That’s why we plan: to reduce a big problem into smaller moves with checkpoints.&lt;/p&gt;

&lt;h2&gt;
  
  
  3.1 Task Decomposition: Why It’s Not Optional
&lt;/h2&gt;

&lt;p&gt;When you ask an agent to “plan,” you’re buying:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;controllability:&lt;/strong&gt; you can inspect steps and constraints,&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;debuggability:&lt;/strong&gt; you can see where it went wrong,&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;tool alignment:&lt;/strong&gt; each step can map to a tool call,&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;lower hallucination risk:&lt;/strong&gt; fewer leaps, more verification.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But planning can be cheap or expensive depending on the technique.&lt;/p&gt;




&lt;h2&gt;
  
  
  3.2 CoT: Linear Reasoning as a Control Interface
&lt;/h2&gt;

&lt;p&gt;Chain-of-Thought style prompting nudges the model to produce intermediate reasoning before the final output.&lt;/p&gt;

&lt;p&gt;From an engineering perspective, the key benefit is &lt;em&gt;not&lt;/em&gt; “the model becomes smarter.”&lt;br&gt;
It’s that the model becomes &lt;strong&gt;more steerable&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;it externalizes intermediate state,&lt;/li&gt;
&lt;li&gt;it decomposes implicitly into substeps,&lt;/li&gt;
&lt;li&gt;and you can gate or validate those steps.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  CoT ≠ show-the-user-everything
&lt;/h3&gt;

&lt;p&gt;In production, you often want the opposite: use structured reasoning internally, then output a crisp answer.&lt;/p&gt;

&lt;p&gt;This is both a UX decision (nobody wants a wall of text) and a safety decision (you don’t want to leak internal deliberations, secrets, or tool inputs).&lt;/p&gt;


&lt;h2&gt;
  
  
  3.3 ToT: When Reasoning Becomes Search
&lt;/h2&gt;

&lt;p&gt;Linear reasoning fails when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;there are multiple plausible paths,&lt;/li&gt;
&lt;li&gt;early choices are hard to reverse,&lt;/li&gt;
&lt;li&gt;you need lookahead (trade-offs, planning, puzzles, strategy).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Tree-of-Thought style reasoning turns “thinking” into &lt;strong&gt;search&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;expand:&lt;/strong&gt; propose multiple candidate thoughts/steps,&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;evaluate:&lt;/strong&gt; score candidates (by heuristics, constraints, or another model call),&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;select:&lt;/strong&gt; continue exploring the best branches,&lt;/li&gt;
&lt;li&gt;optionally &lt;strong&gt;backtrack&lt;/strong&gt; if a branch collapses.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If CoT is “one good route,” ToT is “try a few routes, keep the ones that look promising.”&lt;/p&gt;
&lt;h3&gt;
  
  
  The cost: token burn
&lt;/h3&gt;

&lt;p&gt;Search is expensive. If you expand branches without discipline, cost grows fast.&lt;/p&gt;

&lt;p&gt;So ToT tends to shine in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;high-value tasks,&lt;/li&gt;
&lt;li&gt;problems with clear evaluation signals,&lt;/li&gt;
&lt;li&gt;situations where being wrong is more expensive than being slow.&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  3.4 GoT: The Engineering Upgrade (Reuse, Merge, Backtrack)
&lt;/h2&gt;

&lt;p&gt;Tree search wastes work when branches overlap.&lt;/p&gt;

&lt;p&gt;Graph-of-Thoughts takes a practical step:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;treat intermediate reasoning as &lt;strong&gt;states&lt;/strong&gt; in a directed graph,&lt;/li&gt;
&lt;li&gt;allow &lt;strong&gt;merging&lt;/strong&gt; equivalent states (reuse),&lt;/li&gt;
&lt;li&gt;support &lt;strong&gt;backtracking&lt;/strong&gt; to arbitrary nodes,&lt;/li&gt;
&lt;li&gt;apply pruning more aggressively.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If ToT is a tree, GoT is a graph with memory: you don’t re-derive what you already know.&lt;/p&gt;

&lt;p&gt;This matters in production where repeated tool calls and repeated reasoning are the real cost drivers.&lt;/p&gt;


&lt;h2&gt;
  
  
  3.5 XoT: “Everything of Thoughts” as Research Direction
&lt;/h2&gt;

&lt;p&gt;XoT-style approaches try to unify thought paradigms and inject external knowledge and search methods (think: MCTS-style exploration + domain guidance).&lt;/p&gt;

&lt;p&gt;It’s promising, but the engineering bar is high:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;you need reliable evaluation,&lt;/li&gt;
&lt;li&gt;tight budgets,&lt;/li&gt;
&lt;li&gt;and a clear “why” for using something heavier than a well-designed plan loop.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In practice, many teams implement &lt;strong&gt;a lightweight ToT/GoT hybrid&lt;/strong&gt; without the full research stack.&lt;/p&gt;


&lt;h1&gt;
  
  
  4) ReAct: The Loop That Makes Agents Feel Real
&lt;/h1&gt;

&lt;p&gt;Planning is what the agent &lt;em&gt;intends&lt;/em&gt; to do.&lt;/p&gt;

&lt;p&gt;ReAct is what the agent &lt;em&gt;actually&lt;/em&gt; does:&lt;/p&gt;

&lt;p&gt;1) &lt;strong&gt;Reason&lt;/strong&gt; about what’s missing / what to do next&lt;br&gt;&lt;br&gt;
2) &lt;strong&gt;Act&lt;/strong&gt; by calling a tool&lt;br&gt;&lt;br&gt;
3) &lt;strong&gt;Observe&lt;/strong&gt; the result&lt;br&gt;&lt;br&gt;
4) &lt;strong&gt;Reflect&lt;/strong&gt; and adjust  &lt;/p&gt;

&lt;p&gt;Repeat until done.&lt;/p&gt;

&lt;p&gt;This solves three real problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;incomplete information:&lt;/strong&gt; the agent can fetch what it doesn’t know,&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;verification:&lt;/strong&gt; it can check assumptions against reality,&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;error recovery:&lt;/strong&gt; it can reroute after failures.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you’ve ever debugged a hallucination, you already know why this matters:&lt;br&gt;
a believable explanation isn’t the same thing as a correct answer.&lt;/p&gt;


&lt;h1&gt;
  
  
  5) A Minimal Agent With Memory + Planning (Practical Version)
&lt;/h1&gt;

&lt;p&gt;Below is a deliberately “boring” agent loop. That’s the point.&lt;/p&gt;

&lt;p&gt;Most production agents are not sci-fi. They’re well-instrumented control loops with strict budgets.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dataclasses&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dataclass&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;field&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;

&lt;span class="c1"&gt;# --- Tools (stubs) ---------------------------------------------------------
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;web_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Replace with your search API call + caching.
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[search-results for: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;calc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expression&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Replace with a safe evaluator.
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;eval&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expression&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__builtins__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{}},&lt;/span&gt; &lt;span class="p"&gt;{}))&lt;/span&gt;

&lt;span class="c1"&gt;# --- Memory ----------------------------------------------------------------
&lt;/span&gt;
&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MemoryItem&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;default_factory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;default_factory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MemoryStore&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;short_term&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;MemoryItem&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;default_factory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;long_term&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;MemoryItem&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;default_factory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# stand-in for vector DB
&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;remember_short&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;short_term&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;MemoryItem&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;remember_long&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;long_term&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;MemoryItem&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;retrieve_long&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hint&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;MemoryItem&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="c1"&gt;# Dummy retrieval: filter by substring.
&lt;/span&gt;        &lt;span class="n"&gt;hits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;long_term&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;hint&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()]&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)[:&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# --- Planner (very small ToT-ish idea) ------------------------------------
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;propose_plans&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="c1"&gt;# In reality: this is an LLM call producing multiple plan candidates.
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Search key facts about: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Break task into steps, then execute step-by-step: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Ask a clarifying question if constraints are missing: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;score_plan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;plan&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Heuristic scoring: prefer plans that verify facts.
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;plan&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Break task&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;plan&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

&lt;span class="c1"&gt;# --- Agent Loop ------------------------------------------------------------
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;MemoryStore&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_steps&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# 1) Retrieve long-term memory if relevant.
&lt;/span&gt;    &lt;span class="n"&gt;recalled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;retrieve_long&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;recalled&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;remember_short&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Recalled: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;long_term&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 2) Plan (cheap multi-candidate selection).
&lt;/span&gt;    &lt;span class="n"&gt;plans&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;propose_plans&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;plan&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;plans&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;score_plan&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;remember_short&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Chosen plan: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;plan&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 3) Execute loop.
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_steps&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# In reality: this is an LLM call that decides "next tool" based on state.
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;plan&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;obs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;web_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;remember_short&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Observation: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;obs&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tool&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;web_search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;

        &lt;span class="c1"&gt;# Example: do a small computation if the task contains a calc hint.
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;calculate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;obs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;calc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;6 * 7&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;remember_short&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Observation: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;obs&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tool&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;calc&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;

        &lt;span class="c1"&gt;# Stop condition (simplified).
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;

    &lt;span class="c1"&gt;# 4) Final answer: summarise short-term state.
&lt;/span&gt;    &lt;span class="n"&gt;notes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;- &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;short_term&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;:]])&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Task: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;What I did:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;notes&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;Final: (produce a user-facing answer here.)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;# Demo usage:
&lt;/span&gt;&lt;span class="n"&gt;mem&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MemoryStore&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;mem&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;remember_long&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;User prefers concise outputs with clear bullets.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tag&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;preference&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;run_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write a short guide on LLM agents with memory and planning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mem&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What this toy example demonstrates (and why it matters)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Memory is state, not vibe.&lt;/strong&gt; It’s read/write with policy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Planning can be multi-candidate without going full ToT.&lt;/strong&gt; Generate a few, pick one, move on.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool calls are first-class.&lt;/strong&gt; Observations update state, not just the transcript.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Budgets exist.&lt;/strong&gt; &lt;code&gt;max_steps&lt;/code&gt; is a real safety and cost control.&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  6) Production Notes: Where Agents Actually Fail
&lt;/h1&gt;

&lt;p&gt;If you want this to work outside demos, you’ll spend most of your time on these five areas.&lt;/p&gt;

&lt;h2&gt;
  
  
  6.1 Tool reliability beats prompt cleverness
&lt;/h2&gt;

&lt;p&gt;Tools fail. Time out. Rate limit. Return weird formats.&lt;/p&gt;

&lt;p&gt;Your agent loop needs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;retries with backoff,&lt;/li&gt;
&lt;li&gt;strict schemas,&lt;/li&gt;
&lt;li&gt;parsing + validation,&lt;/li&gt;
&lt;li&gt;and fallback strategies.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A “smart” agent without robust I/O is just a creative writer with API keys.&lt;/p&gt;

&lt;h2&gt;
  
  
  6.2 Memory needs permissions and hygiene
&lt;/h2&gt;

&lt;p&gt;If you store user data, you need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;clear consent and retention rules,&lt;/li&gt;
&lt;li&gt;permission checks at retrieval time,&lt;/li&gt;
&lt;li&gt;deletion pathways,&lt;/li&gt;
&lt;li&gt;and safe defaults.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In regulated environments, long-term memory is often the highest-risk component.&lt;/p&gt;

&lt;h2&gt;
  
  
  6.3 Planning needs evaluation signals
&lt;/h2&gt;

&lt;p&gt;Search-based planning is only as good as its scoring.&lt;/p&gt;

&lt;p&gt;You’ll likely need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;constraint checkers,&lt;/li&gt;
&lt;li&gt;unit tests for tool outputs,&lt;/li&gt;
&lt;li&gt;or a separate “critic” model call that can reject bad steps.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  6.4 Observability is not optional
&lt;/h2&gt;

&lt;p&gt;If you can’t trace:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;which tool was called,&lt;/li&gt;
&lt;li&gt;with what inputs,&lt;/li&gt;
&lt;li&gt;what it returned,&lt;/li&gt;
&lt;li&gt;and how it changed the plan,&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;you can’t debug. You also can’t measure improvements.&lt;/p&gt;

&lt;p&gt;Log everything. Then decide what to retain.&lt;/p&gt;

&lt;h2&gt;
  
  
  6.5 Security: agents amplify blast radius
&lt;/h2&gt;

&lt;p&gt;When a model can take actions, mistakes become incidents.&lt;/p&gt;

&lt;p&gt;Guardrails look like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;allowlists (tools, domains, actions),&lt;/li&gt;
&lt;li&gt;spend limits,&lt;/li&gt;
&lt;li&gt;step limits,&lt;/li&gt;
&lt;li&gt;sandboxing,&lt;/li&gt;
&lt;li&gt;and human-in-the-loop gates for high-impact actions.&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  7) The Real “Agent Upgrade”: A Better Mental Model
&lt;/h1&gt;

&lt;p&gt;If you remember one thing, make it this:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;An agent is an LLM inside a state machine.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Memory = state&lt;/li&gt;
&lt;li&gt;Planning = policy shaping&lt;/li&gt;
&lt;li&gt;Tools = actuators&lt;/li&gt;
&lt;li&gt;Observations = state transitions&lt;/li&gt;
&lt;li&gt;Reflection = error-correcting feedback&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once you build agents this way, you stop chasing “the perfect prompt” and start shipping systems that can survive reality.&lt;/p&gt;

&lt;p&gt;And reality is the only benchmark that matters.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>agents</category>
      <category>programming</category>
    </item>
    <item>
      <title>Refactoring Agent Skills: From Context Explosion to a Fast, Reliable Workflow</title>
      <dc:creator>Dechun Wang</dc:creator>
      <pubDate>Sun, 15 Feb 2026 22:13:15 +0000</pubDate>
      <link>https://forem.com/superorange0707/refactoring-agent-skills-from-context-explosion-to-a-fast-reliable-workflow-5hg6</link>
      <guid>https://forem.com/superorange0707/refactoring-agent-skills-from-context-explosion-to-a-fast-reliable-workflow-5hg6</guid>
      <description>&lt;h1&gt;
  
  
  Refactoring Agent Skills: The Day My Context Window Died
&lt;/h1&gt;

&lt;p&gt;There’s a specific kind of pain you only experience once:&lt;/p&gt;

&lt;p&gt;You’re in Claude Code, you trigger a couple of “helpful” Skills, and suddenly the model is chewing through &lt;strong&gt;thousands of lines&lt;/strong&gt; of markdown + snippets it didn’t ask for.&lt;/p&gt;

&lt;p&gt;Your “AI co-pilot” stops feeling like a co-pilot and starts feeling like a browser tab you can’t close.&lt;/p&gt;

&lt;p&gt;This piece is a practical rewrite (and upgrade) of a popular Claude Code community refactor story: a developer thought “more info = better,” built Skills like mini-wikis, and accidentally created a &lt;strong&gt;context explosion&lt;/strong&gt;. The fix wasn’t a clever prompt. It was an architectural refactor. The result: dramatically leaner initial context and much better token efficiency.&lt;/p&gt;

&lt;p&gt;Let’s steal the playbook.&lt;/p&gt;




&lt;h1&gt;
  
  
  1) The Root Cause: Treating Skills Like Docs
&lt;/h1&gt;

&lt;p&gt;The first trap is incredibly human:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“If I include everything, the model will always have what it needs.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So you create one Skill per tool, and each Skill becomes a documentation dump:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;setup steps&lt;/li&gt;
&lt;li&gt;API references&lt;/li&gt;
&lt;li&gt;exhaustive examples&lt;/li&gt;
&lt;li&gt;“don’t do X” lists&lt;/li&gt;
&lt;li&gt;every edge case since 2017&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then a task like “deploy a serverless function with a small UI” pulls in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;your Cloudflare skill,&lt;/li&gt;
&lt;li&gt;your Docker skill,&lt;/li&gt;
&lt;li&gt;your UI styling skill,&lt;/li&gt;
&lt;li&gt;your web framework skill…&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;…and the model starts its job already half-drowned.&lt;/p&gt;

&lt;p&gt;Claude Code’s own docs warn that Skills share the context window with the conversation, the request, and other Skills — which means uncontrolled loading is a direct performance tax. (You feel it as slowness, drift, and “why is it ignoring the obvious part?”)&lt;/p&gt;

&lt;p&gt;So: &lt;strong&gt;your problem isn’t “lack of info.” It’s “too much irrelevant info.”&lt;/strong&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  2) The Fix: Progressive Disclosure (Three Layers)
&lt;/h1&gt;

&lt;p&gt;Claude Code docs explicitly recommend &lt;strong&gt;progressive disclosure&lt;/strong&gt;: keep essential info in &lt;code&gt;SKILL.md&lt;/code&gt;, and store the heavy stuff in separate files that get loaded only when the task requires them.&lt;/p&gt;

&lt;p&gt;This maps cleanly to a three-layer system:&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 1 — Metadata (always loaded)
&lt;/h2&gt;

&lt;p&gt;A short YAML frontmatter: name + description + the “routing signal.”&lt;/p&gt;

&lt;p&gt;Think of it like a book cover and blurb. You’re not teaching. You’re helping the model decide whether to open the book.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 2 — Entry point: &lt;code&gt;SKILL.md&lt;/code&gt; (loaded on activation)
&lt;/h2&gt;

&lt;p&gt;Your navigation map:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what the Skill is for&lt;/li&gt;
&lt;li&gt;when to use it&lt;/li&gt;
&lt;li&gt;what steps to follow&lt;/li&gt;
&lt;li&gt;what files to open next&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Not a tutorial. Not a wiki.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 3 — References &amp;amp; scripts (loaded &lt;em&gt;only when needed&lt;/em&gt;)
&lt;/h2&gt;

&lt;p&gt;Small, focused files:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;one topic per file&lt;/li&gt;
&lt;li&gt;200–300 lines per file is a good target&lt;/li&gt;
&lt;li&gt;scripts do deterministic work so the model doesn’t burn tokens “describing” actions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here’s what that looks like in a real folder:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.claude/skills/devops/
├── SKILL.md
├── references/
│   ├── serverless-cloudflare.md
│   ├── containers-docker.md
│   └── ci-cd-basics.md
└── scripts/
    ├── validate_env.py
    └── deploy_helper.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h1&gt;
  
  
  3) The “200-Line Rule”: Brutal, Slightly Arbitrary, Weirdly Effective
&lt;/h1&gt;

&lt;p&gt;In the community refactor story, the author landed on a hard constraint:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Keep &lt;code&gt;SKILL.md&lt;/code&gt; under ~200 lines.&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
If you can’t, you’re putting too much in the entry point.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Claude’s own best practices docs recommend keeping the body under a few hundred lines (and splitting content as you approach that limit). But “200 lines” is a sharper knife: it forces you to write a &lt;strong&gt;table of contents&lt;/strong&gt;, not a textbook.&lt;/p&gt;

&lt;p&gt;Why it works:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The model can scan the entry quickly&lt;/li&gt;
&lt;li&gt;It can decide what reference file to load next&lt;/li&gt;
&lt;li&gt;Total “initial load” stays small enough that the conversation still has room to breathe&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  A quick test you can steal
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Start a fresh session (cold start)&lt;/li&gt;
&lt;li&gt;Trigger your Skill&lt;/li&gt;
&lt;li&gt;If your first activation loads &lt;strong&gt;more than ~500 lines&lt;/strong&gt; of content, your design is likely leaking scope&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  4) The Real Mental Shift: From Tool-Centric to Workflow-Centric
&lt;/h1&gt;

&lt;p&gt;This is the part most people miss.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tool-centric Skills&lt;/strong&gt; look like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;cloudflare-skill&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;tailwind-skill&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;postgres-skill&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;kubernetes-skill&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They’re encyclopedias. They don’t compose well.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Workflow-centric Skills&lt;/strong&gt; look like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;devops&lt;/code&gt; (deploy + environments + CI/CD)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ui-styling&lt;/code&gt; (design rules + component patterns)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;web-frameworks&lt;/code&gt; (routing + project structure + SSR pitfalls)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;databases&lt;/code&gt; (schema design + migrations + query patterns)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They map to what you &lt;em&gt;actually do&lt;/em&gt; during development.&lt;/p&gt;

&lt;p&gt;A workflow Skill answers:  &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“When I’m in this stage of work, what does the agent need to know to act correctly?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Not:  &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“What is everything this tool can do?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That one reframing prevents context blowups almost by itself.&lt;/p&gt;




&lt;h1&gt;
  
  
  5) A Minimal, Production-Grade &lt;code&gt;SKILL.md&lt;/code&gt; (Example)
&lt;/h1&gt;

&lt;p&gt;Here’s a deliberately small entry point you can copy and customise.&lt;br&gt;&lt;br&gt;
Notice what’s missing: long examples, full docs, and “everything you might ever need.”&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ui-styling&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Apply consistent UI styling across the app (Tailwind + component conventions). Use when building or refactoring UI.&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

&lt;span class="gh"&gt;# UI Styling Skill&lt;/span&gt;

&lt;span class="gu"&gt;## When to use&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; You are building UI components or pages
&lt;span class="p"&gt;-&lt;/span&gt; You need consistent spacing, typography, and responsive behaviour
&lt;span class="p"&gt;-&lt;/span&gt; You need to align with existing design conventions

&lt;span class="gu"&gt;## Workflow&lt;/span&gt;
&lt;span class="p"&gt;1.&lt;/span&gt; Identify the UI surface (page/component) and constraints (responsive, dark mode, accessibility)
&lt;span class="p"&gt;2.&lt;/span&gt; Apply styling rules from the references (pick only what you need)
&lt;span class="p"&gt;3.&lt;/span&gt; Validate output against the checklist

&lt;span class="gu"&gt;## References (load only if needed)&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="sb"&gt;`references/design-tokens.md`&lt;/span&gt; — spacing, font scale, colour usage
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="sb"&gt;`references/tailwind-patterns.md`&lt;/span&gt; — layouts, common utility combos
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="sb"&gt;`references/accessibility-checklist.md`&lt;/span&gt; — keyboard, focus, contrast

&lt;span class="gu"&gt;## Output contract&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Use UK English in UI strings
&lt;span class="p"&gt;-&lt;/span&gt; Prefer reusable components over copy-paste blocks
&lt;span class="p"&gt;-&lt;/span&gt; Keep className readable (extract when it gets messy)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That’s it.&lt;/p&gt;

&lt;p&gt;The Skill’s job is to &lt;strong&gt;route&lt;/strong&gt; the agent to the right file at the right moment — not to become an on-page encyclopedia.&lt;/p&gt;




&lt;h1&gt;
  
  
  6) Measuring Improvements (Without Lying to Yourself)
&lt;/h1&gt;

&lt;p&gt;If you want repeatable results, track metrics that actually matter:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Initial lines loaded&lt;/strong&gt; on activation
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Time to activation&lt;/strong&gt; (roughly: how “snappy” it feels)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Relevance ratio&lt;/strong&gt; (how much of the loaded content is used)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context overflow frequency&lt;/strong&gt; (how often long tasks crash)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You don’t need a full observability stack. A simple repo audit script helps.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tiny Python audit: count lines per Skill
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;

&lt;span class="n"&gt;skills_dir&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.claude/skills&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;count_lines&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ignore&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;skill&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;skills_dir&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;iterdir&lt;/span&gt;&lt;span class="p"&gt;()):&lt;/span&gt;
    &lt;span class="n"&gt;skill_md&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;skill&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SKILL.md&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;skill_md&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exists&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;lines&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;count_lines&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;skill_md&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OK&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;lines&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;REFACTOR&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;skill&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; lines  -&amp;gt;  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you run this weekly, you’ll catch “documentation creep” before it becomes a crisis.&lt;/p&gt;




&lt;h1&gt;
  
  
  7) Common Failure Modes (And How to Avoid Them)
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Failure mode: Claude writes “a doc” instead of “a Skill”
&lt;/h2&gt;

&lt;p&gt;LLMs love expanding markdown into tutorials.&lt;/p&gt;

&lt;p&gt;Fix:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;explicitly tell it: &lt;strong&gt;this is not documentation&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;remove “beginner” filler&lt;/li&gt;
&lt;li&gt;keep examples short, push detail into references&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Failure mode: Entry point bloats because the Skill scope is too wide
&lt;/h2&gt;

&lt;p&gt;Fix:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;split the Skill by workflow stage&lt;/li&gt;
&lt;li&gt;or move decision trees into references&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Failure mode: Too many references, still hard to navigate
&lt;/h2&gt;

&lt;p&gt;Fix:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;put a short “map” section in &lt;code&gt;SKILL.md&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;keep reference files single-topic and named by intent, not by tool&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  8) A Copyable Refactor Checklist
&lt;/h1&gt;

&lt;p&gt;1) &lt;strong&gt;Audit&lt;/strong&gt;: list Skills + line counts, find any &lt;code&gt;SKILL.md&lt;/code&gt; &amp;gt; 200 lines&lt;br&gt;&lt;br&gt;
2) &lt;strong&gt;Group by workflow&lt;/strong&gt;: merge tool-specific Skills into capability Skills&lt;br&gt;&lt;br&gt;
3) &lt;strong&gt;Create references&lt;/strong&gt;: move detailed info out of &lt;code&gt;SKILL.md&lt;/code&gt;&lt;br&gt;&lt;br&gt;
4) &lt;strong&gt;Enforce entry constraints&lt;/strong&gt;: keep &lt;code&gt;SKILL.md&lt;/code&gt; lean and navigational&lt;br&gt;&lt;br&gt;
5) &lt;strong&gt;Cold start test&lt;/strong&gt;: ensure first activation stays under your chosen budget&lt;br&gt;&lt;br&gt;
6) &lt;strong&gt;Keep scripts deterministic&lt;/strong&gt;: offload “do the thing” to code where possible&lt;br&gt;&lt;br&gt;
7) &lt;strong&gt;Re-check monthly&lt;/strong&gt;: Skills drift over time; treat them like code&lt;/p&gt;




&lt;h1&gt;
  
  
  Final Take: Context Engineering Is “Right Info, Right Time”
&lt;/h1&gt;

&lt;p&gt;The big lesson isn’t “200 lines” or “three layers.”&lt;/p&gt;

&lt;p&gt;It’s this:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context is a budget.&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
And the best Skill design spends it like an engineer, not like a librarian.&lt;/p&gt;

&lt;p&gt;Don’t load everything. Load what matters — when it matters — and keep the rest one file away.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>mcp</category>
      <category>promptengineering</category>
    </item>
    <item>
      <title>Stop Fine-Tuning Blindly: When to Fine-Tune—and When Not to Touch Model Weights</title>
      <dc:creator>Dechun Wang</dc:creator>
      <pubDate>Sun, 15 Feb 2026 22:12:59 +0000</pubDate>
      <link>https://forem.com/superorange0707/stop-fine-tuning-blindly-when-to-fine-tune-and-when-not-to-touch-model-weights-21j</link>
      <guid>https://forem.com/superorange0707/stop-fine-tuning-blindly-when-to-fine-tune-and-when-not-to-touch-model-weights-21j</guid>
      <description>&lt;h1&gt;
  
  
  Fine-Tuning Is a Knife, Not a Hammer
&lt;/h1&gt;

&lt;p&gt;Fine-tuning has a reputation problem.&lt;/p&gt;

&lt;p&gt;Some people treat it like magic: “Just fine-tune and the model will &lt;em&gt;understand our domain&lt;/em&gt;.”&lt;br&gt;&lt;br&gt;
Others treat it like a sin: “Never touch weights, it’s all prompt engineering now.”&lt;/p&gt;

&lt;p&gt;Both are wrong.&lt;/p&gt;

&lt;p&gt;Fine-tuning is a &lt;strong&gt;precision tool&lt;/strong&gt;. Used well, it turns a generic model into a specialist. Used badly, it burns GPU budgets, bakes in bias, and ships a model that performs &lt;em&gt;worse&lt;/em&gt; than the base.&lt;/p&gt;

&lt;p&gt;This is a field guide: what types of fine-tuning exist, what they cost, how to run them, and the traps that quietly ruin outcomes.&lt;/p&gt;


&lt;h1&gt;
  
  
  1) The Real Taxonomy of Fine-Tuning
&lt;/h1&gt;

&lt;p&gt;There are multiple ways to classify fine-tuning. The cleanest is: &lt;strong&gt;what changes&lt;/strong&gt;, &lt;strong&gt;what signal you train on&lt;/strong&gt;, and &lt;strong&gt;what model type you’re adapting&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  1.1 By training scope: Full FT vs PEFT
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Full fine-tuning (Full FT)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Definition:&lt;/strong&gt; update &lt;em&gt;all&lt;/em&gt; model weights so the model fully adapts to the new task.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Traits:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Maximum flexibility, maximum cost&lt;/li&gt;
&lt;li&gt;Requires strong data quality and careful regularisation&lt;/li&gt;
&lt;li&gt;Risk: &lt;strong&gt;catastrophic forgetting&lt;/strong&gt; (the model “forgets” general abilities)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;When it makes sense:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You have a stable task and a solid dataset (usually 10k–100k+ high-quality samples)&lt;/li&gt;
&lt;li&gt;You can afford experiments and regression testing&lt;/li&gt;
&lt;li&gt;You need deeper behavioural change than PEFT can deliver&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Parameter-Efficient Fine-Tuning (PEFT)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Definition:&lt;/strong&gt; freeze most weights and train small, targeted parameters.&lt;/p&gt;

&lt;p&gt;You get most of the gains with a fraction of the cost.&lt;/p&gt;

&lt;p&gt;PEFT subtypes you’ll actually see in production:&lt;/p&gt;
&lt;h4&gt;
  
  
  (A) Adapters
&lt;/h4&gt;

&lt;p&gt;Insert small modules inside transformer blocks; train only those adapter weights. Typically a few percent of the total parameters.&lt;/p&gt;
&lt;h4&gt;
  
  
  (B) Prompt tuning (soft prompts / prefix tuning)
&lt;/h4&gt;

&lt;p&gt;Train learnable “prompt vectors” (or a prefix) that steer behaviour.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Soft prompts: continuous vectors&lt;/li&gt;
&lt;li&gt;Hard prompts: discrete tokens (rarely “trained” in the same way)&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;
  
  
  (C) LoRA (Low-Rank Adaptation)
&lt;/h4&gt;

&lt;p&gt;LoRA is the workhorse. It decomposes weight updates into low-rank matrices:&lt;/p&gt;

&lt;p&gt;$$&lt;br&gt;
\Delta W = BA, \quad B \in \mathbb{R}^{d \times r},\; A \in \mathbb{R}^{r \times k},\; r \ll \min(d,k)&lt;br&gt;
$$&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it wins:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You store only (\Delta W) (small)&lt;/li&gt;
&lt;li&gt;Easy to swap adapters per task&lt;/li&gt;
&lt;li&gt;Strong performance per compute&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;
  
  
  (D) QLoRA
&lt;/h4&gt;

&lt;p&gt;QLoRA runs LoRA on a &lt;strong&gt;quantised base model&lt;/strong&gt; (often 4-bit), slashing VRAM requirements and making “big-ish” fine-tuning viable on consumer GPUs.&lt;/p&gt;


&lt;h2&gt;
  
  
  1.2 By learning signal: SFT, RLHF, contrastive (and friends)
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Supervised Fine-Tuning (SFT)
&lt;/h3&gt;

&lt;p&gt;Train on labelled input-output pairs. This is the default for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;classification&lt;/li&gt;
&lt;li&gt;extraction&lt;/li&gt;
&lt;li&gt;instruction following (instruction tuning)&lt;/li&gt;
&lt;li&gt;style / tone adaptation&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Preference optimisation (RLHF / DPO / variants)
&lt;/h3&gt;

&lt;p&gt;Classic RLHF pipeline: SFT → reward model → policy optimisation (e.g., PPO).&lt;br&gt;&lt;br&gt;
In practice, many teams now use &lt;strong&gt;direct preference optimisation (DPO)&lt;/strong&gt;-style training because it’s simpler operationally, but the concept is the same: &lt;em&gt;align the model to preferences&lt;/em&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  Contrastive fine-tuning
&lt;/h3&gt;

&lt;p&gt;Useful when you care about representations (retrieval, similarity, embedding quality), less common for everyday text generation.&lt;/p&gt;


&lt;h2&gt;
  
  
  1.3 By modality: language, vision, multimodal
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;NLP&lt;/strong&gt;: BERT/GPT/T5-style models; instruction tuning and chain-of-thought-style supervision are common&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vision&lt;/strong&gt;: ResNet/ViT; progressive unfreezing and strong augmentation matter&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multimodal&lt;/strong&gt;: CLIP/BLIP/Flamingo-like; biggest challenge is aligning representations across modalities&lt;/li&gt;
&lt;/ul&gt;


&lt;h1&gt;
  
  
  2) When Fine-Tuning Actually Pays Off
&lt;/h1&gt;

&lt;p&gt;Fine-tuning shines in three situations:&lt;/p&gt;
&lt;h2&gt;
  
  
  2.1 Your domain language is not optional
&lt;/h2&gt;

&lt;p&gt;Example: finance risk text. If the base model misreads terms like “short”, “subprime”, “haircut”, it will miss signals no matter how clever the prompt is.&lt;/p&gt;
&lt;h2&gt;
  
  
  2.2 Your task needs consistent behaviour, not one-off brilliance
&lt;/h2&gt;

&lt;p&gt;A model that produces “sometimes great” answers is a nightmare in production. Fine-tuning can stabilise behaviour and reduce prompt complexity.&lt;/p&gt;
&lt;h2&gt;
  
  
  2.3 Your deployment requires control
&lt;/h2&gt;

&lt;p&gt;On-prem constraints, latency budgets, data residency: self-hosted models + PEFT are often the only workable path.&lt;/p&gt;


&lt;h1&gt;
  
  
  3) When You Should NOT Fine-Tune
&lt;/h1&gt;

&lt;p&gt;Here are the expensive mistakes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&amp;lt;100 labelled samples&lt;/strong&gt;: you’ll overfit or learn noise
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;task changes weekly&lt;/strong&gt;: your fine-tune becomes technical debt
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;you can solve it with retrieval&lt;/strong&gt;: if the problem is “missing knowledge,” do RAG first
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;you can’t evaluate properly&lt;/strong&gt;: if you can’t measure, don’t train&lt;/li&gt;
&lt;/ul&gt;


&lt;h1&gt;
  
  
  4) The Fine-Tuning Workflow That Survives Production
&lt;/h1&gt;

&lt;p&gt;Forget “train.py and vibes.” A real pipeline has repeatable stages.&lt;/p&gt;
&lt;h2&gt;
  
  
  4.1 Environment
&lt;/h2&gt;

&lt;p&gt;Core stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PyTorch&lt;/li&gt;
&lt;li&gt;Transformers + Datasets&lt;/li&gt;
&lt;li&gt;Accelerate&lt;/li&gt;
&lt;li&gt;PEFT&lt;/li&gt;
&lt;li&gt;Experiment tracking (Weights &amp;amp; Biases or MLflow)&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  4.2 Data
&lt;/h2&gt;

&lt;p&gt;This is where most projects win or lose.&lt;/p&gt;

&lt;p&gt;Minimum checklist:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;label consistency (do two annotators agree?)&lt;/li&gt;
&lt;li&gt;balanced distribution (avoid 10:1 class collapse unless you correct for it)&lt;/li&gt;
&lt;li&gt;no leakage (train/val split must be clean)&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  4.3 Model config
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;pick base model&lt;/li&gt;
&lt;li&gt;pick tuning method (LoRA vs QLoRA vs full)&lt;/li&gt;
&lt;li&gt;decide what gets trained, what stays frozen&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  4.4 Training loop
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;forward → loss → backward&lt;/li&gt;
&lt;li&gt;gradient clipping&lt;/li&gt;
&lt;li&gt;mixed precision when appropriate&lt;/li&gt;
&lt;li&gt;periodic eval&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  4.5 Evaluation + export
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;validate on held-out set&lt;/li&gt;
&lt;li&gt;measure robustness and regression&lt;/li&gt;
&lt;li&gt;export artefacts (base + adapter weights)&lt;/li&gt;
&lt;/ul&gt;


&lt;h1&gt;
  
  
  5) Practical Code: SFT + LoRA (PEFT) with Transformers
&lt;/h1&gt;

&lt;p&gt;Below is a &lt;em&gt;slightly tweaked&lt;/em&gt; version of the standard Hugging Face flow, tuned for clarity and real-world guardrails.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# pip install transformers datasets accelerate peft evaluate
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datasets&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dataset&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AutoModelForSequenceClassification&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TrainingArguments&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Trainer&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;peft&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LoraConfig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;get_peft_model&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;evaluate&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;imdb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;tokenizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bert-base-uncased&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;preprocess&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;truncation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;padding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_length&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_length&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;384&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;tokenised&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;preprocess&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;batched&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;tokenised&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tokenised&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;remove_columns&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;rename_column&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;label&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;labels&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;tokenised&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;torch&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;base&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoModelForSequenceClassification&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bert-base-uncased&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_labels&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;lora_cfg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LoraConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;lora_alpha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;lora_dropout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.05&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;target_modules&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;  &lt;span class="c1"&gt;# tweak per model architecture
&lt;/span&gt;    &lt;span class="n"&gt;bias&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;none&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;task_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SEQ_CLS&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_peft_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lora_cfg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;metric&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;accuracy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;compute_metrics&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;eval_pred&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;logits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;labels&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;eval_pred&lt;/span&gt;
    &lt;span class="n"&gt;preds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;argmax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;logits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;predictions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;preds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;references&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TrainingArguments&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;output_dir&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./ft_out&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;learning_rate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;2e-5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;per_device_train_batch_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;per_device_eval_batch_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;num_train_epochs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;weight_decay&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.01&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;evaluation_strategy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;epoch&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;save_strategy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;epoch&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;load_best_model_at_end&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;metric_for_best_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;accuracy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;logging_steps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;fp16&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;trainer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Trainer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;train_dataset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tokenised&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;train&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;eval_dataset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tokenised&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;test&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;compute_metrics&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;compute_metrics&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;train&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./ft_out/lora_adapter&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What’s different (and why it matters):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;max_length trimmed to 384 to reduce waste&lt;/li&gt;
&lt;li&gt;LoRA targets are explicit (you should verify for your model)&lt;/li&gt;
&lt;li&gt;fp16 enabled, batch sizes set for typical GPUs&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  6) QLoRA in Practice: When VRAM Is Your Bottleneck
&lt;/h1&gt;

&lt;p&gt;QLoRA is the “I don’t have an A100” option.&lt;/p&gt;

&lt;p&gt;Use it when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;your model is too big to fine-tune in full precision&lt;/li&gt;
&lt;li&gt;you want LoRA-level results with drastically less memory&lt;/li&gt;
&lt;li&gt;you accept slightly more complexity in setup&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Operational note:&lt;/strong&gt; QLoRA is sensitive to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;quantisation config&lt;/li&gt;
&lt;li&gt;optimizer choice&lt;/li&gt;
&lt;li&gt;batch size / gradient accumulation&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  7) Hardware Planning (The Boring Part That Saves You £££)
&lt;/h1&gt;

&lt;p&gt;A simple rule-of-thumb table (very rough, but directionally useful):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model size&lt;/th&gt;
&lt;th&gt;Practical approach&lt;/th&gt;
&lt;th&gt;GPU class&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&amp;lt;1B&lt;/td&gt;
&lt;td&gt;Full FT or LoRA&lt;/td&gt;
&lt;td&gt;24GB consumer GPU&lt;/td&gt;
&lt;td&gt;Cheap experiments&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1–10B&lt;/td&gt;
&lt;td&gt;LoRA/QLoRA&lt;/td&gt;
&lt;td&gt;40–80GB&lt;/td&gt;
&lt;td&gt;Stable training &amp;amp; eval&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&amp;gt;10B&lt;/td&gt;
&lt;td&gt;QLoRA or multi-GPU&lt;/td&gt;
&lt;td&gt;80GB+ (multi-card)&lt;/td&gt;
&lt;td&gt;Memory + throughput&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If your goal is a &lt;strong&gt;production system&lt;/strong&gt;, plan for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;checkpoints (storage balloons fast)&lt;/li&gt;
&lt;li&gt;inference latency testing (p50/p95/p99)&lt;/li&gt;
&lt;li&gt;versioning (base + adapters + configs)&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  8) Monitoring: How to Detect Failure Early
&lt;/h1&gt;

&lt;p&gt;Track:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;train vs val loss divergence (overfitting)&lt;/li&gt;
&lt;li&gt;task metric (F1/AUC/accuracy) over time&lt;/li&gt;
&lt;li&gt;gradient norms (explosions or vanishing)&lt;/li&gt;
&lt;li&gt;GPU utilisation + VRAM (to catch bottlenecks)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Early stopping is not optional in small-data regimes.&lt;/p&gt;




&lt;h1&gt;
  
  
  9) The Pitfalls That Kill Fine-Tuning Projects
&lt;/h1&gt;

&lt;h2&gt;
  
  
  9.1 Data leakage
&lt;/h2&gt;

&lt;p&gt;Validation looks amazing, test collapses.&lt;/p&gt;

&lt;p&gt;Fix:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;group-aware splits&lt;/li&gt;
&lt;li&gt;time-based splits for temporal data&lt;/li&gt;
&lt;li&gt;deduplicate aggressively&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  9.2 Class imbalance
&lt;/h2&gt;

&lt;p&gt;Model learns the majority class.&lt;/p&gt;

&lt;p&gt;Fix:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;weighting&lt;/li&gt;
&lt;li&gt;resampling&lt;/li&gt;
&lt;li&gt;metric choice (F1 &amp;gt; accuracy in many cases)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  9.3 “Bigger model = better”
&lt;/h2&gt;

&lt;p&gt;On small data, bigger models can overfit harder.&lt;/p&gt;

&lt;p&gt;Fix:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;match model size to data&lt;/li&gt;
&lt;li&gt;prefer PEFT&lt;/li&gt;
&lt;li&gt;regularise&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  9.4 Ignoring deployment constraints
&lt;/h2&gt;

&lt;p&gt;A model that hits 0.96 AUC but misses latency and memory budgets is a demo, not a product.&lt;/p&gt;

&lt;p&gt;Fix:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;benchmark early&lt;/li&gt;
&lt;li&gt;export-friendly formats (ONNX/TensorRT) if needed&lt;/li&gt;
&lt;li&gt;distil if latency matters&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  10) A Decision Cheat Sheet
&lt;/h1&gt;

&lt;p&gt;Use this quick chooser:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data &amp;lt; 100&lt;/strong&gt; → prompt + retrieval + synthetic data
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;100–1,000&lt;/strong&gt; → LoRA / adapters
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1,000–10,000&lt;/strong&gt; → LoRA or full FT (small LR)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;10,000+&lt;/strong&gt; → full FT can make sense (if eval + regression are solid)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VRAM tight&lt;/strong&gt; → QLoRA
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Need preference alignment&lt;/strong&gt; → DPO/RLHF-style preference training
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Task changes often&lt;/strong&gt; → avoid weight updates, design workflows instead&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  Final Take
&lt;/h1&gt;

&lt;p&gt;Successful fine-tuning isn’t “a training run.”&lt;/p&gt;

&lt;p&gt;It’s a loop:&lt;br&gt;
&lt;strong&gt;data → training → evaluation → deployment constraints → monitoring → back to data&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If you treat it as an engineering system (not a one-off experiment), PEFT methods like &lt;strong&gt;LoRA/QLoRA&lt;/strong&gt; give you the best tradeoff curve in 2026: strong gains, manageable cost, and deployable artefacts.&lt;/p&gt;

&lt;p&gt;And that’s what you want: not a model that’s “smart in a notebook,” but a model that’s &lt;strong&gt;reliable in production&lt;/strong&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>finetuning</category>
    </item>
    <item>
      <title>Agent Skills: When Your AI Learns to "Install Plugins</title>
      <dc:creator>Dechun Wang</dc:creator>
      <pubDate>Mon, 02 Feb 2026 14:40:23 +0000</pubDate>
      <link>https://forem.com/superorange0707/agent-skills-when-your-ai-learns-to-install-plugins-247j</link>
      <guid>https://forem.com/superorange0707/agent-skills-when-your-ai-learns-to-install-plugins-247j</guid>
      <description>&lt;h1&gt;
  
  
  Agent Skills: The “Plugins” Moment for Everyday AI
&lt;/h1&gt;

&lt;p&gt;There’s a specific kind of disappointment you only get after asking an LLM to “create an Excel report” and receiving… a beautifully formatted description of a spreadsheet that does not exist.&lt;/p&gt;

&lt;p&gt;It’s not the model’s fault. &lt;strong&gt;LLMs are great at language. They’re not inherently great at deterministic, file-producing, structure-preserving operations.&lt;/strong&gt; That’s where &lt;strong&gt;Agent Skills&lt;/strong&gt; come in: Anthropic’s answer (announced October 16, 2025) to the question: &lt;em&gt;“How do we give AI real capabilities without turning every user into a developer?”&lt;/em&gt; &lt;/p&gt;

&lt;p&gt;If MCP is a highway system for AI tooling, &lt;strong&gt;Agent Skills are the roundabouts and on-ramps built right into Claude&lt;/strong&gt;—fast, local, and predictable.&lt;/p&gt;




&lt;h1&gt;
  
  
  1) What Exactly Is an Agent Skill?
&lt;/h1&gt;

&lt;p&gt;An &lt;strong&gt;Agent Skill&lt;/strong&gt; is a &lt;em&gt;packaged capability&lt;/em&gt; Claude can use when it recognises the situation.&lt;/p&gt;

&lt;p&gt;A Skill typically includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Metadata&lt;/strong&gt; (name + description) used for routing/selection
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Instructions&lt;/strong&gt; (the playbook / workflow)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resources&lt;/strong&gt; (scripts, templates, helper files) that execute in a sandboxed environment &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Think: “a tiny, reusable workflow module” rather than “yet another prompt”.&lt;/p&gt;

&lt;p&gt;Anthropic’s docs describe Skills as organised folders of instructions, scripts, and resources, including pre-built Skills for common document work (PowerPoint, Excel, Word, PDF), plus custom Skills you can write yourself. &lt;/p&gt;




&lt;h1&gt;
  
  
  2) Why This Is A Big Deal: Progressive Disclosure (AKA, Don’t Stuff the Context Window)
&lt;/h1&gt;

&lt;p&gt;The clever part isn’t “Claude can run scripts”—lots of systems can do that.&lt;/p&gt;

&lt;p&gt;The clever part is &lt;strong&gt;how little Claude loads until it needs to&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Claude Code docs describe a three-phase flow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Startup:&lt;/strong&gt; load only each Skill’s &lt;code&gt;name&lt;/code&gt; + &lt;code&gt;description&lt;/code&gt; (keeps startup fast)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Activation:&lt;/strong&gt; when relevant, Claude asks to use the Skill and you confirm before full instructions load
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Execution:&lt;/strong&gt; load resources and run in the execution environment &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In practice, that means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You don’t pay a token tax upfront for 20 Skills you &lt;em&gt;might&lt;/em&gt; need later.&lt;/li&gt;
&lt;li&gt;Claude can route to the right tool without being bloated with detail.&lt;/li&gt;
&lt;li&gt;The user gets a clear “yes/no” moment before full Skill content is injected.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the opposite of the “mega system prompt” era.&lt;/p&gt;




&lt;h1&gt;
  
  
  3) What Skills Are Great At (And Why LLMs Struggle Without Them)
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Problem A: “I need a real file, not a bedtime story about a file.”
&lt;/h2&gt;

&lt;p&gt;Without Skills, the model often produces &lt;em&gt;representations&lt;/em&gt; of artefacts—tables in Markdown, pseudo-Excel, fake download links.&lt;/p&gt;

&lt;p&gt;With Skills, Claude can actually &lt;strong&gt;generate the artefact&lt;/strong&gt; (e.g., &lt;code&gt;.xlsx&lt;/code&gt;, &lt;code&gt;.pptx&lt;/code&gt;, &lt;code&gt;.docx&lt;/code&gt;) and hand it back.&lt;/p&gt;

&lt;h2&gt;
  
  
  Problem B: Domain best practices are annoying to repeat
&lt;/h2&gt;

&lt;p&gt;In real work, the prompt is rarely the hard part. The hard part is the &lt;em&gt;standards&lt;/em&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“Use the company slide template”&lt;/li&gt;
&lt;li&gt;“Always include a pivot table + chart + executive summary”&lt;/li&gt;
&lt;li&gt;“In code review, flag SQL injection risks”&lt;/li&gt;
&lt;li&gt;“For PDFs, preserve table structure and join split rows”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Skills let you bake these standards once—then reuse them consistently.&lt;/p&gt;

&lt;h2&gt;
  
  
  Problem C: Determinism matters (especially for structured extraction)
&lt;/h2&gt;

&lt;p&gt;When extracting tables from PDFs or producing spreadsheets, you want &lt;strong&gt;repeatable output&lt;/strong&gt;. Skills push more of the job into deterministic tooling rather than hoping the model “describes it correctly”.&lt;/p&gt;




&lt;h1&gt;
  
  
  4) Agent Skills vs MCP: It’s Not Redundant, It’s Layering
&lt;/h1&gt;

&lt;p&gt;People see “Skills” and immediately ask: &lt;em&gt;Wait, isn’t that what MCP is for?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Not really.&lt;/p&gt;

&lt;h2&gt;
  
  
  MCP in one paragraph
&lt;/h2&gt;

&lt;p&gt;Anthropic introduced &lt;strong&gt;Model Context Protocol (MCP)&lt;/strong&gt; in November 2024 as an open standard for building secure, two-way connections between AI tools and external data sources—via MCP clients talking to MCP servers. &lt;/p&gt;

&lt;p&gt;In other words: &lt;strong&gt;MCP is about connecting to outside systems&lt;/strong&gt; (databases, file stores, SaaS tools, internal services).&lt;/p&gt;

&lt;h2&gt;
  
  
  Skills in one paragraph
&lt;/h2&gt;

&lt;p&gt;Agent Skills are about &lt;strong&gt;packaging repeatable workflows and execution logic inside Claude’s ecosystem&lt;/strong&gt;, with progressive loading and sandboxed execution.&lt;/p&gt;

&lt;h3&gt;
  
  
  A useful mental model
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Agent Skills = built-in “shortcuts” / workflow modules&lt;/strong&gt; (fast, local, standardised)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MCP = app ecosystem infrastructure&lt;/strong&gt; (powerful, external, programmable, operationally heavier)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Or: &lt;strong&gt;Skills optimise “doing the thing”. MCP optimises “reaching the thing”.&lt;/strong&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  5) The Architecture Patterns You’ll Actually Use
&lt;/h1&gt;

&lt;h2&gt;
  
  
  5.1 Skills for document-heavy work (the “office grind” you shouldn’t be doing manually)
&lt;/h2&gt;

&lt;p&gt;Pre-built Skills cover common doc tasks: spreadsheets, slides, PDFs, Word docs. &lt;/p&gt;

&lt;p&gt;A realistic workflow:&lt;/p&gt;

&lt;p&gt;1) User: “Turn this quarterly sales CSV into a management-ready workbook with a pivot table and chart.”&lt;br&gt;&lt;br&gt;
2) Claude uses a spreadsheet Skill to generate a real &lt;code&gt;.xlsx&lt;/code&gt; artefact.&lt;/p&gt;

&lt;h2&gt;
  
  
  5.2 Skills for organisational consistency (the “one team, one way” problem)
&lt;/h2&gt;

&lt;p&gt;A custom Skill can encode your team’s standards:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;review checklist&lt;/li&gt;
&lt;li&gt;risk scoring rubric&lt;/li&gt;
&lt;li&gt;writing style guide&lt;/li&gt;
&lt;li&gt;“definition of done”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This matters because humans forget standards. LLMs… forget them even faster unless you enforce them.&lt;/p&gt;

&lt;h2&gt;
  
  
  5.3 MCP for external systems (the “we need live data” problem)
&lt;/h2&gt;

&lt;p&gt;If you need to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;query a database&lt;/li&gt;
&lt;li&gt;call a third-party API&lt;/li&gt;
&lt;li&gt;hit an internal service&lt;/li&gt;
&lt;li&gt;read from a production repository&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;…that’s MCP territory—especially because MCP is designed as an open protocol for those client/server connections. &lt;/p&gt;




&lt;h1&gt;
  
  
  6) A Minimal Custom Skill Example (Tweakable, Practical)
&lt;/h1&gt;

&lt;p&gt;Below is a lightweight Skill that turns “messy meeting notes” into a consistent UK-style action log.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;SKILL.md&lt;/code&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;action-tracking&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Turn meeting notes into a UK-style action log with owners, dates, and risks. Use when the user pastes notes or uploads minutes.&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

&lt;span class="gh"&gt;# Action Tracking Assistant&lt;/span&gt;

&lt;span class="gu"&gt;## When to use&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; The user provides meeting notes, minutes, or a transcript
&lt;span class="p"&gt;-&lt;/span&gt; They want actions, owners, deadlines, and risks in a consistent format

&lt;span class="gu"&gt;## Steps&lt;/span&gt;
&lt;span class="p"&gt;1.&lt;/span&gt; Extract decisions (if any) and action items
&lt;span class="p"&gt;2.&lt;/span&gt; Assign each action an owner (use names given; otherwise use "TBC")
&lt;span class="p"&gt;3.&lt;/span&gt; Convert relative dates (e.g. "next Friday") into explicit dates if the date is known; otherwise mark "TBC"
&lt;span class="p"&gt;4.&lt;/span&gt; Flag dependencies and risks

&lt;span class="gu"&gt;## Output format (must follow exactly)&lt;/span&gt;
&lt;span class="gu"&gt;### Decisions&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; ...

&lt;span class="gu"&gt;### Action Log&lt;/span&gt;
| ID | Action | Owner | Due date | Status | Risk/Dependency |
|---:|---|---|---|---|---|
| A1 | ... | ... | ... | Not started | ... |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this works:&lt;/strong&gt; the &lt;code&gt;description&lt;/code&gt; is written in language users actually type, which improves routing. Anthropic also explicitly recommends paying attention to &lt;code&gt;name&lt;/code&gt; and &lt;code&gt;description&lt;/code&gt; because Claude uses them to decide whether to trigger a Skill. &lt;/p&gt;




&lt;h1&gt;
  
  
  7) Safety: The Unsexy Part That Makes Skills Usable
&lt;/h1&gt;

&lt;p&gt;Skills run code and interact with files in an execution environment, so the safety model matters. Claude Code docs frame Skills as folders Claude can navigate and execute within a constrained environment.&lt;/p&gt;

&lt;p&gt;Practical safety rules that hold up in real teams:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prefer &lt;strong&gt;official Skills&lt;/strong&gt; or &lt;strong&gt;skills you authored and reviewed&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Treat third-party Skills like you’d treat random shell scripts from the internet&lt;/li&gt;
&lt;li&gt;Maintain a small allowlist; remove Skills you don’t use&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  8) The Real Limitation: Ecosystem Lock-In (For Now)
&lt;/h1&gt;

&lt;p&gt;Skills are incredibly pragmatic—but they’re also &lt;strong&gt;ecosystem-bound&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Skills are designed around Claude’s tooling and execution model. &lt;/li&gt;
&lt;li&gt;MCP, by contrast, is explicitly positioned as an open protocol for interoperability across tools and platforms. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So the trade-off is clear:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Skills = speed + simplicity&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;MCP = reach + portability&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you’re building internal workflows today, Skills are the “get it done this afternoon” move. If you’re building tooling that must survive model churn, MCP becomes increasingly attractive.&lt;/p&gt;




&lt;h1&gt;
  
  
  Final Take: Skills Make AI Feel Less Like Chat, More Like Software
&lt;/h1&gt;

&lt;p&gt;The biggest shift here isn’t technical—it’s product-shaped:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Before: “AI answers questions.”
&lt;/li&gt;
&lt;li&gt;Now: “AI executes workflows.”
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Agent Skills push Claude closer to what office software has always promised: &lt;strong&gt;less time formatting and moving things around, more time deciding what matters.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;And that’s the quiet superpower: when execution gets cheaper, &lt;strong&gt;ambition gets bigger&lt;/strong&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>mcp</category>
      <category>promptengineering</category>
      <category>agents</category>
    </item>
    <item>
      <title>Getting High-Quality Output from 7B Models: A Production-Grade Prompting Playbook</title>
      <dc:creator>Dechun Wang</dc:creator>
      <pubDate>Mon, 02 Feb 2026 14:39:24 +0000</pubDate>
      <link>https://forem.com/superorange0707/getting-high-quality-output-from-7b-models-a-production-grade-prompting-playbook-21hi</link>
      <guid>https://forem.com/superorange0707/getting-high-quality-output-from-7b-models-a-production-grade-prompting-playbook-21hi</guid>
      <description>&lt;h1&gt;
  
  
  7B Models: Cheap, Fast… and Brutally Honest About Your Prompting
&lt;/h1&gt;

&lt;p&gt;If you’ve deployed a 7B model locally (or on a modest GPU), you already know the trade:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;low cost&lt;/li&gt;
&lt;li&gt;low latency&lt;/li&gt;
&lt;li&gt;easy to self-host&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;patchy world knowledge
&lt;/li&gt;
&lt;li&gt;weaker long-chain reasoning
&lt;/li&gt;
&lt;li&gt;worse instruction-following
&lt;/li&gt;
&lt;li&gt;unstable formatting (“JSON… but not really”)
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The biggest mistake is expecting 7B models to behave like frontier models. They won’t.&lt;/p&gt;

&lt;p&gt;But you &lt;em&gt;can&lt;/em&gt; get surprisingly high-quality output if you treat prompting like &lt;strong&gt;systems design&lt;/strong&gt;, not “creative writing.”&lt;/p&gt;

&lt;p&gt;This is the playbook.&lt;/p&gt;




&lt;h1&gt;
  
  
  1) The 7B Pain Points (What You’re Fighting)
&lt;/h1&gt;

&lt;h2&gt;
  
  
  1.1 Limited knowledge coverage
&lt;/h2&gt;

&lt;p&gt;7B models often miss niche facts and domain jargon. They’ll bluff or generalise.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt implication:&lt;/strong&gt; provide the missing facts up front.&lt;/p&gt;

&lt;h2&gt;
  
  
  1.2 Logic breaks on multi-step tasks
&lt;/h2&gt;

&lt;p&gt;They may skip steps, contradict themselves, or lose track halfway through.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt implication:&lt;/strong&gt; enforce &lt;em&gt;short&lt;/em&gt; steps and validate each step.&lt;/p&gt;

&lt;h2&gt;
  
  
  1.3 Low instruction adherence
&lt;/h2&gt;

&lt;p&gt;Give three requirements, it fulfils one. Give five, it panics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt implication:&lt;/strong&gt; one task per prompt, or a tightly gated checklist.&lt;/p&gt;

&lt;h2&gt;
  
  
  1.4 Format instability
&lt;/h2&gt;

&lt;p&gt;They drift from tables to prose, add commentary, break JSON.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt implication:&lt;/strong&gt; treat output format as a contract and add a repair loop.&lt;/p&gt;




&lt;h1&gt;
  
  
  2) The Four High-Leverage Prompt Tactics
&lt;/h1&gt;

&lt;h2&gt;
  
  
  2.1 Simplify the instruction and focus the target
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Rule:&lt;/strong&gt; one prompt, one job.&lt;/p&gt;

&lt;p&gt;Bad:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Write a review with features, scenarios, advice, and a conclusion, plus SEO keywords.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Better:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Generate &lt;strong&gt;3 key features&lt;/strong&gt; (bullets). No intro. No conclusion.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Then chain steps: features → scenarios → buying advice → final assembly.&lt;/p&gt;

&lt;h3&gt;
  
  
  A reusable “micro-task” skeleton
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ROLE: You are a helpful assistant.
TASK: &amp;lt;single concrete output&amp;gt;
CONSTRAINTS:
- length:
- tone:
- must include:
FORMAT:
- output as:
INPUT:
&amp;lt;your data&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;7B models love rigid scaffolding.&lt;/p&gt;




&lt;h2&gt;
  
  
  2.2 Inject missing knowledge (don’t make the model guess)
&lt;/h2&gt;

&lt;p&gt;If accuracy matters, give the model the facts it needs, like a tiny knowledge base.&lt;/p&gt;

&lt;p&gt;Use “context injection” blocks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;FACTS (use only these):
- Battery cycles: ...
- Fast charging causes: ...
- Definition of ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then ask the question.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hidden benefit:&lt;/strong&gt; this reduces hallucination because you’ve narrowed the search space.&lt;/p&gt;




&lt;h2&gt;
  
  
  2.3 Few-shot + strict formats (make parsing easy)
&lt;/h2&gt;

&lt;p&gt;7B models learn best by imitation. One good example beats three paragraphs of explanation.&lt;/p&gt;

&lt;h3&gt;
  
  
  The “format contract” trick
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Tell it: “Output JSON only.”&lt;/li&gt;
&lt;li&gt;Define keys.&lt;/li&gt;
&lt;li&gt;Provide a small example.&lt;/li&gt;
&lt;li&gt;Add failure behaviour: “If missing info, output &lt;code&gt;INSUFFICIENT_DATA&lt;/code&gt;.”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This reduces the “creative drift” that kills batch workflows.&lt;/p&gt;




&lt;h2&gt;
  
  
  2.4 Step-by-step + multi-turn repair (stop expecting perfection first try)
&lt;/h2&gt;

&lt;p&gt;Small models benefit massively from:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;step decomposition&lt;/li&gt;
&lt;li&gt;targeted corrections (“you missed field X”)&lt;/li&gt;
&lt;li&gt;re-generation of only the broken section&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Think of it as &lt;strong&gt;unit tests for text&lt;/strong&gt;.&lt;/p&gt;




&lt;h1&gt;
  
  
  3) Real Scenarios: Before/After Prompts That Boost Output Quality
&lt;/h1&gt;

&lt;h2&gt;
  
  
  3.1 Content creation: product promo copy
&lt;/h2&gt;

&lt;h3&gt;
  
  
  ❌ Before (too vague)
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;“Write a fun promo for a portable wireless power bank for young people.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  ✅ After (facts + tone + example)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;TASK: Write ONE promo paragraph (90–120 words) for a portable wireless power bank.
AUDIENCE: young commuters in the UK.
TONE: lively, casual, not cringe.
MUST INCLUDE (in any order):
- 180g weight and phone-sized body
- 22.5W fast charging: ~60% in 30 minutes
- 10,000mAh: 2–3 charges
AVOID:
- technical jargon, long specs tables

STYLE EXAMPLE (imitate the vibe, not the product):
"This mini speaker is unreal — pocket-sized, loud, and perfect for the commute."

OUTPUT: one paragraph only. No title. No emojis.

Product:
Portable wireless power bank
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this works on 7B:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;facts are injected&lt;/li&gt;
&lt;li&gt;task is singular&lt;/li&gt;
&lt;li&gt;style anchor reduces tone randomness&lt;/li&gt;
&lt;li&gt;strict output scope prevents rambling&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  3.2 Coding: pandas data cleaning
&lt;/h2&gt;

&lt;h3&gt;
  
  
  ❌ Before
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;“Write Python code to clean user spending data.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  ✅ After (step contract + no guessing)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You are a Python developer. Output code only.

Goal: Clean a CSV file and save the result.

Input file: customer_spend.csv
Columns:
- age (numeric)
- consumption_amount (numeric)

Steps (must follow in order):
1) Read the CSV into `df`
2) Fill missing `age` with the mean of age
3) Fill missing `consumption_amount` with 0
4) Cap `consumption_amount` at 10000 (values &amp;gt;10000 become 10000)
5) Save to clean_customer_spend.csv with index=False
6) Print a single success message

Constraints:
- Use pandas only
- Do not invent extra columns
- Include basic comments
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;7B bonus:&lt;/strong&gt; by forcing explicit steps, you reduce the chance it “forgets” a requirement.&lt;/p&gt;




&lt;h2&gt;
  
  
  3.3 Data analysis: report &lt;em&gt;without&lt;/em&gt; computing numbers
&lt;/h2&gt;

&lt;p&gt;This is where small models often hallucinate numbers. So don’t let them.&lt;/p&gt;

&lt;h3&gt;
  
  
  ✅ Prompt that forbids fabrication
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;TASK: Write a short analysis report framework for 2024 monthly sales trends.
DATA SOURCE: 2024_monthly_sales.xlsx with columns:
- Month (Jan..Dec)
- Units sold
- Revenue (£k)

IMPORTANT RULE:
- You MUST NOT invent any numbers.
- Use placeholders like [X month], [X], [Y] where values are unknown.

Report structure (must match):
# 2024 Product Sales Trend Report
## 1. Sales overview
## 2. Peak &amp;amp; trough months
## 3. Overall trend summary

Analysis requirements:
- Define how to find peak/low months
- Give 2–3 plausible reasons for peaks and troughs (seasonality, promo, stock issues)
- Summarise likely overall trend patterns (up, down, volatile, U-shape)

Output: the full report with placeholders only.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this works:&lt;/strong&gt; it turns the model into a “framework generator,” which is a sweet spot for 7B.&lt;/p&gt;




&lt;h1&gt;
  
  
  4) Quality Evaluation: How to Measure Improvements (Not Just “Feels Better”)
&lt;/h1&gt;

&lt;p&gt;If you can’t measure it, you’ll end up “prompt-shopping” forever. For 7B models, I recommend an evaluation loop that’s &lt;strong&gt;cheap, repeatable, and brutally honest&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  4.1 A lightweight scorecard (run 10–20 samples)
&lt;/h2&gt;

&lt;p&gt;Pick a small test set (even 20–50 prompts is enough) and record:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Adherence&lt;/strong&gt;: did it satisfy every MUST requirement? (hit-rate %)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Factuality vs context&lt;/strong&gt;: count statements that contradict your provided facts (lower is better)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Format pass-rate&lt;/strong&gt;: does the output parse 100% of the time? (JSON/schema/table)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Variance&lt;/strong&gt;: do “key decisions” change across runs? (stability %)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost&lt;/strong&gt;: avg tokens + avg latency (P50/P95)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A simple rubric:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Green&lt;/em&gt;: ≥95% adherence + ≥95% format pass-rate
&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Yellow&lt;/em&gt;: 80–95% (acceptable for drafts)
&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Red&lt;/em&gt;: &amp;lt;80% (you’re still guessing / under-specifying)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  4.2 Common failure modes (and what to change)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;If it invents facts&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;inject missing context&lt;/li&gt;
&lt;li&gt;add a “no fabrication” rule&lt;/li&gt;
&lt;li&gt;require citations &lt;em&gt;from the provided context only&lt;/em&gt; (not web links)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;If it ignores constraints&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;move requirements into a short MUST list&lt;/li&gt;
&lt;li&gt;reduce optional wording (“try”, “maybe”, “if possible”)&lt;/li&gt;
&lt;li&gt;cap output length explicitly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;If the format drifts&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;add a schema + a “format contract”&lt;/li&gt;
&lt;li&gt;include a single example that matches the schema&lt;/li&gt;
&lt;li&gt;set a stop sequence (e.g., stop after the closing brace)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  4.3 Iteration loop that doesn’t waste time
&lt;/h2&gt;

&lt;p&gt;1) Freeze your test prompts&lt;br&gt;&lt;br&gt;
2) Change &lt;strong&gt;one&lt;/strong&gt; thing (constraints, example, context, or step plan)&lt;br&gt;&lt;br&gt;
3) Re-run the scorecard&lt;br&gt;&lt;br&gt;
4) Keep changes that improve adherence/format &lt;strong&gt;without&lt;/strong&gt; increasing cost too much&lt;br&gt;&lt;br&gt;
5) Only then generalise to new tasks&lt;/p&gt;




&lt;h1&gt;
  
  
  5) Iteration Methods That Work on 7B
&lt;/h1&gt;

&lt;h2&gt;
  
  
  5.1 Prompt iteration loop
&lt;/h2&gt;

&lt;p&gt;1) Run prompt&lt;br&gt;
2) Compare output to checklist&lt;br&gt;
3) Issue a repair prompt targeting &lt;em&gt;only the broken parts&lt;/em&gt;&lt;br&gt;
4) Save the “winning” prompt as a template&lt;/p&gt;

&lt;h3&gt;
  
  
  A reusable repair prompt
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You did not follow the contract.

Fix ONLY the following issues:
- Missing: &amp;lt;field&amp;gt;
- Format error: &amp;lt;what broke&amp;gt;
- Constraint violation: &amp;lt;too long / invented numbers / wrong tone&amp;gt;

Return the corrected output only.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  5.2 Few-shot tuning (but keep it small)
&lt;/h2&gt;

&lt;p&gt;For 7B models: &lt;strong&gt;1–3 examples&lt;/strong&gt; usually beats 5–10 (too much context = distraction).&lt;/p&gt;

&lt;h2&gt;
  
  
  5.3 Format simplification
&lt;/h2&gt;

&lt;p&gt;If strict JSON fails:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;use a Markdown table with fixed columns&lt;/li&gt;
&lt;li&gt;or “Key: Value” lines
Then parse with regex.&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  6) Deployment Notes: Hardware, Quantisation, and Inference Choices
&lt;/h1&gt;

&lt;p&gt;7B models can run on consumer hardware, but choices matter:&lt;/p&gt;

&lt;h2&gt;
  
  
  6.1 Hardware baseline
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;CPU-only: workable, but slower; more RAM helps (16GB+)&lt;/li&gt;
&lt;li&gt;GPU: smoother UX; 3090/4090 class GPUs are comfortable for many 7B setups&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  6.2 Quantisation
&lt;/h2&gt;

&lt;p&gt;INT8/INT4 reduces memory and speeds up inference, but can slightly degrade accuracy.&lt;br&gt;
Common approaches:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPTQ / AWQ&lt;/li&gt;
&lt;li&gt;4-bit quant + LoRA adapters for domain tuning&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  6.3 Inference frameworks
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;llama.cpp&lt;/code&gt; for fast local CPU/GPU setups&lt;/li&gt;
&lt;li&gt;vLLM for server-style throughput&lt;/li&gt;
&lt;li&gt;Transformers.js for lightweight client-side experiments&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  Final Take
&lt;/h1&gt;

&lt;p&gt;7B models don’t reward “clever prompts.”&lt;br&gt;&lt;br&gt;
They reward &lt;strong&gt;clear contracts&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;keep tasks small,&lt;/li&gt;
&lt;li&gt;inject missing facts,&lt;/li&gt;
&lt;li&gt;enforce formats,&lt;/li&gt;
&lt;li&gt;and use repair loops,&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;you can make a 7B model deliver output that feels &lt;em&gt;shockingly close&lt;/em&gt; to much larger systems—especially for structured work like copy templates, code scaffolding, and report frameworks.&lt;/p&gt;

&lt;p&gt;That’s the point of low-resource LLMs: not to replace frontier models, but to own the “cheap, fast, good enough” layer of your stack.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>promptengineering</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Prompt Rate Limits &amp; Batching: How to Stop Your LLM API From Melting Down</title>
      <dc:creator>Dechun Wang</dc:creator>
      <pubDate>Mon, 26 Jan 2026 10:53:16 +0000</pubDate>
      <link>https://forem.com/superorange0707/prompt-rate-limits-batching-how-to-stop-your-llm-api-from-melting-down-56e1</link>
      <guid>https://forem.com/superorange0707/prompt-rate-limits-batching-how-to-stop-your-llm-api-from-melting-down-56e1</guid>
      <description>&lt;h1&gt;
  
  
  Prompt Rate Limits &amp;amp; Batching: Your LLM API Has a Speed Limit (Even If Your Product Doesn’t)
&lt;/h1&gt;

&lt;p&gt;You ship a feature, your traffic spikes, and suddenly your LLM layer starts returning &lt;strong&gt;429s&lt;/strong&gt; like it’s handing out parking tickets.&lt;/p&gt;

&lt;p&gt;The bad news: rate limits are inevitable.&lt;/p&gt;

&lt;p&gt;The good news: &lt;strong&gt;most LLM “rate limit incidents” are self-inflicted&lt;/strong&gt;—usually by oversized prompts, bursty traffic, and output formats that are impossible to parse at scale.&lt;/p&gt;

&lt;p&gt;This article is a practical playbook for:&lt;/p&gt;

&lt;p&gt;1) understanding prompt-related throttles,&lt;br&gt;&lt;br&gt;
2) avoiding the common failure modes, and&lt;br&gt;&lt;br&gt;
3) batching requests without turning your responses into soup.&lt;/p&gt;


&lt;h1&gt;
  
  
  1) The Three Limits You Actually Hit (And What They Mean)
&lt;/h1&gt;

&lt;p&gt;Different providers name things differently, but the mechanics are consistent:&lt;/p&gt;
&lt;h2&gt;
  
  
  1.1 Context window (max tokens per request)
&lt;/h2&gt;

&lt;p&gt;If your &lt;strong&gt;input + output&lt;/strong&gt; exceeds the model context window, the request fails immediately.&lt;/p&gt;

&lt;p&gt;Symptoms:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“Maximum context length exceeded”&lt;/li&gt;
&lt;li&gt;“Your messages resulted in X tokens…”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Fix:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;shorten, summarise, or chunk data.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  1.2 RPM (Requests Per Minute)
&lt;/h2&gt;

&lt;p&gt;You can be under token limits and still get throttled if you burst too many calls. Gemini explicitly documents RPM as a core dimension. &lt;/p&gt;

&lt;p&gt;Symptoms:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“Rate limit reached for requests per minute”&lt;/li&gt;
&lt;li&gt;HTTP 429&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Fix:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;client-side pacing, queues, and backoff.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  1.3 TPM / Token throughput limits
&lt;/h2&gt;

&lt;p&gt;Anthropic measures rate limits in &lt;strong&gt;RPM + input tokens/minute + output tokens/minute&lt;/strong&gt; (ITPM/OTPM).&lt;br&gt;&lt;br&gt;
Gemini similarly describes token-per-minute as a key dimension. &lt;/p&gt;

&lt;p&gt;Symptoms:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“Rate limit reached for token usage per minute”&lt;/li&gt;
&lt;li&gt;429 + Retry-After header (Anthropic calls this out) &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Fix:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reduce tokens, batch efficiently, or request higher quota.&lt;/li&gt;
&lt;/ul&gt;


&lt;h1&gt;
  
  
  2) The Most Common “Prompt Limit” Failure Patterns
&lt;/h1&gt;
&lt;h2&gt;
  
  
  2.1 The “one prompt to rule them all” anti-pattern
&lt;/h2&gt;

&lt;p&gt;You ask for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;extraction
&lt;/li&gt;
&lt;li&gt;classification
&lt;/li&gt;
&lt;li&gt;rewriting
&lt;/li&gt;
&lt;li&gt;validation
&lt;/li&gt;
&lt;li&gt;formatting
&lt;/li&gt;
&lt;li&gt;business logic
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;…in a single request, and then you wonder why token usage spikes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Split the workflow&lt;/strong&gt;. If you need multi-step logic, use &lt;strong&gt;Prompt Chaining&lt;/strong&gt; (small prompts with structured intermediate outputs).&lt;/p&gt;
&lt;h2&gt;
  
  
  2.2 Bursty traffic (the silent RPM killer)
&lt;/h2&gt;

&lt;p&gt;Production traffic is spiky. Cron jobs, retries, user clicks, webhook bursts—everything aligns in the worst possible minute.&lt;/p&gt;

&lt;p&gt;If your client sends requests like a machine gun, your provider will respond like a bouncer.&lt;/p&gt;
&lt;h2&gt;
  
  
  2.3 Unstructured output = expensive parsing
&lt;/h2&gt;

&lt;p&gt;If your output is “kinda JSON-ish”, your parser becomes a full-time therapist.&lt;/p&gt;

&lt;p&gt;Make the model output &lt;strong&gt;strict JSON&lt;/strong&gt; or a fixed table. Treat format as a contract.&lt;/p&gt;


&lt;h1&gt;
  
  
  3) Rate Limit Survival Kit (Compliant, Practical, Boring)
&lt;/h1&gt;
&lt;h2&gt;
  
  
  3.1 Prompt-side: shrink tokens without losing signal
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Delete marketing fluff&lt;/strong&gt; (models don’t need your company origin story).&lt;/li&gt;
&lt;li&gt;Convert repeated boilerplate into a short “policy block” and reuse it.&lt;/li&gt;
&lt;li&gt;Prefer &lt;strong&gt;fields&lt;/strong&gt; over prose (“material=316 stainless steel” beats a paragraph).&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  A tiny prompt rewrite that usually saves 30–50%
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Before (chatty):&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“We’re a smart home brand founded in 2010… please write 3 marketing lines…”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;After (dense + precise):&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Write 3 UK e-commerce lines. Product: smart bulb. Material=PC flame-retardant. Feature=3 colour temperatures. Audience=living room.”&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;
  
  
  3.2 Request-side: backoff like an adult
&lt;/h2&gt;

&lt;p&gt;If the provider returns &lt;strong&gt;Retry-After&lt;/strong&gt;, respect it. Anthropic explicitly returns Retry-After on 429s. &lt;/p&gt;

&lt;p&gt;Use exponential backoff + jitter:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;attempt 1: 1s
&lt;/li&gt;
&lt;li&gt;attempt 2: 2–3s
&lt;/li&gt;
&lt;li&gt;attempt 3: 4–6s
&lt;/li&gt;
&lt;li&gt;then fail gracefully&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  3.3 System-side: queue + concurrency caps
&lt;/h2&gt;

&lt;p&gt;If your account supports 10 concurrent requests, do not run 200 coroutines and “hope”.&lt;/p&gt;

&lt;p&gt;Use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a &lt;strong&gt;work queue&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;a &lt;strong&gt;semaphore&lt;/strong&gt; for concurrency&lt;/li&gt;
&lt;li&gt;and a &lt;strong&gt;rate limiter&lt;/strong&gt; for RPM/TPM&lt;/li&gt;
&lt;/ul&gt;


&lt;h1&gt;
  
  
  4) Batching: The Fastest Way to Cut Calls, Cost, and 429s
&lt;/h1&gt;

&lt;p&gt;Batching means: &lt;strong&gt;one API request handles multiple independent tasks&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;It works best when tasks are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;same type (e.g., 20 product blurbs)&lt;/li&gt;
&lt;li&gt;independent (no step depends on another)&lt;/li&gt;
&lt;li&gt;same output schema&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Why it helps
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;fewer network round-trips&lt;/li&gt;
&lt;li&gt;fewer requests → lower RPM pressure&lt;/li&gt;
&lt;li&gt;more predictable throughput&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Also: OpenAI’s pricing pages explicitly include a “Batch API price” column for several models. &lt;br&gt;
(That doesn’t mean “batching is free”, but it’s a strong hint the ecosystem expects this pattern.)&lt;/p&gt;


&lt;h1&gt;
  
  
  5) The Batching Prompt Template That Doesn’t Fall Apart
&lt;/h1&gt;

&lt;p&gt;Here’s a format that stays parseable under pressure.&lt;/p&gt;
&lt;h2&gt;
  
  
  5.1 Use task blocks + a strict JSON response schema
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SYSTEM: You output valid JSON only. No Markdown. No commentary.

USER:
You will process multiple tasks. 
Return a JSON array. Each item must be:
{
  "task_id": &amp;lt;int&amp;gt;,
  "title": &amp;lt;string&amp;gt;,
  "bullets": [&amp;lt;string&amp;gt;, &amp;lt;string&amp;gt;, &amp;lt;string&amp;gt;]
}

Rules:
- UK English spelling
- Title ≤ 12 words
- 3 bullets, each ≤ 18 words
- If input is missing: set title="INSUFFICIENT_DATA" and bullets=[]

TASKS:
### TASK 1
product_name: Insulated smart mug
material: 316 stainless steel
features: temperature alert, 7-day battery
audience: commuters

### TASK 2
product_name: Wireless earbuds
material: ABS shock-resistant
features: ANC, 24-hour battery
audience: students
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;That “INSUFFICIENT_DATA” clause is your lifesaver. One broken task shouldn’t poison the whole batch.&lt;/p&gt;


&lt;h1&gt;
  
  
  6) Python Implementation: Batch → Call → Parse (With Guardrails)
&lt;/h1&gt;

&lt;p&gt;Below is a modern-ish pattern you can adapt (provider SDKs vary, so treat it as &lt;strong&gt;structure&lt;/strong&gt;, not a copy‑paste guarantee).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Tuple&lt;/span&gt;

&lt;span class="n"&gt;MAX_RETRIES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;backoff_sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;retry_after&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;retry_after&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retry_after&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;
    &lt;span class="n"&gt;base&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt;
    &lt;span class="n"&gt;jitter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;random&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;base&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;jitter&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_batch_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;header&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You output valid JSON only. No Markdown. No commentary.&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Return a JSON array. Each item must be:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;  &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;task_id&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;: &amp;lt;int&amp;gt;,&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;  &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;: &amp;lt;string&amp;gt;,&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;  &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;bullets&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;: [&amp;lt;string&amp;gt;, &amp;lt;string&amp;gt;, &amp;lt;string&amp;gt;]&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Rules:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;- UK English spelling&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;- Title ≤ 12 words&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;- 3 bullets, each ≤ 18 words&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;- If input is missing: set title=&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;INSUFFICIENT_DATA&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt; and bullets=[]&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TASKS:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;blocks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;blocks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;### TASK &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;task_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;product_name: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;product_name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;material: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;material&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;features: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;features&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;audience: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;audience&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;header&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;blocks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_json_strict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]]:&lt;/span&gt;
    &lt;span class="c1"&gt;# Hard fail if it's not JSON. This is intentional.
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;call_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Return (text, retry_after_seconds). Replace with your provider call.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nb"&gt;NotImplementedError&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]]:&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_batch_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MAX_RETRIES&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;raw_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;retry_after&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;call_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;parse_json_strict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;JSONDecodeError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# Ask the model to repair formatting in a second pass (or log + retry)
&lt;/span&gt;            &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Fix the output into valid JSON only. Preserve meaning.&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BAD_OUTPUT:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;raw_text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;backoff_sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# If your SDK exposes HTTP status + retry-after, use it here
&lt;/span&gt;            &lt;span class="nf"&gt;backoff_sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;last_error&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;

    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Batch failed after retries: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;last_error&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What changed vs “classic” snippets?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;We treat JSON as a &lt;strong&gt;hard contract&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;We handle &lt;strong&gt;format repair&lt;/strong&gt; explicitly (and keep it cheap).&lt;/li&gt;
&lt;li&gt;We centralise backoff logic so every call behaves the same way.&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  7) How to Choose Batch Size (The Rule Everyone Learns the Hard Way)
&lt;/h1&gt;

&lt;p&gt;Batch size is constrained by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;context window (max tokens per request)&lt;/li&gt;
&lt;li&gt;TPM throughput&lt;/li&gt;
&lt;li&gt;response parsing stability&lt;/li&gt;
&lt;li&gt;your business tolerance for “one batch failed”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A practical heuristic:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;start with &lt;strong&gt;10–20 items per batch&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;measure token usage&lt;/li&gt;
&lt;li&gt;increase until you see:

&lt;ul&gt;
&lt;li&gt;output format drift, or
&lt;/li&gt;
&lt;li&gt;timeouts / latency spikes, or
&lt;/li&gt;
&lt;li&gt;context overflow risk&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;And always keep a &lt;strong&gt;max batch token budget&lt;/strong&gt;.&lt;/p&gt;




&lt;h1&gt;
  
  
  8) “Cost Math” Without Fantasy Numbers
&lt;/h1&gt;

&lt;p&gt;Pricing changes. Tiers change. Models change.&lt;/p&gt;

&lt;p&gt;So instead of hard-coding ancient per-1K token values, calculate cost using the provider’s current pricing page.&lt;/p&gt;

&lt;p&gt;OpenAI publishes per‑token pricing on its API pricing pages.&lt;br&gt;&lt;br&gt;
Anthropic also publishes pricing and documents rate limit tiers. &lt;/p&gt;

&lt;p&gt;A useful cost estimator:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cost ≈ (input_tokens * input_price + output_tokens * output_price) / 1,000,000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then optimise the variables you control:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;shrink input tokens
&lt;/li&gt;
&lt;li&gt;constrain output tokens
&lt;/li&gt;
&lt;li&gt;reduce number of calls (batch)
&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  9) Risks of Batching (And How to Not Get Burnt)
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Risk 1: one bad item ruins the batch
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; “INSUFFICIENT_DATA” fallback per task.&lt;/p&gt;

&lt;h2&gt;
  
  
  Risk 2: output format drift breaks parsing
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; strict JSON, repair step, and logging.&lt;/p&gt;

&lt;h2&gt;
  
  
  Risk 3: batch too big → context overflow
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; token budgeting + auto-splitting.&lt;/p&gt;

&lt;h2&gt;
  
  
  Risk 4: “creative” attempts to bypass quotas
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; don’t. If you need more capacity, request higher limits and follow provider terms.&lt;/p&gt;




&lt;h1&gt;
  
  
  Final Take
&lt;/h1&gt;

&lt;p&gt;Rate limits aren’t the enemy. They’re your early warning system that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;prompts are too long,
&lt;/li&gt;
&lt;li&gt;traffic is too bursty,
&lt;/li&gt;
&lt;li&gt;or your architecture assumes “infinite throughput”.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you treat prompts like payloads (not prose), add pacing, and batch like a grown-up, you’ll get:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;fewer 429s
&lt;/li&gt;
&lt;li&gt;lower cost
&lt;/li&gt;
&lt;li&gt;and a system that scales without drama&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That’s the whole game.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>promptengineering</category>
      <category>api</category>
      <category>python</category>
    </item>
    <item>
      <title>Choosing an LLM in 2026: The Practical Comparison Table (Specs, Cost, Latency, Compatibility)</title>
      <dc:creator>Dechun Wang</dc:creator>
      <pubDate>Mon, 26 Jan 2026 10:52:49 +0000</pubDate>
      <link>https://forem.com/superorange0707/choosing-an-llm-in-2026-the-practical-comparison-table-specs-cost-latency-compatibility-354g</link>
      <guid>https://forem.com/superorange0707/choosing-an-llm-in-2026-the-practical-comparison-table-specs-cost-latency-compatibility-354g</guid>
      <description>&lt;h1&gt;
  
  
  The uncomfortable truth: “model choice” is half your prompt engineering
&lt;/h1&gt;

&lt;p&gt;If your prompt is a recipe, the model is your kitchen.&lt;/p&gt;

&lt;p&gt;A great recipe doesn’t help if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the oven is tiny (context window),&lt;/li&gt;
&lt;li&gt;the ingredients are expensive (token price),&lt;/li&gt;
&lt;li&gt;the chef is slow (latency),&lt;/li&gt;
&lt;li&gt;or your tools don’t fit (function calling / JSON / SDK / ecosystem).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So here’s a &lt;strong&gt;practical&lt;/strong&gt; comparison you can actually use.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note on “parameters”:&lt;/strong&gt; for many frontier models, parameter counts are not publicly disclosed. In practice, context window + pricing + tool features predict “fit” better than guessing parameter scale.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h1&gt;
  
  
  1) Quick comparison: what you should care about first
&lt;/h1&gt;

&lt;h2&gt;
  
  
  1.1 The “four knobs” that matter
&lt;/h2&gt;

&lt;p&gt;1) &lt;strong&gt;Context&lt;/strong&gt;: can you fit the job in one request?&lt;br&gt;&lt;br&gt;
2) &lt;strong&gt;Cost&lt;/strong&gt;: can you afford volume?&lt;br&gt;&lt;br&gt;
3) &lt;strong&gt;Latency&lt;/strong&gt;: does your UX tolerate the wait?&lt;br&gt;&lt;br&gt;
4) &lt;strong&gt;Compatibility&lt;/strong&gt;: will your stack integrate cleanly?&lt;/p&gt;

&lt;p&gt;Everything else is second order.&lt;/p&gt;




&lt;h1&gt;
  
  
  2) Model spec table (context + positioning)
&lt;/h1&gt;

&lt;p&gt;This table focuses on what’s stable: &lt;strong&gt;family, positioning, and context expectations&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;| Provider | Model family (examples) | Typical positioning | Notes |&lt;br&gt;
|---|---|---|&lt;br&gt;
| OpenAI | GPT family (e.g., &lt;code&gt;gpt-4o&lt;/code&gt;, &lt;code&gt;gpt-4.1&lt;/code&gt;, &lt;code&gt;gpt-5*&lt;/code&gt;) | General-purpose, strong tooling ecosystem | Pricing + cached input are clearly published. &lt;br&gt;
| OpenAI | “o” reasoning family (e.g., &lt;code&gt;o3&lt;/code&gt;, &lt;code&gt;o1&lt;/code&gt;) | Deep reasoning / harder planning | Often higher cost; use selectively. &lt;br&gt;
| Anthropic | Claude family (e.g., Haiku / Sonnet tiers) | Strong writing + safety posture; clean docs | Pricing table includes multiple rate dimensions. &lt;br&gt;
| Google | Gemini family (Flash / Pro tiers) | Multimodal + Google ecosystem + caching/grounding options | Pricing page explicitly covers caching + grounding. &lt;br&gt;
| DeepSeek | DeepSeek chat + reasoning models | Aggressive price/perf, popular for scale | Official pricing docs available. &lt;br&gt;
| Open source | Llama / Qwen / Mistral etc. | Self-host for privacy/control | Context depends on model; Llama 3.1 supports 128K. &lt;/p&gt;




&lt;h1&gt;
  
  
  3) Pricing table (the part your CFO actually reads)
&lt;/h1&gt;

&lt;p&gt;Below are &lt;strong&gt;public list prices&lt;/strong&gt; from official docs (USD per &lt;strong&gt;1M tokens&lt;/strong&gt;).&lt;br&gt;&lt;br&gt;
Use this as a baseline, then apply: caching, batch discounts, and your real output length.&lt;/p&gt;

&lt;h2&gt;
  
  
  3.1 OpenAI (selected highlights)
&lt;/h2&gt;

&lt;p&gt;OpenAI publishes input, cached input, and output prices per 1M tokens.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Input / 1M&lt;/th&gt;
&lt;th&gt;Cached input / 1M&lt;/th&gt;
&lt;th&gt;Output / 1M&lt;/th&gt;
&lt;th&gt;When to use&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gpt-4.1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;$2.00&lt;/td&gt;
&lt;td&gt;$0.50&lt;/td&gt;
&lt;td&gt;$8.00&lt;/td&gt;
&lt;td&gt;High-quality general reasoning with sane cost&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gpt-4o&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;$2.50&lt;/td&gt;
&lt;td&gt;$1.25&lt;/td&gt;
&lt;td&gt;$10.00&lt;/td&gt;
&lt;td&gt;Multimodal-ish “workhorse” if you need it&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gpt-4o-mini&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;$0.15&lt;/td&gt;
&lt;td&gt;$0.075&lt;/td&gt;
&lt;td&gt;$0.60&lt;/td&gt;
&lt;td&gt;High-throughput chat, extraction, tagging&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;o3&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;$2.00&lt;/td&gt;
&lt;td&gt;$0.50&lt;/td&gt;
&lt;td&gt;$8.00&lt;/td&gt;
&lt;td&gt;Reasoning-heavy tasks without the top-end pricing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;o1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;$15.00&lt;/td&gt;
&lt;td&gt;$7.50&lt;/td&gt;
&lt;td&gt;$60.00&lt;/td&gt;
&lt;td&gt;“Use sparingly”: hard reasoning where mistakes are expensive&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;If you’re building a product: you’ll often run &lt;strong&gt;80–95%&lt;/strong&gt; of calls on a cheaper model (mini/fast tier), and escalate only the hard cases.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  3.2 Anthropic (Claude)
&lt;/h2&gt;

&lt;p&gt;Anthropic publishes a model pricing table in Claude docs. &lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Input / MTok&lt;/th&gt;
&lt;th&gt;Output / MTok&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Claude Haiku 4.5&lt;/td&gt;
&lt;td&gt;$1.00&lt;/td&gt;
&lt;td&gt;$5.00&lt;/td&gt;
&lt;td&gt;Fast, budget-friendly tier&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Haiku 3.5&lt;/td&gt;
&lt;td&gt;$0.80&lt;/td&gt;
&lt;td&gt;$4.00&lt;/td&gt;
&lt;td&gt;Even cheaper tier option&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Sonnet 3.7 (deprecated)&lt;/td&gt;
&lt;td&gt;$3.75&lt;/td&gt;
&lt;td&gt;$15.00&lt;/td&gt;
&lt;td&gt;Listed as deprecated on pricing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Opus 3 (deprecated)&lt;/td&gt;
&lt;td&gt;$18.75&lt;/td&gt;
&lt;td&gt;$75.00&lt;/td&gt;
&lt;td&gt;Premium, but marked deprecated&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Important:&lt;/strong&gt; model availability changes. Treat the pricing table as the authoritative “what exists right now.” &lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  3.3 Google Gemini (Developer API)
&lt;/h2&gt;

&lt;p&gt;Gemini pricing varies by tier and includes context caching + grounding pricing.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tier (example rows from pricing page)&lt;/th&gt;
&lt;th&gt;Input / 1M (text/image/video)&lt;/th&gt;
&lt;th&gt;Output / 1M&lt;/th&gt;
&lt;th&gt;Notable extras&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Gemini tier (row example)&lt;/td&gt;
&lt;td&gt;$0.30&lt;/td&gt;
&lt;td&gt;$2.50&lt;/td&gt;
&lt;td&gt;Context caching + grounding options&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini Flash-style row example&lt;/td&gt;
&lt;td&gt;$0.10&lt;/td&gt;
&lt;td&gt;$0.40&lt;/td&gt;
&lt;td&gt;Very low output cost; good for high volume&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Gemini’s pricing page also lists:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;context caching prices&lt;/strong&gt;, and
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;grounding with Google Search&lt;/strong&gt; pricing/limits. &lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  3.4 DeepSeek (API)
&lt;/h2&gt;

&lt;p&gt;DeepSeek publishes pricing in its API docs and on its pricing page. &lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model family (per DeepSeek pricing pages)&lt;/th&gt;
&lt;th&gt;What to expect&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek-V3 / “chat” tier&lt;/td&gt;
&lt;td&gt;Very low per-token pricing compared to many frontier models&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek-R1 reasoning tier&lt;/td&gt;
&lt;td&gt;Higher than chat tier, still aggressively priced&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h1&gt;
  
  
  4) Latency: don’t use fake “average seconds” tables
&lt;/h1&gt;

&lt;p&gt;Most blog latency tables are either:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;measured on one day, one region, one payload, then recycled forever, or&lt;/li&gt;
&lt;li&gt;pure fiction.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instead, use &lt;strong&gt;two metrics you can actually observe&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;1) &lt;strong&gt;TTFT (time to first token)&lt;/strong&gt; — how fast streaming starts&lt;br&gt;&lt;br&gt;
2) &lt;strong&gt;Tokens/sec&lt;/strong&gt; — how fast output arrives once it starts&lt;/p&gt;

&lt;h2&gt;
  
  
  4.1 Practical latency expectations (directional)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;“Mini/Flash” tiers usually win TTFT and throughput for chat-style workloads.&lt;/li&gt;
&lt;li&gt;“Reasoning” tiers typically have slower TTFT and may output more tokens (more thinking), so perceived latency increases.&lt;/li&gt;
&lt;li&gt;Long context inputs increase latency &lt;em&gt;everywhere&lt;/em&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  4.2 How to benchmark for your own product (a 15-minute method)
&lt;/h2&gt;

&lt;p&gt;Create a small benchmark script that sends:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the same prompt (e.g., 400–800 tokens),&lt;/li&gt;
&lt;li&gt;fixed max output (e.g., 300 tokens),&lt;/li&gt;
&lt;li&gt;in your target region,&lt;/li&gt;
&lt;li&gt;for 30–50 runs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Record:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;p50 / p95 TTFT,&lt;/li&gt;
&lt;li&gt;p50 / p95 total time,&lt;/li&gt;
&lt;li&gt;tokens/sec.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then make the decision with data, not vibes.&lt;/p&gt;




&lt;h1&gt;
  
  
  5) Compatibility: why “tooling fit” beats raw model quality
&lt;/h1&gt;

&lt;p&gt;A model that’s 5% “smarter” but breaks your stack is a net loss.&lt;/p&gt;

&lt;h2&gt;
  
  
  5.1 Prompt + API surface compatibility (what breaks when you switch models)
&lt;/h2&gt;

&lt;p&gt;| Feature | OpenAI | Claude | Gemini | Open-source (self-host) |&lt;br&gt;
|---|---|---|---|&lt;br&gt;
| Strong “system instruction” control | Yes (explicit system role) | Yes (instructions patterns supported) | Yes | Depends on serving stack |&lt;br&gt;
| Tool / function calling | Widely used in ecosystem | Supported via tools patterns (provider-specific) | Supports tools + grounding options | Often “prompt it to emit JSON”, no native tools |&lt;br&gt;
| Structured output reliability | Strong with constraints | Strong, especially on long text | Strong with explicit schemas | Varies a lot; needs examples + validators |&lt;br&gt;
| Caching / batch primitives | Cached input pricing published | Provider features vary | Context caching explicitly priced  | You implement caching yourself |&lt;/p&gt;

&lt;h2&gt;
  
  
  5.2 Ecosystem fit (a.k.a. “what do you already use?”)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;If you live in &lt;strong&gt;Google Workspace / Vertex-style workflows&lt;/strong&gt;, Gemini integration + grounding options can be a natural fit. &lt;/li&gt;
&lt;li&gt;If you rely on a broad third-party automation ecosystem, OpenAI + Claude both have mature SDK + tooling coverage (LangChain etc.).
&lt;/li&gt;
&lt;li&gt;If you need &lt;strong&gt;data residency / on-prem&lt;/strong&gt;, open-source models (Llama/Qwen) let you keep data inside your boundary, but you pay in MLOps. &lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  6) The decision checklist: pick models like an engineer
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Step 1 — classify the task
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;High volume / low stakes&lt;/strong&gt;: tagging, rewrite, FAQ, extraction
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Medium stakes&lt;/strong&gt;: customer support replies, internal reporting
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High stakes&lt;/strong&gt;: legal, finance, security, medical-like domains (be careful)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 2 — decide your stack (the “2–3 model rule”)
&lt;/h2&gt;

&lt;p&gt;A common setup:&lt;/p&gt;

&lt;p&gt;1) &lt;strong&gt;Fast cheap tier&lt;/strong&gt; for most requests&lt;br&gt;&lt;br&gt;
2) &lt;strong&gt;Strong tier&lt;/strong&gt; for hard prompts, long context, tricky reasoning&lt;br&gt;&lt;br&gt;
3) Optional: &lt;strong&gt;realtime&lt;/strong&gt; or &lt;strong&gt;deep reasoning&lt;/strong&gt; tier for specific UX/features&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3 — cost control strategy (before you ship)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;enforce output length limits
&lt;/li&gt;
&lt;li&gt;cache repeated system/context
&lt;/li&gt;
&lt;li&gt;batch homogeneous jobs
&lt;/li&gt;
&lt;li&gt;add escalation rules (don’t send everything to your most expensive model)&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  7) A practical comparison table you can paste into a PRD
&lt;/h1&gt;

&lt;p&gt;Here’s a short “copy/paste” table for stakeholders.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Priority&lt;/th&gt;
&lt;th&gt;Default pick&lt;/th&gt;
&lt;th&gt;Escalate to&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Customer support chatbot&lt;/td&gt;
&lt;td&gt;Latency + cost&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;gpt-4o-mini&lt;/code&gt; (or Gemini Flash-tier)&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;gpt-4.1&lt;/code&gt; / Claude higher tier&lt;/td&gt;
&lt;td&gt;Cheap 80–90%, escalate only ambiguous cases&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Long document synthesis&lt;/td&gt;
&lt;td&gt;Context + format stability&lt;/td&gt;
&lt;td&gt;Claude tier with strong long-form behaviour&lt;/td&gt;
&lt;td&gt;&lt;code&gt;gpt-4.1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Long prompts + structured output&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Coding helper in IDE&lt;/td&gt;
&lt;td&gt;Tooling + correctness&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;gpt-4.1&lt;/code&gt; or equivalent&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;o3&lt;/code&gt; / &lt;code&gt;o1&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Deep reasoning for tricky bugs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Privacy-sensitive internal assistant&lt;/td&gt;
&lt;td&gt;Data boundary&lt;/td&gt;
&lt;td&gt;Self-host Llama/Qwen&lt;/td&gt;
&lt;td&gt;Cloud model for non-sensitive output&lt;/td&gt;
&lt;td&gt;Keep raw data in-house&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h1&gt;
  
  
  Final take
&lt;/h1&gt;

&lt;p&gt;“Best model” is not a thing.&lt;/p&gt;

&lt;p&gt;There’s only &lt;strong&gt;best model for this prompt, this latency budget, this cost envelope, and this ecosystem&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If you ship with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a measured benchmark,&lt;/li&gt;
&lt;li&gt;a 2–3 model stack,&lt;/li&gt;
&lt;li&gt;strict output constraints,&lt;/li&gt;
&lt;li&gt;and caching/batching,&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;…you’ll outperform teams who chase the newest model every month.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>promptengineering</category>
      <category>openai</category>
    </item>
    <item>
      <title>When AI Can Make “Perfect Decisions”: Why Dynamic Contracts Are the Real Safety Layer</title>
      <dc:creator>Dechun Wang</dc:creator>
      <pubDate>Fri, 16 Jan 2026 17:28:48 +0000</pubDate>
      <link>https://forem.com/superorange0707/when-ai-can-make-perfect-decisions-why-dynamic-contracts-are-the-real-safety-layer-390</link>
      <guid>https://forem.com/superorange0707/when-ai-can-make-perfect-decisions-why-dynamic-contracts-are-the-real-safety-layer-390</guid>
      <description>&lt;h1&gt;
  
  
  The “Perfect Decision” Trap
&lt;/h1&gt;

&lt;p&gt;We’re entering the era where AI doesn’t just answer questions — it &lt;strong&gt;selects actions&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Supply chain routing. Credit risk. Fraud detection. Treatment planning. Portfolio optimisation.&lt;br&gt;&lt;br&gt;
The pitch is always the same:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Give the model data and objectives, and it will find the best move.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And in a narrow, mathematical sense, it can.&lt;/p&gt;

&lt;p&gt;But here’s the catch: optimisation is a superpower &lt;strong&gt;and&lt;/strong&gt; a liability.&lt;/p&gt;

&lt;p&gt;Because if a system can optimise &lt;em&gt;perfectly&lt;/em&gt;, it can also optimise &lt;strong&gt;perfectly for the wrong thing&lt;/strong&gt; — quietly, consistently, at scale.&lt;/p&gt;

&lt;p&gt;That’s why the most important design problem isn’t “make the AI smarter.”&lt;br&gt;&lt;br&gt;
It’s “make the relationship between humans and AI &lt;em&gt;adaptive, observable, and enforceable&lt;/em&gt;.”&lt;/p&gt;

&lt;p&gt;Call that relationship a &lt;strong&gt;dynamic contract&lt;/strong&gt;.&lt;/p&gt;




&lt;h1&gt;
  
  
  1) Why “Perfect” AI Decisions Are a Double-Edged Sword
&lt;/h1&gt;

&lt;p&gt;AI’s “perfection” is usually:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;statistical&lt;/strong&gt; (best expected value given assumptions),&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;objective-driven&lt;/strong&gt; (maximise what you told it to maximise),&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;context-blind&lt;/strong&gt; (it doesn’t feel the consequences).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A model can deliver the highest-return portfolio while ignoring:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reputational risk,&lt;/li&gt;
&lt;li&gt;regulatory risk,&lt;/li&gt;
&lt;li&gt;long-term trust erosion,&lt;/li&gt;
&lt;li&gt;human welfare.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A model can produce the fastest medical plan while ignoring:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;quality of life,&lt;/li&gt;
&lt;li&gt;patient preferences,&lt;/li&gt;
&lt;li&gt;risk tolerance.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;AI can optimise the &lt;em&gt;map&lt;/em&gt; while humans live on the &lt;em&gt;territory&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The problem is not malice.&lt;br&gt;&lt;br&gt;
It’s that &lt;strong&gt;objectives are incomplete&lt;/strong&gt;, and the world changes faster than your policy doc.&lt;/p&gt;




&lt;h1&gt;
  
  
  2) Static Rules vs Dynamic Contracts
&lt;/h1&gt;

&lt;p&gt;Static rules are how we’ve governed software for decades:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“Do X, don’t do Y.”&lt;/li&gt;
&lt;li&gt;“If this, then that.”&lt;/li&gt;
&lt;li&gt;“Hard limits.”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They’re easy to explain, test, and audit — until they meet reality.&lt;/p&gt;

&lt;h2&gt;
  
  
  2.1 The limits of static rules
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1) The world changes, your rules don’t
&lt;/h3&gt;

&lt;p&gt;Market regimes shift. User behaviour shifts. Regulations shift. Data pipelines shift.&lt;br&gt;&lt;br&gt;
Static rules drift from reality, and “optimal” actions start producing weird harm.&lt;/p&gt;

&lt;h3&gt;
  
  
  2) Objective–value mismatch grows over time
&lt;/h3&gt;

&lt;p&gt;A fixed objective function (“maximise conversion”, “minimise cost”) slowly detaches from what you &lt;em&gt;mean&lt;/em&gt; (“healthy growth”, “fair treatment”, “sustainable outcomes”).&lt;/p&gt;

&lt;h3&gt;
  
  
  3) Risk accumulates silently
&lt;/h3&gt;

&lt;p&gt;When the system makes thousands of decisions per hour, small misalignments compound.&lt;br&gt;&lt;br&gt;
Static constraints become a &lt;strong&gt;thin fence&lt;/strong&gt; around a fast-moving machine.&lt;/p&gt;

&lt;h2&gt;
  
  
  2.2 Dynamic contracts (the upgrade)
&lt;/h2&gt;

&lt;p&gt;A &lt;strong&gt;dynamic contract&lt;/strong&gt; is not “no rules.” It’s &lt;strong&gt;rules with a control system&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;goals can be updated,&lt;/li&gt;
&lt;li&gt;constraints can be tightened or relaxed,&lt;/li&gt;
&lt;li&gt;the system is monitored continuously,&lt;/li&gt;
&lt;li&gt;humans can intervene,&lt;/li&gt;
&lt;li&gt;accountability is explicit.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Think: not a fence — a &lt;strong&gt;safety harness with sensors, alarms, and a manual brake&lt;/strong&gt;.&lt;/p&gt;




&lt;h1&gt;
  
  
  3) What a Dynamic Contract Actually Looks Like
&lt;/h1&gt;

&lt;p&gt;A dynamic contract has four components. Miss one, and you’re back to vibes.&lt;/p&gt;

&lt;h2&gt;
  
  
  3.1 Continuous adjustment (rules are living, not laminated)
&lt;/h2&gt;

&lt;p&gt;A dynamic contract assumes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;objectives evolve,&lt;/li&gt;
&lt;li&gt;risk tolerances evolve,&lt;/li&gt;
&lt;li&gt;incentives evolve.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So the system must support:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;updating thresholds,&lt;/li&gt;
&lt;li&gt;changing objective weights,&lt;/li&gt;
&lt;li&gt;enabling/disabling actions by context.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not “moving goalposts.”&lt;br&gt;&lt;br&gt;
It’s acknowledging that &lt;strong&gt;the goalposts move whether you admit it or not&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  3.2 Real-time observability (decisions must be inspectable)
&lt;/h2&gt;

&lt;p&gt;If the system can’t show:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what it decided,&lt;/li&gt;
&lt;li&gt;why it decided,&lt;/li&gt;
&lt;li&gt;what data it used,&lt;/li&gt;
&lt;li&gt;what constraints were active,&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;…then you don’t have governance. You have hope.&lt;/p&gt;

&lt;p&gt;Observability means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;decision logs,&lt;/li&gt;
&lt;li&gt;feature/inputs snapshots,&lt;/li&gt;
&lt;li&gt;model version and prompt version tracking,&lt;/li&gt;
&lt;li&gt;anomaly alerts (distribution shift, rising error rates, unusual outputs).&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  3.3 Human override (intervention must be executable)
&lt;/h2&gt;

&lt;p&gt;A contract without an override is a ceremony.&lt;/p&gt;

&lt;p&gt;You need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;pause switches&lt;/strong&gt; (kill switch / degrade mode),&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;policy overrides&lt;/strong&gt; (block a class of actions),&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;manual approvals&lt;/strong&gt; for high-risk actions,&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;rollback&lt;/strong&gt; to a previous safe configuration.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  3.4 Responsibility chain (power and risk must align)
&lt;/h2&gt;

&lt;p&gt;If AI makes decisions, who owns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;outcomes?&lt;/li&gt;
&lt;li&gt;regressions?&lt;/li&gt;
&lt;li&gt;incidents?&lt;/li&gt;
&lt;li&gt;compliance?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Dynamic contracts require a clear chain:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;who approves contract changes,&lt;/li&gt;
&lt;li&gt;who monitors alerts,&lt;/li&gt;
&lt;li&gt;who signs off on high-risk domains,&lt;/li&gt;
&lt;li&gt;how you do post-incident review.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is less “ethics theatre,” more &lt;strong&gt;on-call rotation for decision systems&lt;/strong&gt;.&lt;/p&gt;




&lt;h1&gt;
  
  
  4) Dynamic Contracts as a Control Loop (Not a Buzzword)
&lt;/h1&gt;

&lt;p&gt;At a systems level, this is a closed loop:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7mzqpedfi3v34cc1y6ux.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7mzqpedfi3v34cc1y6ux.png" alt=" " width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This loop is the difference between:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“it worked in staging”
and&lt;/li&gt;
&lt;li&gt;“it survives the real world.”&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  5) Three Real-World Patterns Where Dynamic Contracts Matter
&lt;/h1&gt;

&lt;h2&gt;
  
  
  5.1 Supply chain: “lowest cost” vs “lowest risk”
&lt;/h2&gt;

&lt;p&gt;A routing model might optimise purely for cost.&lt;br&gt;&lt;br&gt;
But real operations have constraints that appear mid-flight:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;strike actions,&lt;/li&gt;
&lt;li&gt;supplier delays,&lt;/li&gt;
&lt;li&gt;customs bottlenecks,&lt;/li&gt;
&lt;li&gt;seasonal demand spikes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Dynamic contract move:&lt;/strong&gt; temporarily reweight objectives toward reliability, tighten risk limits, trigger manual approval for reroutes above a threshold.&lt;/p&gt;

&lt;h2&gt;
  
  
  5.2 Finance: “best return” vs “acceptable behaviour”
&lt;/h2&gt;

&lt;p&gt;A portfolio optimiser can deliver higher returns by exploiting correlations that become fragile under stress — or by concentrating in ethically questionable exposure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dynamic contract move:&lt;/strong&gt; enforce shifting exposure caps, add human approval gates when volatility spikes, record decision provenance for audit.&lt;/p&gt;

&lt;h2&gt;
  
  
  5.3 Healthcare: “fastest recovery” vs “patient values”
&lt;/h2&gt;

&lt;p&gt;AI can recommend the most statistically effective treatment, but “best” depends on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;side effects tolerance,&lt;/li&gt;
&lt;li&gt;personal priorities,&lt;/li&gt;
&lt;li&gt;comorbidities,&lt;/li&gt;
&lt;li&gt;informed consent.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Dynamic contract move:&lt;/strong&gt; require preference capture, enforce explainability, and make clinician override first-class, not an afterthought.&lt;/p&gt;




&lt;h1&gt;
  
  
  6) How to Implement Dynamic Contracts (Without Building a Religion)
&lt;/h1&gt;

&lt;p&gt;Here’s the pragmatic blueprint.&lt;/p&gt;

&lt;h2&gt;
  
  
  6.1 Start with a contract schema
&lt;/h2&gt;

&lt;p&gt;Define the contract in machine-readable form (YAML/JSON), e.g.:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;objective weights&lt;/li&gt;
&lt;li&gt;hard constraints&lt;/li&gt;
&lt;li&gt;approval thresholds&lt;/li&gt;
&lt;li&gt;escalation rules&lt;/li&gt;
&lt;li&gt;logging requirements&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Treat it like code:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;version it&lt;/li&gt;
&lt;li&gt;review it&lt;/li&gt;
&lt;li&gt;deploy it&lt;/li&gt;
&lt;li&gt;roll it back&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  6.2 Add a “policy engine” layer
&lt;/h2&gt;

&lt;p&gt;Your model shouldn’t directly execute actions.&lt;br&gt;&lt;br&gt;
It should propose actions that pass through a policy layer.&lt;/p&gt;

&lt;p&gt;Policy layer responsibilities:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;enforce constraints,&lt;/li&gt;
&lt;li&gt;require approvals,&lt;/li&gt;
&lt;li&gt;route to safe fallbacks,&lt;/li&gt;
&lt;li&gt;attach provenance metadata.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  6.3 Add monitoring that’s tied to actions, not dashboards
&lt;/h2&gt;

&lt;p&gt;Dashboards are passive. You need &lt;strong&gt;alerts&lt;/strong&gt; linked to contract changes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“False positive rate increased after contract v12”&lt;/li&gt;
&lt;li&gt;“Decision distribution drifted post-update”&lt;/li&gt;
&lt;li&gt;“High-risk actions exceeded threshold”&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  6.4 Build the incident playbook now, not after the incident
&lt;/h2&gt;

&lt;p&gt;At minimum:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;stop-the-world switch&lt;/li&gt;
&lt;li&gt;degrade mode (smaller scope, higher human review)&lt;/li&gt;
&lt;li&gt;rollback to last safe contract&lt;/li&gt;
&lt;li&gt;postmortem template (what changed, what broke, how detected, how prevented)&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  7) A Quick Checklist: Are You Actually Running a Dynamic Contract?
&lt;/h1&gt;

&lt;p&gt;If you answer “no” to any of these, you’re still on static rules.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Can we update objectives without redeploying the model?&lt;/li&gt;
&lt;li&gt;Can we see why each decision happened (inputs + policy + version)?&lt;/li&gt;
&lt;li&gt;Do we have a kill switch and rollback that works in minutes?&lt;/li&gt;
&lt;li&gt;Do we have approval gates for high-risk actions?&lt;/li&gt;
&lt;li&gt;Can we audit who changed what and when?&lt;/li&gt;
&lt;li&gt;Do we measure harm, not just performance?&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  Final Take
&lt;/h1&gt;

&lt;p&gt;AI will keep getting better at optimisation.&lt;br&gt;&lt;br&gt;
That’s not the scary part.&lt;/p&gt;

&lt;p&gt;The scary part is that &lt;strong&gt;our objectives will remain incomplete&lt;/strong&gt;, and our environments will keep changing.&lt;/p&gt;

&lt;p&gt;So the only sane way forward is to treat AI decision-making as a governed system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;not static rules,&lt;/li&gt;
&lt;li&gt;not blind trust,&lt;/li&gt;
&lt;li&gt;but a &lt;strong&gt;dynamic contract&lt;/strong&gt; — living guardrails with observability, override, and accountability.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because the future isn’t “AI makes decisions.”&lt;br&gt;&lt;br&gt;
It’s “humans and AI co-manage a decision system — continuously.”&lt;/p&gt;

&lt;p&gt;That’s how you get “perfect decisions” without perfect disasters.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>governance</category>
      <category>agents</category>
      <category>product</category>
    </item>
    <item>
      <title>AI Slop in 2025: The Year the Internet Started Eating Itself</title>
      <dc:creator>Dechun Wang</dc:creator>
      <pubDate>Fri, 16 Jan 2026 17:28:23 +0000</pubDate>
      <link>https://forem.com/superorange0707/ai-slop-in-2025-the-year-the-internet-started-eating-itself-4ifa</link>
      <guid>https://forem.com/superorange0707/ai-slop-in-2025-the-year-the-internet-started-eating-itself-4ifa</guid>
      <description>&lt;h1&gt;
  
  
  The Year the Internet Started Eating Itself
&lt;/h1&gt;

&lt;p&gt;If “the internet is cooked” had a corporate deck, 2025 would be slide one.&lt;/p&gt;

&lt;p&gt;The word &lt;strong&gt;slop&lt;/strong&gt;—once basically “wet leftovers for animals”—became a serious label for something we all recognise: &lt;strong&gt;cheap, mass-produced AI content that looks like information but behaves like noise.&lt;/strong&gt; Merriam-Webster didn’t just nod at the trend; it crowned “slop” Word of the Year and formalised the definition as AI-produced low-quality digital content made in quantity. &lt;/p&gt;

&lt;p&gt;Here’s the uncomfortable part: AI slop isn’t just a “quality problem.”&lt;br&gt;&lt;br&gt;
It’s a &lt;strong&gt;systems problem&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;incentives (traffic, ad money, affiliate payouts)
&lt;/li&gt;
&lt;li&gt;distribution (recommendation engines)
&lt;/li&gt;
&lt;li&gt;low marginal cost (content is basically free to generate)
&lt;/li&gt;
&lt;li&gt;weak provenance (hard to tell what’s real, or &lt;em&gt;who&lt;/em&gt; made it)
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So this isn’t a rant. It’s a field report: &lt;strong&gt;scale, impact, and governance paths&lt;/strong&gt; that don’t rely on wishful thinking.&lt;/p&gt;




&lt;h1&gt;
  
  
  1) What Counts as “AI Slop” (And What Doesn’t)
&lt;/h1&gt;

&lt;p&gt;Let’s separate terms that get mixed together:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Deepfakes&lt;/strong&gt;: synthetic media meant to &lt;em&gt;deceive&lt;/em&gt; (often identity/voice/video).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hallucinations&lt;/strong&gt;: model errors (it makes up citations or facts).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI slop&lt;/strong&gt;: the broader category—content churned out cheaply at scale, with low effort and low information value.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That’s why the Merriam-Webster definition matters. It anchors the term in two critical properties:&lt;/p&gt;

&lt;p&gt;1) &lt;strong&gt;produced usually in quantity&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
2) &lt;strong&gt;low quality&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;Intent can be malicious or merely lazy. Slop doesn’t need a conspiracy; it only needs a CPM.&lt;/p&gt;




&lt;h1&gt;
  
  
  2) The Scale: Are We Actually Past the Tipping Point?
&lt;/h1&gt;

&lt;p&gt;A lot of “AI dominates the internet” claims are vibes dressed as statistics.&lt;/p&gt;

&lt;p&gt;The cleaner way to talk about scale is: &lt;strong&gt;in specific measurable slices of the web, AI output is now comparable to human output&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Graphite analysed a dataset of online English-language articles over 2020–2025 and Axios visualised the result: by &lt;strong&gt;May 2025&lt;/strong&gt;, human-written articles were about &lt;strong&gt;52%&lt;/strong&gt;, while AI-written articles were about &lt;strong&gt;48%&lt;/strong&gt;. &lt;/p&gt;

&lt;p&gt;That’s not “the whole internet” (and the methodology matters), but it’s still a cultural threshold:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;search results become easier to manipulate
&lt;/li&gt;
&lt;li&gt;content farms become economically viable
&lt;/li&gt;
&lt;li&gt;“average quality” gets pulled down by volume
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Also: even Wikipedia’s “AI slop” entry frames the term around mass production and monetisation dynamics. &lt;/p&gt;




&lt;h1&gt;
  
  
  3) Why 2025 Made Slop Inevitable: The Incentive Stack
&lt;/h1&gt;

&lt;p&gt;Three forces made 2025 feel different:&lt;/p&gt;

&lt;h2&gt;
  
  
  3.1 The marginal cost of content went to ~zero
&lt;/h2&gt;

&lt;p&gt;If you can generate 1,000 “SEO articles” in an afternoon, content becomes a &lt;em&gt;commodity&lt;/em&gt;.&lt;br&gt;&lt;br&gt;
And commodities flood markets.&lt;/p&gt;

&lt;h2&gt;
  
  
  3.2 Distribution is automated
&lt;/h2&gt;

&lt;p&gt;Recommendation algorithms don’t ask “is this meaningful?”&lt;br&gt;&lt;br&gt;
They ask “will someone pause for 1.2 seconds?”&lt;/p&gt;

&lt;h2&gt;
  
  
  3.3 Monetisation doesn’t care who wrote it
&lt;/h2&gt;

&lt;p&gt;Ad networks pay impressions. Affiliate programmes pay conversions.&lt;br&gt;&lt;br&gt;
That’s a slop subsidy.&lt;/p&gt;

&lt;p&gt;This is why slop is not only on social media. It’s also in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;search spam
&lt;/li&gt;
&lt;li&gt;product reviews
&lt;/li&gt;
&lt;li&gt;workplace docs
&lt;/li&gt;
&lt;li&gt;low-effort “expert” explainers
&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  4) The Damage: It’s Not Just Annoying
&lt;/h1&gt;

&lt;h2&gt;
  
  
  4.1 Workslop: the workplace version of slop
&lt;/h2&gt;

&lt;p&gt;BetterUp Labs + Stanford Social Media Lab describe “workslop” as AI-generated content that looks polished but lacks substance, creating an “invisible tax” where colleagues spend time cleaning it up. &lt;/p&gt;

&lt;p&gt;The key takeaway isn’t the exact dollar figure. It’s the dynamic:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;AI doesn’t always remove work; sometimes it &lt;em&gt;moves&lt;/em&gt; work downstream to the next person.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  4.2 Marketplace trust is getting sandblasted
&lt;/h2&gt;

&lt;p&gt;There’s measurable evidence of likely AI-written reviews climbing on major fast-fashion / low-cost e-commerce platforms.&lt;/p&gt;

&lt;p&gt;Originality.ai reports likely AI reviews about &lt;strong&gt;Temu&lt;/strong&gt; on Trustpilot rising to &lt;strong&gt;10.90%&lt;/strong&gt; in 2025 (and up sharply from earlier years). &lt;br&gt;
Cybernews reports similar figures for Temu and Shein, framing this as a surge in AI-generated reviews. &lt;/p&gt;

&lt;p&gt;When reviews become synthetic, “social proof” collapses.&lt;br&gt;&lt;br&gt;
And platforms pay the cost in customer support, refunds, and reputation.&lt;/p&gt;

&lt;h2&gt;
  
  
  4.3 Elections: deepfakes as the sharpest edge of slop
&lt;/h2&gt;

&lt;p&gt;Not all slop is deepfake, but deepfake is the slop category that gets people hurt.&lt;/p&gt;

&lt;p&gt;During Ireland’s presidential race, multiple outlets reported an AI-generated video falsely claiming candidate Catherine Connolly had quit, prompting platform removals and legal/political scrutiny. &lt;/p&gt;

&lt;p&gt;Even if the election outcome doesn’t flip, the trust damage compounds:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“I saw it with my own eyes” becomes unreliable
&lt;/li&gt;
&lt;li&gt;rebuttals are slower than virality
&lt;/li&gt;
&lt;li&gt;cynicism rises (“everything is fake anyway”)
&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  5) The Governance Reality: The 3-Layer Strategy That Might Work
&lt;/h1&gt;

&lt;p&gt;We don’t “solve” slop with one law or one detector.&lt;br&gt;&lt;br&gt;
We reduce it by attacking three layers: &lt;strong&gt;provenance&lt;/strong&gt;, &lt;strong&gt;distribution&lt;/strong&gt;, and &lt;strong&gt;economics&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  5.1 Provenance: labels, metadata, and traceability
&lt;/h2&gt;

&lt;p&gt;China’s Cyberspace Administration of China (CAC) published the &lt;em&gt;Measures for the Labeling of AI-Generated Synthetic Content&lt;/em&gt;, with an effective date of &lt;strong&gt;September 1, 2025&lt;/strong&gt;. &lt;/p&gt;

&lt;p&gt;The directional lesson isn’t “copy China.” It’s that &lt;strong&gt;labelling becomes part of infrastructure&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;visible labels for humans (“AI-generated”)&lt;/li&gt;
&lt;li&gt;machine-readable metadata for platforms and auditors&lt;/li&gt;
&lt;li&gt;rules against stripping labels&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  5.2 Distribution: platform friction beats perfect detection
&lt;/h2&gt;

&lt;p&gt;Detection is hard—and getting harder.&lt;/p&gt;

&lt;p&gt;Research on “faithfulness hallucination” detection is improving (e.g., FaithLens proposes joint detection + explanation).&lt;br&gt;&lt;br&gt;
But even great detectors won’t catch everything, and false positives create backlash.&lt;/p&gt;

&lt;p&gt;So platforms will need &lt;em&gt;friction&lt;/em&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;throttle low-reputation bulk posters
&lt;/li&gt;
&lt;li&gt;rate-limit new domains that behave like content farms
&lt;/li&gt;
&lt;li&gt;reduce reach for unverified media in high-risk categories (health, elections)
&lt;/li&gt;
&lt;li&gt;“are you sure?” prompts before resharing unverified clips
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Friction is less glamorous than AI-vs-AI, but it works because it changes behaviour.&lt;/p&gt;

&lt;h2&gt;
  
  
  5.3 Economics: stop paying for junk
&lt;/h2&gt;

&lt;p&gt;If ad networks and affiliate programmes keep paying for slop, the slop machine will keep running.&lt;/p&gt;

&lt;p&gt;A practical governance move:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;require provenance and publisher reputation for premium ads
&lt;/li&gt;
&lt;li&gt;penalise repeat slop farms (not just single pages)
&lt;/li&gt;
&lt;li&gt;increase the cost of automation (verification, KYC, rate caps)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the “follow the money” layer, and it’s the most neglected.&lt;/p&gt;




&lt;h1&gt;
  
  
  6) The Weird Twist: Even “Words of the Year” Became a Signal
&lt;/h1&gt;

&lt;p&gt;You’ll see sloppy claims floating around, like “The Economist picked slop as word of the year.”&lt;/p&gt;

&lt;p&gt;It didn’t. The Economist’s official Word of the Year piece for 2025 chose &lt;strong&gt;TACO&lt;/strong&gt; (“Trump Always Chickens Out”). &lt;/p&gt;

&lt;p&gt;But the broader cultural signal is real:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Merriam-Webster picked &lt;strong&gt;slop&lt;/strong&gt;. &lt;/li&gt;
&lt;li&gt;Oxford chose &lt;strong&gt;rage bait&lt;/strong&gt;—a cousin phenomenon about engagement-driven outrage loops. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Together, they map a coherent 2025 mood:&lt;br&gt;&lt;br&gt;
&lt;strong&gt;we’re drowning in synthetic content and engineered emotion.&lt;/strong&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  7) A Practical Playbook for Teams (Not Governments)
&lt;/h1&gt;

&lt;p&gt;If you’re not a regulator, here’s what you can do inside a company/product:&lt;/p&gt;

&lt;h2&gt;
  
  
  7.1 For content teams
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;make “human-reviewed” a product feature
&lt;/li&gt;
&lt;li&gt;measure &lt;strong&gt;information density&lt;/strong&gt; (how much substance per token)
&lt;/li&gt;
&lt;li&gt;require sources (or ban claims without citations)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  7.2 For product + platform teams
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;treat provenance as a first-class signal
&lt;/li&gt;
&lt;li&gt;add friction for reposting and bulk upload
&lt;/li&gt;
&lt;li&gt;build reputation systems that can’t be trivially gamed by bots&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  7.3 For engineering teams shipping AI features
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;default to “assist, don’t autoplay”
&lt;/li&gt;
&lt;li&gt;add “verification steps” in workflows (citations, checks, validation prompts)
&lt;/li&gt;
&lt;li&gt;log and audit generated content downstream&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal is not “no AI.”&lt;br&gt;&lt;br&gt;
The goal is &lt;strong&gt;no cheap amplification of low-value output.&lt;/strong&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  Final Take: Slop Is an Incentive Failure Wearing a Tech Costume
&lt;/h1&gt;

&lt;p&gt;AI slop exists because it’s profitable to produce and cheap to distribute.&lt;/p&gt;

&lt;p&gt;So the fix is unsexy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;provenance infrastructure
&lt;/li&gt;
&lt;li&gt;platform friction
&lt;/li&gt;
&lt;li&gt;economic disincentives
&lt;/li&gt;
&lt;li&gt;and literacy (people learning what “synthetic” looks like)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If we do nothing, the web doesn’t “end.”&lt;br&gt;&lt;br&gt;
It just becomes a place where &lt;strong&gt;signal costs money&lt;/strong&gt; and &lt;strong&gt;noise is free&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;And that is how a civilisation ends up eating content troughs for breakfast.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>aigc</category>
      <category>promptengineering</category>
      <category>goverance</category>
    </item>
    <item>
      <title>Instruction Tuning and Custom Instruction Libraries: Your Model’s Real ‘Operating Manual</title>
      <dc:creator>Dechun Wang</dc:creator>
      <pubDate>Wed, 31 Dec 2025 14:02:17 +0000</pubDate>
      <link>https://forem.com/superorange0707/instruction-tuning-and-custom-instruction-libraries-your-models-real-operating-manual-1lcd</link>
      <guid>https://forem.com/superorange0707/instruction-tuning-and-custom-instruction-libraries-your-models-real-operating-manual-1lcd</guid>
      <description>&lt;h1&gt;
  
  
  Prompt Tricks Don’t Scale. Instruction Tuning Does.
&lt;/h1&gt;

&lt;p&gt;If you’ve ever shipped an LLM feature, you know the pattern:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;You craft a gorgeous prompt.&lt;/li&gt;
&lt;li&gt;It works… &lt;em&gt;until&lt;/em&gt; real users show up.&lt;/li&gt;
&lt;li&gt;Suddenly your “polite customer support bot” becomes a poetic philosopher who forgets the refund policy.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That’s the moment you realise: &lt;strong&gt;prompting is configuration; Instruction Tuning is installation&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Instruction Tuning is how you teach a model to treat your requirements like default behaviour—not a suggestion it can “creatively interpret”.&lt;/p&gt;




&lt;h1&gt;
  
  
  What Is Instruction Tuning, Really?
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Definition
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Instruction Tuning&lt;/strong&gt; is a post-training technique that trains a language model on &lt;strong&gt;Instruction–Response pairs&lt;/strong&gt; so it learns to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;understand the intent behind a task (“summarise”, “classify”, “extract”, “fix code”)&lt;/li&gt;
&lt;li&gt;produce output that matches &lt;strong&gt;format, tone, and constraints&lt;/strong&gt; reliably&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In other words, you’re moving from:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Generate coherent text”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;to:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Execute tasks like a dependable system component.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  A quick intuition
&lt;/h3&gt;

&lt;p&gt;A base model may respond to:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Summarise this document”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;with something long, vague, and slightly dramatic.&lt;/p&gt;

&lt;p&gt;A tuned model learns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;what counts as a summary&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;what “key points” actually means&lt;/li&gt;
&lt;li&gt;how to keep it short, structured, and consistent&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  Instruction Tuning vs Prompt Tuning: The Difference That Matters
&lt;/h1&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Instruction Tuning&lt;/th&gt;
&lt;th&gt;Traditional Prompt Tuning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Where it acts&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Model weights&lt;/strong&gt; (behaviour changes)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Input text&lt;/strong&gt; only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data need&lt;/td&gt;
&lt;td&gt;Needs &lt;strong&gt;many&lt;/strong&gt; labelled examples&lt;/td&gt;
&lt;td&gt;Needs few examples&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best for&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Long-term, repeated tasks&lt;/strong&gt; (support, compliance, extraction)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Ad-hoc tasks&lt;/strong&gt; (one-off writing, translation)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Persistence&lt;/td&gt;
&lt;td&gt;Behaviour sticks after training&lt;/td&gt;
&lt;td&gt;You redesign prompts repeatedly&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Barrier&lt;/td&gt;
&lt;td&gt;Higher (data + training setup)&lt;/td&gt;
&lt;td&gt;Lower (just write prompts)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If you’re running the same workflows every day, &lt;strong&gt;Instruction Tuning pays off fast&lt;/strong&gt;.&lt;/p&gt;




&lt;h1&gt;
  
  
  The Custom Instruction Library: Your “Ammo Depot”
&lt;/h1&gt;

&lt;p&gt;Instruction Tuning is the strategy.&lt;br&gt;&lt;br&gt;
A &lt;strong&gt;custom instruction library&lt;/strong&gt; is the ammunition.&lt;/p&gt;

&lt;p&gt;It’s a curated dataset that encodes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what tasks you care about&lt;/li&gt;
&lt;li&gt;how they should be handled&lt;/li&gt;
&lt;li&gt;how outputs must look and sound&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Think of it as your model’s &lt;strong&gt;operating manual&lt;/strong&gt;, written in training data.&lt;/p&gt;


&lt;h1&gt;
  
  
  What a High-Quality Instruction Pair Must Contain
&lt;/h1&gt;

&lt;p&gt;Every example should have four parts. No shortcuts.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Task type&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
e.g., summarisation, translation, extraction, code repair, sentiment analysis&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Instruction&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Clear, unambiguous, with output constraints (format, length, tone)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Input&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
The raw material: document, conversation log, code snippet, medical notes, etc.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Reference response&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
The &lt;em&gt;ideal&lt;/em&gt; output: correct, complete, consistent, and formatted&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  A simple schema (JSONL)
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"task"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"refund_support"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"instruction"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"Reply in friendly UK English. Confirm refund status and estimate payout time. Avoid jargon."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"input"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"Order #A91822. I requested a refund yesterday due to a faulty item. What's happening and when will I get the money?"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"output"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"Hi! I can see refund request for order A91822 is approved and currently processing. You should receive the funds within 1–3 working days (depending on your bank). You don’t need to do anything else — we’ll notify you once it’s completed."&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Notice what’s happening: &lt;strong&gt;tone, region, and format are all part of the spec&lt;/strong&gt;.&lt;/p&gt;


&lt;h1&gt;
  
  
  Design Principles That Actually Move the Needle
&lt;/h1&gt;
&lt;h2&gt;
  
  
  1) Coverage: hit the long tail, not just the happy path
&lt;/h2&gt;

&lt;p&gt;If you’re tuning for e-commerce support, don’t only include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“Where’s my parcel?”&lt;/li&gt;
&lt;li&gt;“I want a refund.”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Also include the messy real world:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;partial refunds&lt;/li&gt;
&lt;li&gt;missing items&lt;/li&gt;
&lt;li&gt;“Royal Mail says delivered but it isn’t”&lt;/li&gt;
&lt;li&gt;chargebacks&lt;/li&gt;
&lt;li&gt;angry customers who won’t provide an order number&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A model trained on only “clean” scenarios will panic the first time the input isn’t.&lt;/p&gt;
&lt;h2&gt;
  
  
  2) Precision: remove ambiguity from your instructions
&lt;/h2&gt;

&lt;p&gt;Bad instruction:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Handle this user request.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Better instruction:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Classify the sentiment as Positive/Neutral/Negative, then give a one-sentence reason.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Best instruction:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Return JSON exactly: {\"label\": \"Positive|Neutral|Negative\", \"reason\": \"...\"}. No extra text.”&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;
  
  
  3) Diversity: vary inputs aggressively
&lt;/h2&gt;

&lt;p&gt;Include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;short vs long inputs&lt;/li&gt;
&lt;li&gt;slang, typos, mixed languages&lt;/li&gt;
&lt;li&gt;formal tickets vs WhatsApp-style messages&lt;/li&gt;
&lt;li&gt;different difficulty levels&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Your production users are a chaos generator. Train accordingly.&lt;/p&gt;
&lt;h2&gt;
  
  
  4) Consistency: standardise output formats like you mean it
&lt;/h2&gt;

&lt;p&gt;If 200 “Order status” answers are formatted differently, the model will learn inconsistency.&lt;/p&gt;

&lt;p&gt;Pick a template and enforce it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;fixed headings&lt;/li&gt;
&lt;li&gt;stable phrasing&lt;/li&gt;
&lt;li&gt;stable field order&lt;/li&gt;
&lt;/ul&gt;


&lt;h1&gt;
  
  
  Three Domain Examples (With Slightly More Realism)
&lt;/h1&gt;
&lt;h2&gt;
  
  
  1) Healthcare: extract case notes into a structured table (with safety)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Task:&lt;/strong&gt; clinical note extraction&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Instruction:&lt;/strong&gt; extract fields, present as a table, avoid inventing missing values&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Input (example):&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Patient: male, 45. Complains of cough and phlegm for 8 weeks, worsened with chest tightness for 1 week. History: hypertension x5 years, on nifedipine. CT: inflammation in both lower lobes. WBC: 12.5×10^9/L (high). Preliminary diagnosis: community-acquired pneumonia; hypertension grade 2.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Output (example):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Details&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Demographics&lt;/td&gt;
&lt;td&gt;Age: 45; Sex: Male&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Presenting complaint&lt;/td&gt;
&lt;td&gt;Cough and phlegm for 8 weeks; worse with chest tightness for 1 week&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;History&lt;/td&gt;
&lt;td&gt;Hypertension for 5 years; nifedipine&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tests&lt;/td&gt;
&lt;td&gt;CT: bilateral lower-lobe inflammation; WBC: 12.5×10^9/L (elevated)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Preliminary diagnosis&lt;/td&gt;
&lt;td&gt;Community-acquired pneumonia; Hypertension (grade 2)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: In medical settings, you typically add safety rules (“don’t diagnose beyond provided evidence”) directly into the instruction library.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;
  
  
  2) Programming: fix broken Python (and explain why)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Input (broken code):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;calculate_average&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nums&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;nums&lt;/span&gt;
        &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nums&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Output (fixed + explanation):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;calculate_average&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nums&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;nums&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# missing colon fixed
&lt;/span&gt;        &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nums&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Explanation: &lt;code&gt;for&lt;/code&gt; loops in Python require a colon (&lt;code&gt;:&lt;/code&gt;). Without it, the interpreter can’t parse the loop body.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(Also: avoid naming a variable &lt;code&gt;sum&lt;/code&gt;—it shadows Python’s built-in &lt;code&gt;sum()&lt;/code&gt;.)&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  3) Education: generate a Year 4 maths word problem (UK flavour)
&lt;/h2&gt;

&lt;p&gt;Instruction:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Generate &lt;strong&gt;one&lt;/strong&gt; word problem&lt;/li&gt;
&lt;li&gt;Topic: two-digit × one-digit multiplication&lt;/li&gt;
&lt;li&gt;Include: scenario, known facts, question&lt;/li&gt;
&lt;li&gt;Use £ and UK context&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Output:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scenario:&lt;/strong&gt; School fair&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Known facts:&lt;/strong&gt; Each ticket costs &lt;strong&gt;£24&lt;/strong&gt;. A parent buys &lt;strong&gt;3&lt;/strong&gt; tickets.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Question:&lt;/strong&gt; How much do they pay in total?&lt;/p&gt;




&lt;h1&gt;
  
  
  Implementation Workflow: From Library to Tuned Model
&lt;/h1&gt;

&lt;p&gt;Instruction Tuning is a pipeline. If you skip steps, you pay later.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Build the dataset
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Sources
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Internal logs:&lt;/strong&gt; support tickets, agent replies, summaries&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Public datasets:&lt;/strong&gt; Alpaca, FLAN, ShareGPT (filter aggressively)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Human labelling:&lt;/strong&gt; for tasks where correctness matters&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Cleaning checklist
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;remove vague instructions (“handle this”, “do the thing”)&lt;/li&gt;
&lt;li&gt;remove wrong/incomplete outputs&lt;/li&gt;
&lt;li&gt;deduplicate near-identical samples&lt;/li&gt;
&lt;li&gt;standardise formatting and field names&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Split
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Train: 70–80%&lt;/li&gt;
&lt;li&gt;Validation: 10–15%&lt;/li&gt;
&lt;li&gt;Test: 10–15%&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No leakage. No overlap.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2: Choose the base model (with reality constraints)
&lt;/h2&gt;

&lt;p&gt;Pick based on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;task complexity&lt;/li&gt;
&lt;li&gt;deployment constraints&lt;/li&gt;
&lt;li&gt;compute budget&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A practical rule:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;7B–13B&lt;/strong&gt; models + LoRA work well for many enterprise workflows&lt;/li&gt;
&lt;li&gt;bigger models help with harder reasoning and richer generation&lt;/li&gt;
&lt;li&gt;if you need on-device or edge, plan for quantisation and smaller footprints&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Step 3: Fine-tune strategy and hyperparameters
&lt;/h2&gt;

&lt;p&gt;Typical starting points (LoRA):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;learning rate: &lt;strong&gt;1e-4 to 5e-5&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;batch size: &lt;strong&gt;8–32&lt;/strong&gt; (use gradient accumulation if needed)&lt;/li&gt;
&lt;li&gt;epochs: &lt;strong&gt;3–5&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;early stopping if validation stops improving&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;LoRA is popular because it’s efficient: you train small adapter matrices instead of all weights.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 4: Evaluate like you’re going to ship it
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Quantitative (depends on task)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;classification accuracy (for extract/classify)&lt;/li&gt;
&lt;li&gt;BLEU/ROUGE (for summarise/translate — imperfect but useful)&lt;/li&gt;
&lt;li&gt;perplexity (language quality proxy)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Qualitative (the “would you trust this?” test)
&lt;/h3&gt;

&lt;p&gt;Get 3–5 reviewers to score:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;instruction adherence&lt;/li&gt;
&lt;li&gt;completeness&lt;/li&gt;
&lt;li&gt;format correctness&lt;/li&gt;
&lt;li&gt;domain alignment&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Also run &lt;strong&gt;scenario tests&lt;/strong&gt;: 10–20 realistic edge cases.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 5: Deploy + keep tuning
&lt;/h2&gt;

&lt;p&gt;After deployment:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;log response times, failures, “format drift”&lt;/li&gt;
&lt;li&gt;collect bad outputs and user feedback&lt;/li&gt;
&lt;li&gt;convert them into new instruction pairs&lt;/li&gt;
&lt;li&gt;periodically re-tune&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instruction libraries are living assets.&lt;/p&gt;




&lt;h1&gt;
  
  
  A Practical UK-Focused Case Study: E‑Commerce Support
&lt;/h1&gt;

&lt;h3&gt;
  
  
  Goal
&lt;/h3&gt;

&lt;p&gt;Teach a model to handle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;order status&lt;/li&gt;
&lt;li&gt;refunds&lt;/li&gt;
&lt;li&gt;product recommendations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;consistent format&lt;/li&gt;
&lt;li&gt;friendly UK English&lt;/li&gt;
&lt;li&gt;accurate, policy-compliant answers&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Dataset (example proportions)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;300 order status&lt;/li&gt;
&lt;li&gt;400 refunds&lt;/li&gt;
&lt;li&gt;300 recommendations&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Training setup (example)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;base model: a chat-tuned 7B model&lt;/li&gt;
&lt;li&gt;LoRA rank: 8&lt;/li&gt;
&lt;li&gt;LoRA dropout: 0.05&lt;/li&gt;
&lt;li&gt;epochs: 3&lt;/li&gt;
&lt;li&gt;one consumer GPU class machine (or cloud instance)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Deployment trick
&lt;/h3&gt;

&lt;p&gt;Quantise to 4-bit / 8-bit for serving efficiency, then integrate with your order/refund systems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;model drafts the response&lt;/li&gt;
&lt;li&gt;the system injects verified order facts&lt;/li&gt;
&lt;li&gt;the final message is generated with hard constraints&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This hybrid approach reduces hallucinations dramatically.&lt;/p&gt;




&lt;h1&gt;
  
  
  Common Failure Modes (And Fixes)
&lt;/h1&gt;

&lt;h2&gt;
  
  
  1) “We don’t have enough data”
&lt;/h2&gt;

&lt;p&gt;If you’re below ~500 high-quality pairs, results can be shaky.&lt;/p&gt;

&lt;p&gt;Fix:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;generate candidate pairs with a strong model&lt;/li&gt;
&lt;li&gt;have humans review and correct&lt;/li&gt;
&lt;li&gt;do light data augmentation (rephrase, swap names, vary order numbers)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  2) Overfitting: great on train, bad on test
&lt;/h2&gt;

&lt;p&gt;Fix:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reduce epochs&lt;/li&gt;
&lt;li&gt;add diversity&lt;/li&gt;
&lt;li&gt;add dropout/weight decay&lt;/li&gt;
&lt;li&gt;use early stopping based on validation performance&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  3) Domain terms confuse the model
&lt;/h2&gt;

&lt;p&gt;Fix:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;add “term explanation” examples&lt;/li&gt;
&lt;li&gt;include lightweight domain Q&amp;amp;A pairs so the model builds vocabulary&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  4) Output format keeps drifting
&lt;/h2&gt;

&lt;p&gt;Fix:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;standardise reference outputs&lt;/li&gt;
&lt;li&gt;make formatting requirements explicit&lt;/li&gt;
&lt;li&gt;add negative examples (“do not output extra text”)&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  The Future: Auto-Instructions, Multimodal, and Edge Fine-Tuning
&lt;/h1&gt;

&lt;p&gt;Where this is heading:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Auto-instruction generation&lt;/strong&gt;: models produce draft datasets, humans curate&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;multimodal tuning&lt;/strong&gt;: text + images + audio instructions become normal (e.g., “analyse this product photo and write a listing”)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;lighter tuning on edge devices&lt;/strong&gt;: smaller models + efficient adapters + on-device updates&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  Final Take
&lt;/h1&gt;

&lt;p&gt;Prompting is a great way to &lt;em&gt;ask&lt;/em&gt; a model to behave.&lt;br&gt;&lt;br&gt;
Instruction Tuning is how you &lt;em&gt;teach&lt;/em&gt; it to behave.&lt;/p&gt;

&lt;p&gt;If you want reliable outputs across many tasks, stop writing prompts like spells—and start building a &lt;strong&gt;custom instruction library&lt;/strong&gt; like a real product asset:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;comprehensive coverage&lt;/li&gt;
&lt;li&gt;precise instructions&lt;/li&gt;
&lt;li&gt;diverse inputs&lt;/li&gt;
&lt;li&gt;consistent outputs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That’s how you get “one fine-tune, many tasks” without babysitting the model forever.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>promptengineering</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Stop Fine-Tuning Everything: Inject Knowledge with Few‑Shot In‑Context Learning</title>
      <dc:creator>Dechun Wang</dc:creator>
      <pubDate>Wed, 31 Dec 2025 13:50:57 +0000</pubDate>
      <link>https://forem.com/superorange0707/stop-fine-tuning-everything-inject-knowledge-with-few-shot-in-context-learning-40gb</link>
      <guid>https://forem.com/superorange0707/stop-fine-tuning-everything-inject-knowledge-with-few-shot-in-context-learning-40gb</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;You don’t always need a new model. Most of the time, you just need better examples.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Large language models aren’t “smart” in the human sense; they’re &lt;em&gt;pattern addicts&lt;/em&gt;.&lt;br&gt;&lt;br&gt;
Give them the right patterns in the prompt, and they start behaving like a domain expert you never trained.&lt;/p&gt;

&lt;p&gt;This is basically what &lt;strong&gt;Few‑Shot In‑Context Learning (ICL)&lt;/strong&gt; is about:&lt;br&gt;&lt;br&gt;
instead of retraining or fine‑tuning the model, you &lt;strong&gt;inject knowledge directly through carefully crafted examples in the prompt&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In this piece, we’ll walk through:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What “knowledge injection” really means in practice&lt;/li&gt;
&lt;li&gt;How Few‑Shot‑in‑Context sits between “do nothing” and “full fine‑tune”&lt;/li&gt;
&lt;li&gt;A concrete mental model of how LLMs learn from examples&lt;/li&gt;
&lt;li&gt;Five battle‑tested design principles for good few‑shot prompts&lt;/li&gt;
&lt;li&gt;Cross‑industry examples (medical, finance, programming, education, law)&lt;/li&gt;
&lt;li&gt;Six common failure modes — and how to debug them&lt;/li&gt;
&lt;li&gt;Where this technique is going next (RAG, multi‑modal, personal knowledge, etc.)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you’re tired of hearing “we need to fine‑tune a custom model” every time someone wants to add new knowledge… this article is for you.&lt;/p&gt;


&lt;h2&gt;
  
  
  1. Why “knowledge injection” matters (and where Few‑Shot ICL fits)
&lt;/h2&gt;

&lt;p&gt;Most real‑world LLM problems boil down to one painful fact:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The model doesn’t actually know what you care about.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;There are two big gaps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Fresh knowledge&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Anything that happened &lt;em&gt;after&lt;/em&gt; the model’s training cutoff:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;2024–2025 policy changes
&lt;/li&gt;
&lt;li&gt;New internal processes
&lt;/li&gt;
&lt;li&gt;Product updates, pricing changes, API deprecations…&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Niche / long‑tail knowledge&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Things that were never common enough to appear in large quantities online:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A tiny vertical’s industry standard
&lt;/li&gt;
&lt;li&gt;Rare diseases’ treatment guidelines
&lt;/li&gt;
&lt;li&gt;Your company’s weird internal naming conventions&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The classic response is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Let’s fine‑tune a model.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Which sounds cool until you realise what that implies:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Curate + clean + label a dataset&lt;/li&gt;
&lt;li&gt;Pay for training (GPU hours, infra, dev time)&lt;/li&gt;
&lt;li&gt;Wait days or weeks&lt;/li&gt;
&lt;li&gt;Repeat whenever something changes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For many teams (and almost all solo builders), that’s overkill.&lt;/p&gt;
&lt;h3&gt;
  
  
  Enter Few‑Shot‑in‑Context
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Few‑Shot‑in‑Context Learning&lt;/strong&gt; = you don’t touch model weights.&lt;br&gt;&lt;br&gt;
You just:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Pick 3–5 good examples&lt;/strong&gt; that:

&lt;ul&gt;
&lt;li&gt;Embed the knowledge you care about&lt;/li&gt;
&lt;li&gt;Match the task format you want&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stick them into the prompt&lt;/strong&gt; before the actual query&lt;/li&gt;
&lt;li&gt;Let the model &lt;strong&gt;infer the pattern&lt;/strong&gt; and apply it to new inputs&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You’ve just &lt;em&gt;“fine‑tuned”&lt;/em&gt; the model’s behaviour… without training anything.&lt;/p&gt;

&lt;p&gt;Example: suppose you want the model to answer questions about a &lt;strong&gt;2025 EV subsidy policy&lt;/strong&gt; that it’s never seen.&lt;/p&gt;

&lt;p&gt;Instead of fine‑tuning, you can drop something like this into the prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Task: Decide if a car model qualifies for the 2025 EV subsidy and explain why.

Example 1
Input: Battery EV, 500km range
Output: Eligible. Reason: battery electric vehicle with range ≥ 500km gets a 15,000 GBP subsidy.

Example 2
Input: Plug‑in hybrid, 520km range
Output: Eligible. Reason: plug‑in hybrid with range ≥ 500km also gets 15,000 GBP.

Example 3
Input: Battery EV, 480km range
Output: Not eligible. Reason: 480km &amp;lt; 500km threshold, so no subsidy.

Now solve this:
Input: Battery EV, 530km range
Output:
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From these 3 examples, the model can infer the rule:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;“Type ∈ {BEV, PHEV} AND range ≥ 500km → 15k subsidy; otherwise 0.”&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That’s &lt;strong&gt;knowledge injection via examples&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. How Few‑Shot‑in‑Context actually works (mental model)
&lt;/h2&gt;

&lt;p&gt;To understand why this works, forget about gradient descent and transformers for a second and think in more human terms.&lt;/p&gt;

&lt;p&gt;When you few‑shot an LLM, it roughly goes through three internal phases:&lt;/p&gt;

&lt;h3&gt;
  
  
  2.1 Example parsing: “What’s going on here?”
&lt;/h3&gt;

&lt;p&gt;The model scans your prompt and:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Detects the &lt;strong&gt;task pattern&lt;/strong&gt;:
“Ah, it’s always &lt;code&gt;Task → Input → Output&lt;/code&gt;.”&lt;/li&gt;
&lt;li&gt;Extracts &lt;strong&gt;knowledge bits&lt;/strong&gt;:
“EV subsidy depends on:

&lt;ul&gt;
&lt;li&gt;type: BEV / PHEV vs fuel car&lt;/li&gt;
&lt;li&gt;range: threshold 500km&lt;/li&gt;
&lt;li&gt;standard subsidy amount: 15k”&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;It’s not memorising the car models; it’s capturing the &lt;em&gt;relationship&lt;/em&gt; between input features and output labels.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.2 Pattern induction: “What’s the general rule?”
&lt;/h3&gt;

&lt;p&gt;Given multiple examples, the model tries to generalise:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Example A: BEV, 500km → 15k
&lt;/li&gt;
&lt;li&gt;Example B: PHEV, 520km → 15k
&lt;/li&gt;
&lt;li&gt;Example C: BEV, 480km → 0
&lt;/li&gt;
&lt;li&gt;Example D: Fuel car, 600km → 0&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It can then infer something like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Subsidy depends on being a new energy vehicle AND hitting the range threshold. Fuel cars never qualify.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is the “few‑shot learning” part: very little data, but high‑signal structure.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.3 Task transfer: “Apply that rule over here”
&lt;/h3&gt;

&lt;p&gt;When the user sends a new query — say:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input: PHEV, 510km
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model doesn’t go “I’ve seen this exact string before”.&lt;br&gt;&lt;br&gt;
It goes: “This matches the pattern → I know the rule → apply → 15k subsidy.”&lt;/p&gt;

&lt;p&gt;This is the crucial distinction:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Good few‑shot prompts teach the model the &lt;em&gt;logic&lt;/em&gt;, not just the &lt;em&gt;answer&lt;/em&gt;.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If your examples are just random Q&amp;amp;A with no visible structure, the model may end up parroting instead of reasoning. The rest of this article is about avoiding that.&lt;/p&gt;


&lt;h2&gt;
  
  
  3. Five design principles for Few‑Shot knowledge injection
&lt;/h2&gt;

&lt;p&gt;You can’t just throw examples at the model and hope for the best.&lt;br&gt;&lt;br&gt;
The &lt;em&gt;quality&lt;/em&gt; of your examples is 90% of the game.&lt;/p&gt;

&lt;p&gt;Here are five principles I keep coming back to.&lt;/p&gt;


&lt;h3&gt;
  
  
  3.1 Cover both &lt;strong&gt;core dimensions&lt;/strong&gt; and &lt;strong&gt;edge cases&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Real rules always have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Normal cases (most inputs)&lt;/li&gt;
&lt;li&gt;Edge cases (boundary conditions, exclusions, weird situations)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Your examples should cover both.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bad pattern (only “happy paths”):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Example 1: BEV, 500km → subsidy
Example 2: PHEV, 520km → subsidy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model might incorrectly infer:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“If it’s an EV and the number looks big-ish, say ‘subsidy’.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Better pattern (explicit edges):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Example 1: BEV, 500km → eligible (base case)
Example 2: PHEV, 520km → eligible (second base case)
Example 3: BEV, 480km → not eligible (range too low)
Example 4: Fuel car, 600km → not eligible (wrong type)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now the model has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dimensions: drive type, range&lt;/li&gt;
&lt;li&gt;Edges:

&lt;ul&gt;
&lt;li&gt;Range &amp;lt; threshold&lt;/li&gt;
&lt;li&gt;Non‑EV types&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;Design rule:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;For any rule you’re encoding, ask:&lt;br&gt;&lt;br&gt;
&lt;strong&gt;“What are the obvious ‘no’ cases? Did I show at least one of each?”&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h3&gt;
  
  
  3.2 Keep the &lt;strong&gt;format&lt;/strong&gt; perfectly consistent
&lt;/h3&gt;

&lt;p&gt;Models are surprisingly picky about &lt;em&gt;format&lt;/em&gt;.&lt;br&gt;&lt;br&gt;
If your examples look one way and your “real” task looks another, performance tanks.&lt;/p&gt;

&lt;p&gt;You need consistency at three levels:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Structure&lt;/strong&gt; — same high‑level layout&lt;br&gt;&lt;br&gt;
e.g. always &lt;code&gt;Task → Input → Output&lt;/code&gt;, in that order.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Terminology&lt;/strong&gt; — same words for the same concepts&lt;br&gt;&lt;br&gt;
Don’t alternate between &lt;code&gt;EV&lt;/code&gt;, &lt;code&gt;electric car&lt;/code&gt;, &lt;code&gt;new energy car&lt;/code&gt; unless you really have to.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Output shape&lt;/strong&gt; — same style of answer&lt;br&gt;&lt;br&gt;
e.g. always “Yes/No + Reason”, or always a JSON object, or always a 3‑bullet explanation.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Inconsistent example (what &lt;em&gt;not&lt;/em&gt; to do):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Example 1: “2025 EV subsidy: BEV A, 500km → 15k.”   (terse)
Example 2: “According to the 2025 regulation, model B qualifies for…”
Task: “Explain whether model C qualifies and justify your reasoning.”
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model has to guess which style to imitate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Consistent example (what you want):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Task: Decide if the car qualifies for subsidy and explain why.

Example 1
Input: Type=BEV, Range=500km
Output: Eligible.
Reason: Battery EV with range ≥ 500km meets the threshold; base subsidy is 15,000 GBP.

Example 2
Input: Type=PHEV, Range=490km
Output: Not eligible.
Reason: Plug‑in hybrid but range 490km &amp;lt; 500km threshold, so no subsidy.

Now solve this:

Input: Type=BEV, Range=530km
Output:
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Everything lines up, so the model can just &lt;strong&gt;continue the pattern&lt;/strong&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  3.3 Sample count: 3–5 examples is usually enough
&lt;/h3&gt;

&lt;p&gt;This one is empirical but holds up annoyingly well:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Less than 3 examples&lt;/strong&gt; → shaky generalisation, easy overfitting&lt;br&gt;&lt;br&gt;
&lt;strong&gt;More than 5–6 examples&lt;/strong&gt; → diminishing returns + wasted context&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Why not 20 examples? Because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LLMs have a &lt;strong&gt;context window&lt;/strong&gt;; you pay in tokens&lt;/li&gt;
&lt;li&gt;Extra examples can actually &lt;strong&gt;dilute&lt;/strong&gt; the pattern if they’re noisy&lt;/li&gt;
&lt;li&gt;You often don’t &lt;em&gt;have&lt;/em&gt; 20 high‑quality labelled examples for a new rule&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The sweet spot for most tasks is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;3–4 examples&lt;/strong&gt; for simple rules&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;4–5 examples&lt;/strong&gt; when you need to show edge cases + 1–2 tricky combos&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you feel you need 12 examples, it’s often a smell that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You haven’t factored the rule cleanly, or&lt;/li&gt;
&lt;li&gt;You’re mixing multiple tasks into one prompt&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  3.4 Your examples must be &lt;strong&gt;factually correct&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;This sounds obvious, but in practice… we all copy‑paste in a hurry.&lt;/p&gt;

&lt;p&gt;The model &lt;strong&gt;assumes&lt;/strong&gt; your examples are ground truth.&lt;br&gt;&lt;br&gt;
If you smuggle in a bug, it’ll confidently reproduce it everywhere.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example of hidden poison:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Example 1: Range ≥ 500km → subsidy 15,000
Example 2: Range 520km → subsidy 20,000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now the model has to guess whether:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You changed the policy halfway through, or&lt;/li&gt;
&lt;li&gt;You made a mistake&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There’s no way for it to “correct” you; it’s not cross‑checking with the internet.&lt;/p&gt;

&lt;p&gt;Treat your few‑shot block as &lt;strong&gt;production code&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Double‑check numbers &amp;amp; thresholds&lt;/li&gt;
&lt;li&gt;Make sure all examples obey the &lt;em&gt;same rule&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;If you’re encoding law / medicine / finance, verify against the source document&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  3.5 Order your examples from &lt;strong&gt;simple → complex&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Humans don’t like learning by being dropped straight into edge cases; neither do LLMs.&lt;/p&gt;

&lt;p&gt;If your first example already mixes three special conditions, the model might:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Over‑weight that weird case&lt;/li&gt;
&lt;li&gt;Miss the simpler underlying rule&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A nicer pattern:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Base case A
&lt;/li&gt;
&lt;li&gt;Base case B
&lt;/li&gt;
&lt;li&gt;Clear negative / boundary
&lt;/li&gt;
&lt;li&gt;Then &lt;em&gt;maybe&lt;/em&gt; one tricky combo&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Example in subsidy land:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Example 1: Simple positive (type OK + range OK)
Example 2: Simple negative (type OK + range too low)
Example 3: Simple negative (wrong type)
Example 4: Complex (meets base rule + qualifies for extra top‑up)
Example 5: Complex (meets base rule but not the extra condition)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By the time you show examples 4 and 5, the model already &lt;em&gt;knows&lt;/em&gt; the base rule and can now learn the extra dimension cleanly.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Cross‑industry case studies
&lt;/h2&gt;

&lt;p&gt;Let’s look at how this plays out in different domains.&lt;br&gt;&lt;br&gt;
Same technique, very different flavours.&lt;/p&gt;

&lt;p&gt;We’ll keep the prompts compact but expressive; you can expand them in your own stack.&lt;/p&gt;


&lt;h3&gt;
  
  
  4.1 Medical: encoding a rare disease guideline (HAE)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You want the model to assist clinicians with a &lt;strong&gt;2024 guideline&lt;/strong&gt; on a rare disease:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Hereditary Angioedema (HAE)&lt;br&gt;&lt;br&gt;
– recurrent non‑itchy swelling&lt;br&gt;&lt;br&gt;
– C1‑INH deficiency&lt;br&gt;&lt;br&gt;
– often misdiagnosed as allergy&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The base model either:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Has never seen the latest guideline, or&lt;/li&gt;
&lt;li&gt;Hallucinates based on generic “allergy” patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instead of fine‑tuning, you inject the guideline logic as examples.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Few‑shot pattern&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You define a task:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Task: Based on the 2024 HAE guideline, decide:
1) Does this case meet the diagnostic criteria?
2) What is the reasoning?
3) What is an appropriate initial management plan?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then you show 3–4 cases that span:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Typical family case&lt;/li&gt;
&lt;li&gt;Clear non‑HAE allergic case&lt;/li&gt;
&lt;li&gt;“Sporadic” case with no family history but positive lab markers&lt;/li&gt;
&lt;li&gt;Another non‑HAE case with itchy rash&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each example is shaped like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Example 1
Input (summary):
- Recurrent facial swelling, non‑itchy
- Mother with similar episodes
- C1‑INH activity 20% (low)

Output:
1) Diagnosis: Consistent with HAE.
2) Reasoning: Non‑itchy angioedema + positive family history + low C1‑INH.
3) Management: …
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then your real query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Now analyse this patient:

- 35‑year‑old male
- Recurrent eyelid swelling for 2 years
- No itch, no urticaria
- Father with similar symptoms
- C1‑INH activity 25% (low), C4 reduced

Return the same 3‑part structure.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model doesn’t need to “know” HAE from training — it learns the diagnostic logic on the fly from your few examples.&lt;/p&gt;

&lt;p&gt;You still need a human in the loop (this is medicine), but the model stops hallucinating generic “take some antihistamines” advice.&lt;/p&gt;




&lt;h3&gt;
  
  
  4.2 Finance: injecting a brand‑new IPO regulation
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You’re building an assistant for investment bankers analysing &lt;strong&gt;2025 IPO rules&lt;/strong&gt; for a specific exchange (say, a revised science‑tech board in UK).&lt;/p&gt;

&lt;p&gt;The rulebook changed in March 2025:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;New &lt;strong&gt;market‑cap + revenue + profit&lt;/strong&gt; combos&lt;/li&gt;
&lt;li&gt;Special rules for &lt;strong&gt;red‑chip / dual‑class&lt;/strong&gt; structures&lt;/li&gt;
&lt;li&gt;Stricter restrictions on &lt;strong&gt;use of proceeds&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Your base model has never seen this document.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Few‑shot pattern&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You declare a task:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Task: Based on the 2025 revised IPO rules, decide whether a company qualifies for listing on the sci‑tech board. Explain your reasoning by:
- Market cap &amp;amp; financial metrics
- Tech / “hard‑tech” profile
- Use of proceeds
- Any special structure (red‑chip, VIE, dual‑class)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then you show three archetypal examples:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Clean domestic hard‑tech issuer&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;R&amp;amp;D‑heavy AI chip company
&lt;/li&gt;
&lt;li&gt;Market cap ≥ 50B, revenue ≥ 10B, net profit ≥ 2B
&lt;/li&gt;
&lt;li&gt;Proceeds used to build new chip fab
&lt;/li&gt;
&lt;li&gt;Result: qualifies&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Red‑chip, loss‑making biotech&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cayman / Hong Kong structure
&lt;/li&gt;
&lt;li&gt;Market cap ≥ 150B, loss‑making but high R&amp;amp;D ratio
&lt;/li&gt;
&lt;li&gt;Proceeds for overseas R&amp;amp;D centres
&lt;/li&gt;
&lt;li&gt;Result: qualifies under special red‑chip route&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Non‑tech traditional business misusing funds&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Clothing manufacturer
&lt;/li&gt;
&lt;li&gt;Market cap and profits below threshold
&lt;/li&gt;
&lt;li&gt;Wants to use proceeds to buy wealth‑management products
&lt;/li&gt;
&lt;li&gt;Result: does &lt;em&gt;not&lt;/em&gt; qualify (fails both “hard‑tech” positioning and funds‑use rule)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Now when you feed in:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Company D:
- Hong Kong‑registered red‑chip
- Quantum computing hardware
- Projected post‑IPO market cap: 180B
- Still loss‑making; R&amp;amp;D / revenue ratio 30%
- No VIE structure
- Proceeds used to hire researchers and develop quantum algorithms

Question: Does it qualify? Explain by the same dimensions as the examples.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model can map this onto the red‑chip example and apply the rule: “large market cap + genuine hard‑tech + loss‑making allowed under route X + proceeds used for R&amp;amp;D” → qualifies.&lt;/p&gt;

&lt;p&gt;No regulation fine‑tuning required.&lt;/p&gt;




&lt;h3&gt;
  
  
  4.3 Programming: teaching Python 3.12 type parameter syntax
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You’re generating code, but the base model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Was trained mostly on Python ≤ 3.11&lt;/li&gt;
&lt;li&gt;Keeps suggesting &lt;code&gt;TypeVar&lt;/code&gt; boilerplate&lt;/li&gt;
&lt;li&gt;Doesn’t use &lt;strong&gt;Python 3.12’s new generic syntax&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You want the model to write idiomatic 3.12 code, today.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key new idea&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Instead of:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TypeVar&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;

&lt;span class="n"&gt;T&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TypeVar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;T&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;identity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Python 3.12 lets you write:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;identity&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And similarly for classes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Stack&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Few‑shot pattern&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Define the task:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Task: You are coding in Python 3.12.
- Prefer the new type parameter syntax: def func[T](...) → ...
- Prefer built‑in generics (list[T], dict[str, T]) over typing.List / typing.Dict.
- When you see older TypeVar patterns, refactor them into the new style.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then show a few before/after examples.&lt;/p&gt;

&lt;h4&gt;
  
  
  Example 1 — basic function
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before (pre‑3.12)
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TypeVar&lt;/span&gt;
&lt;span class="n"&gt;T&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TypeVar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;T&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;echo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# After (Python 3.12)
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;echo&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Example 2 — list generics
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TypeVar&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;
&lt;span class="n"&gt;U&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TypeVar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;U&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;U&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;U&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# After
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;U&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;U&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;U&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Example 3 — generic class
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TypeVar&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;
&lt;span class="n"&gt;K&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TypeVar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;K&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;V&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TypeVar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;V&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SimpleCache&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;V&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;V&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# After
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SimpleCache&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;V&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;V&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;V&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Example 4 — dictionary helper (the “target” pattern)
&lt;/h4&gt;

&lt;p&gt;Now you show a case very close to what you actually want:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TypeVar&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;
&lt;span class="n"&gt;T&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TypeVar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;T&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_value&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# After
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_value&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now when you ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Write a generic &lt;code&gt;lookup&lt;/code&gt; function in Python 3.12 that looks up a key in a &lt;code&gt;dict[str, T]&lt;/code&gt; and raises &lt;code&gt;KeyError&lt;/code&gt; if missing, using the new type parameter syntax.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;…you’ll get something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lookup&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;KeyError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Exactly the behaviour you want — without touching the model weights.&lt;/p&gt;




&lt;h3&gt;
  
  
  4.4 Education: encoding a new curriculum standard
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In 2025, the middle‑school math curriculum adds a new unit:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;“Data &amp;amp; algorithms basics”&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Simple visualisation (e.g. box plots)&lt;/li&gt;
&lt;li&gt;Basic algorithm concepts (e.g. bubble sort, described in natural language)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;You’re building a teacher‑assistant tool that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reviews lesson plans&lt;/li&gt;
&lt;li&gt;Flags whether they align with the new standard&lt;/li&gt;
&lt;li&gt;Explains &lt;em&gt;why&lt;/em&gt; in teacher‑friendly language&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Few‑shot pattern&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Define the task:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Task: Given a lesson description for grades 7–9, decide:
1) Does it align with the 2025 “Data &amp;amp; Algorithms Basics” standard?
2) Why or why not?
3) If not, how could it be adjusted?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Show three archetypes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Good “box plot” lesson&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Students compute quartiles for real exam‑score data
&lt;/li&gt;
&lt;li&gt;Draw box plots
&lt;/li&gt;
&lt;li&gt;Discuss concentration and spread
→ Meets “data visualisation” requirement&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Good “bubble sort” lesson&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Teacher explains algorithm with diagrams
&lt;/li&gt;
&lt;li&gt;Students describe steps in natural language
&lt;/li&gt;
&lt;li&gt;No Python coding required
→ Matches “understand algorithm idea”, not “implement deep learning”&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Overkill AI lesson&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Teacher asks 9th graders to code a neural network in Python
&lt;/li&gt;
&lt;li&gt;Predict exam scores from features
→ Way beyond the standard; flagged as misaligned&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Then test case:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Lesson: Grade 8 “Data &amp;amp; Algorithms Basics”.
- Students design a simple random sampling plan to pick 50 students out of 2,000.
- They collect “weekly exercise time” data.
- They draw a box plot of exercise time and discuss the results.

Question: Aligned or not? Analyse using the three points as in the examples.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model learns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Random sampling + box plots + interpretation
&lt;/li&gt;
&lt;li&gt;Is very much in‑scope for the new standard
&lt;/li&gt;
&lt;li&gt;And it can explain that in the same structure as the examples.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  4.5 Law: encoding new labour‑law amendments
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You have a legal assistant for HR that needs to know about &lt;strong&gt;2025 labour‑law amendments&lt;/strong&gt;, including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Occupational injury insurance for platform workers / gig economy&lt;/li&gt;
&lt;li&gt;How to measure working time under remote work&lt;/li&gt;
&lt;li&gt;Minimum compensation for non‑compete agreements&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The base model doesn’t know about the 2025 changes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Few‑shot pattern&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Define the task:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Task: Given a short labour dispute case, do:
1) Legal assessment under the 2025 amendments
2) Cite the relevant new article(s) in plain language
3) Suggest next steps for the worker and/or employer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Show three examples:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Gig worker injury&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Platform courier, no formal contract, injured on delivery
&lt;/li&gt;
&lt;li&gt;Platform refuses compensation
→ New rule: factual employment + platform must pay into injury insurance
→ Suggest: apply for recognition of labour relationship, then injury claim&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Remote work time tracking&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Employer arbitrarily deducts pay by claiming “low efficiency”
&lt;/li&gt;
&lt;li&gt;No proper time‑tracking system
→ New rule: use “effective working time” and require documented tracking
→ Suggest: ask employer for evidence; if none, pay must be corrected&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Non‑compete with absurdly low pay&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;2‑year non‑compete, but only 500 GBP/month compensation
→ New rule: at least 30% of average salary and no lower than minimum wage
→ Worker can refuse to comply or demand revised compensation&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Then you ask about a design worker who:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Worked fully remotely&lt;/li&gt;
&lt;li&gt;Got only 3,000 GBP in June, while their historical average is 8,000&lt;/li&gt;
&lt;li&gt;Employer never tracked “effective working time”&lt;/li&gt;
&lt;li&gt;Employer claims “your efficiency was low”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model can now chain:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;This matches the &lt;strong&gt;remote work&lt;/strong&gt; example
&lt;/li&gt;
&lt;li&gt;Employer failed their duty to track time
&lt;/li&gt;
&lt;li&gt;Worker can demand full pay based on historical average, minus what’s been paid&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Again: the law text itself lives &lt;em&gt;outside&lt;/em&gt; the model.&lt;br&gt;&lt;br&gt;
The &lt;em&gt;usable logic&lt;/em&gt; gets distilled into 3–4 examples.&lt;/p&gt;


&lt;h2&gt;
  
  
  5. Six common failure modes (and how to fix them)
&lt;/h2&gt;

&lt;p&gt;Even with the right idea, few‑shot prompts can fail in subtle ways.&lt;br&gt;&lt;br&gt;
Let’s debug some typical issues.&lt;/p&gt;


&lt;h3&gt;
  
  
  5.1 The model just parrots your examples
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Symptom&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Works fine on seen cases&lt;/li&gt;
&lt;li&gt;On new input, it copies values from examples or repeats entire example answers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Likely causes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Examples are &lt;em&gt;purely concrete&lt;/em&gt; — lots of “case A / case B” but no visible rule&lt;/li&gt;
&lt;li&gt;Task wording doesn’t force generalisation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Fixes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Explicitly encode the rule once&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Instead of only:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Example: EV A, 500km → 15k
Example: EV B, 520km → 15k
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add a line like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Rule: For all BEVs and PHEVs with range ≥ 500km, subsidy is 15,000 GBP.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Align features&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If your examples all use “Car: A, range 500, type BEV”, and your real task says “Model E, long‑range battery, SUV body style”…&lt;br&gt;&lt;br&gt;
the model may not spot that “long‑range battery” = “range feature”.&lt;/p&gt;

&lt;p&gt;Be boring and explicit:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Always show the same fields: &lt;code&gt;Type=…&lt;/code&gt;, &lt;code&gt;Range=…&lt;/code&gt;, &lt;code&gt;BodyStyle=…&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  5.2 The model ignores edge cases
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Symptom&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Handles normal cases well&lt;/li&gt;
&lt;li&gt;Fails on boundary or special cases that &lt;em&gt;were&lt;/em&gt; in your examples&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example: you showed one “HAE with no family history” case, but the model still insists “no family history → no HAE”.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Likely causes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Only a single edge example vs many normal examples&lt;/li&gt;
&lt;li&gt;Edge case buried at the end with no emphasis&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Fixes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Add &lt;strong&gt;at least two&lt;/strong&gt; edge‑case examples&lt;/li&gt;
&lt;li&gt;Label them clearly, e.g.:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Boundary case] Patient C: no family history, but low C1‑INH and typical symptoms → still HAE.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Explicit tags like &lt;code&gt;[Boundary case]&lt;/code&gt; or &lt;code&gt;[Special exception]&lt;/code&gt; help the model treat them as part of the rule, not noise.&lt;/p&gt;




&lt;h3&gt;
  
  
  5.3 Examples are too long (signal drowned in noise)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Symptom&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prompt looks like a wall of text&lt;/li&gt;
&lt;li&gt;Model latches onto random parts (e.g. story flavour) instead of the core logic&lt;/li&gt;
&lt;li&gt;Sometimes it even ignores later examples because the context is overloaded&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Likely causes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You copy‑pasted entire documents instead of minimal supervised examples&lt;/li&gt;
&lt;li&gt;You included narrative fluff, version history, citations, etc.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Fixes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Trim each example to: &lt;code&gt;Input → Output → Short explanation&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Put long policy / guideline text in a &lt;em&gt;separate&lt;/em&gt; section above, and &lt;strong&gt;summarise&lt;/strong&gt; the operative rule in the example&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Try to keep:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Each example ≤ ~100 tokens if possible&lt;/li&gt;
&lt;li&gt;All examples + instructions under ~80% of your context window, leaving room for the actual query and model’s reasoning&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  5.4 Terminology collision across domains
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Symptom&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You re‑use generic terms like “subsidy”, “margin”, “policy” in multiple domains&lt;/li&gt;
&lt;li&gt;The model mixes up meanings (e.g. subsidy in finance vs subsidy in automotive)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Likely causes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No domain qualification: “subsidy” means different things in 3 prompts&lt;/li&gt;
&lt;li&gt;Inconsistent phrasing: “post‑IPO market cap” vs “total value after listing” vs “size”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Fixes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Define terms&lt;/strong&gt; in your instructions:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;In this task, “market cap” always means “projected post‑IPO market cap” (shares × offering price).
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Use &lt;strong&gt;domain‑specific names&lt;/strong&gt; where possible:

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;ev_subsidy&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ipo_post_listing_market_cap&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;hae_lab_marker&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;Consistency beats cleverness.&lt;/p&gt;




&lt;h3&gt;
  
  
  5.5 Small models can’t juggle too many conditions
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Symptom&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A big model (e.g. GPT‑4‑class) handles your prompt perfectly&lt;/li&gt;
&lt;li&gt;A smaller one (7B / 13B) seems to ignore half the conditions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example: in HAE diagnosis, the small model only looks at symptoms but ignores lab values and family history.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Likely causes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You’re asking it to learn a 3–4‑dimensional rule in one shot&lt;/li&gt;
&lt;li&gt;Each example mixes many conditions at once&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Fixes&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Break the problem into &lt;strong&gt;layers of examples&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Single‑dimension examples&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Only symptoms: itchy vs non‑itchy swelling&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Two‑dimension examples&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Symptoms + lab values&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Three‑dimension examples&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Symptoms + lab values + family history&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The smaller model can first lock in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“Non‑itchy swelling” vs “allergic swelling”
&lt;/li&gt;
&lt;li&gt;Then “low C1‑INH” vs “normal”
&lt;/li&gt;
&lt;li&gt;Then “family history strengthens the case”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instead of being hit with all three dimensions at once.&lt;/p&gt;




&lt;h3&gt;
  
  
  5.6 Silent errors in your examples poison everything
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Symptom&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Everything &lt;em&gt;feels&lt;/em&gt; coherent&lt;/li&gt;
&lt;li&gt;But the model’s answers are consistently off from the real rules&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When you go back and re‑read your few‑shot block, you find:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One threshold is wrong
&lt;/li&gt;
&lt;li&gt;Or two examples contradict each other&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Fixes&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Treat prompt debugging like code review:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Validate against the source of truth&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Policy document
&lt;/li&gt;
&lt;li&gt;Official guideline
&lt;/li&gt;
&lt;li&gt;Legal article&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Check internal consistency&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Same thresholds everywhere
&lt;/li&gt;
&lt;li&gt;Same units (km vs miles, %)
&lt;/li&gt;
&lt;li&gt;No “≥ 450km” in one example and “≥ 500km” in the next&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you fix the examples and the model’s behaviour changes, you just proved that your few‑shot block &lt;em&gt;is&lt;/em&gt; acting like a miniature “training set in context” — which is exactly the point.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Where this is going next
&lt;/h2&gt;

&lt;p&gt;Few‑Shot‑in‑Context is not a temporary hack; it’s becoming part of the core &lt;strong&gt;LLM engineering toolbox&lt;/strong&gt;, especially when combined with other techniques.&lt;/p&gt;

&lt;p&gt;A few directions that are already practical today:&lt;/p&gt;

&lt;h3&gt;
  
  
  6.1 Few‑shot + RAG = dynamic knowledge injection
&lt;/h3&gt;

&lt;p&gt;Instead of hard‑coding your examples into the prompt, you can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Store policy snippets / typical cases in a vector DB&lt;/li&gt;
&lt;li&gt;At query time:

&lt;ul&gt;
&lt;li&gt;Retrieve the most relevant 3–5 items&lt;/li&gt;
&lt;li&gt;Format them as few‑shot examples&lt;/li&gt;
&lt;li&gt;Feed them to the model&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;You get:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Up‑to‑date knowledge (change the DB, not the model)&lt;/li&gt;
&lt;li&gt;Domain‑specific behaviour&lt;/li&gt;
&lt;li&gt;No retraining loops&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  6.2 Multi‑modal few‑shot
&lt;/h3&gt;

&lt;p&gt;With vision‑capable models, examples don’t have to be text‑only.&lt;/p&gt;

&lt;p&gt;You can show:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An &lt;strong&gt;image of a box plot&lt;/strong&gt; + text interpretation → teach data‑viz reading&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;scan of a legal clause&lt;/strong&gt; + structured summary → teach contract analysis&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;medical image&lt;/strong&gt; + diagnosis → teach pattern recognition frameworks (with humans supervising)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The principle is the same: a tiny set of &lt;strong&gt;high‑quality, well‑structured examples&lt;/strong&gt; sets the behaviour.&lt;/p&gt;

&lt;h3&gt;
  
  
  6.3 Personal and team‑level “knowledge presets”
&lt;/h3&gt;

&lt;p&gt;For individuals and small teams, we’ll likely see:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tools that help you &lt;strong&gt;generate few‑shot blocks&lt;/strong&gt; from your own notes&lt;/li&gt;
&lt;li&gt;“Profiles” that encode:

&lt;ul&gt;
&lt;li&gt;Your coding style&lt;/li&gt;
&lt;li&gt;Your company’s policy interpretations&lt;/li&gt;
&lt;li&gt;Your favourite data formats&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;Think of it as a “soft fine‑tune” you can edit in a text editor.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. Takeaways
&lt;/h2&gt;

&lt;p&gt;If you remember only three things from this article, make them these:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Fine‑tuning is rarely your first move.&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Often, you can get 80–90% of the value by injecting knowledge with 3–5 carefully chosen examples.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Good few‑shot prompts encode logic, not trivia.&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Cover both normal and edge cases, be brutally consistent in format, and keep examples factual and compact.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Few‑shot is a bridge between raw models and real products.&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
It lets you adapt general‑purpose LLMs to fast‑moving, niche, or private knowledge — on your laptop, in minutes, without a GPU farm.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Once you start thinking of prompts as tiny, editable “on‑the‑fly training sets”, you stop reaching for fine‑tuning by default — and start shipping faster.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>promptengineering</category>
      <category>llm</category>
      <category>rag</category>
    </item>
  </channel>
</rss>
