<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: René Zander</title>
    <description>The latest articles on Forem by René Zander (@reneza).</description>
    <link>https://forem.com/reneza</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1138713%2Fa7d8635c-22db-4dec-b156-1fb07de64a8d.jpeg</url>
      <title>Forem: René Zander</title>
      <link>https://forem.com/reneza</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/reneza"/>
    <language>en</language>
    <item>
      <title>Browser-Use Is Solving the Wrong Half of the Problem</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Tue, 19 May 2026 14:57:13 +0000</pubDate>
      <link>https://forem.com/reneza/browser-use-is-solving-the-wrong-half-of-the-problem-37pa</link>
      <guid>https://forem.com/reneza/browser-use-is-solving-the-wrong-half-of-the-problem-37pa</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR — when to use browserground (and when to use UI-TARS-MLX instead)
&lt;/h2&gt;

&lt;p&gt;If you're on Apple Silicon with ≥16 GB RAM and you need &lt;strong&gt;generic, max-accuracy UI grounding&lt;/strong&gt;, use &lt;strong&gt;&lt;a href="https://huggingface.co/mlx-community/UI-TARS-1.5-7B-4bit" rel="noopener noreferrer"&gt;mlx-community/UI-TARS-1.5-7B-4bit&lt;/a&gt;&lt;/strong&gt;. It's the obvious default — ~94% on ScreenSpot-v2, MLX-native, drops into &lt;code&gt;mlx-vlm&lt;/code&gt; directly. ByteDance research-lab compute, you couldn't reproduce it on a budget. I'm not the right pick for that workload.&lt;/p&gt;

&lt;p&gt;browserground is for two narrower jobs:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The recipe for &lt;em&gt;your product's&lt;/em&gt; custom UI grounder.&lt;/strong&gt; UI-TARS is a finished model — closed pipeline, proprietary data, hard to extend. browserground is the opposite: a  reproducible template. Open base (Qwen3-VL-2B), open training scripts, open data mix (26k records, OS-Atlas + wave-ui). Swap in your dashboard's screenshots / your customer app / your internal tooling → ship a domain-trained UI grounder over a weekend. The 60% generic ScreenSpot-v2 score isn't the deliverable; the &lt;em&gt;recipe&lt;/em&gt; is. A 60-point baseline on generic screens becomes 85-95% on your own product's narrow surface because the test distribution finally matches the training distribution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The smallest viable slot in a multi-model stack.&lt;/strong&gt; browserground 4-bit MLX is ~1 GB on disk / ~2 GB RAM. UI-TARS-1.5-7B-MLX is ~4 GB / ~5-6 GB RAM. The difference matters on 8 GB Macs and in agent stacks that already run a 7B planner + an OCR model + an embedding model in the same RAM budget. Plus strict JSON output (100% parseable, no regex on prose) — small win, but real.&lt;/p&gt;

&lt;p&gt;A direct head-to-head benchmark of browserground vs UI-TARS-1.5-7B-MLX on the same Apple Silicon hardware is forthcoming.&lt;/p&gt;




&lt;h2&gt;
  
  
  The broader argument — why a parser-stage specialist matters at all
&lt;/h2&gt;

&lt;h2&gt;
  
  
  And if you're new to the hybrid pattern — why this exists at all
&lt;/h2&gt;

&lt;p&gt;Everyone's posting browser-agent demos this week. Click here, scroll there, fill that form. Most break by click seven.&lt;/p&gt;

&lt;p&gt;Mine broke too. The submit button on a checkout form that the frontier vision model literally couldn't see. Billed at $0.01-0.05 per call, called 20-50 times per agent run, the model was burning reasoning capacity on parsing pixel coordinates. A 2B specialist I trained for $5 hits that same button &lt;strong&gt;3.3x more reliably&lt;/strong&gt; on ScreenSpot-v2 (60.0% vs GPT-4o's 18.3%).&lt;/p&gt;

&lt;p&gt;The architecture is the bug, not the model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two Jobs, One Forward Pass
&lt;/h2&gt;

&lt;p&gt;Browser-agent stacks send a screenshot to a frontier vision model and ask it for both the next decision and the click coordinates in one call. Splitting that into two calls, a local 2B grounding model that emits JSON followed by a frontier model that reasons over the JSON, drops vision token spend and raises click accuracy.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;browser-use&lt;/code&gt; (94k stars), Skyvern (22k stars), Claude Computer Use, OpenAI Operator. Same pattern. Same compound question every step:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Given this page, what should the agent do next, and where exactly does it click?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Two jobs welded together. Reasoning ("what next") is a probabilistic problem worth a frontier model. Grounding ("where exactly") is a structured-output problem with a tight schema: clickable elements, bounding boxes, accessible labels.&lt;/p&gt;

&lt;p&gt;You're paying frontier-tier rates for the second job. Per screenshot. Every step of the loop.&lt;/p&gt;

&lt;h2&gt;
  
  
  Grounding Is a Parser Problem
&lt;/h2&gt;

&lt;p&gt;Once you name it as a parser problem, the right tool changes. You don't need 200 billion parameters to emit a JSON list of clickable elements. You need a model that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Has seen enough UI screenshots to recognize buttons, inputs, links with sub-50-pixel precision&lt;/li&gt;
&lt;li&gt;Outputs strict JSON without hallucinating bounding boxes&lt;/li&gt;
&lt;li&gt;Runs locally so the per-step cost is zero&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A 2B specialist trained on screen-parsing data. Not a frontier model.&lt;/p&gt;

&lt;p&gt;I trained one. Total cost: &lt;strong&gt;~$5&lt;/strong&gt; of RunPod compute on a single A6000 GPU. The result, &lt;code&gt;browserground&lt;/code&gt;, hits &lt;strong&gt;60.0% on ScreenSpot-v2&lt;/strong&gt; vs GPT-4o's 18.3% — a 3.3x beat at the click-grounding job. More telling: it &lt;strong&gt;beats SeeClick (9.6B params, 55.1%) at 4.8x smaller&lt;/strong&gt;. A drop-in for any agent loop currently handing screenshots to a frontier API. Today the CLI runs via &lt;code&gt;transformers&lt;/code&gt; on Apple Silicon (~14 s/call); MLX-native build coming for the ~1.5 s path.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Reasoning Model Gets Its Reasoning Capacity Back
&lt;/h2&gt;

&lt;p&gt;When you split the call, the frontier model stops seeing pixels. It sees:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"elements"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"e7"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"label"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Submit order"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"button"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"bbox"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;344&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;612&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;478&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;658&lt;/span&gt;&lt;span class="p"&gt;]},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"e8"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"label"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Edit cart"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"link"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="nl"&gt;"bbox"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now the frontier model does the job it's good at: deciding &lt;code&gt;e7&lt;/code&gt; vs &lt;code&gt;e8&lt;/code&gt; given the agent's goal. A reasoning question over structured input. Cheap. Reliable. Auditable.&lt;/p&gt;

&lt;p&gt;Three things change at once. Per-step token spend on vision collapses, because the grounding step runs locally. JSON validity hits 100% (the specialist learned the output convention with 35M LoRA parameters on a Qwen3-VL-2B base). Agent traces become debuggable. You read the structured grounding output before the reasoning step ever runs.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Anthropic and OpenAI Ship Next
&lt;/h2&gt;

&lt;p&gt;The frontier providers will absorb grounding into their own small models. Within twelve months, "fast vision" or "tool vision" tiers will appear in both Anthropic and OpenAI billing at a fraction of frontier rates. The economics demand it. Nobody can justify charging GPT-5 prices for a parser, and Hugging Face downloads already prove the demand: SeeClick, UI-TARS, and ShowUI pull ~300k category downloads a month between them.&lt;/p&gt;

&lt;p&gt;When that ships, stack owners who already split grounding from reasoning have three things the wait-and-see crowd doesn't. A local fallback if the provider has an outage. An auditable structured-grounding trace in every log line. An exit option to a different reasoning provider without re-validating click behavior, because the grounding step belongs to them.&lt;/p&gt;

&lt;p&gt;Stack owners who didn't split will find their grounding step has quietly become someone else's API. Same vendor billing the reasoning calls. Same vendor setting the price. Same vendor's deprecation calendar.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Diagnostic
&lt;/h2&gt;

&lt;p&gt;Pull up your last failed agent trace. Three numbers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Total tokens spent on vision calls per agent step.&lt;/li&gt;
&lt;li&gt;Fraction of those tokens spent on grounding (parsing pixel coordinates) vs reasoning (deciding actions).&lt;/li&gt;
&lt;li&gt;Per-run vision cost at your current API rates.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If grounding dominates the first two numbers, and in most stacks it does, your stack has the split wrong.&lt;/p&gt;

&lt;p&gt;Grounding is plumbing. Reasoning is cognition. Stop paying cognition rates for plumbing.&lt;/p&gt;




&lt;p&gt;I build the split layer. &lt;code&gt;browserground&lt;/code&gt; is the open-source reference for the local grounding half. v0.3 ships three packagings so it drops into any stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;npm CLI&lt;/strong&gt; (daemon, HTTP REST server, batch, confidence, eval): &lt;code&gt;npm install -g browserground&lt;/code&gt; → &lt;code&gt;browserground parse &amp;lt;img&amp;gt; --target "..."&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PyPI&lt;/strong&gt; (no Node required, MLX or transformers): &lt;code&gt;pip install "browserground[mlx]"&lt;/code&gt; → &lt;code&gt;from browserground import click_xy&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ollama&lt;/strong&gt; (cross-platform, GGUF Q4_K_M + f16 mmproj): &lt;code&gt;ollama run renezander030/browserground&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Adapters land in the repo for &lt;strong&gt;browser-use&lt;/strong&gt; (drop-in &lt;code&gt;Controller&lt;/code&gt; action) and &lt;strong&gt;Skyvern&lt;/strong&gt; (&lt;code&gt;ground_with_fallback&lt;/code&gt; for local-first + cloud-fallback). Model: &lt;a href="https://huggingface.co/renezander030/browserground" rel="noopener noreferrer"&gt;huggingface.co/renezander030/browserground&lt;/a&gt;. MLX 4-bit: &lt;a href="https://huggingface.co/renezander030/browserground-mlx" rel="noopener noreferrer"&gt;browserground-mlx&lt;/a&gt;. GGUF: &lt;a href="https://huggingface.co/renezander030/browserground-gguf" rel="noopener noreferrer"&gt;browserground-gguf&lt;/a&gt;. Source: &lt;a href="https://github.com/renezander030/browserground" rel="noopener noreferrer"&gt;github.com/renezander030/browserground&lt;/a&gt;. Apache-2.0. v0.2 LoRA trained on 26k mixed-domain examples (macOS + Android + UIBert + web). PRs welcome, especially eval cases where it fails.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at &lt;a href="https://renezander.com" rel="noopener noreferrer"&gt;renezander.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>productivity</category>
      <category>discuss</category>
    </item>
    <item>
      <title>Three Files I Renamed Last Month That Fixed My AI Agent</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Tue, 12 May 2026 06:59:17 +0000</pubDate>
      <link>https://forem.com/reneza/three-files-i-renamed-last-month-that-fixed-my-ai-agent-kgi</link>
      <guid>https://forem.com/reneza/three-files-i-renamed-last-month-that-fixed-my-ai-agent-kgi</guid>
      <description>&lt;p&gt;A new job title showed up in my feed last week. Context engineer.&lt;/p&gt;

&lt;p&gt;I clicked. Twice. Three articles, all definitional. None of them showed code. Every definition was a paraphrase of "name things well."&lt;/p&gt;

&lt;p&gt;Then I opened my own repo and grepped for &lt;code&gt;utils.py&lt;/code&gt;. Eleven hits. Six folders deep, in a project I have shipped to production. The same project where I had spent two hours that morning fighting an agent that kept loading the wrong helper file.&lt;/p&gt;

&lt;p&gt;The fix took thirty seconds. I renamed the file.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pitch checks out
&lt;/h2&gt;

&lt;p&gt;The current pitch goes like this. LLMs perform better with the right context. So we need a new role to design that context. Curate the files, structure the prompts, build the retrieval system, manage the embeddings.&lt;/p&gt;

&lt;p&gt;Each claim is accurate. None of them describes a new problem.&lt;/p&gt;

&lt;p&gt;You have been doing this since you wrote your first config file. Every time you renamed &lt;code&gt;helpers.js&lt;/code&gt; to &lt;code&gt;payment-validation.js&lt;/code&gt;, you were context engineering. Every time you split a 400-line file into three named pieces, you were context engineering. The audience was always the next developer reading the file. Now there is one more reader.&lt;/p&gt;

&lt;h2&gt;
  
  
  The actual job
&lt;/h2&gt;

&lt;p&gt;A senior engineer joining a codebase does three things in the first week. They read the folder structure. They read the most-named-things in the imports. They follow the breadcrumb from filename to function to line.&lt;/p&gt;

&lt;p&gt;Agents do the same thing. The retrieval layer in your agent loop is grep with extra steps. It pulls files whose names match the task. It pulls functions whose docstrings match the intent. It pulls comments that say what the code does.&lt;/p&gt;

&lt;p&gt;If your filenames are vague, the retrieval is vague. If your function names lie about what the function does, the agent loads the wrong one. If your folder names group code by file type instead of by concern, the agent loads &lt;code&gt;models/&lt;/code&gt; and gets nothing useful.&lt;/p&gt;

&lt;p&gt;The skill we have been failing to enforce for thirty years became load-bearing the day a probabilistic reader joined the loop.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three things I renamed last month
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;api.py&lt;/code&gt; became &lt;code&gt;payment-webhook-handler.py&lt;/code&gt;. The agent stopped loading it for unrelated payment questions. One rename, one less failure mode.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;utils/&lt;/code&gt; got deleted. The five files inside moved next to the code that called them, with names that said what they did. &lt;code&gt;format-currency.py&lt;/code&gt;, &lt;code&gt;parse-iso-date.py&lt;/code&gt;, &lt;code&gt;redact-pii.py&lt;/code&gt;. The agent now loads them only when the task mentions currency, dates, or PII.&lt;/p&gt;

&lt;p&gt;A 600-line &lt;code&gt;process.py&lt;/code&gt; split into four files. &lt;code&gt;validate-input.py&lt;/code&gt;, &lt;code&gt;dedupe-rows.py&lt;/code&gt;, &lt;code&gt;enrich-from-cache.py&lt;/code&gt;, &lt;code&gt;write-to-warehouse.py&lt;/code&gt;. The agent stopped trying to read the whole pipeline to answer questions about a single step.&lt;/p&gt;

&lt;p&gt;Call it context engineering if the buzzword helps. The work is renaming.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the model has to guess
&lt;/h2&gt;

&lt;p&gt;Walk your repo right now. Count the files named &lt;code&gt;utils.py&lt;/code&gt;, &lt;code&gt;helpers.js&lt;/code&gt;, &lt;code&gt;common.go&lt;/code&gt;, &lt;code&gt;lib/&lt;/code&gt;, &lt;code&gt;services/&lt;/code&gt;, &lt;code&gt;manager/&lt;/code&gt;. Count functions named &lt;code&gt;process&lt;/code&gt;, &lt;code&gt;handle&lt;/code&gt;, &lt;code&gt;run&lt;/code&gt;, &lt;code&gt;execute&lt;/code&gt;, &lt;code&gt;do&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Each one is a place where the model has to guess. Each one is a place where you have to write a longer prompt to compensate. Each one is a paragraph you will never have to write again if you rename it now.&lt;/p&gt;

&lt;p&gt;The reason context engineering feels hard is that you are trying to solve at retrieval time what should have been solved at naming time. You cannot grep your way out of a vague folder structure. You cannot embedding-search your way out of &lt;code&gt;process()&lt;/code&gt;. The model is asking the same question the senior engineer asks in week one. The answer was always going to be: name things by what they do, not by what they are.&lt;/p&gt;

&lt;h2&gt;
  
  
  The role that already existed
&lt;/h2&gt;

&lt;p&gt;There is a real version of context engineering. Chunking strategy, embedding choice, retrieval rerankers, evaluation harnesses. That work is real and hard.&lt;/p&gt;

&lt;p&gt;Most of what gets called context engineering this month is the rename you skipped in the original PR.&lt;/p&gt;

&lt;p&gt;A context engineer is a developer who finally names things.&lt;/p&gt;

&lt;p&gt;What is the file in your codebase that the agent keeps loading wrong?&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at &lt;a href="https://renezander.com" rel="noopener noreferrer"&gt;renezander.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>programming</category>
      <category>discuss</category>
    </item>
    <item>
      <title>Your AI Workflow Doesn't Need Better Prompts. It Needs Less AI.</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Tue, 05 May 2026 06:33:55 +0000</pubDate>
      <link>https://forem.com/reneza/your-ai-workflow-doesnt-need-better-prompts-it-needs-less-ai-1cfp</link>
      <guid>https://forem.com/reneza/your-ai-workflow-doesnt-need-better-prompts-it-needs-less-ai-1cfp</guid>
      <description>&lt;p&gt;The first stage of AI work is prompting.&lt;/p&gt;

&lt;p&gt;The last stage is removing the model from most of the workflow.&lt;/p&gt;

&lt;p&gt;That sounds backwards.&lt;/p&gt;

&lt;p&gt;It is not.&lt;/p&gt;

&lt;p&gt;When a workflow is new, the LLM is useful because the work is still ambiguous. You are discovering what good looks like. You try a prompt, read the output, adjust the examples, change the tone, add constraints, and run it again.&lt;/p&gt;

&lt;p&gt;That is a good use of AI.&lt;/p&gt;

&lt;p&gt;But if the same workflow keeps coming back, and you are still explaining it to the model every time, you are not building capability. You are repeating yourself with a better interface.&lt;/p&gt;

&lt;p&gt;The mature workflow is not one where the LLM does everything.&lt;/p&gt;

&lt;p&gt;The mature workflow is one where the LLM only handles the part where ambiguity is useful.&lt;/p&gt;

&lt;p&gt;Everything else becomes process.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prompting Is Discovery
&lt;/h2&gt;

&lt;p&gt;Prompting is where most people start because it is the fastest way to get feedback.&lt;/p&gt;

&lt;p&gt;You ask:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Write this article."&lt;/li&gt;
&lt;li&gt;"Make it sound less generic."&lt;/li&gt;
&lt;li&gt;"Use my style."&lt;/li&gt;
&lt;li&gt;"Add examples."&lt;/li&gt;
&lt;li&gt;"Make the intro stronger."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At this stage, the model is helping you figure out the shape of the task.&lt;/p&gt;

&lt;p&gt;You do not fully know the target yet. You are exploring. You are testing whether the idea even works. You are using the model as a thinking partner, a drafter, a critic, and sometimes a mirror.&lt;/p&gt;

&lt;p&gt;That is fine.&lt;/p&gt;

&lt;p&gt;The mistake is treating this as the final form.&lt;/p&gt;

&lt;p&gt;If you have to keep saying the same thing, you do not have a workflow. You have a recurring conversation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Better Prompts Still Have a Ceiling
&lt;/h2&gt;

&lt;p&gt;The next stage is usually better prompting.&lt;/p&gt;

&lt;p&gt;You add examples. You add constraints. You add a target audience. You tell the model what to avoid. You define the output format. You write a longer system prompt.&lt;/p&gt;

&lt;p&gt;The output improves.&lt;/p&gt;

&lt;p&gt;For a while, this feels like progress.&lt;/p&gt;

&lt;p&gt;But longer prompts have a hidden failure mode: the model still has to remember and balance everything at once.&lt;/p&gt;

&lt;p&gt;Style rules. Factual constraints. Tone. Audience. Platform conventions. Examples. Edge cases. Forbidden phrases. Review criteria.&lt;/p&gt;

&lt;p&gt;All of it goes into one big instruction block.&lt;/p&gt;

&lt;p&gt;The prompt becomes a pile of expectations, and the model is still the one deciding which expectations matter in the moment.&lt;/p&gt;

&lt;p&gt;That is fragile.&lt;/p&gt;

&lt;p&gt;At some point, "make the prompt better" stops being the right move.&lt;/p&gt;

&lt;h2&gt;
  
  
  Skills Are Repetition
&lt;/h2&gt;

&lt;p&gt;The next level is turning repeated prompting into a skill.&lt;/p&gt;

&lt;p&gt;A skill packages the context and process:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what files or sources to read&lt;/li&gt;
&lt;li&gt;what examples matter&lt;/li&gt;
&lt;li&gt;what tone to use&lt;/li&gt;
&lt;li&gt;what scripts to run&lt;/li&gt;
&lt;li&gt;what output format is expected&lt;/li&gt;
&lt;li&gt;what review criteria should be applied&lt;/li&gt;
&lt;li&gt;what fallback path to use when something breaks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is a real improvement.&lt;/p&gt;

&lt;p&gt;The workflow becomes portable. You stop explaining everything from scratch. The model gets the right context faster. You are no longer relying on whatever happens to be in the current chat thread.&lt;/p&gt;

&lt;p&gt;For many AI workflows, this is the first serious productivity jump.&lt;/p&gt;

&lt;p&gt;But skills have their own failure mode.&lt;/p&gt;

&lt;p&gt;They can become too large.&lt;/p&gt;

&lt;h2&gt;
  
  
  When Skills Become Prompt Monsters
&lt;/h2&gt;

&lt;p&gt;A skill can start as a clean reusable process and slowly turn into another giant prompt.&lt;/p&gt;

&lt;p&gt;More instructions.&lt;/p&gt;

&lt;p&gt;More exceptions.&lt;/p&gt;

&lt;p&gt;More "always do this."&lt;/p&gt;

&lt;p&gt;More "never do that."&lt;/p&gt;

&lt;p&gt;More examples.&lt;/p&gt;

&lt;p&gt;More scripts.&lt;/p&gt;

&lt;p&gt;More personal preferences.&lt;/p&gt;

&lt;p&gt;At some point, the skill is not making the workflow reliable. It is just giving the model more things to interpret.&lt;/p&gt;

&lt;p&gt;The model still has to decide what matters.&lt;/p&gt;

&lt;p&gt;The model still has to judge whether the output is good enough.&lt;/p&gt;

&lt;p&gt;The model is still checking its own homework.&lt;/p&gt;

&lt;p&gt;That is the point where the workflow needs to move outside the LLM.&lt;/p&gt;

&lt;p&gt;Not all of it.&lt;/p&gt;

&lt;p&gt;The stable parts.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Model Can Write the Code. It Does Not Get to Decide Whether the Code Passes.
&lt;/h2&gt;

&lt;p&gt;Developers already understand this when we talk about code.&lt;/p&gt;

&lt;p&gt;We do not ask a developer, human or AI:&lt;/p&gt;

&lt;p&gt;"Does this code look correct?"&lt;/p&gt;

&lt;p&gt;We run the formatter.&lt;/p&gt;

&lt;p&gt;We run the linter.&lt;/p&gt;

&lt;p&gt;We run the tests.&lt;/p&gt;

&lt;p&gt;We run the type checker.&lt;/p&gt;

&lt;p&gt;We run CI.&lt;/p&gt;

&lt;p&gt;We use pre-commit hooks.&lt;/p&gt;

&lt;p&gt;The model can generate code, but it does not get to decide whether the code passes.&lt;/p&gt;

&lt;p&gt;The gate decides.&lt;/p&gt;

&lt;p&gt;That is the important shift.&lt;/p&gt;

&lt;p&gt;If a rule can be checked deterministically, it should not live only inside a prompt.&lt;/p&gt;

&lt;p&gt;For code, the gates are obvious:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;formatting&lt;/li&gt;
&lt;li&gt;linting&lt;/li&gt;
&lt;li&gt;type checks&lt;/li&gt;
&lt;li&gt;unit tests&lt;/li&gt;
&lt;li&gt;integration tests&lt;/li&gt;
&lt;li&gt;golden files&lt;/li&gt;
&lt;li&gt;JSON schema validation&lt;/li&gt;
&lt;li&gt;pre-commit hooks&lt;/li&gt;
&lt;li&gt;CI checks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The agent can still do useful work. It can draft the patch, explain a tradeoff, write a test, debug a failure, or propose a smaller path.&lt;/p&gt;

&lt;p&gt;But the agent should not be the final authority on whether the work satisfies the standard.&lt;/p&gt;

&lt;p&gt;That authority should be outside the model.&lt;/p&gt;

&lt;h2&gt;
  
  
  This Applies to Content Too
&lt;/h2&gt;

&lt;p&gt;Content feels less deterministic than code, so people keep more of the workflow inside the prompt.&lt;/p&gt;

&lt;p&gt;But the same principle applies.&lt;/p&gt;

&lt;p&gt;If you have found a content formula that works, do not just ask the model to remember it.&lt;/p&gt;

&lt;p&gt;Turn it into gates.&lt;/p&gt;

&lt;p&gt;For example, before publishing an article:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Does the title match the actual promise of the article?&lt;/li&gt;
&lt;li&gt;Does the first section create tension quickly?&lt;/li&gt;
&lt;li&gt;Is the reader obvious?&lt;/li&gt;
&lt;li&gt;Is there one clear argument?&lt;/li&gt;
&lt;li&gt;Does every section move the argument forward?&lt;/li&gt;
&lt;li&gt;Are there generic AI phrases that should be removed?&lt;/li&gt;
&lt;li&gt;Does the article contain a real observation or only recycled advice?&lt;/li&gt;
&lt;li&gt;Is there a concrete next action for the reader?&lt;/li&gt;
&lt;li&gt;Does the tag strategy match the article, not just the broad topic?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These checks are not perfect.&lt;/p&gt;

&lt;p&gt;But they are better than "make it good."&lt;/p&gt;

&lt;p&gt;"Make it good" is a vibe.&lt;/p&gt;

&lt;p&gt;A gate is a standard.&lt;/p&gt;

&lt;p&gt;The more often a workflow matters, the more it deserves standards that do not depend on the model's mood.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 20% LLM Workflow
&lt;/h2&gt;

&lt;p&gt;The mature version of an AI workflow is not:&lt;/p&gt;

&lt;p&gt;"The LLM does everything."&lt;/p&gt;

&lt;p&gt;It is:&lt;/p&gt;

&lt;p&gt;"The LLM does the part where ambiguity is useful. The system handles the rest."&lt;/p&gt;

&lt;p&gt;That might mean the model only does 20% of the workflow.&lt;/p&gt;

&lt;p&gt;And that is a good thing.&lt;/p&gt;

&lt;p&gt;For a content workflow, the LLM might:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;propose angles&lt;/li&gt;
&lt;li&gt;compare possible hooks&lt;/li&gt;
&lt;li&gt;draft sections&lt;/li&gt;
&lt;li&gt;rewrite unclear paragraphs&lt;/li&gt;
&lt;li&gt;find contradictions&lt;/li&gt;
&lt;li&gt;suggest examples&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But the system should handle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;loading the source notes&lt;/li&gt;
&lt;li&gt;selecting the target platform&lt;/li&gt;
&lt;li&gt;applying the tag strategy&lt;/li&gt;
&lt;li&gt;checking title/promise match&lt;/li&gt;
&lt;li&gt;scanning for banned phrases&lt;/li&gt;
&lt;li&gt;verifying links&lt;/li&gt;
&lt;li&gt;enforcing formatting&lt;/li&gt;
&lt;li&gt;creating the publishing task&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a coding workflow, the LLM might:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;inspect the codebase&lt;/li&gt;
&lt;li&gt;implement the change&lt;/li&gt;
&lt;li&gt;write tests&lt;/li&gt;
&lt;li&gt;explain a failure&lt;/li&gt;
&lt;li&gt;reduce a patch&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But the system should handle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;formatting&lt;/li&gt;
&lt;li&gt;linting&lt;/li&gt;
&lt;li&gt;type checking&lt;/li&gt;
&lt;li&gt;test execution&lt;/li&gt;
&lt;li&gt;schema validation&lt;/li&gt;
&lt;li&gt;contract checks&lt;/li&gt;
&lt;li&gt;CI gates&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is not using AI less because AI is weak.&lt;/p&gt;

&lt;p&gt;It is using AI less because the workflow is becoming stronger.&lt;/p&gt;

&lt;h2&gt;
  
  
  The LLM Should Be the Escalation Path, Not the Reflex
&lt;/h2&gt;

&lt;p&gt;There is a simple rule I keep coming back to:&lt;/p&gt;

&lt;p&gt;If code can check it, do not ask the model to remember it.&lt;/p&gt;

&lt;p&gt;If a script can clean it, do not spend tokens reasoning about it.&lt;/p&gt;

&lt;p&gt;If a test can catch it, do not rely on a sentence in a prompt.&lt;/p&gt;

&lt;p&gt;The LLM should be used when ambiguity remains.&lt;/p&gt;

&lt;p&gt;It should not be the first line of defense for things that are already measurable.&lt;/p&gt;

&lt;p&gt;This is where many AI workflows waste effort. They ask the model to handle everything:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;classify the task&lt;/li&gt;
&lt;li&gt;remember the rules&lt;/li&gt;
&lt;li&gt;generate the output&lt;/li&gt;
&lt;li&gt;check the output&lt;/li&gt;
&lt;li&gt;decide whether the output is done&lt;/li&gt;
&lt;li&gt;explain why it is done&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is too much responsibility in one probabilistic step.&lt;/p&gt;

&lt;p&gt;Split the work.&lt;/p&gt;

&lt;p&gt;Let the model handle the ambiguous part.&lt;/p&gt;

&lt;p&gt;Let deterministic systems handle the stable part.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Self-Check for Your Own AI Workflow
&lt;/h2&gt;

&lt;p&gt;If you want to know where your workflow is immature, ask these questions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What do I keep explaining to the model again and again?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That probably belongs in a skill.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What does the model keep judging by itself?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That probably belongs in a gate.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What failure would be obvious to a script, linter, test, schema, or checklist?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That should not live only in the prompt.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;If I removed the LLM tomorrow, which parts of the workflow would still be clear?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Those parts are real process.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Which parts only work because the model is being generous?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Those parts are risk.&lt;/p&gt;

&lt;p&gt;This self-check is uncomfortable because it reveals how much "automation" is just trust in a model call.&lt;/p&gt;

&lt;p&gt;But that is the point.&lt;/p&gt;

&lt;p&gt;Capability is not how much work you can hand to the model.&lt;/p&gt;

&lt;p&gt;Capability is how much of the workflow still holds when the model is only doing the part it is actually good at.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Maturity Curve
&lt;/h2&gt;

&lt;p&gt;The pattern looks like this:&lt;/p&gt;

&lt;p&gt;Prompt -&amp;gt; Skill -&amp;gt; Gate -&amp;gt; System&lt;/p&gt;

&lt;p&gt;Or:&lt;/p&gt;

&lt;p&gt;Ask -&amp;gt; Package -&amp;gt; Validate -&amp;gt; Automate&lt;/p&gt;

&lt;p&gt;When the task is new, prompt.&lt;/p&gt;

&lt;p&gt;When the task repeats, create a skill.&lt;/p&gt;

&lt;p&gt;When the skill succeeds, move the stable checks into gates.&lt;/p&gt;

&lt;p&gt;When the gates are stable, reduce the LLM's responsibility.&lt;/p&gt;

&lt;p&gt;This is the path from beginner prompting to a 20% LLM workflow.&lt;/p&gt;

&lt;p&gt;It does not make the model irrelevant.&lt;/p&gt;

&lt;p&gt;It puts the model in the right place.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Wrong Goal Is Better Prompting Forever
&lt;/h2&gt;

&lt;p&gt;There is a lot of advice about better prompts.&lt;/p&gt;

&lt;p&gt;Some of it is useful.&lt;/p&gt;

&lt;p&gt;But better prompting is not the destination.&lt;/p&gt;

&lt;p&gt;Better prompting helps you discover the workflow.&lt;/p&gt;

&lt;p&gt;Skills help you repeat the workflow.&lt;/p&gt;

&lt;p&gt;Gates help you trust the workflow.&lt;/p&gt;

&lt;p&gt;Systems help you scale the workflow.&lt;/p&gt;

&lt;p&gt;If you stop at prompting, every task stays a negotiation.&lt;/p&gt;

&lt;p&gt;If you stop at skills, every process still depends on the model interpreting the skill correctly.&lt;/p&gt;

&lt;p&gt;If you add gates, the model has something it must pass.&lt;/p&gt;

&lt;p&gt;That is the difference between a helpful assistant and a reliable workflow.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Useful Question
&lt;/h2&gt;

&lt;p&gt;The useful question is no longer:&lt;/p&gt;

&lt;p&gt;"How do I write a better prompt?"&lt;/p&gt;

&lt;p&gt;The useful question is:&lt;/p&gt;

&lt;p&gt;"Which part of this should stop being a prompt?"&lt;/p&gt;

&lt;p&gt;If it is repeated context, make it a skill.&lt;/p&gt;

&lt;p&gt;If it is a stable rule, make it a checklist.&lt;/p&gt;

&lt;p&gt;If it is measurable, make it a test.&lt;/p&gt;

&lt;p&gt;If it is non-negotiable, make it a gate.&lt;/p&gt;

&lt;p&gt;Let the LLM handle ambiguity.&lt;/p&gt;

&lt;p&gt;Make the system handle standards.&lt;/p&gt;

&lt;p&gt;Use prompts to discover.&lt;/p&gt;

&lt;p&gt;Use skills to repeat.&lt;/p&gt;

&lt;p&gt;Use gates to scale.&lt;/p&gt;

&lt;p&gt;And when the workflow is mature, let the LLM do less.&lt;/p&gt;

&lt;p&gt;That is not a downgrade.&lt;/p&gt;

&lt;p&gt;That is capability becoming real.&lt;/p&gt;

&lt;p&gt;Which part of your AI workflow should stop being a prompt?&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at &lt;a href="https://renezander.com" rel="noopener noreferrer"&gt;renezander.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>productivity</category>
      <category>testing</category>
    </item>
    <item>
      <title>Pure semantic search missed 4 of 5 of my agent queries. Hybrid + parallel fan-out fixed it.</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Sat, 02 May 2026 14:13:04 +0000</pubDate>
      <link>https://forem.com/reneza/agentic-knowledge-base-karpathys-llm-wiki-with-adapters-593n</link>
      <guid>https://forem.com/reneza/agentic-knowledge-base-karpathys-llm-wiki-with-adapters-593n</guid>
      <description>&lt;p&gt;When Karpathy's &lt;a href="https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f" rel="noopener noreferrer"&gt;LLM Wiki&lt;/a&gt; post landed, I already had semantic search over my TickTick — qdrant for the vector store, nomic-embed-text via ollama for embeddings, a daily cron to keep the index fresh, the works. The agent-side retrieval wasn't the missing piece.&lt;/p&gt;

&lt;p&gt;What was missing was the &lt;em&gt;structure&lt;/em&gt;. Karpathy's framing — designate a wiki, write notes for an LLM reader, lean on retrieval instead of taxonomy — surfaced the parts of my setup that didn't have shape yet: where durable knowledge lives versus ephemeral tasks, how agents pull structured data out of notes humans wrote, why my existing semantic search sometimes returned the right answer and sometimes returned nothing useful.&lt;/p&gt;

&lt;p&gt;I almost migrated to plain markdown anyway. Thousands of durable notes — production playbooks, API quirks, decisions I want to survive next month's task list — already live in TickTick. They sync to my phone. Capture friction is zero. Migrating breaks all of that.&lt;/p&gt;

&lt;p&gt;So I built the wiki structure on top of TickTick, and made the storage layer swappable. &lt;strong&gt;The retrieval, the wiki conventions, the agent-data note pattern, the bench harness — none of those are TickTick-specific.&lt;/strong&gt; They're a small framework. You point it at TickTick / Notion / Obsidian / Things / a folder of markdown / whatever you've already invested years of capture habit into.&lt;/p&gt;

&lt;p&gt;I'm calling it &lt;strong&gt;Agentic Knowledge Base&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The framework, in one diagram
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                  ┌───────────────────────────────────┐
                  │  Agent (Claude / scripts / cron)  │
                  └─────────────────┬─────────────────┘
                                    │ akb find / get / url / links
                                    ▼
                  ┌───────────────────────────────────┐
                  │  AKB Core                         │
                  │  • parallel retrieval + RRF       │
                  │  • corpus cache (5 min TTL)       │
                  │  • bench harness                  │
                  │  • usage logging                  │
                  └─────────────────┬─────────────────┘
                                    │ adapter interface
                  ┌─────────────────┼──────────────────┐
                  ▼                 ▼                  ▼
       ┌────────────────┐  ┌────────────────┐  ┌────────────────┐
       │ adapter-       │  │ adapter-       │  │ adapter-       │
       │ ticktick       │  │ obsidian       │  │ notion         │
       │  (reference)   │  │  (filesystem)  │  │  (your turn)   │
       └────────────────┘  └────────────────┘  └────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Core is storage-agnostic. The retrieval, the cache, the bench, the usage logger — none of them know what TickTick is. They call a small adapter interface (~6 methods).&lt;/p&gt;

&lt;p&gt;Karpathy's setup, in this framing, is the &lt;strong&gt;filesystem adapter&lt;/strong&gt; of a broader pattern. Mine is the &lt;strong&gt;TickTick adapter&lt;/strong&gt;. Yours might be the Notion or Obsidian one.&lt;/p&gt;

&lt;h2&gt;
  
  
  The adapter interface
&lt;/h2&gt;

&lt;p&gt;Six methods. Two payload shapes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;KnowledgeAdapter&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;listProjects&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;Project&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="nf"&gt;listTasksInProject&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;projectId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;Task&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="nf"&gt;getTask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;projectId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;taskId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;Task&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="nf"&gt;createTask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;TaskInput&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;Task&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="nf"&gt;updateTask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;projectId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;taskId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;patch&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;TaskPatch&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;Task&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="nf"&gt;urlFor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nl"&gt;projectId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;taskId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="p"&gt;}):&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;  &lt;span class="c1"&gt;// deep-link string&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;Project&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;?:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;tasks&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;notes&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;Task&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;projectId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt; &lt;span class="nx"&gt;dueDate&lt;/span&gt;&lt;span class="p"&gt;?:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;modifiedTime&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Anything you can list, get, and link to — task systems, note apps, plain folders — can be an adapter.&lt;/p&gt;

&lt;p&gt;If your storage exposes a native search endpoint, your adapter can implement an optional &lt;code&gt;searchByQuery(query)&lt;/code&gt; and the core will use it as one branch of the parallel retrieval. If not, the core falls back to its own keyword scan against the corpus.&lt;/p&gt;

&lt;p&gt;That's the whole interface. Everything interesting is in the Core.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two patterns the Core implements (worth stealing)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Agent-data notes
&lt;/h3&gt;

&lt;p&gt;A regular note whose body has a fenced JSON (or YAML) block. Humans read the prose at the top. Agents extract the JSON via the adapter.&lt;/p&gt;

&lt;p&gt;The note's content looks like this — prose first, then a single fenced JSON block:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Type:&lt;/strong&gt; agent-data&lt;br&gt;
&lt;strong&gt;Consumed by:&lt;/strong&gt; EOD triage cron, capture-time relevance enrichment&lt;/p&gt;

&lt;p&gt;A "trunk" is an active project the user cares about. Edit this list when projects launch, finish, or shift focus.&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"trunks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"release-engineering"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"desc"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"shipping cadence, deployment rituals, on-call rotation"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"writing-projects"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"desc"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"drafts and edits across personal and client channels"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Read it from any cron or agent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;akb get &lt;span class="s2"&gt;"Trunk Catalog"&lt;/span&gt; &lt;span class="nt"&gt;--extract&lt;/span&gt; json | jq &lt;span class="s1"&gt;'.trunks[].name'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The benefit: one note, mobile-editable in your existing app, consumed by agents as structured data. &lt;strong&gt;Single source of truth, no schema migration.&lt;/strong&gt; This pattern works for anything an agent needs programmatically and a human needs to edit on the move: prompt templates, character locks for video projects, recurring queries, cron config.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Parallel retrieval with provenance
&lt;/h3&gt;

&lt;p&gt;Three retrievers run in parallel against a shared cached corpus, results are RRF-fused, and the top-K come back tagged with which retrievers agreed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hybrid&lt;/strong&gt; — dense cosine (qdrant + nomic-embed) + sparse keyword, internally RRF'd&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keyword&lt;/strong&gt; — substring match on title + content&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Notes-find&lt;/strong&gt; — title-fuzzy on a designated wiki project&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a query like &lt;code&gt;openrouter api key&lt;/code&gt;, all three retrievers return the same gold note. The fused result tags it &lt;code&gt;sources: [hybrid, keyword, notes_find]&lt;/code&gt; — three independent signals agreeing means high confidence. Lower-ranked results have only one source — look at them with skepticism.&lt;/p&gt;

&lt;p&gt;For a query like &lt;code&gt;ffmpeg commands&lt;/code&gt;, the keyword tool misses (the literal phrase isn't in any document). Pure semantic misses too (nomic-embed underweights short titles like &lt;code&gt;ffmpeg&lt;/code&gt;). Hybrid catches it. The fan-out gracefully handles the asymmetry — different queries lean on different retrievers, and the core doesn't pretend any single algorithm is universally best.&lt;/p&gt;

&lt;p&gt;A 5-min disk-backed corpus cache means warm queries are sub-100ms. The first call after a cold start fetches your full task/note list (one batch — adapters that support it use a single API call; adapters that don't fall back to per-project iteration). Within a working session, retrieval is essentially free.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bench
&lt;/h2&gt;

&lt;p&gt;I built a small harness in &lt;code&gt;bench/&lt;/code&gt;. Questions paired with gold answers (the task or note that actually contains the answer). Each retriever runs against the same questions, results scored by hit@1 / recall@5 / MRR.&lt;/p&gt;

&lt;p&gt;Five agent-issued queries (the rephrased version Opus 4.7 actually generates, not the natural-language form a human types):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;hit@1&lt;/th&gt;
&lt;th&gt;recall@5&lt;/th&gt;
&lt;th&gt;MRR&lt;/th&gt;
&lt;th&gt;warm latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;keyword (substring)&lt;/td&gt;
&lt;td&gt;20%&lt;/td&gt;
&lt;td&gt;20%&lt;/td&gt;
&lt;td&gt;0.20&lt;/td&gt;
&lt;td&gt;&amp;lt;100ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;semantic (dense only)&lt;/td&gt;
&lt;td&gt;20%&lt;/td&gt;
&lt;td&gt;40%&lt;/td&gt;
&lt;td&gt;0.30&lt;/td&gt;
&lt;td&gt;~300ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;hybrid (dense + sparse RRF)&lt;/td&gt;
&lt;td&gt;60%&lt;/td&gt;
&lt;td&gt;80%&lt;/td&gt;
&lt;td&gt;0.70&lt;/td&gt;
&lt;td&gt;~500ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;find&lt;/strong&gt; (parallel + cache)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;60%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;80%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.70&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~93ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The headline number for agent retrieval is &lt;code&gt;recall@5 = 80%&lt;/code&gt; — the right doc lands in the top five 4 times out of 5. Agents read top-K, not just rank 1, so recall@5 is the metric that actually predicts whether the agent gets the context it needs. Top-1 (60%) is a stricter cut and a leading indicator for "did the first guess work" — useful but not the bar. The benchmark won't generalize from five questions — grow it as confidence in a particular adapter accumulates.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I optimized for the model, not for me
&lt;/h2&gt;

&lt;p&gt;There's a subtle reframe that took an embarrassing number of iterations to land.&lt;/p&gt;

&lt;p&gt;When &lt;em&gt;I&lt;/em&gt; use search, I type a single word: &lt;code&gt;ffmpeg&lt;/code&gt;. The keyword tool returns the right note instantly.&lt;/p&gt;

&lt;p&gt;When &lt;em&gt;Claude&lt;/em&gt; uses search on my behalf — "where did I document my ffmpeg workflow?" — it issues something like &lt;code&gt;find "What ffmpeg commands do I have notes on?"&lt;/code&gt;. Different shape entirely. The model writes longer queries. It uses question phrasing. It includes scope words.&lt;/p&gt;

&lt;p&gt;Optimizing for human queries was the wrong objective. The user (me) wasn't using these tools — Claude was. Every retrieval test had to be written in the form Opus 4.7 actually generates, not how I'd type it. That changes which retriever wins.&lt;/p&gt;

&lt;p&gt;Tomorrow's model writes queries differently. The benchmark needs to track &lt;em&gt;the model in use&lt;/em&gt;, not a fixed assumption about query shape. The bench file is short and dated; re-tune when the model changes.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I deliberately didn't build (yet)
&lt;/h2&gt;

&lt;p&gt;Karpathy's wiki post mentions periodically updating notes when facts change — propagating new information across the knowledge base. Useful at scale; auto-rewriting notes is high-blast-radius and needs an approval ramp before it's trustworthy. I sketched it: a weekly cron that semantic-searches for affected notes, drafts updates, queues them for my approval, applies the approved ones. Deferred.&lt;/p&gt;

&lt;p&gt;Same call on a "lint the wiki" pass (Karpathy idea: agent reads every note weekly, flags missing summaries, dangling references, contradictions). Useful at scale; premature when the wiki itself is still under construction.&lt;/p&gt;

&lt;p&gt;Both will live in Core when they ship — adapter-agnostic by design.&lt;/p&gt;

&lt;h2&gt;
  
  
  A daily flow (my setup, your tools optional)
&lt;/h2&gt;

&lt;p&gt;This is what runs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Capture (mobile, manual).&lt;/strong&gt; I add a task or note in my storage app. No CLI involved. The friction has to be zero.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capture-time relevance prompt (when in a Claude session).&lt;/strong&gt; &lt;code&gt;akb create "..." --relevance&lt;/code&gt; appends a small instruction block to the result. Active Claude reads it, picks a project trunk, calls &lt;code&gt;akb update&lt;/code&gt; to append a &lt;code&gt;why: &amp;lt;trunk&amp;gt; — &amp;lt;reason&amp;gt;&lt;/code&gt; line. Five seconds of LLM-side reasoning makes that task much more retrievable later.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;EOD triage (cron, daily morning).&lt;/strong&gt; Pulls yesterday's completed tasks, scores them 0–3 against the trunks (read live from a &lt;code&gt;Trunk Catalog&lt;/code&gt; agent-data note), sends a Telegram message with keepers grouped by trunk. I read it on my phone with breakfast.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retrieval (during work, all surfaces).&lt;/strong&gt; When Claude needs context — &lt;code&gt;akb find &amp;lt;query&amp;gt;&lt;/code&gt; returns top-K with provenance. Cached, parallel, sub-100ms warm.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Swap "TickTick app" for "Notion / Obsidian / Things" and the flow is identical. The adapter changes, the daily ritual doesn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  Roadmap
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;v0.1&lt;/strong&gt; — Core + reference TickTick adapter + bench (where I am today)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;v0.2&lt;/strong&gt; — Filesystem adapter (Karpathy-style local markdown). Probably one weekend's work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;v0.3&lt;/strong&gt; — Notion adapter (community contribution most likely)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;v0.4&lt;/strong&gt; — Lint pass + fact-propagation queue with approval gate&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;v0.5&lt;/strong&gt; — Adapter for Apple Notes / Things / iA Writer (Mac-native captures)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Code
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Repo: &lt;a href="https://github.com/renezander030/agentic-knowledge-base" rel="noopener noreferrer"&gt;github.com/renezander030/agentic-knowledge-base&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;One-page summary: &lt;a href="https://gist.github.com/renezander030/c7bd6d5c4088e24d3add043720284453" rel="noopener noreferrer"&gt;gist.github.com/renezander030/c7bd6d5c4088e24d3add043720284453&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Karpathy's wiki idea is right. The implementation that fits an existing system isn't a folder of markdown — it's the agent-side primitives that turn whatever you already have into something the model can reason over.&lt;/p&gt;

&lt;p&gt;If you write your own adapter, I want to see it.&lt;/p&gt;

&lt;p&gt;—&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Posted from &lt;a href="https://renezander.com/blog/agentic-knowledge-base/" rel="noopener noreferrer"&gt;https://renezander.com/blog/agentic-knowledge-base/&lt;/a&gt;. Source at &lt;a href="https://github.com/renezander030/agentic-knowledge-base" rel="noopener noreferrer"&gt;https://github.com/renezander030/agentic-knowledge-base&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at &lt;a href="https://renezander.com" rel="noopener noreferrer"&gt;renezander.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>claudecode</category>
      <category>rag</category>
      <category>agents</category>
      <category>ai</category>
    </item>
    <item>
      <title>What Anthropic's April 23 Postmortem Reveals About Your Agent Harness</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Thu, 30 Apr 2026 14:19:56 +0000</pubDate>
      <link>https://forem.com/reneza/what-anthropics-april-23-postmortem-reveals-about-your-agent-harness-4e5i</link>
      <guid>https://forem.com/reneza/what-anthropics-april-23-postmortem-reveals-about-your-agent-harness-4e5i</guid>
      <description>&lt;p&gt;The April 23 Claude Code postmortem dropped last week. Three bugs, two months of degraded output, one usage-limit reset for every Pro subscriber.&lt;/p&gt;

&lt;p&gt;I read it twice. The second time I started writing notes for my own agent harness.&lt;/p&gt;

&lt;p&gt;It is unusually candid for a company at this scale, and it reads like a checklist of failure modes any team running production AI agents will eventually hit. Worth treating as a free engineering review.&lt;/p&gt;

&lt;h2&gt;
  
  
  Defaults that nobody can see
&lt;/h2&gt;

&lt;p&gt;On March 4, the default reasoning effort dropped from high to medium. The reason was real. High mode was freezing the UI for some users. The fix was reasonable. The interesting bit: it shipped without an operator-visible knob, and quality regressed for a month before users complained loud enough.&lt;/p&gt;

&lt;p&gt;Open question for your harness: how many silent defaults does it have? Temperature 0.7 because that was the framework default in 2024. Top-p 1.0 because nobody touched it. Max tokens 4096 because somebody picked the number once. Each of these is a quality lever. Which ones are worth surfacing in your dashboard?&lt;/p&gt;

&lt;p&gt;A line worth saving from the postmortem: "users told us they'd prefer higher intelligence and opt into lower effort for simple tasks." Defaults can optimize for quality, with cost concerns as opt-in rather than opt-out.&lt;/p&gt;

&lt;h2&gt;
  
  
  A cache rule that ate the working memory
&lt;/h2&gt;

&lt;p&gt;On March 26 they shipped a thinking-cache clearing rule. Intent: clear reasoning history once after a session sits idle for more than an hour. Bug: it cleared on every turn for the rest of the session. Sessions felt forgetful. Tool choices got weird. Usage limits depleted faster because the model was rebuilding context every turn.&lt;/p&gt;

&lt;p&gt;I have shipped this exact bug. Different system, same shape. A "small optimization" to a caching layer that turned every cache lookup into a miss. Cost went up 4x for two days before alerting caught it.&lt;/p&gt;

&lt;p&gt;Useful question to bring to your team: do our caching tests cover multi-turn behavior, or only single-call hit/miss? Most teams I have asked answer "single call only". Surfacing that gap costs an afternoon and saves a quarter.&lt;/p&gt;

&lt;h2&gt;
  
  
  A 25-word cap that cost 3% intelligence
&lt;/h2&gt;

&lt;p&gt;On April 16 they added a system prompt: limit text between tool calls to 25 words, final responses to 100. The intent was cleaning up verbose narration. After ablation testing, they measured a 3% intelligence drop on coding tasks and reverted four days later.&lt;/p&gt;

&lt;p&gt;Three percent doesn't sound like much, which is the part that stays with me. A prompt change hurting quality by 3% is invisible to anyone not running ablations. How many of us are? The honest answer in most rooms I sit in: not many.&lt;/p&gt;

&lt;p&gt;Worth asking out loud: if you change a system prompt today, what catches a 3% regression?&lt;/p&gt;

&lt;h2&gt;
  
  
  What two of three tells you
&lt;/h2&gt;

&lt;p&gt;Of the three bugs, two were silent until users yelled. The third was visible only after dedicated ablation testing. That ratio is the most interesting line in the whole postmortem.&lt;/p&gt;

&lt;p&gt;I run six production agents. I have eval coverage on three. The other three I monitor with output sampling and gut feel. That setup is probably close to median for the industry.&lt;/p&gt;

&lt;p&gt;The postmortem hands you a free checklist anyway. Default knobs visible to operators. Cache-hit rate tracked across multi-turn conversations. System prompts gated by eval ablations. Three failure modes, each one a useful question to ask your own setup.&lt;/p&gt;

&lt;p&gt;Did you check your harness this week?&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at &lt;a href="https://renezander.com" rel="noopener noreferrer"&gt;renezander.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>claudecode</category>
      <category>agents</category>
      <category>production</category>
      <category>ai</category>
    </item>
    <item>
      <title>Claude Code with Local LLMs and ANTHROPIC_BASE_URL: Ollama, LM Studio, llama.cpp, vLLM</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Wed, 29 Apr 2026 05:53:48 +0000</pubDate>
      <link>https://forem.com/reneza/claude-code-with-local-llms-and-anthropicbaseurl-ollama-lm-studio-llamacpp-vllm-1g6j</link>
      <guid>https://forem.com/reneza/claude-code-with-local-llms-and-anthropicbaseurl-ollama-lm-studio-llamacpp-vllm-1g6j</guid>
      <description>&lt;p&gt;&lt;em&gt;Native Anthropic endpoints, tool-call compatibility, and context-window sizing for local Claude Code.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Last tested: April 2026. See Changelog at the bottom.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR cheat sheet
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Goal&lt;/th&gt;
&lt;th&gt;Use&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MacBook Air&lt;/td&gt;
&lt;td&gt;Gemma 4 26B-A4B Q4, &lt;strong&gt;32K context&lt;/strong&gt;, LM Studio or Ollama&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MacBook Pro&lt;/td&gt;
&lt;td&gt;Gemma 4 26B-A4B Q4 / UD-Q4, &lt;strong&gt;64K context&lt;/strong&gt;, llama.cpp or LM Studio&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Code minimum&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;32K context&lt;/strong&gt; (anything below is a chat demo)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best local backend&lt;/td&gt;
&lt;td&gt;LM Studio or Ollama first; llama.cpp for advanced; vLLM for servers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avoid&lt;/td&gt;
&lt;td&gt;8K / 16K context, dense 31B Gemma 4 on 32 GB machines, old llama.cpp builds&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The local-Claude-Code rule of thumb
&lt;/h2&gt;

&lt;p&gt;Three things decide whether a local Claude Code session works:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Model quality&lt;/strong&gt; decides whether the answer is smart.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool-call formatting&lt;/strong&gt; decides whether Claude Code can act on the answer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context length&lt;/strong&gt; decides whether the session survives past the first few edits.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For local coding agents: &lt;strong&gt;32K is the floor. 64K is the sweet spot.&lt;/strong&gt; Anything below 32K is a chat demo, not Claude Code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Recommended setup
&lt;/h2&gt;

&lt;p&gt;Use this first. Don't shop the buffet of alternatives until you've tried this one.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Backend:&lt;/strong&gt; LM Studio (≥ 0.4.1) or Ollama (≥ v0.14.0) — both expose a native &lt;strong&gt;Anthropic compatible local endpoint&lt;/strong&gt;, no proxy needed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model:&lt;/strong&gt; &lt;code&gt;gemma4:26b-a4b&lt;/code&gt; (Gemma 4 26B-A4B-it, Q4 quant). MoE active-param ≈ 3.88 B → laptop-friendly latency, tool-use trained directly into the model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context:&lt;/strong&gt; &lt;strong&gt;32K context&lt;/strong&gt; on a MacBook Air, &lt;strong&gt;64K context&lt;/strong&gt; on a MacBook Pro M5 Pro/Max with 48 GB+ RAM.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Machine:&lt;/strong&gt; 32 GB+ RAM strongly preferred. 24 GB works at 24K–32K with care.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you don't have Anthropic-compatible mode and only have an &lt;strong&gt;OpenAI compatible local endpoint&lt;/strong&gt; running, run LiteLLM in front (see section on LiteLLM).&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Environment variables Claude Code reads
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Where Claude Code POSTs requests. Default: https://api.anthropic.com&lt;/span&gt;
&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:11434

&lt;span class="c"&gt;# Sent as auth. Local servers usually accept any non-empty value.&lt;/span&gt;
&lt;span class="nv"&gt;ANTHROPIC_AUTH_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ollama

&lt;span class="c"&gt;# Map Claude Code's "claude-opus-X-Y" / "claude-sonnet-X-Y" / "claude-haiku-X-Y"&lt;/span&gt;
&lt;span class="c"&gt;# to model names your local backend serves.&lt;/span&gt;
&lt;span class="nv"&gt;ANTHROPIC_DEFAULT_OPUS_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;gemma4:26b-a4b
&lt;span class="nv"&gt;ANTHROPIC_DEFAULT_SONNET_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;gemma4:26b-a4b
&lt;span class="nv"&gt;ANTHROPIC_DEFAULT_HAIKU_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;gpt-oss:20b

claude
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or override per-invocation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude &lt;span class="nt"&gt;--model&lt;/span&gt; gemma4:26b-a4b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;code&gt;ANTHROPIC_BASE_URL&lt;/code&gt; is set but the URL doesn't respond with the right shape, Claude Code does not fall back to the cloud. It errors out.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Context length: the hidden failure mode
&lt;/h2&gt;

&lt;p&gt;Claude Code is not a chat prompt. Before your actual request, the backend sees:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Claude Code's system prompt (~6–10K tokens by itself)&lt;/li&gt;
&lt;li&gt;tool definitions for &lt;code&gt;Read&lt;/code&gt; / &lt;code&gt;Edit&lt;/code&gt; / &lt;code&gt;Bash&lt;/code&gt; / &lt;code&gt;Grep&lt;/code&gt; / &lt;code&gt;Glob&lt;/code&gt; / &lt;code&gt;TodoWrite&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;conversation history&lt;/li&gt;
&lt;li&gt;file excerpts and full reads&lt;/li&gt;
&lt;li&gt;diffs&lt;/li&gt;
&lt;li&gt;command output&lt;/li&gt;
&lt;li&gt;retry/error messages from failed tool calls&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That means &lt;strong&gt;8K and 16K contexts are misleading tests.&lt;/strong&gt; They may answer a chat question, but they are not enough for reliable agentic coding. The session survives a handful of turns, then silently degrades — file edits truncate, tool calls drop arguments, the loop gets confused.&lt;/p&gt;

&lt;h3&gt;
  
  
  Practical context tiers
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Context&lt;/th&gt;
&lt;th&gt;Verdict&lt;/th&gt;
&lt;th&gt;What happens&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;8K&lt;/td&gt;
&lt;td&gt;Broken for Claude Code&lt;/td&gt;
&lt;td&gt;System prompt + tools eat the window before your code arrives. Chat-only.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16K&lt;/td&gt;
&lt;td&gt;Demo only&lt;/td&gt;
&lt;td&gt;Tiny edits, short sessions. Not a real test of any model.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;25K&lt;/td&gt;
&lt;td&gt;LM Studio's stated minimum&lt;/td&gt;
&lt;td&gt;Good enough for small tasks if tool calls are reliable.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;32K&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Real minimum (32K context).&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Ollama recommends this floor. Use as your default.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;64K&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Sweet spot (64K context).&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Best balance on 32GB+ machines. Handles medium repos and multi-file edits.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;128K+&lt;/td&gt;
&lt;td&gt;Diminishing returns&lt;/td&gt;
&lt;td&gt;Prefill latency and KV-cache memory rise hard. Worth it only on high-memory servers, and only for repo-wide reads.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Apple Silicon context presets
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Machine&lt;/th&gt;
&lt;th&gt;Recommended context&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MacBook Air M5, 16 GB&lt;/td&gt;
&lt;td&gt;16K–24K&lt;/td&gt;
&lt;td&gt;Use smaller models (≤8B). 26B-A4B is tight.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MacBook Air M5, 24 GB&lt;/td&gt;
&lt;td&gt;24K–32K&lt;/td&gt;
&lt;td&gt;32K is the target; keep other apps light.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MacBook Air M5, 32 GB&lt;/td&gt;
&lt;td&gt;32K&lt;/td&gt;
&lt;td&gt;Best Air setup. Higher rarely beats thermal throttling.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MacBook Pro M5 Pro, 24 GB&lt;/td&gt;
&lt;td&gt;32K&lt;/td&gt;
&lt;td&gt;Better sustained perf than Air at the same context.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MacBook Pro M5 Pro, 48/64 GB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;64K&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Sweet spot for serious local coding.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MacBook Pro M5 Max, 64/128 GB&lt;/td&gt;
&lt;td&gt;64K default, 128K experimental&lt;/td&gt;
&lt;td&gt;Use 128K for repo-wide analysis, not every edit loop.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: backend docs differ — LM Studio says "start at 25K, increase for better results," Ollama recommends 32K. &lt;strong&gt;Use 32K as the cross-backend baseline.&lt;/strong&gt; Reading "25K" as "25K is enough" is the most common mistake.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  3. Claude Code Ollama setup (native, v0.14.0+)
&lt;/h2&gt;

&lt;p&gt;Ollama announced Anthropic Messages API compatibility on 2026-01-16. No proxy, no LiteLLM, no nothing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Set context length first — this is the most important knob&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OLLAMA_CONTEXT_LENGTH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;32768   &lt;span class="c"&gt;# 65536 on a Pro&lt;/span&gt;

&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_AUTH_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ollama
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:11434

claude &lt;span class="nt"&gt;--model&lt;/span&gt; gemma4:26b-a4b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cloud-hosted Ollama models work too:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude &lt;span class="nt"&gt;--model&lt;/span&gt; glm-4.7:cloud
claude &lt;span class="nt"&gt;--model&lt;/span&gt; minimax-m2.1:cloud
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Two known limits of Ollama's Anthropic-compat layer (April 2026):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No prompt caching.&lt;/strong&gt; Anthropic's &lt;code&gt;cache_control&lt;/code&gt; doesn't apply — every Claude Code request re-processes the system prompt and conversation history from scratch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No &lt;code&gt;tool_choice&lt;/code&gt;.&lt;/strong&gt; Claude Code occasionally uses &lt;code&gt;tool_choice&lt;/code&gt; to force a specific tool call. Ollama's compat layer ignores it. When it matters, Claude Code may pick the wrong tool and get stuck in a loop.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  4. Claude Code LM Studio setup (native, 0.4.1+)
&lt;/h2&gt;

&lt;p&gt;LM Studio added the Anthropic-compatible &lt;code&gt;/v1/messages&lt;/code&gt; endpoint on 2026-01-30. Streaming, tool calls, and message-shape are all supported natively.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Set context to at least 32K in the LM Studio UI (or higher; see section 2)&lt;/span&gt;
lms server start &lt;span class="nt"&gt;--port&lt;/span&gt; 1234

&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:1234
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_AUTH_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;lmstudio

claude &lt;span class="nt"&gt;--model&lt;/span&gt; openai/gpt-oss-20b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For VS Code with the Claude Code extension (env vars from your shell are NOT inherited by VS Code):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json-doc"&gt;&lt;code&gt;&lt;span class="c1"&gt;// .vscode/settings.json&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;"claudeCode.environmentVariables"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ANTHROPIC_BASE_URL"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"http://localhost:1234"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ANTHROPIC_AUTH_TOKEN"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"lmstudio"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;LM Studio's docs say "at least 25K." Set &lt;strong&gt;32K&lt;/strong&gt;. See section 2.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Claude Code llama.cpp setup (Apple Silicon fast path for Gemma 4 26B-A4B)
&lt;/h2&gt;

&lt;p&gt;If you're on Apple Silicon and want the absolute lowest overhead with Gemma 4 26B-A4B, llama.cpp's server is faster per-token than Ollama or LM Studio. You need a recent build (one that supports &lt;code&gt;-hf&lt;/code&gt; for HuggingFace pulls and &lt;code&gt;--jinja&lt;/code&gt; for chat templates).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./build/bin/llama-server &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-hf&lt;/span&gt; ggml-org/gemma-4-26B-A4B-it-GGUF:Q4_K_M &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--host&lt;/span&gt; 127.0.0.1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--port&lt;/span&gt; 8080 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ngl&lt;/span&gt; 99 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-c&lt;/span&gt; 65536 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--jinja&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:8080
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_AUTH_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;llama-cpp
claude &lt;span class="nt"&gt;--model&lt;/span&gt; gemma-4-26B-A4B
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Flags that matter:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;-c 65536&lt;/code&gt; sets &lt;strong&gt;64K context&lt;/strong&gt; (drop to &lt;code&gt;-c 32768&lt;/code&gt; on tighter machines).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;-ngl 99&lt;/code&gt; offloads all layers to Metal/GPU.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--jinja&lt;/code&gt; is required for Gemma 4's chat template to render correctly. Without it, tool calls won't format and you'll see &lt;code&gt;&amp;lt;unused24&amp;gt;&lt;/code&gt; / &lt;code&gt;&amp;lt;unused49&amp;gt;&lt;/code&gt; tokens leaking into output.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;-hf ggml-org/gemma-4-26B-A4B-it-GGUF:Q4_K_M&lt;/code&gt; pulls the GGUF straight from HuggingFace.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Caveat:&lt;/strong&gt; llama.cpp's Anthropic-compat is &lt;strong&gt;partial.&lt;/strong&gt; Works for chat and basic tool calling. Streaming-shape and some Anthropic-specific request fields are rougher than Ollama or LM Studio. If something breaks weirdly, fall back to Ollama. llama.cpp is the speed play, not the compatibility play.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Claude Code vLLM setup (native + tool parser)
&lt;/h2&gt;

&lt;p&gt;vLLM ships an official Claude Code integration. Three things at server start: a tool-calling-capable model, &lt;code&gt;--enable-auto-tool-choice&lt;/code&gt;, and the right &lt;code&gt;--tool-call-parser&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vllm serve openai/gpt-oss-120b &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--served-model-name&lt;/span&gt; my-model &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--enable-auto-tool-choice&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--tool-call-parser&lt;/span&gt; openai &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--port&lt;/span&gt; 8000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:8000
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;dummy
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_AUTH_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;dummy
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_DEFAULT_OPUS_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;my-model
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_DEFAULT_SONNET_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;my-model
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_DEFAULT_HAIKU_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;my-model

claude
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;--tool-call-parser&lt;/code&gt; value depends on the model family — &lt;code&gt;openai&lt;/code&gt; for the gpt-oss family, &lt;code&gt;llama3_json&lt;/code&gt; for Llama 3.x, &lt;code&gt;hermes&lt;/code&gt; for Hermes. Wrong parser → tool calls return as plain text and Claude Code's edit/grep/bash tools silently no-op.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. LiteLLM — for fallbacks, not for translation
&lt;/h2&gt;

&lt;p&gt;With Ollama, LM Studio, llama.cpp, and vLLM all speaking native Anthropic now, LiteLLM's role changes. It's no longer "the translator" — it's the router for &lt;strong&gt;fallbacks, request logging, per-tenant keys, and rate limits.&lt;/strong&gt; Also the right answer if your only local option is an &lt;strong&gt;OpenAI compatible local endpoint&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# litellm-config.yaml&lt;/span&gt;
&lt;span class="na"&gt;model_list&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;model_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;claude-opus-4-7&lt;/span&gt;
    &lt;span class="na"&gt;litellm_params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openai/my-vllm-model&lt;/span&gt;
      &lt;span class="na"&gt;api_base&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://vllm:8000/v1&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;model_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;
    &lt;span class="na"&gt;litellm_params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ollama/gemma4:26b-a4b&lt;/span&gt;
      &lt;span class="na"&gt;api_base&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://ollama:11434&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;model_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;claude-haiku-4-5&lt;/span&gt;
    &lt;span class="na"&gt;litellm_params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;anthropic/claude-haiku-4-5&lt;/span&gt;
      &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;os.environ/ANTHROPIC_API_KEY&lt;/span&gt;

&lt;span class="na"&gt;router_settings&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;fallbacks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;claude-opus-4-7&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-haiku-4-5"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;   &lt;span class="c1"&gt;# local fail → cloud Haiku&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The single biggest win: when a local tool call silently fails, LiteLLM falls back to cloud Haiku transparently. Claude Code keeps working.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. Common failures (the error strings developers google)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;tool_use parse error&lt;/code&gt; / &lt;code&gt;invalid tool call&lt;/code&gt; / &lt;code&gt;tool_use is not supported&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Three different symptoms, one root cause: the model is not emitting Anthropic-format &lt;code&gt;tool_use&lt;/code&gt; content blocks.&lt;/p&gt;

&lt;p&gt;The most deceptive symptom is the silent one — Claude Code starts, prints the model's plain-prose answer ("I would change the file like this..."), and &lt;em&gt;nothing happens&lt;/em&gt;. No file edit, no error.&lt;/p&gt;

&lt;p&gt;Common causes (April 2026):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;vLLM:&lt;/strong&gt; missing &lt;code&gt;--enable-auto-tool-choice&lt;/code&gt; or wrong &lt;code&gt;--tool-call-parser&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ollama:&lt;/strong&gt; model that wasn't trained for tool calling (avoid stock &lt;code&gt;llama3.x&lt;/code&gt; instruct).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;llama.cpp:&lt;/strong&gt; missing &lt;code&gt;--jinja&lt;/code&gt;. The chat template renders incorrectly and you see literal &lt;code&gt;&amp;lt;unused24&amp;gt;&lt;/code&gt; / &lt;code&gt;&amp;lt;unused49&amp;gt;&lt;/code&gt; tokens.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LM Studio:&lt;/strong&gt; model file is fine but the loaded preset uses the wrong template.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;context length exceeded&lt;/code&gt; / model stopped mid-edit
&lt;/h3&gt;

&lt;p&gt;Claude Code's prompts overflow the configured window. The session may finish a single turn, then truncate the next file edit silently. &lt;strong&gt;Fix: raise context to at least 32K.&lt;/strong&gt; If you're already at 32K and still hitting this, the model is reading too aggressively — drop to fewer tools or shorter file reads.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;empty assistant response&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Backend returned &lt;code&gt;200 OK&lt;/code&gt; with an empty content array. Causes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Streaming SSE format mismatch (mostly llama.cpp).&lt;/li&gt;
&lt;li&gt;Tool-call parser swallowed the message because it couldn't parse it.&lt;/li&gt;
&lt;li&gt;Model emitted only a &lt;code&gt;&amp;lt;unused24&amp;gt;&lt;/code&gt; / &lt;code&gt;&amp;lt;unused49&amp;gt;&lt;/code&gt; token and the parser dropped the rest.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Fix: switch backend (Ollama or LM Studio if you were on llama.cpp), or upgrade llama.cpp to a build with the patched Gemma 4 chat template.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;model not found&lt;/code&gt; / &lt;code&gt;404 the model X does not exist&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Claude Code asked for &lt;code&gt;claude-opus-4-7&lt;/code&gt; but the backend serves &lt;code&gt;gpt-oss:20b&lt;/code&gt; or &lt;code&gt;gemma4:26b-a4b&lt;/code&gt;. Fixes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Set &lt;code&gt;ANTHROPIC_DEFAULT_OPUS_MODEL&lt;/code&gt; (plus &lt;code&gt;_SONNET_&lt;/code&gt; and &lt;code&gt;_HAIKU_&lt;/code&gt;) to the backend's actual model name.&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;claude --model &amp;lt;backend-name&amp;gt;&lt;/code&gt; per call.&lt;/li&gt;
&lt;li&gt;Map the names in LiteLLM (the &lt;code&gt;model_name:&lt;/code&gt; field is what Claude Code asks for; &lt;code&gt;model:&lt;/code&gt; is what gets served).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;messages: Extra inputs are not permitted&lt;/code&gt; (HTTP 422)
&lt;/h3&gt;

&lt;p&gt;Some backends are stricter than Anthropic's own. They reject Anthropic-specific fields (&lt;code&gt;cache_control&lt;/code&gt;, &lt;code&gt;thinking&lt;/code&gt;, &lt;code&gt;tools[].input_schema&lt;/code&gt;, &lt;code&gt;metadata.user_id&lt;/code&gt;). Fix: upgrade the backend, or run a small middleware proxy that strips the unsupported fields before forwarding.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;ANTHROPIC_BASE_URL&lt;/code&gt; ignored / Claude Code still calls the real API
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Env var was set in &lt;code&gt;.zshrc&lt;/code&gt; &lt;em&gt;after&lt;/em&gt; the shell session started — restart the terminal.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;~/.config/claude/config.json&lt;/code&gt; or a &lt;code&gt;--api-key&lt;/code&gt; flag is overriding the env var.&lt;/li&gt;
&lt;li&gt;VS Code: env vars from your shell are NOT inherited. Use &lt;code&gt;claudeCode.environmentVariables&lt;/code&gt; in workspace settings (section 4).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;echo $ANTHROPIC_BASE_URL&lt;/code&gt; inside the same shell that runs &lt;code&gt;claude&lt;/code&gt;. If empty, you have a sourcing problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  9. Debug flow
&lt;/h2&gt;

&lt;p&gt;When something breaks, walk this tree before swapping backends:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Did the model load?&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;No → check quant size vs RAM. 26B-A4B Q4 needs ~16 GB free; bigger quants need more.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Is the context at least 32K?&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;No → raise to 32K (Air) or 64K (Pro). See section 2.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Are tool calls malformed?&lt;/strong&gt; (Look for &lt;code&gt;&amp;lt;unused24&amp;gt;&lt;/code&gt;, &lt;code&gt;&amp;lt;unused49&amp;gt;&lt;/code&gt;, plain prose where you expected an edit.)

&lt;ul&gt;
&lt;li&gt;Yes → switch to native Anthropic mode (Ollama/LM Studio), or for vLLM verify &lt;code&gt;--tool-call-parser&lt;/code&gt;, or for llama.cpp add &lt;code&gt;--jinja&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Does Claude Code stop mid-edit?&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Yes → context exhaustion. Lower context targets in your tools, or use a faster quant so the model finishes turns before the window reuse cycle.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Is the model hallucinating files that don't exist?&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Yes → the model isn't calling &lt;code&gt;Read&lt;/code&gt; before &lt;code&gt;Edit&lt;/code&gt;. Add a CLAUDE.md rule that requires reading before editing, or use a tool-finer model (Gemma 4 26B-A4B is solid here).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  10. Smoke test
&lt;/h2&gt;

&lt;p&gt;Verify your setup with one prompt. Ask Claude Code:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Create a small FastAPI app with one &lt;code&gt;/health&lt;/code&gt; endpoint, add a pytest test for it, run pytest, and fix any failures.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Passes if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It reads/writes files correctly (no hallucinated paths).&lt;/li&gt;
&lt;li&gt;It runs the test command (you see real &lt;code&gt;pytest&lt;/code&gt; output).&lt;/li&gt;
&lt;li&gt;It patches a failure (e.g. missing dependency) without losing context.&lt;/li&gt;
&lt;li&gt;It does not lose tool-call format (no &lt;code&gt;&amp;lt;unused24&amp;gt;&lt;/code&gt; / &lt;code&gt;&amp;lt;unused49&amp;gt;&lt;/code&gt; leakage).&lt;/li&gt;
&lt;li&gt;It does not truncate after the first edit.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Expected terminal feel:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;✓ model loaded     (gemma4:26b-a4b, Q4_K_M)
✓ context: 32768
✓ tool call parsed (Edit)
✓ edited file      (app.py)
✓ tool call parsed (Bash)
✓ tests passed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you don't see all five, walk the debug flow above.&lt;/p&gt;




&lt;h2&gt;
  
  
  11. Compatibility matrix (April 2026)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Backend&lt;/th&gt;
&lt;th&gt;Native Anthropic API&lt;/th&gt;
&lt;th&gt;Tool calls&lt;/th&gt;
&lt;th&gt;Context floor&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Ollama (≥ v0.14.0)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Depends on model&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;32K context&lt;/strong&gt; (cross-backend baseline)&lt;/td&gt;
&lt;td&gt;Easiest setup. No prompt caching, no &lt;code&gt;tool_choice&lt;/code&gt; (see section 3).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LM Studio (≥ 0.4.1)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes (out of the box)&lt;/td&gt;
&lt;td&gt;Stated 25K, &lt;strong&gt;use 32K&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;Streaming + &lt;code&gt;tool_use&lt;/code&gt; blocks supported natively. VS Code extension takes workspace env vars.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;llama.cpp server&lt;/td&gt;
&lt;td&gt;Partial&lt;/td&gt;
&lt;td&gt;Yes with &lt;code&gt;--jinja&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;32K&lt;/strong&gt;, &lt;strong&gt;64K context&lt;/strong&gt; on Pro&lt;/td&gt;
&lt;td&gt;Lowest overhead on Apple Silicon. Rougher Anthropic-compat. Best path for Gemma 4 26B-A4B.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;vLLM&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes with &lt;code&gt;--enable-auto-tool-choice&lt;/code&gt; + correct parser&lt;/td&gt;
&lt;td&gt;Model-dependent&lt;/td&gt;
&lt;td&gt;Best throughput. Requires correct parser per model family.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LiteLLM&lt;/td&gt;
&lt;td&gt;Routes to any backend&lt;/td&gt;
&lt;td&gt;Whatever the backend supports&lt;/td&gt;
&lt;td&gt;n/a&lt;/td&gt;
&lt;td&gt;Use for fallbacks and logging, or to wrap an OpenAI compatible local endpoint as Anthropic.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Direct Ollama &amp;lt; v0.14.0&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;n/a&lt;/td&gt;
&lt;td&gt;Upgrade.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  12. Hardware × model × context × backend (the cheat-sheet table)
&lt;/h2&gt;

&lt;p&gt;A developer should not have to infer what to use:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Machine&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Context&lt;/th&gt;
&lt;th&gt;Backend&lt;/th&gt;
&lt;th&gt;Verdict&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MacBook Air M5, 16 GB&lt;/td&gt;
&lt;td&gt;Gemma 4 E4B&lt;/td&gt;
&lt;td&gt;16K–24K&lt;/td&gt;
&lt;td&gt;LM Studio&lt;/td&gt;
&lt;td&gt;usable for small tasks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MacBook Air M5, 24 GB&lt;/td&gt;
&lt;td&gt;Gemma 4 26B-A4B Q4&lt;/td&gt;
&lt;td&gt;24K–32K&lt;/td&gt;
&lt;td&gt;Ollama / LM Studio&lt;/td&gt;
&lt;td&gt;good&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MacBook Air M5, 32 GB&lt;/td&gt;
&lt;td&gt;Gemma 4 26B-A4B Q4&lt;/td&gt;
&lt;td&gt;32K&lt;/td&gt;
&lt;td&gt;Ollama / LM Studio&lt;/td&gt;
&lt;td&gt;best Air setup&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MacBook Pro M5 Pro, 48 GB&lt;/td&gt;
&lt;td&gt;Gemma 4 26B-A4B Q4/UD-Q4&lt;/td&gt;
&lt;td&gt;64K&lt;/td&gt;
&lt;td&gt;llama.cpp / LM Studio&lt;/td&gt;
&lt;td&gt;sweet spot&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MacBook Pro M5 Max, 64 GB+&lt;/td&gt;
&lt;td&gt;Gemma 4 26B-A4B or 31B&lt;/td&gt;
&lt;td&gt;64K–128K&lt;/td&gt;
&lt;td&gt;llama.cpp / vLLM&lt;/td&gt;
&lt;td&gt;best local&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is the single most copied table in this gist. Bookmark it.&lt;/p&gt;




&lt;h2&gt;
  
  
  13. Gemma 4 26B-A4B: the Apple Silicon sweet spot
&lt;/h2&gt;

&lt;p&gt;For Mac local Claude Code, the standout Gemma 4 variant is &lt;strong&gt;26B-A4B-it&lt;/strong&gt;, not the dense 31B. Reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Google trained tool-use directly into Gemma 4 (not bolted on as a fine-tune). It works on the first try, not after three retries.&lt;/li&gt;
&lt;li&gt;The 26B MoE activates only ~3.88 B params per inference, so latency is in the 4 B-model range — around 300 tok/sec on M2 Ultra.&lt;/li&gt;
&lt;li&gt;Strong tool-use behavior, good enough coding quality for private/local workflows.&lt;/li&gt;
&lt;li&gt;Fits at useful context sizes on high-memory MacBooks.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why 26B-A4B instead of 31B?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Faster tool calls — every Claude Code turn is bottlenecked by tool-call latency, not single-shot quality.&lt;/li&gt;
&lt;li&gt;Lower active-parameter count keeps prefill cheap.&lt;/li&gt;
&lt;li&gt;Better fit for laptops — 31B dense needs more RAM and more thermal headroom.&lt;/li&gt;
&lt;li&gt;Enough quality for iterative coding; the agent loop matters more than peak IQ.&lt;/li&gt;
&lt;li&gt;31B may be better for single-shot answers — but Claude Code is many small turns, not one big answer.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For &lt;strong&gt;Gemma 4 local coding&lt;/strong&gt; specifically: pick 26B-A4B unless you're on a 64 GB+ Pro and you've measured that 31B Q4 actually finishes turns faster on your hardware.&lt;/p&gt;




&lt;h2&gt;
  
  
  14. Other model picks for Claude Code (April 2026)
&lt;/h2&gt;

&lt;p&gt;If Gemma 4 isn't available or you want to compare:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;gpt-oss:20b&lt;/code&gt;&lt;/strong&gt; — easy starting point. Tool calling reliable, runs on a single decent GPU. Recommended in Ollama's and LM Studio's official Claude Code blog posts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;gpt-oss:120b&lt;/code&gt;&lt;/strong&gt; — much smarter on real codebases. The vLLM Claude Code integration page uses this as the example. Needs serious VRAM.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;qwen3-coder&lt;/code&gt;&lt;/strong&gt; — purpose-built for coding. Strong tool-call performance on Ollama. Frequently called the strongest local pick for Claude Code in March/April 2026 community threads.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;qwen3.5&lt;/code&gt; family&lt;/strong&gt; — the 35B MoE variants are reported as the strongest agentic-coding open models in this size class. Verify tool-call support per quant.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;glm-4.7-flash&lt;/code&gt; / &lt;code&gt;glm-4.7:cloud&lt;/code&gt;&lt;/strong&gt; — strong agentic coder. Available as an Ollama cloud model (no local GPU needed).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;minimax-m2.1:cloud&lt;/code&gt;&lt;/strong&gt; — newer Ollama cloud option, agentic-tuned.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What to avoid: stock &lt;code&gt;llama3.x&lt;/code&gt; instruct models without tool fine-tuning. They will look like they work, then silently fail on file edits.&lt;/p&gt;




&lt;h2&gt;
  
  
  15. Setups I would avoid
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;8K context.&lt;/strong&gt; Too small for Claude Code. The system prompt eats it before your code arrives.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;16K context.&lt;/strong&gt; Demos only. Don't judge a model by 16K behavior.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Old llama.cpp builds with Gemma 4.&lt;/strong&gt; No &lt;code&gt;--jinja&lt;/code&gt; or no patched chat template → &lt;code&gt;&amp;lt;unused24&amp;gt;&lt;/code&gt; / &lt;code&gt;&amp;lt;unused49&amp;gt;&lt;/code&gt; token leakage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;128K context on a 32 GB laptop.&lt;/strong&gt; KV cache + prefill latency tax &amp;gt; the benefit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Judging model quality before tool calls are stable.&lt;/strong&gt; Fix the parser/template first, then evaluate the model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Routing through LiteLLM when the backend is already native Anthropic.&lt;/strong&gt; Adds a hop for nothing — only use LiteLLM for fallbacks or when wrapping an OpenAI compatible local endpoint.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  16. Reusable startup script
&lt;/h2&gt;

&lt;p&gt;Drop this in &lt;code&gt;start-claude-code-local.sh&lt;/code&gt; and &lt;code&gt;chmod +x&lt;/code&gt;. Default 32K context, override via env.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/usr/bin/env bash&lt;/span&gt;
&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-euo&lt;/span&gt; pipefail

&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OLLAMA_CONTEXT_LENGTH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;OLLAMA_CONTEXT_LENGTH&lt;/span&gt;&lt;span class="k"&gt;:-&lt;/span&gt;&lt;span class="nv"&gt;32768&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="k"&gt;:-&lt;/span&gt;&lt;span class="nv"&gt;http&lt;/span&gt;://localhost:11434&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_AUTH_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_AUTH_TOKEN&lt;/span&gt;&lt;span class="k"&gt;:-&lt;/span&gt;&lt;span class="nv"&gt;ollama&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_DEFAULT_OPUS_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_DEFAULT_OPUS_MODEL&lt;/span&gt;&lt;span class="k"&gt;:-&lt;/span&gt;&lt;span class="nv"&gt;gemma4&lt;/span&gt;:26b-a4b&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_DEFAULT_SONNET_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_DEFAULT_SONNET_MODEL&lt;/span&gt;&lt;span class="k"&gt;:-&lt;/span&gt;&lt;span class="nv"&gt;gemma4&lt;/span&gt;:26b-a4b&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_DEFAULT_HAIKU_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_DEFAULT_HAIKU_MODEL&lt;/span&gt;&lt;span class="k"&gt;:-&lt;/span&gt;&lt;span class="nv"&gt;gpt&lt;/span&gt;&lt;span class="p"&gt;-oss&lt;/span&gt;:20b&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Starting Ollama with context=&lt;/span&gt;&lt;span class="nv"&gt;$OLLAMA_CONTEXT_LENGTH&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
ollama serve &amp;amp;
&lt;span class="nv"&gt;OLLAMA_PID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$!&lt;/span&gt;

&lt;span class="c"&gt;# Wait for Ollama to be ready&lt;/span&gt;
&lt;span class="k"&gt;until &lt;/span&gt;curl &lt;span class="nt"&gt;-sf&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="s2"&gt;/api/version"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /dev/null&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;&lt;span class="nb"&gt;sleep &lt;/span&gt;0.5
&lt;span class="k"&gt;done

&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Launching Claude Code → &lt;/span&gt;&lt;span class="nv"&gt;$ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Model: &lt;/span&gt;&lt;span class="nv"&gt;$ANTHROPIC_DEFAULT_OPUS_MODEL&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

claude

&lt;span class="nb"&gt;kill&lt;/span&gt; &lt;span class="nv"&gt;$OLLAMA_PID&lt;/span&gt; 2&amp;gt;/dev/null &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For LM Studio, swap &lt;code&gt;ollama serve&lt;/code&gt; for &lt;code&gt;lms server start --port 1234&lt;/code&gt; and update the env vars accordingly.&lt;/p&gt;

&lt;p&gt;This script (and additions for other backends as they ship) lives in the companion repo:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://github.com/renezander030/local-ai-coding-stack" rel="noopener noreferrer"&gt;github.com/renezander030/local-ai-coding-stack&lt;/a&gt; — &lt;code&gt;git clone&lt;/code&gt;, &lt;code&gt;chmod +x scripts/start-claude-code-local.sh&lt;/code&gt;, run.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  17. Production recommendation
&lt;/h2&gt;

&lt;p&gt;For real work, do not let Claude Code talk directly to a single local endpoint without a fallback path:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Claude Code
   │  ANTHROPIC_BASE_URL
   ▼
LiteLLM (router + logger)
   │  primary
   ▼
Ollama / LM Studio / llama.cpp / vLLM (local)
   │  on tool-call failure or 5xx
   ▼
Cloud Claude Haiku (fallback)
   │
   ▼
Audit log
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Model swaps without restarting Claude Code; transparent fallback when local tool calling silently fails; request logs you can grep when something goes wrong. Same five-contract pattern from &lt;a href="https://github.com/renezander030/agent-approval-gate" rel="noopener noreferrer"&gt;agent-approval-gate&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  18. When local models are the wrong choice
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Repo-wide refactors.&lt;/strong&gt; Multi-step tool flows compound silent tool-call failures. Local fine-tunes drop accuracy fast.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security-sensitive edits without an approval gate.&lt;/strong&gt; Use &lt;a href="https://github.com/renezander030/agent-approval-gate" rel="noopener noreferrer"&gt;agent-approval-gate&lt;/a&gt; and the local-vs-cloud question becomes secondary.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool-heavy sessions&lt;/strong&gt; (50+ tool calls). Every silent failure compounds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anything billed by your time.&lt;/strong&gt; A failed local tool call costs your time; a successful Haiku call is roughly $0.001.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Local Claude Code is a fit for: chat-only assist on private code, classification/summarization sub-steps, air-gapped environments.&lt;/p&gt;




&lt;h2&gt;
  
  
  Series
&lt;/h2&gt;

&lt;p&gt;This gist is part of &lt;strong&gt;Production AI Automation Notes&lt;/strong&gt; — a running set of repos and gists on shipping AI agents outside demos. Other entries:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/renezander030/agent-approval-gate" rel="noopener noreferrer"&gt;agent-approval-gate&lt;/a&gt; — production-safe approval pattern. Drop in front of any local-model agent that touches real systems.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gist.github.com/renezander030/9069db775e494ffd2cdd5a09adf83add" rel="noopener noreferrer"&gt;Production AI Automation Notes #1: Agent Approval Gates&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gist.github.com/renezander030/2898eb5f0100688f4197b5e493e156a2" rel="noopener noreferrer"&gt;CLAUDE.md — 10 rules for Claude Code, edit-time and runtime&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gist.github.com/renezander030/83ad49aeffa5f8749325a2b19617823f" rel="noopener noreferrer"&gt;Context7 v2 — enterprise GraphQL MCP pattern&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Sources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://ollama.com/blog/claude" rel="noopener noreferrer"&gt;Ollama — Claude Code with Anthropic API compatibility (2026-01-16)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://lmstudio.ai/blog/claudecode" rel="noopener noreferrer"&gt;LM Studio — Use your LM Studio Models in Claude Code (2026-01-30)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.vllm.ai/en/stable/serving/integrations/claude_code/" rel="noopener noreferrer"&gt;vLLM — Claude Code integration docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.claude.com/en/docs/claude-code/overview" rel="noopener noreferrer"&gt;Anthropic Claude Code documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.claude.com/en/api/messages" rel="noopener noreferrer"&gt;Anthropic Messages API reference&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.litellm.ai/docs/anthropic_unified" rel="noopener noreferrer"&gt;LiteLLM Anthropic-compatible route docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/anthropics/claude-code/issues/7178" rel="noopener noreferrer"&gt;Claude Code GitHub issue #7178 — local/self-hosted model support&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Reader contributions
&lt;/h2&gt;

&lt;p&gt;If you get this working on a different Mac/RAM/model combo, comment with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;machine&lt;/li&gt;
&lt;li&gt;RAM&lt;/li&gt;
&lt;li&gt;backend&lt;/li&gt;
&lt;li&gt;model + quant&lt;/li&gt;
&lt;li&gt;context length&lt;/li&gt;
&lt;li&gt;what worked / what failed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The compatibility matrix and hardware table are updated weekly from these reports.&lt;/p&gt;

&lt;h2&gt;
  
  
  Changelog
&lt;/h2&gt;

&lt;h3&gt;
  
  
  2026-04-28
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Added TL;DR cheat sheet, Recommended setup section, smoke test, debug flow, reusable startup script, hardware × model × context × backend table.&lt;/li&gt;
&lt;li&gt;Expanded error-string section to include &lt;code&gt;&amp;lt;unused24&amp;gt;&lt;/code&gt; / &lt;code&gt;&amp;lt;unused49&amp;gt;&lt;/code&gt; template-leak symptoms.&lt;/li&gt;
&lt;li&gt;Added 26B-A4B vs 31B comparison bullets.&lt;/li&gt;
&lt;li&gt;Added "Setups I would avoid."&lt;/li&gt;
&lt;li&gt;Renamed Update log → Changelog.&lt;/li&gt;
&lt;li&gt;Added Gemma 4 26B-A4B context recommendations.&lt;/li&gt;
&lt;li&gt;Added MacBook Air vs Pro presets.&lt;/li&gt;
&lt;li&gt;Added 32K / 64K Claude Code guidance.&lt;/li&gt;
&lt;li&gt;Backend coverage rewritten: Ollama, LM Studio, vLLM all native Anthropic; llama.cpp added as Apple Silicon fast path.&lt;/li&gt;
&lt;li&gt;LiteLLM repositioned as fallback router (and OpenAI-compat wrapper), not translator.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2026-04-22
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Initial publish.&lt;/li&gt;
&lt;/ul&gt;







&lt;p&gt;&lt;strong&gt;Get the next field note in your inbox.&lt;/strong&gt; A new mini case from real AI builds every two weeks. No theory, no pitches.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://renezander.com/blog/?utm_source=dev-to&amp;amp;utm_campaign=claude-code-with-local-llms-and-anthropicbaseurl-ollama-lm-studio-llamacpp-vllm" rel="noopener noreferrer"&gt;Subscribe at renezander.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at &lt;a href="https://renezander.com" rel="noopener noreferrer"&gt;renezander.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>programming</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Voice AI in Production: From RunPod to Hosted Kubernetes</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Thu, 23 Apr 2026 13:10:51 +0000</pubDate>
      <link>https://forem.com/reneza/voice-ai-in-production-from-runpod-to-hosted-kubernetes-7gg</link>
      <guid>https://forem.com/reneza/voice-ai-in-production-from-runpod-to-hosted-kubernetes-7gg</guid>
      <description>&lt;p&gt;Your voice model works in a demo. The same model in production stalls under concurrent load. The model file is identical. So is the GPU card. Only the deployment changed.&lt;/p&gt;

&lt;p&gt;If your TTS service runs on a single RunPod pod, you've already met this wall. You handle one request per GPU at a time. A crash costs ninety seconds to reload the model. Failover isn't in the setup. Your marketing page says "generate narration instantly." Your infrastructure says "please form an orderly queue."&lt;/p&gt;

&lt;p&gt;The gap between prototype and product sits in the infrastructure layer. The voice AI companies asking me for help want hosted Kubernetes because their engineering hours are going into pod management when they should be going into the model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Single Pod Stops Working Around Four Concurrent Users
&lt;/h2&gt;

&lt;p&gt;A voice model like Qwen3-TTS loads into GPU memory once. Each inference holds that memory plus a working buffer. On an H100 you fit the model plus maybe four to eight concurrent generations before latency goes off a cliff. On a 4090, less.&lt;/p&gt;

&lt;p&gt;That number is the ceiling of your business on a single pod. You can buy a bigger GPU. You can't buy a second one attached to the same pod. The moment you need more than one machine, you're in distributed-systems territory whether you planned for it or not.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Breaks First
&lt;/h2&gt;

&lt;p&gt;Cold starts are the obvious one. A pod that dies takes ninety seconds to reload the model into VRAM, and during those ninety seconds your users hit 502s. Kubernetes with a warm pool absorbs it.&lt;/p&gt;

&lt;p&gt;Voice profile storage gets worse the moment you scale. On one pod a user's cloned voice sits on local disk. Spread that across ten pods and you need shared storage plus replication on every node that might serve that user. Miss one and the next request uses the wrong voice or errors out.&lt;/p&gt;

&lt;p&gt;Then there's the cost trap. You rent preemptible GPUs at a third the price, and one afternoon the cloud provider takes them back with two minutes' warning. A single pod goes dark. A K8s cluster with a warm replica serves the next request from a different node and nobody sees the eviction.&lt;/p&gt;

&lt;p&gt;Fine-tuning is the one that forces the decision. The moment you offer custom voice creation, you need training runs that don't block inference. That means another queue, another GPU pool, and priority rules that don't collide with live inference. A single pod can't multiplex that, and bolting it on later costs more than designing for it up front.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the K8s Layer Actually Buys You
&lt;/h2&gt;

&lt;p&gt;Keep model weights on the node, where they outlive any single pod. New pods scheduled to that node get a warm cache and start in under ten seconds instead of ninety.&lt;/p&gt;

&lt;p&gt;Not every request needs an H100. Real-time low-latency responses can run on a 4090 nodepool, premium batch generations go to H100. Nodepool labels and taints handle the routing without the application code caring.&lt;/p&gt;

&lt;p&gt;Pick queue depth as your autoscale signal. CPU metrics are useless here. GPU utilization also lies when the model is streaming. The number that maps to user-visible latency is requests waiting in the queue.&lt;/p&gt;

&lt;p&gt;Show the queue depth back to the caller. "You're number four, about forty seconds" keeps users on the line. A thirty-second timeout with no feedback teaches them your service is broken.&lt;/p&gt;

&lt;p&gt;None of this is visible in a Voicebox README.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hosted K8s Is the Service
&lt;/h2&gt;

&lt;p&gt;Voice AI companies keep asking for this because it's the gap between a model that works and a product that holds up under paying users. You can learn Kubernetes while trying to ship, but most founders can't afford both learning curves at once. Hiring a team is slow. Handing the layer off gets your engineering hours back on the model.&lt;/p&gt;

&lt;p&gt;If your voice AI product is past the demo and breaking under real traffic, I run the K8s layer so your team stays on the model. &lt;a href="https://renezander.com/#contact" rel="noopener noreferrer"&gt;Contact on the blog&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Your Model Is the Value. Your Pod Isn't.
&lt;/h2&gt;

&lt;p&gt;Are your engineering hours going into the model or into the pod that serves it? If the answer is the pod, you're paying to solve the wrong problem twice. Handle the infrastructure properly or hand it off. A half-built version while your competitor ships isn't a strategy.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://renezander.com/blog/voice-ai-production-kubernetes/" rel="noopener noreferrer"&gt;renezander.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Where is your engineering time actually going right now: into the model or into the pod that serves it?&lt;/p&gt;







&lt;p&gt;&lt;strong&gt;Get the next field note in your inbox.&lt;/strong&gt; A new mini case from real AI builds every two weeks. No theory, no pitches.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://renezander.com/blog/?utm_source=dev-to&amp;amp;utm_campaign=voice-ai-in-production-from-runpod-to-hosted-kubernetes" rel="noopener noreferrer"&gt;Subscribe at renezander.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at &lt;a href="https://renezander.com" rel="noopener noreferrer"&gt;renezander.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>kubernetes</category>
      <category>devops</category>
      <category>discuss</category>
    </item>
    <item>
      <title>Ten CLAUDE.md rules for Claude Code - four edit-time, six runtime</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Thu, 23 Apr 2026 04:06:42 +0000</pubDate>
      <link>https://forem.com/reneza/ten-claudemd-rules-for-claude-code-four-edit-time-six-runtime-210g</link>
      <guid>https://forem.com/reneza/ten-claudemd-rules-for-claude-code-four-edit-time-six-runtime-210g</guid>
      <description>&lt;p&gt;Forrestchang's &lt;a href="https://github.com/forrestchang/andrej-karpathy-skills" rel="noopener noreferrer"&gt;andrej-karpathy-skills&lt;/a&gt; CLAUDE.md is four rules aimed at the moment Claude is &lt;strong&gt;writing code&lt;/strong&gt;. They work. What they don't cover is the moment Claude is &lt;strong&gt;running&lt;/strong&gt;. Once a Claude-driven pipeline goes to production, a different failure mode shows up: confident outputs, silent budget overruns, destructive side-effects, prompt injection via user input.&lt;/p&gt;

&lt;p&gt;These six extension rules are what I shipped into &lt;a href="https://github.com/renezander030/fixclaw" rel="noopener noreferrer"&gt;fixclaw&lt;/a&gt; — a Go pipeline engine where Claude drafts, classifies, and summarizes, but &lt;em&gt;never&lt;/em&gt; executes. Deterministic code does. The rules below are what made that claim stick.&lt;/p&gt;

&lt;p&gt;Merge with your own project rules. Tradeoff: these bias toward caution over autonomy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Forrestchang's four (edit-time) — unchanged
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Think Before Coding&lt;/strong&gt; — state assumptions, surface tradeoffs, ask when unclear.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Simplicity First&lt;/strong&gt; — minimum code, no speculative abstractions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Surgical Changes&lt;/strong&gt; — touch only what the task requires.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Goal-Driven Execution&lt;/strong&gt; — define success criteria, loop until verified.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;(Full text: forrestchang/andrej-karpathy-skills/CLAUDE.md.)&lt;/p&gt;

&lt;h2&gt;
  
  
  Six runtime rules — lessons from fixclaw
&lt;/h2&gt;

&lt;h3&gt;
  
  
  5. Deterministic First
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Claude is for judgment calls. Plain code does everything else.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Fetching, filtering, routing, persisting, dispatching — none of it is a language task. Don't ask the model to "decide if we should retry" when a status code already answers. Use the model for: classification, drafting, summarization, extraction from unstructured text. That's the whole list.&lt;/p&gt;

&lt;p&gt;The failure mode without this rule: the model makes a routing decision one week, a different routing decision the next, and you've reinvented flaky if-else at $0.003/token.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Declare Budgets, Halt On Breach
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;No silent overruns. Ever.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every AI step runs under a token budget: per-step, per-pipeline, per-day. Exceeding any of the three halts the pipeline immediately, logs the breach, and surfaces it to the operator. Budgets live in config, not in prompts.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;budgets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;per_step_tokens&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2048&lt;/span&gt;
  &lt;span class="na"&gt;per_pipeline_tokens&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10000&lt;/span&gt;
  &lt;span class="na"&gt;per_day_tokens&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The failure mode without this rule: a runaway loop burns $40 overnight and you find out from the invoice.&lt;/p&gt;

&lt;h3&gt;
  
  
  7. Human-In-The-Loop Is A First-Class Step Type
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Label destructive actions. Require approval. No exceptions via flags.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Anything touching the outside world — sending an email, updating a CRM, posting a message — is an &lt;code&gt;approval&lt;/code&gt; step, not an &lt;code&gt;ai&lt;/code&gt; step. The approval is routed to an operator channel (Slack, Telegram, whatever) with approve/edit/reject controls. The pipeline blocks until a decision is recorded.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;approve-send&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;approval&lt;/span&gt;
  &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hitl&lt;/span&gt;
  &lt;span class="na"&gt;channel&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;telegram&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The failure mode without this rule: a hallucinated follow-up email goes to a real customer.&lt;/p&gt;

&lt;h3&gt;
  
  
  8. Validate AI Output Against A Schema
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Unstructured strings don't belong in deterministic downstream code.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every AI step declares an output schema. The runtime rejects anything that doesn't match — missing fields, wrong types, out-of-range numbers. Rejected outputs trigger a retry (under budget) or halt.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;output_schema&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;object&lt;/span&gt;
  &lt;span class="na"&gt;required&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;match&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;reason&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;score&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;properties&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;match&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;boolean&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
    &lt;span class="na"&gt;reason&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;string&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;maxLength&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;280&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
    &lt;span class="na"&gt;score&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;integer&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;minimum&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;0&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;maximum&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;100&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The failure mode without this rule: a boolean comes back as the string &lt;code&gt;"maybe"&lt;/code&gt; and a downstream &lt;code&gt;if&lt;/code&gt; branches the wrong way.&lt;/p&gt;

&lt;h3&gt;
  
  
  9. Sanitize Operator Input Before It Reaches A Prompt
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;User-supplied text is not trusted.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Before any operator or external input enters a prompt, strip role markers (&lt;code&gt;system:&lt;/code&gt;, &lt;code&gt;assistant:&lt;/code&gt;, &lt;code&gt;&amp;lt;|im_start|&amp;gt;&lt;/code&gt; variants), enforce length limits, and normalize markdown so formatting can't break prompt boundaries. This is prompt-injection defense, not input validation — the goal is to stop an attacker from pivoting the model mid-run.&lt;/p&gt;

&lt;h3&gt;
  
  
  10. Log Rejections Silently
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Don't narrate to the attacker.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When input is rejected for sanitization or schema violations, log internally — never echo the rejection reason back to the source. A detailed error message is a free signal that tells the attacker which pattern to try next.&lt;/p&gt;

&lt;h2&gt;
  
  
  The "working if" test
&lt;/h2&gt;

&lt;p&gt;The full ten rules are working if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Diffs are smaller and more targeted (rules 1–4).&lt;/li&gt;
&lt;li&gt;Pipeline runs have predictable token costs (rule 6).&lt;/li&gt;
&lt;li&gt;No AI output ever reaches a production side-effect without a human approval record (rule 7).&lt;/li&gt;
&lt;li&gt;Downstream code never branches on a malformed AI response (rule 8).&lt;/li&gt;
&lt;li&gt;Operator-channel logs show silent rejections rather than echoed errors (rules 9–10).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If even one of those is failing, the rule isn't enforced — it's aspirational.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published as a gist: &lt;a href="https://gist.github.com/renezander030/2898eb5f0100688f4197b5e493e156a2" rel="noopener noreferrer"&gt;https://gist.github.com/renezander030/2898eb5f0100688f4197b5e493e156a2&lt;/a&gt; — weekly gists on Claude Code, MCP, and automation at &lt;a href="https://github.com/renezander030" rel="noopener noreferrer"&gt;@renezander030&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;







&lt;p&gt;&lt;strong&gt;Get the next field note in your inbox.&lt;/strong&gt; A new mini case from real AI builds every two weeks. No theory, no pitches.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://renezander.com/blog/?utm_source=dev-to&amp;amp;utm_campaign=ten-claudemd-rules-for-claude-code-four-edit-time-six-runtime" rel="noopener noreferrer"&gt;Subscribe at renezander.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at &lt;a href="https://renezander.com" rel="noopener noreferrer"&gt;renezander.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>productivity</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>95% of PII Redaction Doesn't Need an LLM. The Other 5% Is Where Your Masker Leaks.</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Tue, 21 Apr 2026 11:43:05 +0000</pubDate>
      <link>https://forem.com/reneza/95-of-pii-redaction-doesnt-need-an-llm-the-other-5-is-where-your-masker-leaks-13pp</link>
      <guid>https://forem.com/reneza/95-of-pii-redaction-doesnt-need-an-llm-the-other-5-is-where-your-masker-leaks-13pp</guid>
      <description>&lt;p&gt;A VP at an SAP shop told me recently: "Every time we copy production to our lower environments, PII leaks. And no, we're not throwing an LLM at it. That's a thousand times the compute of what we already run."&lt;/p&gt;

&lt;p&gt;He's right.&lt;/p&gt;

&lt;p&gt;Most of the PII redaction problem in enterprise data isn't a neural network problem. It's a lookup table problem. And the incumbents already solve it. SAP TDMS, Delphix, Informatica, IBM InfoSphere Optim. All schema-aware. All row-level. All deterministic.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 95% Where Deterministic Wins
&lt;/h2&gt;

&lt;p&gt;In a SAP production database, the schema tells you almost everything. &lt;code&gt;KNA1-NAME1&lt;/code&gt; is a customer name. &lt;code&gt;BSEG-IBAN&lt;/code&gt; is a bank account. &lt;code&gt;USR02-BNAME&lt;/code&gt; is a user ID. A YAML rule says: "for this column type, replace with this pattern." Done.&lt;/p&gt;

&lt;p&gt;The math is brutal. A regex plus a lookup table costs microseconds per row. A 1.5B-parameter model costs 10 to 50 milliseconds per row, even on a GPU. That's three to five orders of magnitude. A nightly batch copy that finishes by morning with TDMS would take weeks with an LLM in the loop.&lt;/p&gt;

&lt;p&gt;Compute isn't even the main argument.&lt;/p&gt;

&lt;p&gt;Referential integrity is. "Anna Müller" has to become "Person_47" consistently across 200 tables. &lt;code&gt;KNA1&lt;/code&gt;, &lt;code&gt;VBAK&lt;/code&gt;, &lt;code&gt;VBKD&lt;/code&gt;, &lt;code&gt;BSEG&lt;/code&gt;, wherever the customer ID travels. Deterministic pseudonymization with an HMAC and a scoped salt gives you that for free. Neural outputs drift.&lt;/p&gt;

&lt;p&gt;Auditability is. A regulator asks: "show me the rule that masked this column." A YAML rule is defensible. A model output is not.&lt;/p&gt;

&lt;p&gt;So for any SAP field with a known schema type, deterministic masking wins. Full stop. Don't let anyone sell you a neural-network-powered "modernization" of that layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where a Fine-Tuned Model Earns Its Compute
&lt;/h2&gt;

&lt;p&gt;Here's what TDMS, Delphix, and their peers silently miss.&lt;/p&gt;

&lt;p&gt;Free-text columns. &lt;code&gt;BSEG-SGTXT&lt;/code&gt;, the long-text field where someone typed "Ansprechpartner Anna Müller, Tel +49-170-...". Ticket descriptions from ServiceNow mirrored into dev. Email bodies stored as CLOBs. ADRC annotations. The column type is "text." The content is gold-mine PII.&lt;/p&gt;

&lt;p&gt;Unstructured attachments. PDFs, scanned invoices, OCR'd contracts pulled into dev via ArchiveLink. Names and IBANs mid-prose, not in a column.&lt;/p&gt;

&lt;p&gt;Schema drift. Consultants add Z-tables. The data steward hasn't classified them yet. Deterministic tools don't know the column holds PII. They pass the data through untouched.&lt;/p&gt;

&lt;p&gt;On these, rule-based tools do one of two things. They wipe the whole column, destroying test fidelity, so the dev team can't debug against realistic data. Or they miss the PII entirely, and you get a compliance incident.&lt;/p&gt;

&lt;p&gt;A German-specialized redactor earns its keep here because the alternative isn't "faster regex." It's "no coverage at all."&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hybrid Architecture
&lt;/h2&gt;

&lt;p&gt;This is the part that actually ships.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A classifier pass on the SAP copy. Cheap heuristics (column-name keywords, column type, sample-value regex) flag each column as &lt;code&gt;structured_pii&lt;/code&gt;, &lt;code&gt;free_text&lt;/code&gt;, or &lt;code&gt;safe&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Deterministic masker handles &lt;code&gt;structured_pii&lt;/code&gt;. TDMS or whatever you already run.&lt;/li&gt;
&lt;li&gt;Fine-tuned LLM redactor runs &lt;em&gt;only&lt;/em&gt; on &lt;code&gt;free_text&lt;/code&gt;, attachments, and unclassified Z-columns.&lt;/li&gt;
&lt;li&gt;A consistency bridge. Both paths share a pseudonym table keyed by &lt;code&gt;HMAC(value, tenant_salt)&lt;/code&gt;. "Anna Müller" becomes "Person_47" whether she was caught by regex or by the model.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Compute budget: the LLM runs on maybe 1 to 5 percent of the cells. Total cost is still dominated by the deterministic layer. You're not replacing TDMS. You're covering its blind spots.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Won't Claim
&lt;/h2&gt;

&lt;p&gt;Three things I won't sell you:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The LLM is cheaper than a regex. It isn't. Ever.&lt;/li&gt;
&lt;li&gt;It replaces your incumbent masking vendor. It doesn't.&lt;/li&gt;
&lt;li&gt;A benchmark against TDMS on structured columns is meaningful. You lose that benchmark. Benchmark on free-text and attachments, where deterministic tools score near zero.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The honest pitch to the VP was this. "You're right. For the 90% structured case, keep TDMS. The model is the long-tail layer. It runs only over the free-text fields and attachments your current tools silently leak. Small job. Different problem."&lt;/p&gt;

&lt;p&gt;That's the conversation that lands. Not "replace your stack." Not "AI-powered everything."&lt;/p&gt;

&lt;p&gt;Regex for the schema. LLM for the shadows.&lt;/p&gt;

&lt;p&gt;I reserve my audits for teams ready to take action on the results.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.cal.eu/reneza/30min" rel="noopener noreferrer"&gt;Book a 30-min call →&lt;/a&gt;&lt;/p&gt;







&lt;p&gt;&lt;strong&gt;Get the next field note in your inbox.&lt;/strong&gt; A new mini case from real AI builds every two weeks. No theory, no pitches.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://renezander.com/blog/?utm_source=dev-to&amp;amp;utm_campaign=95-of-pii-redaction-doesnt-need-an-llm-the-other-5-is-where-your-masker-leaks" rel="noopener noreferrer"&gt;Subscribe at renezander.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at &lt;a href="https://renezander.com" rel="noopener noreferrer"&gt;renezander.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>gdpr</category>
      <category>dsvgo</category>
      <category>pii</category>
    </item>
    <item>
      <title>What llama.cpp's Pace Tells You About On-Prem LLM Readiness</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Tue, 14 Apr 2026 08:42:06 +0000</pubDate>
      <link>https://forem.com/reneza/what-llamacpps-pace-tells-you-about-on-prem-llm-readiness-eh1</link>
      <guid>https://forem.com/reneza/what-llamacpps-pace-tells-you-about-on-prem-llm-readiness-eh1</guid>
      <description>&lt;p&gt;Your team asked for GPU budget for self-hosted inference. You said "not yet" because last time you checked, the tooling wasn't production-grade. That was true 18 months ago. It's not true now, and the delay is costing you leverage you don't know you're losing.&lt;/p&gt;

&lt;p&gt;I'm writing this because most decision-makers I talk to are still running on an outdated mental model of what self-hosted LLM infrastructure looks like. The software moved. The org didn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Team That Celebrated Too Early
&lt;/h2&gt;

&lt;p&gt;I watched a team spin up on-prem inference, celebrate for a week, then watch it rot because nobody owned it. Six months later they were back on the API, having spent the budget anyway.&lt;/p&gt;

&lt;p&gt;This is the failure mode nobody talks about. The software works. It's been working for a while now. The problem is everything around it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Nobody owns the stack.&lt;/strong&gt; Running self-hosted inference in production means someone on your team owns model updates, hardware failures, quantization tradeoffs, and latency tuning. That's a different job than calling an API. If you don't staff it, the deployment decays.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Procurement kills momentum.&lt;/strong&gt; GPU capacity is a capital expenditure conversation, not a software download. If you don't already have data center access or cloud-GPU contracts, the blocker isn't the code. It's a procurement cycle that takes months. By the time the hardware arrives, the team that asked for it has moved on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model selection is real work.&lt;/strong&gt; The quantized model that runs great for summarization falls apart on code generation. There is no default. Every use case needs evaluation, and evaluation takes time nobody budgets for.&lt;/p&gt;

&lt;p&gt;These are solvable problems. But teams that skip them end up with on-prem deployments that nobody trusts, and leadership that says "see, I told you it wasn't ready" when the real issue was organizational, not technical.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Changed While You Were Waiting
&lt;/h2&gt;

&lt;p&gt;A year ago, I would have told you to hold off. Not anymore.&lt;/p&gt;

&lt;p&gt;You can now split inference across multiple GPUs without patching anything yourself. The server mode handles concurrent requests behind a load balancer. 1-bit quantization means models that needed high-end hardware run on modest configs without catastrophic quality loss.&lt;/p&gt;

&lt;p&gt;Multi-modal support landed. Speculative decoding shipped, cutting latency on long outputs. The API compatibility layer means your existing code that talks to cloud providers works against a self-hosted endpoint with a URL change.&lt;/p&gt;

&lt;p&gt;I deployed a quantized model on a client's on-prem GPU last month. Set up the server, pointed the app at it, ran inference. It worked. First try. That sentence would have been fiction two years ago.&lt;/p&gt;

&lt;p&gt;The gap between "experimental" and "production-ready" closed while most orgs were waiting for someone else to go first.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Decision You're Actually Making
&lt;/h2&gt;

&lt;p&gt;This isn't a permanent binary. It's a portfolio allocation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Move workloads on-prem when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your inference volume is high enough that API costs became a material line item.&lt;/li&gt;
&lt;li&gt;You need predictable latency without network variability.&lt;/li&gt;
&lt;li&gt;Compliance or data residency requirements mandate it. But verify this. Many teams assume they need on-prem when they don't.&lt;/li&gt;
&lt;li&gt;You have an engineer who wants to own the stack.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Stay on the API when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You're prototyping or usage is unpredictable.&lt;/li&gt;
&lt;li&gt;You need frontier models not available as open weights.&lt;/li&gt;
&lt;li&gt;Nobody on your team can own the ops burden.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The mistake I see most often: treating this as all-or-nothing. Start with API. Move specific workloads to self-hosted when economics or data constraints force the conversation. The infrastructure to do it properly exists now. It didn't two years ago.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Question for Your Next Planning Cycle
&lt;/h2&gt;

&lt;p&gt;The software is ready. The open-weight models are good enough for most production use cases. The tooling matured past the point where "not ready yet" is a defensible position.&lt;/p&gt;

&lt;p&gt;The real question isn't whether the technology works. It's whether your org is set up to operate it. That's a staffing decision and a procurement decision, not a technology bet.&lt;/p&gt;

&lt;p&gt;If you're still saying "not yet," make sure you're saying it because of an actual blocker, not because of a mental model that expired a year ago.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Related reading:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://renezander.com/guides/self-hosted-llm-vs-api/" rel="noopener noreferrer"&gt;Self-Hosted LLM vs API: when the math actually works&lt;/a&gt; — the decision framework I use with clients (&lt;a href="https://renezander.com/de/guides/self-hosted-llm-vs-api/" rel="noopener noreferrer"&gt;deutsche Version&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://renezander.com/guides/llm-api-comparison/" rel="noopener noreferrer"&gt;LLM API comparison 2026&lt;/a&gt; — Claude vs GPT vs Gemini vs Mistral vs DeepSeek for production (&lt;a href="https://renezander.com/de/guides/llm-api-comparison/" rel="noopener noreferrer"&gt;deutsche Version&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;I help teams navigate this decision. If your org is evaluating self-hosted inference and you want an honest assessment of readiness, reach out.&lt;/p&gt;







&lt;p&gt;&lt;strong&gt;Get the next field note in your inbox.&lt;/strong&gt; A new mini case from real AI builds every two weeks. No theory, no pitches.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://renezander.com/blog/?utm_source=dev-to&amp;amp;utm_campaign=what-llamacpps-pace-tells-you-about-on-prem-llm-readiness" rel="noopener noreferrer"&gt;Subscribe at renezander.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at &lt;a href="https://renezander.com" rel="noopener noreferrer"&gt;renezander.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>programming</category>
      <category>discuss</category>
    </item>
    <item>
      <title>Your AI Content Tool Knows Your Strategy. Do You Know Where It Goes?</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Tue, 07 Apr 2026 06:59:24 +0000</pubDate>
      <link>https://forem.com/reneza/your-ai-content-tool-knows-your-strategy-do-you-know-where-it-goes-23fg</link>
      <guid>https://forem.com/reneza/your-ai-content-tool-knows-your-strategy-do-you-know-where-it-goes-23fg</guid>
      <description>&lt;p&gt;Your team is using AI for content. Everybody is. LinkedIn posts, blog drafts, internal comms, maybe some customer-facing copy too.&lt;/p&gt;

&lt;p&gt;And it works. The output is decent, the speed is real, nobody wants to go back to writing everything from scratch.&lt;/p&gt;

&lt;p&gt;But have you thought about what you are actually pasting into these tools?&lt;/p&gt;

&lt;h2&gt;
  
  
  The Prompt Is the Product
&lt;/h2&gt;

&lt;p&gt;Every time someone on your team writes a prompt, they are feeding context into a system they do not control. Brand voice guidelines. Competitive positioning notes. Messaging frameworks. That internal strategy deck someone summarized into a prompt last Tuesday.&lt;/p&gt;

&lt;p&gt;This is not hypothetical. This is what good prompts look like. The more context you give, the better the output. So people give more context. They paste in the brief. They paste in the competitor analysis. They paste in the draft that legal has not approved yet.&lt;/p&gt;

&lt;p&gt;The tool gets better because your data is better. And your data is sitting on someone else's infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Trust Model Is the Problem
&lt;/h2&gt;

&lt;p&gt;Most AI content tools handle your data the same way: they promise not to train on it. That is the entire security model. A policy page. Maybe an enterprise agreement with a data processing addendum.&lt;/p&gt;

&lt;p&gt;Your data still gets processed on shared infrastructure. It still passes through systems you cannot inspect. You are trusting that the vendor's internal controls work perfectly, that no employee has access they should not have, and that every subprocessor in the chain follows the same rules.&lt;/p&gt;

&lt;p&gt;For most companies, this never becomes a visible problem. The data does not leak in a way anyone notices. The risk stays theoretical.&lt;/p&gt;

&lt;p&gt;Until it does not.&lt;/p&gt;

&lt;p&gt;A client asks where their data goes during your AI-assisted content process. Legal needs to document compliance for an audit. A competitor publishes something that looks suspiciously familiar. A new regulation drops that requires you to prove where personal data was processed, not just promise.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Technology Already Exists
&lt;/h2&gt;

&lt;p&gt;Here is what most people in the content space do not realize: the technology to solve this is not theoretical. It is production-ready. It has been running in cloud infrastructure for years. It just has not reached the content tooling layer yet.&lt;/p&gt;

&lt;p&gt;Three capabilities change the game:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Client-side encryption.&lt;/strong&gt; Your data gets encrypted before it leaves your browser. The server never sees plaintext. It processes encrypted inputs and returns encrypted outputs. The key stays with you. Not with the vendor. Not in their key management system. With you.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Confidential computing.&lt;/strong&gt; Instead of shared servers where your workload runs alongside everyone else's, your data gets processed in an isolated hardware enclave. The cloud provider cannot see inside it. The vendor cannot see inside it. The operating system cannot see inside it. Your data exists in cleartext only inside a hardware boundary that nobody else can access.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Attestation.&lt;/strong&gt; Cryptographic proof of what code is running in that enclave. Not a vendor's word that they are running the right version. A hardware-signed certificate that you can independently verify. You know exactly what software touched your data because the hardware tells you, not the vendor.&lt;/p&gt;

&lt;p&gt;These are not research papers. AWS Nitro Enclaves, Azure Confidential VMs, and GCP Confidential Computing have been generally available for years. The infrastructure is there. The content tools just have not caught up.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters Now
&lt;/h2&gt;

&lt;p&gt;Two things are converging.&lt;/p&gt;

&lt;p&gt;First, AI adoption in content workflows is no longer experimental. Teams are building real pipelines. They are feeding in real business data, not just test prompts. The volume and sensitivity of data flowing through AI tools is growing every quarter.&lt;/p&gt;

&lt;p&gt;Second, regulation is catching up. GDPR already requires you to document where personal data is processed. The EU AI Act adds requirements around transparency and risk management for AI systems. Industry-specific regulations in finance, healthcare, and legal services are getting more specific about AI data handling. "We have a DPA" is becoming insufficient.&lt;/p&gt;

&lt;p&gt;The companies that figure out verifiable AI data handling now will not be scrambling when their clients, their board, or their regulator asks how their AI content pipeline handles sensitive data.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to Ask Your Vendors
&lt;/h2&gt;

&lt;p&gt;You do not need to become a cryptography expert. But you should be asking three questions:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where does my data exist in plaintext?&lt;/strong&gt; If the answer is "on our servers," you are in the trust model. If the answer is "only inside a hardware enclave that we cannot access," you are in the proof model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I verify what code processes my data?&lt;/strong&gt; If the answer requires trusting the vendor's word, that is trust. If the answer involves a hardware attestation you can independently check, that is proof.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Who holds the encryption keys?&lt;/strong&gt; If the vendor holds them, they can decrypt your data whenever they want, regardless of what the policy says. If you hold them, the vendor literally cannot access your plaintext data even if they tried.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Shift from Trust to Proof
&lt;/h2&gt;

&lt;p&gt;The content industry is going to go through the same transition that payments, healthcare, and financial services already went through. The question will shift from "do you promise to protect our data?" to "can you prove it?"&lt;/p&gt;

&lt;p&gt;Right now, almost nobody in the AI content space is building with these guarantees. That gap will not last.&lt;/p&gt;

&lt;p&gt;I am building &lt;a href="https://teedian.com" rel="noopener noreferrer"&gt;Teedian&lt;/a&gt;, an AI content tool that uses exactly this architecture. Client-side encryption, confidential computing, attestation. Not as a roadmap item, but as the foundation.&lt;/p&gt;

&lt;p&gt;If you work in a regulated industry, or you handle client data in your content workflows, or you want to understand what cryptographic privacy looks like in practice, I put together a &lt;a href="https://teedian.com/#brief" rel="noopener noreferrer"&gt;short brief on teedian.com&lt;/a&gt; that walks through the architecture. Plain language, no jargon, 3 pages.&lt;/p&gt;







&lt;p&gt;&lt;strong&gt;Get the next field note in your inbox.&lt;/strong&gt; A new mini case from real AI builds every two weeks. No theory, no pitches.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://renezander.com/blog/?utm_source=dev-to&amp;amp;utm_campaign=your-ai-content-tool-knows-your-strategy-do-you-know-where-it-goes" rel="noopener noreferrer"&gt;Subscribe at renezander.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at &lt;a href="https://renezander.com" rel="noopener noreferrer"&gt;renezander.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>security</category>
      <category>programming</category>
      <category>privacy</category>
    </item>
    <item>
      <title>Spend Your Human Thinking Tokens Where They Compound</title>
      <dc:creator>René Zander</dc:creator>
      <pubDate>Tue, 31 Mar 2026 08:32:22 +0000</pubDate>
      <link>https://forem.com/reneza/spend-your-human-thinking-tokens-where-they-compound-pf1</link>
      <guid>https://forem.com/reneza/spend-your-human-thinking-tokens-where-they-compound-pf1</guid>
      <description>&lt;p&gt;More automations running. More agents deployed. More pipelines humming in the background.&lt;/p&gt;

&lt;p&gt;I run about a dozen automated jobs. Daily briefings, proposal generation, content pipelines, data syncing, monitoring alerts. They handle a lot.&lt;/p&gt;

&lt;p&gt;But the biggest improvement to my workflow this year wasn't adding more automation. It was getting honest about where my thinking actually matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  You Have a Token Budget Too
&lt;/h2&gt;

&lt;p&gt;LLMs have context windows. Feed in too much noise and the signal degrades. The output gets worse even though you gave it more to work with.&lt;/p&gt;

&lt;p&gt;Human attention works the same way. I have maybe 4 good hours of focused thinking per day. When I spend those hours reviewing cron output or formatting documents or triaging alerts that resolve themselves, I'm burning tokens on low-value work.&lt;/p&gt;

&lt;p&gt;The quality of my actual decisions goes down. Not because the decisions got harder, but because I already used up my thinking budget on stuff that didn't need me.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where I Stopped Spending
&lt;/h2&gt;

&lt;p&gt;I used to review my morning briefing line by line. Check every data point, verify every summary. Then I realized: if the briefing is wrong, I'll notice when the information doesn't match reality later that day. The cost of a slightly wrong briefing at 6:30 is near zero. The cost of spending 20 minutes checking it every morning is real.&lt;/p&gt;

&lt;p&gt;Same with monitoring. I had alerts for everything. Cache refreshes, API response times, sync completions. Most of them were informational, not actionable. I stripped it down to alerts that require a decision: something broke, something is about to expire, something needs my approval before it touches an external system.&lt;/p&gt;

&lt;p&gt;Data syncing runs on a schedule. If it fails, I get one alert. I don't watch it run. I don't check the logs unless the alert fires.&lt;/p&gt;

&lt;p&gt;First drafts of anything. Cover letters, content outlines, research summaries. The AI produces a version. Sometimes it's good enough. Sometimes I rewrite half of it. But I never start from a blank page anymore, and that alone saves the hardest type of thinking: getting started.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where I Still Spend Every Token
&lt;/h2&gt;

&lt;p&gt;Scoping client work. An AI can research a company, summarize a job posting, draft a proposal. But deciding whether the project is actually worth pursuing? Whether the client's problem is what they say it is? That's pattern recognition built from years of seeing projects go sideways. No automation for that.&lt;/p&gt;

&lt;p&gt;Choosing what to build next. I have a backlog of 50 things I could automate, improve, or ship. The AI can't tell me which one moves the needle this week. That decision depends on context it doesn't have: what conversations I had yesterday, what I'm optimizing for this month, what feels right.&lt;/p&gt;

&lt;p&gt;Anything with my name on it that reaches another person. Proposals get edited. Posts get rewritten. Client messages get reviewed word by word. The AI drafts. I decide what actually represents me.&lt;/p&gt;

&lt;p&gt;System design decisions. Where to draw the boundary between automatic and manual. What gets a human checkpoint and what runs unsupervised. These are the highest-leverage decisions in any AI system, and they're entirely human.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Honest Ratio
&lt;/h2&gt;

&lt;p&gt;Maybe 20% of my working hours involve focused, high-stakes thinking. The rest is execution, coordination, and maintenance.&lt;/p&gt;

&lt;p&gt;Before I built these systems, that ratio was reversed. 80% thinking, 20% execution, and half the thinking was on tasks that didn't deserve it.&lt;/p&gt;

&lt;p&gt;The goal was never "automate everything." It was "protect the 20% that matters and make sure I'm not exhausted when I get there."&lt;/p&gt;

&lt;h2&gt;
  
  
  The Shift
&lt;/h2&gt;

&lt;p&gt;This isn't about working less. I work the same hours. But the distribution changed.&lt;/p&gt;

&lt;p&gt;I spend less time on decisions that don't compound. I spend more time on the ones that do. Client relationships, system architecture, strategic bets. The stuff where being sharp at 10 in the morning instead of burned out from triaging alerts actually changes the outcome.&lt;/p&gt;

&lt;p&gt;The question isn't how much your AI can do. It's whether you're spending your own thinking tokens on the right things.&lt;/p&gt;

&lt;p&gt;Where are you still spending attention that you probably shouldn't?&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I help teams figure out where AI should run unsupervised and where humans still need to be in the loop. If that's a question your team is working through, let's talk: &lt;a href="https://cal.eu/reneza" rel="noopener noreferrer"&gt;cal.eu/reneza&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;







&lt;p&gt;&lt;strong&gt;Get the next field note in your inbox.&lt;/strong&gt; A new mini case from real AI builds every two weeks. No theory, no pitches.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://renezander.com/blog/?utm_source=dev-to&amp;amp;utm_campaign=spend-your-human-thinking-tokens-where-they-compound" rel="noopener noreferrer"&gt;Subscribe at renezander.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at &lt;a href="https://renezander.com" rel="noopener noreferrer"&gt;renezander.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>programming</category>
      <category>discuss</category>
    </item>
  </channel>
</rss>
