Forem: René Zander

Browser-Use Is Solving the Wrong Half of the Problem

René Zander — Tue, 19 May 2026 14:57:13 +0000

TL;DR — when to use browserground (and when to use UI-TARS-MLX instead)

If you're on Apple Silicon with ≥16 GB RAM and you need generic, max-accuracy UI grounding, use mlx-community/UI-TARS-1.5-7B-4bit. It's the obvious default — ~94% on ScreenSpot-v2, MLX-native, drops into mlx-vlm directly. ByteDance research-lab compute, you couldn't reproduce it on a budget. I'm not the right pick for that workload.

browserground is for two narrower jobs:

1. The recipe for your product's custom UI grounder. UI-TARS is a finished model — closed pipeline, proprietary data, hard to extend. browserground is the opposite: a reproducible template. Open base (Qwen3-VL-2B), open training scripts, open data mix (26k records, OS-Atlas + wave-ui). Swap in your dashboard's screenshots / your customer app / your internal tooling → ship a domain-trained UI grounder over a weekend. The 60% generic ScreenSpot-v2 score isn't the deliverable; the recipe is. A 60-point baseline on generic screens becomes 85-95% on your own product's narrow surface because the test distribution finally matches the training distribution.

2. The smallest viable slot in a multi-model stack. browserground 4-bit MLX is ~1 GB on disk / ~2 GB RAM. UI-TARS-1.5-7B-MLX is ~4 GB / ~5-6 GB RAM. The difference matters on 8 GB Macs and in agent stacks that already run a 7B planner + an OCR model + an embedding model in the same RAM budget. Plus strict JSON output (100% parseable, no regex on prose) — small win, but real.

A direct head-to-head benchmark of browserground vs UI-TARS-1.5-7B-MLX on the same Apple Silicon hardware is forthcoming.

The broader argument — why a parser-stage specialist matters at all

And if you're new to the hybrid pattern — why this exists at all

Everyone's posting browser-agent demos this week. Click here, scroll there, fill that form. Most break by click seven.

Mine broke too. The submit button on a checkout form that the frontier vision model literally couldn't see. Billed at $0.01-0.05 per call, called 20-50 times per agent run, the model was burning reasoning capacity on parsing pixel coordinates. A 2B specialist I trained for $5 hits that same button 3.3x more reliably on ScreenSpot-v2 (60.0% vs GPT-4o's 18.3%).

The architecture is the bug, not the model.

Two Jobs, One Forward Pass

Browser-agent stacks send a screenshot to a frontier vision model and ask it for both the next decision and the click coordinates in one call. Splitting that into two calls, a local 2B grounding model that emits JSON followed by a frontier model that reasons over the JSON, drops vision token spend and raises click accuracy.

browser-use (94k stars), Skyvern (22k stars), Claude Computer Use, OpenAI Operator. Same pattern. Same compound question every step:

Given this page, what should the agent do next, and where exactly does it click?

Two jobs welded together. Reasoning ("what next") is a probabilistic problem worth a frontier model. Grounding ("where exactly") is a structured-output problem with a tight schema: clickable elements, bounding boxes, accessible labels.

You're paying frontier-tier rates for the second job. Per screenshot. Every step of the loop.

Grounding Is a Parser Problem

Once you name it as a parser problem, the right tool changes. You don't need 200 billion parameters to emit a JSON list of clickable elements. You need a model that:

Has seen enough UI screenshots to recognize buttons, inputs, links with sub-50-pixel precision
Outputs strict JSON without hallucinating bounding boxes
Runs locally so the per-step cost is zero

A 2B specialist trained on screen-parsing data. Not a frontier model.

I trained one. Total cost: ~$5 of RunPod compute on a single A6000 GPU. The result, browserground, hits 60.0% on ScreenSpot-v2 vs GPT-4o's 18.3% — a 3.3x beat at the click-grounding job. More telling: it beats SeeClick (9.6B params, 55.1%) at 4.8x smaller. A drop-in for any agent loop currently handing screenshots to a frontier API. Today the CLI runs via transformers on Apple Silicon (~14 s/call); MLX-native build coming for the ~1.5 s path.

The Reasoning Model Gets Its Reasoning Capacity Back

When you split the call, the frontier model stops seeing pixels. It sees:

{
  "elements": [
    {"id": "e7", "label": "Submit order", "type": "button", "bbox": [344, 612, 478, 658]},
    {"id": "e8", "label": "Edit cart",    "type": "link",   "bbox": [...]}
  ]
}

Now the frontier model does the job it's good at: deciding e7 vs e8 given the agent's goal. A reasoning question over structured input. Cheap. Reliable. Auditable.

Three things change at once. Per-step token spend on vision collapses, because the grounding step runs locally. JSON validity hits 100% (the specialist learned the output convention with 35M LoRA parameters on a Qwen3-VL-2B base). Agent traces become debuggable. You read the structured grounding output before the reasoning step ever runs.

What Anthropic and OpenAI Ship Next

The frontier providers will absorb grounding into their own small models. Within twelve months, "fast vision" or "tool vision" tiers will appear in both Anthropic and OpenAI billing at a fraction of frontier rates. The economics demand it. Nobody can justify charging GPT-5 prices for a parser, and Hugging Face downloads already prove the demand: SeeClick, UI-TARS, and ShowUI pull ~300k category downloads a month between them.

When that ships, stack owners who already split grounding from reasoning have three things the wait-and-see crowd doesn't. A local fallback if the provider has an outage. An auditable structured-grounding trace in every log line. An exit option to a different reasoning provider without re-validating click behavior, because the grounding step belongs to them.

Stack owners who didn't split will find their grounding step has quietly become someone else's API. Same vendor billing the reasoning calls. Same vendor setting the price. Same vendor's deprecation calendar.

The Diagnostic

Pull up your last failed agent trace. Three numbers:

Total tokens spent on vision calls per agent step.
Fraction of those tokens spent on grounding (parsing pixel coordinates) vs reasoning (deciding actions).
Per-run vision cost at your current API rates.

If grounding dominates the first two numbers, and in most stacks it does, your stack has the split wrong.

Grounding is plumbing. Reasoning is cognition. Stop paying cognition rates for plumbing.

I build the split layer. browserground is the open-source reference for the local grounding half. v0.3 ships three packagings so it drops into any stack:

npm CLI (daemon, HTTP REST server, batch, confidence, eval): npm install -g browserground → browserground parse <img> --target "..."
PyPI (no Node required, MLX or transformers): pip install "browserground[mlx]" → from browserground import click_xy
Ollama (cross-platform, GGUF Q4_K_M + f16 mmproj): ollama run renezander030/browserground

Adapters land in the repo for browser-use (drop-in Controller action) and Skyvern (ground_with_fallback for local-first + cloud-fallback). Model: huggingface.co/renezander030/browserground. MLX 4-bit: browserground-mlx. GGUF: browserground-gguf. Source: github.com/renezander030/browserground. Apache-2.0. v0.2 LoRA trained on 26k mixed-domain examples (macOS + Android + UIBert + web). PRs welcome, especially eval cases where it fails.

I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at renezander.com.

Three Files I Renamed Last Month That Fixed My AI Agent

René Zander — Tue, 12 May 2026 06:59:17 +0000

A new job title showed up in my feed last week. Context engineer.

I clicked. Twice. Three articles, all definitional. None of them showed code. Every definition was a paraphrase of "name things well."

Then I opened my own repo and grepped for utils.py. Eleven hits. Six folders deep, in a project I have shipped to production. The same project where I had spent two hours that morning fighting an agent that kept loading the wrong helper file.

The fix took thirty seconds. I renamed the file.

The pitch checks out

The current pitch goes like this. LLMs perform better with the right context. So we need a new role to design that context. Curate the files, structure the prompts, build the retrieval system, manage the embeddings.

Each claim is accurate. None of them describes a new problem.

You have been doing this since you wrote your first config file. Every time you renamed helpers.js to payment-validation.js, you were context engineering. Every time you split a 400-line file into three named pieces, you were context engineering. The audience was always the next developer reading the file. Now there is one more reader.

The actual job

A senior engineer joining a codebase does three things in the first week. They read the folder structure. They read the most-named-things in the imports. They follow the breadcrumb from filename to function to line.

Agents do the same thing. The retrieval layer in your agent loop is grep with extra steps. It pulls files whose names match the task. It pulls functions whose docstrings match the intent. It pulls comments that say what the code does.

If your filenames are vague, the retrieval is vague. If your function names lie about what the function does, the agent loads the wrong one. If your folder names group code by file type instead of by concern, the agent loads models/ and gets nothing useful.

The skill we have been failing to enforce for thirty years became load-bearing the day a probabilistic reader joined the loop.

Three things I renamed last month

api.py became payment-webhook-handler.py. The agent stopped loading it for unrelated payment questions. One rename, one less failure mode.

utils/ got deleted. The five files inside moved next to the code that called them, with names that said what they did. format-currency.py, parse-iso-date.py, redact-pii.py. The agent now loads them only when the task mentions currency, dates, or PII.

A 600-line process.py split into four files. validate-input.py, dedupe-rows.py, enrich-from-cache.py, write-to-warehouse.py. The agent stopped trying to read the whole pipeline to answer questions about a single step.

Call it context engineering if the buzzword helps. The work is renaming.

Where the model has to guess

Walk your repo right now. Count the files named utils.py, helpers.js, common.go, lib/, services/, manager/. Count functions named process, handle, run, execute, do.

Each one is a place where the model has to guess. Each one is a place where you have to write a longer prompt to compensate. Each one is a paragraph you will never have to write again if you rename it now.

The reason context engineering feels hard is that you are trying to solve at retrieval time what should have been solved at naming time. You cannot grep your way out of a vague folder structure. You cannot embedding-search your way out of process(). The model is asking the same question the senior engineer asks in week one. The answer was always going to be: name things by what they do, not by what they are.

The role that already existed

There is a real version of context engineering. Chunking strategy, embedding choice, retrieval rerankers, evaluation harnesses. That work is real and hard.

Most of what gets called context engineering this month is the rename you skipped in the original PR.

A context engineer is a developer who finally names things.

What is the file in your codebase that the agent keeps loading wrong?

I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at renezander.com.

Your AI Workflow Doesn't Need Better Prompts. It Needs Less AI.

René Zander — Tue, 05 May 2026 06:33:55 +0000

The first stage of AI work is prompting.

The last stage is removing the model from most of the workflow.

That sounds backwards.

It is not.

When a workflow is new, the LLM is useful because the work is still ambiguous. You are discovering what good looks like. You try a prompt, read the output, adjust the examples, change the tone, add constraints, and run it again.

That is a good use of AI.

But if the same workflow keeps coming back, and you are still explaining it to the model every time, you are not building capability. You are repeating yourself with a better interface.

The mature workflow is not one where the LLM does everything.

The mature workflow is one where the LLM only handles the part where ambiguity is useful.

Everything else becomes process.

Prompting Is Discovery

Prompting is where most people start because it is the fastest way to get feedback.

You ask:

"Write this article."
"Make it sound less generic."
"Use my style."
"Add examples."
"Make the intro stronger."

At this stage, the model is helping you figure out the shape of the task.

You do not fully know the target yet. You are exploring. You are testing whether the idea even works. You are using the model as a thinking partner, a drafter, a critic, and sometimes a mirror.

That is fine.

The mistake is treating this as the final form.

If you have to keep saying the same thing, you do not have a workflow. You have a recurring conversation.

Better Prompts Still Have a Ceiling

The next stage is usually better prompting.

You add examples. You add constraints. You add a target audience. You tell the model what to avoid. You define the output format. You write a longer system prompt.

The output improves.

For a while, this feels like progress.

But longer prompts have a hidden failure mode: the model still has to remember and balance everything at once.

Style rules. Factual constraints. Tone. Audience. Platform conventions. Examples. Edge cases. Forbidden phrases. Review criteria.

All of it goes into one big instruction block.

The prompt becomes a pile of expectations, and the model is still the one deciding which expectations matter in the moment.

That is fragile.

At some point, "make the prompt better" stops being the right move.

Skills Are Repetition

The next level is turning repeated prompting into a skill.

A skill packages the context and process:

what files or sources to read
what examples matter
what tone to use
what scripts to run
what output format is expected
what review criteria should be applied
what fallback path to use when something breaks

This is a real improvement.

The workflow becomes portable. You stop explaining everything from scratch. The model gets the right context faster. You are no longer relying on whatever happens to be in the current chat thread.

For many AI workflows, this is the first serious productivity jump.

But skills have their own failure mode.

They can become too large.

When Skills Become Prompt Monsters

A skill can start as a clean reusable process and slowly turn into another giant prompt.

More instructions.

More exceptions.

More "always do this."

More "never do that."

More examples.

More scripts.

More personal preferences.

At some point, the skill is not making the workflow reliable. It is just giving the model more things to interpret.

The model still has to decide what matters.

The model still has to judge whether the output is good enough.

The model is still checking its own homework.

That is the point where the workflow needs to move outside the LLM.

Not all of it.

The stable parts.

The Model Can Write the Code. It Does Not Get to Decide Whether the Code Passes.

Developers already understand this when we talk about code.

We do not ask a developer, human or AI:

"Does this code look correct?"

We run the formatter.

We run the linter.

We run the tests.

We run the type checker.

We run CI.

We use pre-commit hooks.

The model can generate code, but it does not get to decide whether the code passes.

The gate decides.

That is the important shift.

If a rule can be checked deterministically, it should not live only inside a prompt.

For code, the gates are obvious:

formatting
linting
type checks
unit tests
integration tests
golden files
JSON schema validation
pre-commit hooks
CI checks

The agent can still do useful work. It can draft the patch, explain a tradeoff, write a test, debug a failure, or propose a smaller path.

But the agent should not be the final authority on whether the work satisfies the standard.

That authority should be outside the model.

This Applies to Content Too

Content feels less deterministic than code, so people keep more of the workflow inside the prompt.

But the same principle applies.

If you have found a content formula that works, do not just ask the model to remember it.

Turn it into gates.

For example, before publishing an article:

Does the title match the actual promise of the article?
Does the first section create tension quickly?
Is the reader obvious?
Is there one clear argument?
Does every section move the argument forward?
Are there generic AI phrases that should be removed?
Does the article contain a real observation or only recycled advice?
Is there a concrete next action for the reader?
Does the tag strategy match the article, not just the broad topic?

These checks are not perfect.

But they are better than "make it good."

"Make it good" is a vibe.

A gate is a standard.

The more often a workflow matters, the more it deserves standards that do not depend on the model's mood.

The 20% LLM Workflow

The mature version of an AI workflow is not:

"The LLM does everything."

It is:

"The LLM does the part where ambiguity is useful. The system handles the rest."

That might mean the model only does 20% of the workflow.

And that is a good thing.

For a content workflow, the LLM might:

propose angles
compare possible hooks
draft sections
rewrite unclear paragraphs
find contradictions
suggest examples

But the system should handle:

loading the source notes
selecting the target platform
applying the tag strategy
checking title/promise match
scanning for banned phrases
verifying links
enforcing formatting
creating the publishing task

For a coding workflow, the LLM might:

inspect the codebase
implement the change
write tests
explain a failure
reduce a patch

But the system should handle:

formatting
linting
type checking
test execution
schema validation
contract checks
CI gates

That is not using AI less because AI is weak.

It is using AI less because the workflow is becoming stronger.

The LLM Should Be the Escalation Path, Not the Reflex

There is a simple rule I keep coming back to:

If code can check it, do not ask the model to remember it.

If a script can clean it, do not spend tokens reasoning about it.

If a test can catch it, do not rely on a sentence in a prompt.

The LLM should be used when ambiguity remains.

It should not be the first line of defense for things that are already measurable.

This is where many AI workflows waste effort. They ask the model to handle everything:

classify the task
remember the rules
generate the output
check the output
decide whether the output is done
explain why it is done

That is too much responsibility in one probabilistic step.

Split the work.

Let the model handle the ambiguous part.

Let deterministic systems handle the stable part.

A Self-Check for Your Own AI Workflow

If you want to know where your workflow is immature, ask these questions:

What do I keep explaining to the model again and again?

That probably belongs in a skill.

What does the model keep judging by itself?

That probably belongs in a gate.

What failure would be obvious to a script, linter, test, schema, or checklist?

That should not live only in the prompt.

If I removed the LLM tomorrow, which parts of the workflow would still be clear?

Those parts are real process.

Which parts only work because the model is being generous?

Those parts are risk.

This self-check is uncomfortable because it reveals how much "automation" is just trust in a model call.

But that is the point.

Capability is not how much work you can hand to the model.

Capability is how much of the workflow still holds when the model is only doing the part it is actually good at.

The Maturity Curve

The pattern looks like this:

Prompt -> Skill -> Gate -> System

Or:

Ask -> Package -> Validate -> Automate

When the task is new, prompt.

When the task repeats, create a skill.

When the skill succeeds, move the stable checks into gates.

When the gates are stable, reduce the LLM's responsibility.

This is the path from beginner prompting to a 20% LLM workflow.

It does not make the model irrelevant.

It puts the model in the right place.

The Wrong Goal Is Better Prompting Forever

There is a lot of advice about better prompts.

Some of it is useful.

But better prompting is not the destination.

Better prompting helps you discover the workflow.

Skills help you repeat the workflow.

Gates help you trust the workflow.

Systems help you scale the workflow.

If you stop at prompting, every task stays a negotiation.

If you stop at skills, every process still depends on the model interpreting the skill correctly.

If you add gates, the model has something it must pass.

That is the difference between a helpful assistant and a reliable workflow.

The Useful Question

The useful question is no longer:

"How do I write a better prompt?"

The useful question is:

"Which part of this should stop being a prompt?"

If it is repeated context, make it a skill.

If it is a stable rule, make it a checklist.

If it is measurable, make it a test.

If it is non-negotiable, make it a gate.

Let the LLM handle ambiguity.

Make the system handle standards.

Use prompts to discover.

Use skills to repeat.

Use gates to scale.

And when the workflow is mature, let the LLM do less.

That is not a downgrade.

That is capability becoming real.

Which part of your AI workflow should stop being a prompt?

I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at renezander.com.

Pure semantic search missed 4 of 5 of my agent queries. Hybrid + parallel fan-out fixed it.

René Zander — Sat, 02 May 2026 14:13:04 +0000

When Karpathy's LLM Wiki post landed, I already had semantic search over my TickTick — qdrant for the vector store, nomic-embed-text via ollama for embeddings, a daily cron to keep the index fresh, the works. The agent-side retrieval wasn't the missing piece.

What was missing was the structure. Karpathy's framing — designate a wiki, write notes for an LLM reader, lean on retrieval instead of taxonomy — surfaced the parts of my setup that didn't have shape yet: where durable knowledge lives versus ephemeral tasks, how agents pull structured data out of notes humans wrote, why my existing semantic search sometimes returned the right answer and sometimes returned nothing useful.

I almost migrated to plain markdown anyway. Thousands of durable notes — production playbooks, API quirks, decisions I want to survive next month's task list — already live in TickTick. They sync to my phone. Capture friction is zero. Migrating breaks all of that.

So I built the wiki structure on top of TickTick, and made the storage layer swappable. The retrieval, the wiki conventions, the agent-data note pattern, the bench harness — none of those are TickTick-specific. They're a small framework. You point it at TickTick / Notion / Obsidian / Things / a folder of markdown / whatever you've already invested years of capture habit into.

I'm calling it Agentic Knowledge Base.

The framework, in one diagram

                  ┌───────────────────────────────────┐
                  │  Agent (Claude / scripts / cron)  │
                  └─────────────────┬─────────────────┘
                                    │ akb find / get / url / links
                                    ▼
                  ┌───────────────────────────────────┐
                  │  AKB Core                         │
                  │  • parallel retrieval + RRF       │
                  │  • corpus cache (5 min TTL)       │
                  │  • bench harness                  │
                  │  • usage logging                  │
                  └─────────────────┬─────────────────┘
                                    │ adapter interface
                  ┌─────────────────┼──────────────────┐
                  ▼                 ▼                  ▼
       ┌────────────────┐  ┌────────────────┐  ┌────────────────┐
       │ adapter-       │  │ adapter-       │  │ adapter-       │
       │ ticktick       │  │ obsidian       │  │ notion         │
       │  (reference)   │  │  (filesystem)  │  │  (your turn)   │
       └────────────────┘  └────────────────┘  └────────────────┘

The Core is storage-agnostic. The retrieval, the cache, the bench, the usage logger — none of them know what TickTick is. They call a small adapter interface (~6 methods).

Karpathy's setup, in this framing, is the filesystem adapter of a broader pattern. Mine is the TickTick adapter. Yours might be the Notion or Obsidian one.

The adapter interface

Six methods. Two payload shapes.

interface KnowledgeAdapter {
  listProjects(): Promise<Project[]>
  listTasksInProject(projectId: string): Promise<Task[]>
  getTask(projectId: string, taskId: string): Promise<Task>
  createTask(input: TaskInput): Promise<Task>
  updateTask(projectId: string, taskId: string, patch: TaskPatch): Promise<Task>
  urlFor(ref: { projectId: string, taskId: string }): string  // deep-link string
}

type Project = { id: string, name: string, kind?: 'tasks' | 'notes' }
type Task    = { id: string, title: string, content: string, projectId: string, tags: string[], dueDate?: string, modifiedTime: string }

Anything you can list, get, and link to — task systems, note apps, plain folders — can be an adapter.

If your storage exposes a native search endpoint, your adapter can implement an optional searchByQuery(query) and the core will use it as one branch of the parallel retrieval. If not, the core falls back to its own keyword scan against the corpus.

That's the whole interface. Everything interesting is in the Core.

Two patterns the Core implements (worth stealing)

1. Agent-data notes

A regular note whose body has a fenced JSON (or YAML) block. Humans read the prose at the top. Agents extract the JSON via the adapter.

The note's content looks like this — prose first, then a single fenced JSON block:

Type: agent-data
Consumed by: EOD triage cron, capture-time relevance enrichment

A "trunk" is an active project the user cares about. Edit this list when projects launch, finish, or shift focus.

{
  "trunks": [
    { "name": "release-engineering", "desc": "shipping cadence, deployment rituals, on-call rotation" },
    { "name": "writing-projects", "desc": "drafts and edits across personal and client channels" }
  ]
}

Read it from any cron or agent:

akb get "Trunk Catalog" --extract json | jq '.trunks[].name'

The benefit: one note, mobile-editable in your existing app, consumed by agents as structured data. Single source of truth, no schema migration. This pattern works for anything an agent needs programmatically and a human needs to edit on the move: prompt templates, character locks for video projects, recurring queries, cron config.

2. Parallel retrieval with provenance

Three retrievers run in parallel against a shared cached corpus, results are RRF-fused, and the top-K come back tagged with which retrievers agreed:

Hybrid — dense cosine (qdrant + nomic-embed) + sparse keyword, internally RRF'd
Keyword — substring match on title + content
Notes-find — title-fuzzy on a designated wiki project

For a query like openrouter api key, all three retrievers return the same gold note. The fused result tags it sources: [hybrid, keyword, notes_find] — three independent signals agreeing means high confidence. Lower-ranked results have only one source — look at them with skepticism.

For a query like ffmpeg commands, the keyword tool misses (the literal phrase isn't in any document). Pure semantic misses too (nomic-embed underweights short titles like ffmpeg). Hybrid catches it. The fan-out gracefully handles the asymmetry — different queries lean on different retrievers, and the core doesn't pretend any single algorithm is universally best.

A 5-min disk-backed corpus cache means warm queries are sub-100ms. The first call after a cold start fetches your full task/note list (one batch — adapters that support it use a single API call; adapters that don't fall back to per-project iteration). Within a working session, retrieval is essentially free.

The bench

I built a small harness in bench/. Questions paired with gold answers (the task or note that actually contains the answer). Each retriever runs against the same questions, results scored by hit@1 / recall@5 / MRR.

Five agent-issued queries (the rephrased version Opus 4.7 actually generates, not the natural-language form a human types):

Method	hit@1	recall@5	MRR	warm latency
keyword (substring)	20%	20%	0.20	<100ms
semantic (dense only)	20%	40%	0.30	~300ms
hybrid (dense + sparse RRF)	60%	80%	0.70	~500ms
find (parallel + cache)	60%	80%	0.70	~93ms

The headline number for agent retrieval is recall@5 = 80% — the right doc lands in the top five 4 times out of 5. Agents read top-K, not just rank 1, so recall@5 is the metric that actually predicts whether the agent gets the context it needs. Top-1 (60%) is a stricter cut and a leading indicator for "did the first guess work" — useful but not the bar. The benchmark won't generalize from five questions — grow it as confidence in a particular adapter accumulates.

Why I optimized for the model, not for me

There's a subtle reframe that took an embarrassing number of iterations to land.

When I use search, I type a single word: ffmpeg. The keyword tool returns the right note instantly.

When Claude uses search on my behalf — "where did I document my ffmpeg workflow?" — it issues something like find "What ffmpeg commands do I have notes on?". Different shape entirely. The model writes longer queries. It uses question phrasing. It includes scope words.

Optimizing for human queries was the wrong objective. The user (me) wasn't using these tools — Claude was. Every retrieval test had to be written in the form Opus 4.7 actually generates, not how I'd type it. That changes which retriever wins.

Tomorrow's model writes queries differently. The benchmark needs to track the model in use, not a fixed assumption about query shape. The bench file is short and dated; re-tune when the model changes.

What I deliberately didn't build (yet)

Karpathy's wiki post mentions periodically updating notes when facts change — propagating new information across the knowledge base. Useful at scale; auto-rewriting notes is high-blast-radius and needs an approval ramp before it's trustworthy. I sketched it: a weekly cron that semantic-searches for affected notes, drafts updates, queues them for my approval, applies the approved ones. Deferred.

Same call on a "lint the wiki" pass (Karpathy idea: agent reads every note weekly, flags missing summaries, dangling references, contradictions). Useful at scale; premature when the wiki itself is still under construction.

Both will live in Core when they ship — adapter-agnostic by design.

A daily flow (my setup, your tools optional)

This is what runs:

Capture (mobile, manual). I add a task or note in my storage app. No CLI involved. The friction has to be zero.
Capture-time relevance prompt (when in a Claude session). akb create "..." --relevance appends a small instruction block to the result. Active Claude reads it, picks a project trunk, calls akb update to append a why: <trunk> — <reason> line. Five seconds of LLM-side reasoning makes that task much more retrievable later.
EOD triage (cron, daily morning). Pulls yesterday's completed tasks, scores them 0–3 against the trunks (read live from a Trunk Catalog agent-data note), sends a Telegram message with keepers grouped by trunk. I read it on my phone with breakfast.
Retrieval (during work, all surfaces). When Claude needs context — akb find <query> returns top-K with provenance. Cached, parallel, sub-100ms warm.

Swap "TickTick app" for "Notion / Obsidian / Things" and the flow is identical. The adapter changes, the daily ritual doesn't.

Roadmap

v0.1 — Core + reference TickTick adapter + bench (where I am today)
v0.2 — Filesystem adapter (Karpathy-style local markdown). Probably one weekend's work.
v0.3 — Notion adapter (community contribution most likely)
v0.4 — Lint pass + fact-propagation queue with approval gate
v0.5 — Adapter for Apple Notes / Things / iA Writer (Mac-native captures)

Code

Repo: github.com/renezander030/agentic-knowledge-base
One-page summary: gist.github.com/renezander030/c7bd6d5c4088e24d3add043720284453

Karpathy's wiki idea is right. The implementation that fits an existing system isn't a folder of markdown — it's the agent-side primitives that turn whatever you already have into something the model can reason over.

If you write your own adapter, I want to see it.

—

Posted from https://renezander.com/blog/agentic-knowledge-base/. Source at https://github.com/renezander030/agentic-knowledge-base.

I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at renezander.com.

What Anthropic's April 23 Postmortem Reveals About Your Agent Harness

René Zander — Thu, 30 Apr 2026 14:19:56 +0000

The April 23 Claude Code postmortem dropped last week. Three bugs, two months of degraded output, one usage-limit reset for every Pro subscriber.

I read it twice. The second time I started writing notes for my own agent harness.

It is unusually candid for a company at this scale, and it reads like a checklist of failure modes any team running production AI agents will eventually hit. Worth treating as a free engineering review.

Defaults that nobody can see

On March 4, the default reasoning effort dropped from high to medium. The reason was real. High mode was freezing the UI for some users. The fix was reasonable. The interesting bit: it shipped without an operator-visible knob, and quality regressed for a month before users complained loud enough.

Open question for your harness: how many silent defaults does it have? Temperature 0.7 because that was the framework default in 2024. Top-p 1.0 because nobody touched it. Max tokens 4096 because somebody picked the number once. Each of these is a quality lever. Which ones are worth surfacing in your dashboard?

A line worth saving from the postmortem: "users told us they'd prefer higher intelligence and opt into lower effort for simple tasks." Defaults can optimize for quality, with cost concerns as opt-in rather than opt-out.

A cache rule that ate the working memory

On March 26 they shipped a thinking-cache clearing rule. Intent: clear reasoning history once after a session sits idle for more than an hour. Bug: it cleared on every turn for the rest of the session. Sessions felt forgetful. Tool choices got weird. Usage limits depleted faster because the model was rebuilding context every turn.

I have shipped this exact bug. Different system, same shape. A "small optimization" to a caching layer that turned every cache lookup into a miss. Cost went up 4x for two days before alerting caught it.

Useful question to bring to your team: do our caching tests cover multi-turn behavior, or only single-call hit/miss? Most teams I have asked answer "single call only". Surfacing that gap costs an afternoon and saves a quarter.

A 25-word cap that cost 3% intelligence

On April 16 they added a system prompt: limit text between tool calls to 25 words, final responses to 100. The intent was cleaning up verbose narration. After ablation testing, they measured a 3% intelligence drop on coding tasks and reverted four days later.

Three percent doesn't sound like much, which is the part that stays with me. A prompt change hurting quality by 3% is invisible to anyone not running ablations. How many of us are? The honest answer in most rooms I sit in: not many.

Worth asking out loud: if you change a system prompt today, what catches a 3% regression?

What two of three tells you

Of the three bugs, two were silent until users yelled. The third was visible only after dedicated ablation testing. That ratio is the most interesting line in the whole postmortem.

I run six production agents. I have eval coverage on three. The other three I monitor with output sampling and gut feel. That setup is probably close to median for the industry.

The postmortem hands you a free checklist anyway. Default knobs visible to operators. Cache-hit rate tracked across multi-turn conversations. System prompts gated by eval ablations. Three failure modes, each one a useful question to ask your own setup.

Did you check your harness this week?

I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at renezander.com.

Claude Code with Local LLMs and ANTHROPIC_BASE_URL: Ollama, LM Studio, llama.cpp, vLLM

René Zander — Wed, 29 Apr 2026 05:53:48 +0000

Native Anthropic endpoints, tool-call compatibility, and context-window sizing for local Claude Code.

Last tested: April 2026. See Changelog at the bottom.

TL;DR cheat sheet

Goal	Use
MacBook Air	Gemma 4 26B-A4B Q4, 32K context, LM Studio or Ollama
MacBook Pro	Gemma 4 26B-A4B Q4 / UD-Q4, 64K context, llama.cpp or LM Studio
Claude Code minimum	32K context (anything below is a chat demo)
Best local backend	LM Studio or Ollama first; llama.cpp for advanced; vLLM for servers
Avoid	8K / 16K context, dense 31B Gemma 4 on 32 GB machines, old llama.cpp builds

The local-Claude-Code rule of thumb

Three things decide whether a local Claude Code session works:

Model quality decides whether the answer is smart.
Tool-call formatting decides whether Claude Code can act on the answer.
Context length decides whether the session survives past the first few edits.

For local coding agents: 32K is the floor. 64K is the sweet spot. Anything below 32K is a chat demo, not Claude Code.

Recommended setup

Use this first. Don't shop the buffet of alternatives until you've tried this one.

Backend: LM Studio (≥ 0.4.1) or Ollama (≥ v0.14.0) — both expose a native Anthropic compatible local endpoint, no proxy needed.
Model: gemma4:26b-a4b (Gemma 4 26B-A4B-it, Q4 quant). MoE active-param ≈ 3.88 B → laptop-friendly latency, tool-use trained directly into the model.
Context: 32K context on a MacBook Air, 64K context on a MacBook Pro M5 Pro/Max with 48 GB+ RAM.
Machine: 32 GB+ RAM strongly preferred. 24 GB works at 24K–32K with care.

If you don't have Anthropic-compatible mode and only have an OpenAI compatible local endpoint running, run LiteLLM in front (see section on LiteLLM).

1. Environment variables Claude Code reads

# Where Claude Code POSTs requests. Default: https://api.anthropic.com
ANTHROPIC_BASE_URL=http://localhost:11434

# Sent as auth. Local servers usually accept any non-empty value.
ANTHROPIC_AUTH_TOKEN=ollama

# Map Claude Code's "claude-opus-X-Y" / "claude-sonnet-X-Y" / "claude-haiku-X-Y"
# to model names your local backend serves.
ANTHROPIC_DEFAULT_OPUS_MODEL=gemma4:26b-a4b
ANTHROPIC_DEFAULT_SONNET_MODEL=gemma4:26b-a4b
ANTHROPIC_DEFAULT_HAIKU_MODEL=gpt-oss:20b

claude

Or override per-invocation:

claude --model gemma4:26b-a4b

If ANTHROPIC_BASE_URL is set but the URL doesn't respond with the right shape, Claude Code does not fall back to the cloud. It errors out.

2. Context length: the hidden failure mode

Claude Code is not a chat prompt. Before your actual request, the backend sees:

Claude Code's system prompt (~6–10K tokens by itself)
tool definitions for Read / Edit / Bash / Grep / Glob / TodoWrite
conversation history
file excerpts and full reads
diffs
command output
retry/error messages from failed tool calls

That means 8K and 16K contexts are misleading tests. They may answer a chat question, but they are not enough for reliable agentic coding. The session survives a handful of turns, then silently degrades — file edits truncate, tool calls drop arguments, the loop gets confused.

Practical context tiers

Context	Verdict	What happens
8K	Broken for Claude Code	System prompt + tools eat the window before your code arrives. Chat-only.
16K	Demo only	Tiny edits, short sessions. Not a real test of any model.
25K	LM Studio's stated minimum	Good enough for small tasks if tool calls are reliable.
32K	Real minimum (32K context).	Ollama recommends this floor. Use as your default.
64K	Sweet spot (64K context).	Best balance on 32GB+ machines. Handles medium repos and multi-file edits.
128K+	Diminishing returns	Prefill latency and KV-cache memory rise hard. Worth it only on high-memory servers, and only for repo-wide reads.

Apple Silicon context presets

Machine	Recommended context	Notes
MacBook Air M5, 16 GB	16K–24K	Use smaller models (≤8B). 26B-A4B is tight.
MacBook Air M5, 24 GB	24K–32K	32K is the target; keep other apps light.
MacBook Air M5, 32 GB	32K	Best Air setup. Higher rarely beats thermal throttling.
MacBook Pro M5 Pro, 24 GB	32K	Better sustained perf than Air at the same context.
MacBook Pro M5 Pro, 48/64 GB	64K	Sweet spot for serious local coding.
MacBook Pro M5 Max, 64/128 GB	64K default, 128K experimental	Use 128K for repo-wide analysis, not every edit loop.

Note: backend docs differ — LM Studio says "start at 25K, increase for better results," Ollama recommends 32K. Use 32K as the cross-backend baseline. Reading "25K" as "25K is enough" is the most common mistake.

3. Claude Code Ollama setup (native, v0.14.0+)

Ollama announced Anthropic Messages API compatibility on 2026-01-16. No proxy, no LiteLLM, no nothing.

# Set context length first — this is the most important knob
export OLLAMA_CONTEXT_LENGTH=32768   # 65536 on a Pro

export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_BASE_URL=http://localhost:11434

claude --model gemma4:26b-a4b

Cloud-hosted Ollama models work too:

claude --model glm-4.7:cloud
claude --model minimax-m2.1:cloud

Two known limits of Ollama's Anthropic-compat layer (April 2026):

No prompt caching. Anthropic's cache_control doesn't apply — every Claude Code request re-processes the system prompt and conversation history from scratch.
No tool_choice. Claude Code occasionally uses tool_choice to force a specific tool call. Ollama's compat layer ignores it. When it matters, Claude Code may pick the wrong tool and get stuck in a loop.

4. Claude Code LM Studio setup (native, 0.4.1+)

LM Studio added the Anthropic-compatible /v1/messages endpoint on 2026-01-30. Streaming, tool calls, and message-shape are all supported natively.

# Set context to at least 32K in the LM Studio UI (or higher; see section 2)
lms server start --port 1234

export ANTHROPIC_BASE_URL=http://localhost:1234
export ANTHROPIC_AUTH_TOKEN=lmstudio

claude --model openai/gpt-oss-20b

For VS Code with the Claude Code extension (env vars from your shell are NOT inherited by VS Code):

// .vscode/settings.json
"claudeCode.environmentVariables": [
  { "name": "ANTHROPIC_BASE_URL", "value": "http://localhost:1234" },
  { "name": "ANTHROPIC_AUTH_TOKEN", "value": "lmstudio" }
]

LM Studio's docs say "at least 25K." Set 32K. See section 2.

5. Claude Code llama.cpp setup (Apple Silicon fast path for Gemma 4 26B-A4B)

If you're on Apple Silicon and want the absolute lowest overhead with Gemma 4 26B-A4B, llama.cpp's server is faster per-token than Ollama or LM Studio. You need a recent build (one that supports -hf for HuggingFace pulls and --jinja for chat templates).

./build/bin/llama-server \
  -hf ggml-org/gemma-4-26B-A4B-it-GGUF:Q4_K_M \
  --host 127.0.0.1 \
  --port 8080 \
  -ngl 99 \
  -c 65536 \
  --jinja

export ANTHROPIC_BASE_URL=http://localhost:8080
export ANTHROPIC_AUTH_TOKEN=llama-cpp
claude --model gemma-4-26B-A4B

Flags that matter:

-c 65536 sets 64K context (drop to -c 32768 on tighter machines).
-ngl 99 offloads all layers to Metal/GPU.
--jinja is required for Gemma 4's chat template to render correctly. Without it, tool calls won't format and you'll see <unused24> / <unused49> tokens leaking into output.
-hf ggml-org/gemma-4-26B-A4B-it-GGUF:Q4_K_M pulls the GGUF straight from HuggingFace.

Caveat: llama.cpp's Anthropic-compat is partial. Works for chat and basic tool calling. Streaming-shape and some Anthropic-specific request fields are rougher than Ollama or LM Studio. If something breaks weirdly, fall back to Ollama. llama.cpp is the speed play, not the compatibility play.

6. Claude Code vLLM setup (native + tool parser)

vLLM ships an official Claude Code integration. Three things at server start: a tool-calling-capable model, --enable-auto-tool-choice, and the right --tool-call-parser.

vllm serve openai/gpt-oss-120b \
  --served-model-name my-model \
  --enable-auto-tool-choice \
  --tool-call-parser openai \
  --port 8000

export ANTHROPIC_BASE_URL=http://localhost:8000
export ANTHROPIC_API_KEY=dummy
export ANTHROPIC_AUTH_TOKEN=dummy
export ANTHROPIC_DEFAULT_OPUS_MODEL=my-model
export ANTHROPIC_DEFAULT_SONNET_MODEL=my-model
export ANTHROPIC_DEFAULT_HAIKU_MODEL=my-model

claude

The --tool-call-parser value depends on the model family — openai for the gpt-oss family, llama3_json for Llama 3.x, hermes for Hermes. Wrong parser → tool calls return as plain text and Claude Code's edit/grep/bash tools silently no-op.

7. LiteLLM — for fallbacks, not for translation

With Ollama, LM Studio, llama.cpp, and vLLM all speaking native Anthropic now, LiteLLM's role changes. It's no longer "the translator" — it's the router for fallbacks, request logging, per-tenant keys, and rate limits. Also the right answer if your only local option is an OpenAI compatible local endpoint.

# litellm-config.yaml
model_list:
  - model_name: claude-opus-4-7
    litellm_params:
      model: openai/my-vllm-model
      api_base: http://vllm:8000/v1

  - model_name: claude-sonnet-4-6
    litellm_params:
      model: ollama/gemma4:26b-a4b
      api_base: http://ollama:11434

  - model_name: claude-haiku-4-5
    litellm_params:
      model: anthropic/claude-haiku-4-5
      api_key: os.environ/ANTHROPIC_API_KEY

router_settings:
  fallbacks:
    - claude-opus-4-7: ["claude-haiku-4-5"]   # local fail → cloud Haiku

The single biggest win: when a local tool call silently fails, LiteLLM falls back to cloud Haiku transparently. Claude Code keeps working.

8. Common failures (the error strings developers google)

`tool_use parse error` / `invalid tool call` / `tool_use is not supported`

Three different symptoms, one root cause: the model is not emitting Anthropic-format tool_use content blocks.

The most deceptive symptom is the silent one — Claude Code starts, prints the model's plain-prose answer ("I would change the file like this..."), and nothing happens. No file edit, no error.

Common causes (April 2026):

vLLM: missing --enable-auto-tool-choice or wrong --tool-call-parser.
Ollama: model that wasn't trained for tool calling (avoid stock llama3.x instruct).
llama.cpp: missing --jinja. The chat template renders incorrectly and you see literal <unused24> / <unused49> tokens.
LM Studio: model file is fine but the loaded preset uses the wrong template.

`context length exceeded` / model stopped mid-edit

Claude Code's prompts overflow the configured window. The session may finish a single turn, then truncate the next file edit silently. Fix: raise context to at least 32K. If you're already at 32K and still hitting this, the model is reading too aggressively — drop to fewer tools or shorter file reads.

`empty assistant response`

Backend returned 200 OK with an empty content array. Causes:

Streaming SSE format mismatch (mostly llama.cpp).
Tool-call parser swallowed the message because it couldn't parse it.
Model emitted only a <unused24> / <unused49> token and the parser dropped the rest.

Fix: switch backend (Ollama or LM Studio if you were on llama.cpp), or upgrade llama.cpp to a build with the patched Gemma 4 chat template.

`model not found` / `404 the model X does not exist`

Claude Code asked for claude-opus-4-7 but the backend serves gpt-oss:20b or gemma4:26b-a4b. Fixes:

Set ANTHROPIC_DEFAULT_OPUS_MODEL (plus _SONNET_ and _HAIKU_) to the backend's actual model name.
Use claude --model <backend-name> per call.
Map the names in LiteLLM (the model_name: field is what Claude Code asks for; model: is what gets served).

`messages: Extra inputs are not permitted` (HTTP 422)

Some backends are stricter than Anthropic's own. They reject Anthropic-specific fields (cache_control, thinking, tools[].input_schema, metadata.user_id). Fix: upgrade the backend, or run a small middleware proxy that strips the unsupported fields before forwarding.

`ANTHROPIC_BASE_URL` ignored / Claude Code still calls the real API

Env var was set in .zshrc after the shell session started — restart the terminal.
~/.config/claude/config.json or a --api-key flag is overriding the env var.
VS Code: env vars from your shell are NOT inherited. Use claudeCode.environmentVariables in workspace settings (section 4).

echo $ANTHROPIC_BASE_URL inside the same shell that runs claude. If empty, you have a sourcing problem.

9. Debug flow

When something breaks, walk this tree before swapping backends:

Did the model load?
- No → check quant size vs RAM. 26B-A4B Q4 needs ~16 GB free; bigger quants need more.
Is the context at least 32K?
- No → raise to 32K (Air) or 64K (Pro). See section 2.
Are tool calls malformed? (Look for <unused24>, <unused49>, plain prose where you expected an edit.)
- Yes → switch to native Anthropic mode (Ollama/LM Studio), or for vLLM verify --tool-call-parser, or for llama.cpp add --jinja.
Does Claude Code stop mid-edit?
- Yes → context exhaustion. Lower context targets in your tools, or use a faster quant so the model finishes turns before the window reuse cycle.
Is the model hallucinating files that don't exist?
- Yes → the model isn't calling Read before Edit. Add a CLAUDE.md rule that requires reading before editing, or use a tool-finer model (Gemma 4 26B-A4B is solid here).

10. Smoke test

Verify your setup with one prompt. Ask Claude Code:

Create a small FastAPI app with one /health endpoint, add a pytest test for it, run pytest, and fix any failures.

Passes if:

It reads/writes files correctly (no hallucinated paths).
It runs the test command (you see real pytest output).
It patches a failure (e.g. missing dependency) without losing context.
It does not lose tool-call format (no <unused24> / <unused49> leakage).
It does not truncate after the first edit.

Expected terminal feel:

✓ model loaded     (gemma4:26b-a4b, Q4_K_M)
✓ context: 32768
✓ tool call parsed (Edit)
✓ edited file      (app.py)
✓ tool call parsed (Bash)
✓ tests passed

If you don't see all five, walk the debug flow above.

11. Compatibility matrix (April 2026)

Backend	Native Anthropic API	Tool calls	Context floor	Notes
Ollama (≥ v0.14.0)	Yes	Depends on model	32K context (cross-backend baseline)	Easiest setup. No prompt caching, no `tool_choice` (see section 3).
LM Studio (≥ 0.4.1)	Yes	Yes (out of the box)	Stated 25K, use 32K	Streaming + `tool_use` blocks supported natively. VS Code extension takes workspace env vars.
llama.cpp server	Partial	Yes with `--jinja`	32K, 64K context on Pro	Lowest overhead on Apple Silicon. Rougher Anthropic-compat. Best path for Gemma 4 26B-A4B.
vLLM	Yes	Yes with `--enable-auto-tool-choice` + correct parser	Model-dependent	Best throughput. Requires correct parser per model family.
LiteLLM	Routes to any backend	Whatever the backend supports	n/a	Use for fallbacks and logging, or to wrap an OpenAI compatible local endpoint as Anthropic.
Direct Ollama < v0.14.0	No	No	n/a	Upgrade.

12. Hardware × model × context × backend (the cheat-sheet table)

A developer should not have to infer what to use:

Machine	Model	Context	Backend	Verdict
MacBook Air M5, 16 GB	Gemma 4 E4B	16K–24K	LM Studio	usable for small tasks
MacBook Air M5, 24 GB	Gemma 4 26B-A4B Q4	24K–32K	Ollama / LM Studio	good
MacBook Air M5, 32 GB	Gemma 4 26B-A4B Q4	32K	Ollama / LM Studio	best Air setup
MacBook Pro M5 Pro, 48 GB	Gemma 4 26B-A4B Q4/UD-Q4	64K	llama.cpp / LM Studio	sweet spot
MacBook Pro M5 Max, 64 GB+	Gemma 4 26B-A4B or 31B	64K–128K	llama.cpp / vLLM	best local

This is the single most copied table in this gist. Bookmark it.

13. Gemma 4 26B-A4B: the Apple Silicon sweet spot

For Mac local Claude Code, the standout Gemma 4 variant is 26B-A4B-it, not the dense 31B. Reasons:

Google trained tool-use directly into Gemma 4 (not bolted on as a fine-tune). It works on the first try, not after three retries.
The 26B MoE activates only ~3.88 B params per inference, so latency is in the 4 B-model range — around 300 tok/sec on M2 Ultra.
Strong tool-use behavior, good enough coding quality for private/local workflows.
Fits at useful context sizes on high-memory MacBooks.

Why 26B-A4B instead of 31B?

Faster tool calls — every Claude Code turn is bottlenecked by tool-call latency, not single-shot quality.
Lower active-parameter count keeps prefill cheap.
Better fit for laptops — 31B dense needs more RAM and more thermal headroom.
Enough quality for iterative coding; the agent loop matters more than peak IQ.
31B may be better for single-shot answers — but Claude Code is many small turns, not one big answer.

For Gemma 4 local coding specifically: pick 26B-A4B unless you're on a 64 GB+ Pro and you've measured that 31B Q4 actually finishes turns faster on your hardware.

14. Other model picks for Claude Code (April 2026)

If Gemma 4 isn't available or you want to compare:

gpt-oss:20b — easy starting point. Tool calling reliable, runs on a single decent GPU. Recommended in Ollama's and LM Studio's official Claude Code blog posts.
gpt-oss:120b — much smarter on real codebases. The vLLM Claude Code integration page uses this as the example. Needs serious VRAM.
qwen3-coder — purpose-built for coding. Strong tool-call performance on Ollama. Frequently called the strongest local pick for Claude Code in March/April 2026 community threads.
qwen3.5 family — the 35B MoE variants are reported as the strongest agentic-coding open models in this size class. Verify tool-call support per quant.
glm-4.7-flash / glm-4.7:cloud — strong agentic coder. Available as an Ollama cloud model (no local GPU needed).
minimax-m2.1:cloud — newer Ollama cloud option, agentic-tuned.

What to avoid: stock llama3.x instruct models without tool fine-tuning. They will look like they work, then silently fail on file edits.

15. Setups I would avoid

8K context. Too small for Claude Code. The system prompt eats it before your code arrives.
16K context. Demos only. Don't judge a model by 16K behavior.
Old llama.cpp builds with Gemma 4. No --jinja or no patched chat template → <unused24> / <unused49> token leakage.
128K context on a 32 GB laptop. KV cache + prefill latency tax > the benefit.
Judging model quality before tool calls are stable. Fix the parser/template first, then evaluate the model.
Routing through LiteLLM when the backend is already native Anthropic. Adds a hop for nothing — only use LiteLLM for fallbacks or when wrapping an OpenAI compatible local endpoint.

16. Reusable startup script

Drop this in start-claude-code-local.sh and chmod +x. Default 32K context, override via env.

#!/usr/bin/env bash
set -euo pipefail

export OLLAMA_CONTEXT_LENGTH="${OLLAMA_CONTEXT_LENGTH:-32768}"
export ANTHROPIC_BASE_URL="${ANTHROPIC_BASE_URL:-http://localhost:11434}"
export ANTHROPIC_AUTH_TOKEN="${ANTHROPIC_AUTH_TOKEN:-ollama}"
export ANTHROPIC_DEFAULT_OPUS_MODEL="${ANTHROPIC_DEFAULT_OPUS_MODEL:-gemma4:26b-a4b}"
export ANTHROPIC_DEFAULT_SONNET_MODEL="${ANTHROPIC_DEFAULT_SONNET_MODEL:-gemma4:26b-a4b}"
export ANTHROPIC_DEFAULT_HAIKU_MODEL="${ANTHROPIC_DEFAULT_HAIKU_MODEL:-gpt-oss:20b}"

echo "Starting Ollama with context=$OLLAMA_CONTEXT_LENGTH"
ollama serve &
OLLAMA_PID=$!

# Wait for Ollama to be ready
until curl -sf "$ANTHROPIC_BASE_URL/api/version" > /dev/null; do
  sleep 0.5
done

echo "Launching Claude Code → $ANTHROPIC_BASE_URL"
echo "Model: $ANTHROPIC_DEFAULT_OPUS_MODEL"

claude

kill $OLLAMA_PID 2>/dev/null || true

For LM Studio, swap ollama serve for lms server start --port 1234 and update the env vars accordingly.

This script (and additions for other backends as they ship) lives in the companion repo:

github.com/renezander030/local-ai-coding-stack — git clone, chmod +x scripts/start-claude-code-local.sh, run.

17. Production recommendation

For real work, do not let Claude Code talk directly to a single local endpoint without a fallback path:

Claude Code
   │  ANTHROPIC_BASE_URL
   ▼
LiteLLM (router + logger)
   │  primary
   ▼
Ollama / LM Studio / llama.cpp / vLLM (local)
   │  on tool-call failure or 5xx
   ▼
Cloud Claude Haiku (fallback)
   │
   ▼
Audit log

Model swaps without restarting Claude Code; transparent fallback when local tool calling silently fails; request logs you can grep when something goes wrong. Same five-contract pattern from agent-approval-gate.

18. When local models are the wrong choice

Repo-wide refactors. Multi-step tool flows compound silent tool-call failures. Local fine-tunes drop accuracy fast.
Security-sensitive edits without an approval gate. Use agent-approval-gate and the local-vs-cloud question becomes secondary.
Tool-heavy sessions (50+ tool calls). Every silent failure compounds.
Anything billed by your time. A failed local tool call costs your time; a successful Haiku call is roughly $0.001.

Local Claude Code is a fit for: chat-only assist on private code, classification/summarization sub-steps, air-gapped environments.

Series

This gist is part of Production AI Automation Notes — a running set of repos and gists on shipping AI agents outside demos. Other entries:

agent-approval-gate — production-safe approval pattern. Drop in front of any local-model agent that touches real systems.
Production AI Automation Notes #1: Agent Approval Gates
CLAUDE.md — 10 rules for Claude Code, edit-time and runtime
Context7 v2 — enterprise GraphQL MCP pattern

Sources

Reader contributions

If you get this working on a different Mac/RAM/model combo, comment with:

machine
RAM
backend
model + quant
context length
what worked / what failed

The compatibility matrix and hardware table are updated weekly from these reports.

Changelog

2026-04-28

Added TL;DR cheat sheet, Recommended setup section, smoke test, debug flow, reusable startup script, hardware × model × context × backend table.
Expanded error-string section to include <unused24> / <unused49> template-leak symptoms.
Added 26B-A4B vs 31B comparison bullets.
Added "Setups I would avoid."
Renamed Update log → Changelog.
Added Gemma 4 26B-A4B context recommendations.
Added MacBook Air vs Pro presets.
Added 32K / 64K Claude Code guidance.
Backend coverage rewritten: Ollama, LM Studio, vLLM all native Anthropic; llama.cpp added as Apple Silicon fast path.
LiteLLM repositioned as fallback router (and OpenAI-compat wrapper), not translator.

2026-04-22

Initial publish.

Get the next field note in your inbox. A new mini case from real AI builds every two weeks. No theory, no pitches.

Subscribe at renezander.com

I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at renezander.com.

Voice AI in Production: From RunPod to Hosted Kubernetes

René Zander — Thu, 23 Apr 2026 13:10:51 +0000

Your voice model works in a demo. The same model in production stalls under concurrent load. The model file is identical. So is the GPU card. Only the deployment changed.

If your TTS service runs on a single RunPod pod, you've already met this wall. You handle one request per GPU at a time. A crash costs ninety seconds to reload the model. Failover isn't in the setup. Your marketing page says "generate narration instantly." Your infrastructure says "please form an orderly queue."

The gap between prototype and product sits in the infrastructure layer. The voice AI companies asking me for help want hosted Kubernetes because their engineering hours are going into pod management when they should be going into the model.

Single Pod Stops Working Around Four Concurrent Users

A voice model like Qwen3-TTS loads into GPU memory once. Each inference holds that memory plus a working buffer. On an H100 you fit the model plus maybe four to eight concurrent generations before latency goes off a cliff. On a 4090, less.

That number is the ceiling of your business on a single pod. You can buy a bigger GPU. You can't buy a second one attached to the same pod. The moment you need more than one machine, you're in distributed-systems territory whether you planned for it or not.

What Actually Breaks First

Cold starts are the obvious one. A pod that dies takes ninety seconds to reload the model into VRAM, and during those ninety seconds your users hit 502s. Kubernetes with a warm pool absorbs it.

Voice profile storage gets worse the moment you scale. On one pod a user's cloned voice sits on local disk. Spread that across ten pods and you need shared storage plus replication on every node that might serve that user. Miss one and the next request uses the wrong voice or errors out.

Then there's the cost trap. You rent preemptible GPUs at a third the price, and one afternoon the cloud provider takes them back with two minutes' warning. A single pod goes dark. A K8s cluster with a warm replica serves the next request from a different node and nobody sees the eviction.

Fine-tuning is the one that forces the decision. The moment you offer custom voice creation, you need training runs that don't block inference. That means another queue, another GPU pool, and priority rules that don't collide with live inference. A single pod can't multiplex that, and bolting it on later costs more than designing for it up front.

What the K8s Layer Actually Buys You

Keep model weights on the node, where they outlive any single pod. New pods scheduled to that node get a warm cache and start in under ten seconds instead of ninety.

Not every request needs an H100. Real-time low-latency responses can run on a 4090 nodepool, premium batch generations go to H100. Nodepool labels and taints handle the routing without the application code caring.

Pick queue depth as your autoscale signal. CPU metrics are useless here. GPU utilization also lies when the model is streaming. The number that maps to user-visible latency is requests waiting in the queue.

Show the queue depth back to the caller. "You're number four, about forty seconds" keeps users on the line. A thirty-second timeout with no feedback teaches them your service is broken.

None of this is visible in a Voicebox README.

Hosted K8s Is the Service

Voice AI companies keep asking for this because it's the gap between a model that works and a product that holds up under paying users. You can learn Kubernetes while trying to ship, but most founders can't afford both learning curves at once. Hiring a team is slow. Handing the layer off gets your engineering hours back on the model.

If your voice AI product is past the demo and breaking under real traffic, I run the K8s layer so your team stays on the model. Contact on the blog.

Your Model Is the Value. Your Pod Isn't.

Are your engineering hours going into the model or into the pod that serves it? If the answer is the pod, you're paying to solve the wrong problem twice. Handle the infrastructure properly or hand it off. A half-built version while your competitor ships isn't a strategy.

Originally published at renezander.com.

Where is your engineering time actually going right now: into the model or into the pod that serves it?

Get the next field note in your inbox. A new mini case from real AI builds every two weeks. No theory, no pitches.

Subscribe at renezander.com

I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at renezander.com.

Ten CLAUDE.md rules for Claude Code - four edit-time, six runtime

René Zander — Thu, 23 Apr 2026 04:06:42 +0000

Forrestchang's andrej-karpathy-skills CLAUDE.md is four rules aimed at the moment Claude is writing code. They work. What they don't cover is the moment Claude is running. Once a Claude-driven pipeline goes to production, a different failure mode shows up: confident outputs, silent budget overruns, destructive side-effects, prompt injection via user input.

These six extension rules are what I shipped into fixclaw — a Go pipeline engine where Claude drafts, classifies, and summarizes, but never executes. Deterministic code does. The rules below are what made that claim stick.

Merge with your own project rules. Tradeoff: these bias toward caution over autonomy.

Forrestchang's four (edit-time) — unchanged

Think Before Coding — state assumptions, surface tradeoffs, ask when unclear.
Simplicity First — minimum code, no speculative abstractions.
Surgical Changes — touch only what the task requires.
Goal-Driven Execution — define success criteria, loop until verified.

(Full text: forrestchang/andrej-karpathy-skills/CLAUDE.md.)

Six runtime rules — lessons from fixclaw

5. Deterministic First

Claude is for judgment calls. Plain code does everything else.

Fetching, filtering, routing, persisting, dispatching — none of it is a language task. Don't ask the model to "decide if we should retry" when a status code already answers. Use the model for: classification, drafting, summarization, extraction from unstructured text. That's the whole list.

The failure mode without this rule: the model makes a routing decision one week, a different routing decision the next, and you've reinvented flaky if-else at $0.003/token.

6. Declare Budgets, Halt On Breach

No silent overruns. Ever.

Every AI step runs under a token budget: per-step, per-pipeline, per-day. Exceeding any of the three halts the pipeline immediately, logs the breach, and surfaces it to the operator. Budgets live in config, not in prompts.

budgets:
  per_step_tokens: 2048
  per_pipeline_tokens: 10000
  per_day_tokens: 100000

The failure mode without this rule: a runaway loop burns $40 overnight and you find out from the invoice.

7. Human-In-The-Loop Is A First-Class Step Type

Label destructive actions. Require approval. No exceptions via flags.

Anything touching the outside world — sending an email, updating a CRM, posting a message — is an approval step, not an ai step. The approval is routed to an operator channel (Slack, Telegram, whatever) with approve/edit/reject controls. The pipeline blocks until a decision is recorded.

- name: approve-send
  type: approval
  mode: hitl
  channel: telegram

The failure mode without this rule: a hallucinated follow-up email goes to a real customer.

8. Validate AI Output Against A Schema

Unstructured strings don't belong in deterministic downstream code.

Every AI step declares an output schema. The runtime rejects anything that doesn't match — missing fields, wrong types, out-of-range numbers. Rejected outputs trigger a retry (under budget) or halt.

output_schema:
  type: object
  required: [match, reason, score]
  properties:
    match:  { type: boolean }
    reason: { type: string, maxLength: 280 }
    score:  { type: integer, minimum: 0, maximum: 100 }

The failure mode without this rule: a boolean comes back as the string "maybe" and a downstream if branches the wrong way.

9. Sanitize Operator Input Before It Reaches A Prompt

User-supplied text is not trusted.

Before any operator or external input enters a prompt, strip role markers (system:, assistant:, <|im_start|> variants), enforce length limits, and normalize markdown so formatting can't break prompt boundaries. This is prompt-injection defense, not input validation — the goal is to stop an attacker from pivoting the model mid-run.

10. Log Rejections Silently

Don't narrate to the attacker.

When input is rejected for sanitization or schema violations, log internally — never echo the rejection reason back to the source. A detailed error message is a free signal that tells the attacker which pattern to try next.

The "working if" test

The full ten rules are working if:

Diffs are smaller and more targeted (rules 1–4).
Pipeline runs have predictable token costs (rule 6).
No AI output ever reaches a production side-effect without a human approval record (rule 7).
Downstream code never branches on a malformed AI response (rule 8).
Operator-channel logs show silent rejections rather than echoed errors (rules 9–10).

If even one of those is failing, the rule isn't enforced — it's aspirational.

Originally published as a gist: https://gist.github.com/renezander030/2898eb5f0100688f4197b5e493e156a2 — weekly gists on Claude Code, MCP, and automation at @renezander030.

Get the next field note in your inbox. A new mini case from real AI builds every two weeks. No theory, no pitches.

Subscribe at renezander.com

I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at renezander.com.

95% of PII Redaction Doesn't Need an LLM. The Other 5% Is Where Your Masker Leaks.

René Zander — Tue, 21 Apr 2026 11:43:05 +0000

A VP at an SAP shop told me recently: "Every time we copy production to our lower environments, PII leaks. And no, we're not throwing an LLM at it. That's a thousand times the compute of what we already run."

He's right.

Most of the PII redaction problem in enterprise data isn't a neural network problem. It's a lookup table problem. And the incumbents already solve it. SAP TDMS, Delphix, Informatica, IBM InfoSphere Optim. All schema-aware. All row-level. All deterministic.

The 95% Where Deterministic Wins

In a SAP production database, the schema tells you almost everything. KNA1-NAME1 is a customer name. BSEG-IBAN is a bank account. USR02-BNAME is a user ID. A YAML rule says: "for this column type, replace with this pattern." Done.

The math is brutal. A regex plus a lookup table costs microseconds per row. A 1.5B-parameter model costs 10 to 50 milliseconds per row, even on a GPU. That's three to five orders of magnitude. A nightly batch copy that finishes by morning with TDMS would take weeks with an LLM in the loop.

Compute isn't even the main argument.

Referential integrity is. "Anna Müller" has to become "Person_47" consistently across 200 tables. KNA1, VBAK, VBKD, BSEG, wherever the customer ID travels. Deterministic pseudonymization with an HMAC and a scoped salt gives you that for free. Neural outputs drift.

Auditability is. A regulator asks: "show me the rule that masked this column." A YAML rule is defensible. A model output is not.

So for any SAP field with a known schema type, deterministic masking wins. Full stop. Don't let anyone sell you a neural-network-powered "modernization" of that layer.

Where a Fine-Tuned Model Earns Its Compute

Here's what TDMS, Delphix, and their peers silently miss.

Free-text columns. BSEG-SGTXT, the long-text field where someone typed "Ansprechpartner Anna Müller, Tel +49-170-...". Ticket descriptions from ServiceNow mirrored into dev. Email bodies stored as CLOBs. ADRC annotations. The column type is "text." The content is gold-mine PII.

Unstructured attachments. PDFs, scanned invoices, OCR'd contracts pulled into dev via ArchiveLink. Names and IBANs mid-prose, not in a column.

Schema drift. Consultants add Z-tables. The data steward hasn't classified them yet. Deterministic tools don't know the column holds PII. They pass the data through untouched.

On these, rule-based tools do one of two things. They wipe the whole column, destroying test fidelity, so the dev team can't debug against realistic data. Or they miss the PII entirely, and you get a compliance incident.

A German-specialized redactor earns its keep here because the alternative isn't "faster regex." It's "no coverage at all."

The Hybrid Architecture

This is the part that actually ships.

A classifier pass on the SAP copy. Cheap heuristics (column-name keywords, column type, sample-value regex) flag each column as structured_pii, free_text, or safe.
Deterministic masker handles structured_pii. TDMS or whatever you already run.
Fine-tuned LLM redactor runs only on free_text, attachments, and unclassified Z-columns.
A consistency bridge. Both paths share a pseudonym table keyed by HMAC(value, tenant_salt). "Anna Müller" becomes "Person_47" whether she was caught by regex or by the model.

Compute budget: the LLM runs on maybe 1 to 5 percent of the cells. Total cost is still dominated by the deterministic layer. You're not replacing TDMS. You're covering its blind spots.

What I Won't Claim

Three things I won't sell you:

The LLM is cheaper than a regex. It isn't. Ever.
It replaces your incumbent masking vendor. It doesn't.
A benchmark against TDMS on structured columns is meaningful. You lose that benchmark. Benchmark on free-text and attachments, where deterministic tools score near zero.

The honest pitch to the VP was this. "You're right. For the 90% structured case, keep TDMS. The model is the long-tail layer. It runs only over the free-text fields and attachments your current tools silently leak. Small job. Different problem."

That's the conversation that lands. Not "replace your stack." Not "AI-powered everything."

Regex for the schema. LLM for the shadows.

I reserve my audits for teams ready to take action on the results.

Book a 30-min call →

Get the next field note in your inbox. A new mini case from real AI builds every two weeks. No theory, no pitches.

Subscribe at renezander.com

I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at renezander.com.

What llama.cpp's Pace Tells You About On-Prem LLM Readiness

René Zander — Tue, 14 Apr 2026 08:42:06 +0000

Your team asked for GPU budget for self-hosted inference. You said "not yet" because last time you checked, the tooling wasn't production-grade. That was true 18 months ago. It's not true now, and the delay is costing you leverage you don't know you're losing.

I'm writing this because most decision-makers I talk to are still running on an outdated mental model of what self-hosted LLM infrastructure looks like. The software moved. The org didn't.

The Team That Celebrated Too Early

I watched a team spin up on-prem inference, celebrate for a week, then watch it rot because nobody owned it. Six months later they were back on the API, having spent the budget anyway.

This is the failure mode nobody talks about. The software works. It's been working for a while now. The problem is everything around it.

Nobody owns the stack. Running self-hosted inference in production means someone on your team owns model updates, hardware failures, quantization tradeoffs, and latency tuning. That's a different job than calling an API. If you don't staff it, the deployment decays.

Procurement kills momentum. GPU capacity is a capital expenditure conversation, not a software download. If you don't already have data center access or cloud-GPU contracts, the blocker isn't the code. It's a procurement cycle that takes months. By the time the hardware arrives, the team that asked for it has moved on.

Model selection is real work. The quantized model that runs great for summarization falls apart on code generation. There is no default. Every use case needs evaluation, and evaluation takes time nobody budgets for.

These are solvable problems. But teams that skip them end up with on-prem deployments that nobody trusts, and leadership that says "see, I told you it wasn't ready" when the real issue was organizational, not technical.

What Changed While You Were Waiting

A year ago, I would have told you to hold off. Not anymore.

You can now split inference across multiple GPUs without patching anything yourself. The server mode handles concurrent requests behind a load balancer. 1-bit quantization means models that needed high-end hardware run on modest configs without catastrophic quality loss.

Multi-modal support landed. Speculative decoding shipped, cutting latency on long outputs. The API compatibility layer means your existing code that talks to cloud providers works against a self-hosted endpoint with a URL change.

I deployed a quantized model on a client's on-prem GPU last month. Set up the server, pointed the app at it, ran inference. It worked. First try. That sentence would have been fiction two years ago.

The gap between "experimental" and "production-ready" closed while most orgs were waiting for someone else to go first.

The Decision You're Actually Making

This isn't a permanent binary. It's a portfolio allocation.

Move workloads on-prem when:

Your inference volume is high enough that API costs became a material line item.
You need predictable latency without network variability.
Compliance or data residency requirements mandate it. But verify this. Many teams assume they need on-prem when they don't.
You have an engineer who wants to own the stack.

Stay on the API when:

You're prototyping or usage is unpredictable.
You need frontier models not available as open weights.
Nobody on your team can own the ops burden.

The mistake I see most often: treating this as all-or-nothing. Start with API. Move specific workloads to self-hosted when economics or data constraints force the conversation. The infrastructure to do it properly exists now. It didn't two years ago.

The Question for Your Next Planning Cycle

The software is ready. The open-weight models are good enough for most production use cases. The tooling matured past the point where "not ready yet" is a defensible position.

The real question isn't whether the technology works. It's whether your org is set up to operate it. That's a staffing decision and a procurement decision, not a technology bet.

If you're still saying "not yet," make sure you're saying it because of an actual blocker, not because of a mental model that expired a year ago.

Related reading:

Self-Hosted LLM vs API: when the math actually works — the decision framework I use with clients (deutsche Version)
LLM API comparison 2026 — Claude vs GPT vs Gemini vs Mistral vs DeepSeek for production (deutsche Version)

I help teams navigate this decision. If your org is evaluating self-hosted inference and you want an honest assessment of readiness, reach out.

Get the next field note in your inbox. A new mini case from real AI builds every two weeks. No theory, no pitches.

Subscribe at renezander.com

I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at renezander.com.

Your AI Content Tool Knows Your Strategy. Do You Know Where It Goes?

René Zander — Tue, 07 Apr 2026 06:59:24 +0000

Your team is using AI for content. Everybody is. LinkedIn posts, blog drafts, internal comms, maybe some customer-facing copy too.

And it works. The output is decent, the speed is real, nobody wants to go back to writing everything from scratch.

But have you thought about what you are actually pasting into these tools?

The Prompt Is the Product

Every time someone on your team writes a prompt, they are feeding context into a system they do not control. Brand voice guidelines. Competitive positioning notes. Messaging frameworks. That internal strategy deck someone summarized into a prompt last Tuesday.

This is not hypothetical. This is what good prompts look like. The more context you give, the better the output. So people give more context. They paste in the brief. They paste in the competitor analysis. They paste in the draft that legal has not approved yet.

The tool gets better because your data is better. And your data is sitting on someone else's infrastructure.

The Trust Model Is the Problem

Most AI content tools handle your data the same way: they promise not to train on it. That is the entire security model. A policy page. Maybe an enterprise agreement with a data processing addendum.

Your data still gets processed on shared infrastructure. It still passes through systems you cannot inspect. You are trusting that the vendor's internal controls work perfectly, that no employee has access they should not have, and that every subprocessor in the chain follows the same rules.

For most companies, this never becomes a visible problem. The data does not leak in a way anyone notices. The risk stays theoretical.

Until it does not.

A client asks where their data goes during your AI-assisted content process. Legal needs to document compliance for an audit. A competitor publishes something that looks suspiciously familiar. A new regulation drops that requires you to prove where personal data was processed, not just promise.

The Technology Already Exists

Here is what most people in the content space do not realize: the technology to solve this is not theoretical. It is production-ready. It has been running in cloud infrastructure for years. It just has not reached the content tooling layer yet.

Three capabilities change the game:

Client-side encryption. Your data gets encrypted before it leaves your browser. The server never sees plaintext. It processes encrypted inputs and returns encrypted outputs. The key stays with you. Not with the vendor. Not in their key management system. With you.

Confidential computing. Instead of shared servers where your workload runs alongside everyone else's, your data gets processed in an isolated hardware enclave. The cloud provider cannot see inside it. The vendor cannot see inside it. The operating system cannot see inside it. Your data exists in cleartext only inside a hardware boundary that nobody else can access.

Attestation. Cryptographic proof of what code is running in that enclave. Not a vendor's word that they are running the right version. A hardware-signed certificate that you can independently verify. You know exactly what software touched your data because the hardware tells you, not the vendor.

These are not research papers. AWS Nitro Enclaves, Azure Confidential VMs, and GCP Confidential Computing have been generally available for years. The infrastructure is there. The content tools just have not caught up.

Why This Matters Now

Two things are converging.

First, AI adoption in content workflows is no longer experimental. Teams are building real pipelines. They are feeding in real business data, not just test prompts. The volume and sensitivity of data flowing through AI tools is growing every quarter.

Second, regulation is catching up. GDPR already requires you to document where personal data is processed. The EU AI Act adds requirements around transparency and risk management for AI systems. Industry-specific regulations in finance, healthcare, and legal services are getting more specific about AI data handling. "We have a DPA" is becoming insufficient.

The companies that figure out verifiable AI data handling now will not be scrambling when their clients, their board, or their regulator asks how their AI content pipeline handles sensitive data.

What to Ask Your Vendors

You do not need to become a cryptography expert. But you should be asking three questions:

Where does my data exist in plaintext? If the answer is "on our servers," you are in the trust model. If the answer is "only inside a hardware enclave that we cannot access," you are in the proof model.

Can I verify what code processes my data? If the answer requires trusting the vendor's word, that is trust. If the answer involves a hardware attestation you can independently check, that is proof.

Who holds the encryption keys? If the vendor holds them, they can decrypt your data whenever they want, regardless of what the policy says. If you hold them, the vendor literally cannot access your plaintext data even if they tried.

The Shift from Trust to Proof

The content industry is going to go through the same transition that payments, healthcare, and financial services already went through. The question will shift from "do you promise to protect our data?" to "can you prove it?"

Right now, almost nobody in the AI content space is building with these guarantees. That gap will not last.

I am building Teedian, an AI content tool that uses exactly this architecture. Client-side encryption, confidential computing, attestation. Not as a roadmap item, but as the foundation.

If you work in a regulated industry, or you handle client data in your content workflows, or you want to understand what cryptographic privacy looks like in practice, I put together a short brief on teedian.com that walks through the architecture. Plain language, no jargon, 3 pages.

Get the next field note in your inbox. A new mini case from real AI builds every two weeks. No theory, no pitches.

Subscribe at renezander.com

I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at renezander.com.

Spend Your Human Thinking Tokens Where They Compound

René Zander — Tue, 31 Mar 2026 08:32:22 +0000

More automations running. More agents deployed. More pipelines humming in the background.

I run about a dozen automated jobs. Daily briefings, proposal generation, content pipelines, data syncing, monitoring alerts. They handle a lot.

But the biggest improvement to my workflow this year wasn't adding more automation. It was getting honest about where my thinking actually matters.

You Have a Token Budget Too

LLMs have context windows. Feed in too much noise and the signal degrades. The output gets worse even though you gave it more to work with.

Human attention works the same way. I have maybe 4 good hours of focused thinking per day. When I spend those hours reviewing cron output or formatting documents or triaging alerts that resolve themselves, I'm burning tokens on low-value work.

The quality of my actual decisions goes down. Not because the decisions got harder, but because I already used up my thinking budget on stuff that didn't need me.

Where I Stopped Spending

I used to review my morning briefing line by line. Check every data point, verify every summary. Then I realized: if the briefing is wrong, I'll notice when the information doesn't match reality later that day. The cost of a slightly wrong briefing at 6:30 is near zero. The cost of spending 20 minutes checking it every morning is real.

Same with monitoring. I had alerts for everything. Cache refreshes, API response times, sync completions. Most of them were informational, not actionable. I stripped it down to alerts that require a decision: something broke, something is about to expire, something needs my approval before it touches an external system.

Data syncing runs on a schedule. If it fails, I get one alert. I don't watch it run. I don't check the logs unless the alert fires.

First drafts of anything. Cover letters, content outlines, research summaries. The AI produces a version. Sometimes it's good enough. Sometimes I rewrite half of it. But I never start from a blank page anymore, and that alone saves the hardest type of thinking: getting started.

Where I Still Spend Every Token

Scoping client work. An AI can research a company, summarize a job posting, draft a proposal. But deciding whether the project is actually worth pursuing? Whether the client's problem is what they say it is? That's pattern recognition built from years of seeing projects go sideways. No automation for that.

Choosing what to build next. I have a backlog of 50 things I could automate, improve, or ship. The AI can't tell me which one moves the needle this week. That decision depends on context it doesn't have: what conversations I had yesterday, what I'm optimizing for this month, what feels right.

Anything with my name on it that reaches another person. Proposals get edited. Posts get rewritten. Client messages get reviewed word by word. The AI drafts. I decide what actually represents me.

System design decisions. Where to draw the boundary between automatic and manual. What gets a human checkpoint and what runs unsupervised. These are the highest-leverage decisions in any AI system, and they're entirely human.

The Honest Ratio

Maybe 20% of my working hours involve focused, high-stakes thinking. The rest is execution, coordination, and maintenance.

Before I built these systems, that ratio was reversed. 80% thinking, 20% execution, and half the thinking was on tasks that didn't deserve it.

The goal was never "automate everything." It was "protect the 20% that matters and make sure I'm not exhausted when I get there."

The Shift

This isn't about working less. I work the same hours. But the distribution changed.

I spend less time on decisions that don't compound. I spend more time on the ones that do. Client relationships, system architecture, strategic bets. The stuff where being sharp at 10 in the morning instead of burned out from triaging alerts actually changes the outcome.

The question isn't how much your AI can do. It's whether you're spending your own thinking tokens on the right things.

Where are you still spending attention that you probably shouldn't?

I help teams figure out where AI should run unsupervised and where humans still need to be in the loop. If that's a question your team is working through, let's talk: cal.eu/reneza

Get the next field note in your inbox. A new mini case from real AI builds every two weeks. No theory, no pitches.

Subscribe at renezander.com

I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at renezander.com.

Forem: René Zander

Browser-Use Is Solving the Wrong Half of the Problem

TL;DR — when to use browserground (and when to use UI-TARS-MLX instead)

The broader argument — why a parser-stage specialist matters at all

And if you're new to the hybrid pattern — why this exists at all

Two Jobs, One Forward Pass

Grounding Is a Parser Problem

The Reasoning Model Gets Its Reasoning Capacity Back

What Anthropic and OpenAI Ship Next

The Diagnostic

Three Files I Renamed Last Month That Fixed My AI Agent

The pitch checks out

The actual job

Three things I renamed last month

Where the model has to guess

The role that already existed

Your AI Workflow Doesn't Need Better Prompts. It Needs Less AI.

Prompting Is Discovery

Better Prompts Still Have a Ceiling

Skills Are Repetition

When Skills Become Prompt Monsters

The Model Can Write the Code. It Does Not Get to Decide Whether the Code Passes.

This Applies to Content Too

The 20% LLM Workflow

The LLM Should Be the Escalation Path, Not the Reflex

A Self-Check for Your Own AI Workflow

The Maturity Curve

The Wrong Goal Is Better Prompting Forever

The Useful Question

Pure semantic search missed 4 of 5 of my agent queries. Hybrid + parallel fan-out fixed it.

The framework, in one diagram

The adapter interface

Two patterns the Core implements (worth stealing)

1. Agent-data notes

2. Parallel retrieval with provenance

The bench

Why I optimized for the model, not for me

What I deliberately didn't build (yet)

A daily flow (my setup, your tools optional)

Roadmap

Code

What Anthropic's April 23 Postmortem Reveals About Your Agent Harness

Defaults that nobody can see

A cache rule that ate the working memory

A 25-word cap that cost 3% intelligence

What two of three tells you

Claude Code with Local LLMs and ANTHROPIC_BASE_URL: Ollama, LM Studio, llama.cpp, vLLM

TL;DR cheat sheet

The local-Claude-Code rule of thumb

Recommended setup

1. Environment variables Claude Code reads

2. Context length: the hidden failure mode

Practical context tiers

Apple Silicon context presets

3. Claude Code Ollama setup (native, v0.14.0+)

4. Claude Code LM Studio setup (native, 0.4.1+)

5. Claude Code llama.cpp setup (Apple Silicon fast path for Gemma 4 26B-A4B)

6. Claude Code vLLM setup (native + tool parser)

7. LiteLLM — for fallbacks, not for translation

8. Common failures (the error strings developers google)

tool_use parse error / invalid tool call / tool_use is not supported

context length exceeded / model stopped mid-edit

empty assistant response

model not found / 404 the model X does not exist

messages: Extra inputs are not permitted (HTTP 422)

ANTHROPIC_BASE_URL ignored / Claude Code still calls the real API

9. Debug flow

10. Smoke test

11. Compatibility matrix (April 2026)

12. Hardware × model × context × backend (the cheat-sheet table)

13. Gemma 4 26B-A4B: the Apple Silicon sweet spot

Why 26B-A4B instead of 31B?

14. Other model picks for Claude Code (April 2026)

15. Setups I would avoid

16. Reusable startup script

17. Production recommendation

18. When local models are the wrong choice

Series

Sources

Reader contributions

`tool_use parse error` / `invalid tool call` / `tool_use is not supported`

`context length exceeded` / model stopped mid-edit

`empty assistant response`

`model not found` / `404 the model X does not exist`

`messages: Extra inputs are not permitted` (HTTP 422)

`ANTHROPIC_BASE_URL` ignored / Claude Code still calls the real API