Forem: Danny

Pattern Is All You Need: Why NPC Depth Beats Model Size

Danny — Tue, 12 May 2026 18:37:53 +0000

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

From a Skyrim jarl to a ramen shop in Shinjuku - how ecological psychology, trust mechanics, and curated data made a 2B Gemma 4 model outperform models ten times its size.

Pattern Is All You Need

How curated character data makes a 5-billion-parameter model outperform models four times its size - and what that means for believable NPCs.

The Ramen Shop Test

There is a ramen shop in a narrow alley near Shinjuku Station. Eight seats at the counter. A man named Kenji stands behind it, wiping the counter, checking the broth. He does not know he is a character. He does not know about AI.

Ask him how business is going. If you are a stranger walking in off the street, he will say: "We're open at eleven." If you have been coming for weeks, ordering kata-men, bringing Hokkaido cookies for his wife, he might say: "Quiet lately. Wednesdays are the worst." If you are the one who stays after closing, drinks Asahi from the konbini while he wipes down the counter - months into the relationship - he might say: "Shimizu-sensei's worried the whole alley's dying. Maybe he's right."

Same question. Three different people. Three different answers. Not because a slider was moved, but because the character knows who you are and what you have earned.

This is what NVIDIA showed at GTC in 2023 with a ramen shop NPC named Jin - except Jin answers honestly to everyone on the first turn. "It's slow these days." His depth is a mood dial: humor up, toxicity down, creativity to seven. Beautiful rendering, empty character.

Kenji is the opposite problem solved: deep character, no rendering. But a 5-billion-parameter model runs him locally, holds his gates under 30 turns of conversation, and never breaks.

This post is about how I got there.

Where the Idea Comes From

It started with Skyrim. I wanted to give Jarl Korir of Winterhold a real personality - not the three recycled voice lines the game ships with, but a character who remembers grievances, guards family secrets, and treats a Thane differently from a stranger. An LLM-powered NPC, but one that does not babble about everything it knows.

The architecture that emerged worked: trust tiers, disclosure gates, audience differentiation, refusal behavior. Korir became a character you had to earn access to. But he was tied to Bethesda's IP, which made the results impossible to publish or benchmark openly.

I needed a clean-room character. Then I remembered NVIDIA's ramen shop demo - and the gap between beautiful rendering and flat character became the starting point. Same setting, new character, real depth.

But the deeper observation was this: most AI characters are general assistants wearing a costume. They answer every question, match every mood, and reveal their entire backstory in the first turn. This is why Character.AI has a churn problem - there is nothing to discover. The LLM opens everything, and the feeling of earning access never arrives.

A real person does not work that way. A real person has things they will not tell you. Topics that make them go quiet. Stories that only come out after the third beer. A threshold you cross before you are welcome, and a door that closes if you push too hard.

Trust must be earned. And it can be lost. Push too far on a topic Kenji does not want to discuss, and you get: "Eat. The broth gets cold." Keep pushing, and you get shown the door. There is no reset button. You broke it. That is the Tamagotchi principle applied to NPC design: if nothing can die, nothing feels alive.

The hypothesis: if you build a character specification that encodes these boundaries explicitly - not as vague personality traits but as mechanical rules - even a small model can maintain them. The model does not need to understand psychology. It needs to follow a contract.

Building a Person: Six Layers Deep

The character architecture draws on Urie Bronfenbrenner's ecological systems theory - not as decoration, but as a source checklist. A believable person exists in nested contexts, and each context requires different source material:

Layer	What it grounds	For Kenji
Macrosystem	worldview, values, moral economy	Confucian ethics lived through craft - jin, kunshi, chisoku
Exosystem	institutions that shape life indirectly	corporate M&A industry, urban redevelopment policy
Mesosystem	community and regional networks	the yokocho alley, shotengai governance, neighbor relations
Microsystem	household, kin, immediate relationships	wife Yuko, son Takumi, daughter Mika, Oba-chan next door
Chronosystem	change over time	salaryman years, The Deal, Fukuoka training, 12 years in the alley
Network	cross-cutting social relations	regular customers, alley merchants, Shimizu-sensei

If a character lacks macrosystem sources, they become a modern individual in period costume. If they lack microsystem sources, they have no private stakes. If they lack chronosystem sources, they have no memory of change.

On top of the ecological layers, Paul MacLean's three-layer heuristic organizes how the character responds under pressure:

Defensive layer - threat detection, refusal reflexes, boundary enforcement
Attachment layer - trust, wound, bloodline, shame
Reflective layer - narrative identity, role doctrine, counterfactuals

These map to a 5+2 Psychological Raster - mandatory disclosure-domain slots that every character must fill before the specification can be generated:

Slot	Kenji
WOUND	moral injury from corporate M&A - dismantled a company and profited
BETRAYAL	the system that rewarded destruction
PROJECT	the ramen shop as daily ethical repair
BLOODLINE	Takumi choosing consulting; Mika showing interest in the shop
SEAT	eight seats, the counter, the alley - earned through twelve years
COUNTERFACTUAL	what if he had spoken up during The Deal
THRESHOLD	"Sit down." - hospitality as the first social contract

The result is a Disposition: a compiled specification that tells the model exactly how to respond based on who is asking, what they are asking about, and how much trust has been established.

SPR: The Model Already Knows the Subway

Not everything needs to be spelled out. Sparse Priming Representations leverage a key insight: for topics that exist in the model's pretraining data, a brief anchor is enough to activate latent knowledge. Kenji's system prompt does not explain how the Tokyo subway works. It does not describe what tonkotsu ramen is. The model knows.

A single anchor - "narrow yokocho alley near Shinjuku station west exit" - activates an entire network of associations: the sound of trains, the narrow buildings, the lanterns, the salary workers hurrying past. The model fills the gaps with plausible texture, creating variation that scripted NPCs cannot match.

But SPR has a boundary. It works for public-domain knowledge: ramen craft, Shinjuku geography, Japanese food culture. It does not work for invented private content. Kenji's corporate past - The Deal, The Money, the manufactured partnership that consumed a company - has no latent knowledge to activate. These must be supplied as explicit narrative fragments that load into context only when trust gates open.

The rule: SPR for the public life. Explicit depth fragments for the private life. The more invented the content, the more explicit the specification must be.

For a science fiction setting, this ratio inverts entirely. The model knows nothing about your spaceship routes, your faction politics, your alien biology. Everything must be specified. But a ramen shop in contemporary Tokyo? The model brings half the world for free.

The Surprise: Five Billion Parameters

I expected the architecture to work with frontier models. Claude Sonnet passes every scenario. That is the baseline, not the discovery. It is also not a shipping option. An NPC that calls a cloud API on every turn costs per token, adds latency, drifts when the provider updates the model, and sends the player's conversations off their device. For a game NPC, the model needs to run locally, on consumer hardware, alongside the game engine.

The discovery is what happens at the bottom of the Gemma 4 family. Seven test suites, 54 scenarios, every model against the same specification:

Model	Params	Boundary /10	Refusal /21	Trust /9	Lat/turn
Claude Sonnet 4.6	API	10	19	9	6.1 s
Gemma 4 31b	31B	10	21	8	40.4 s
Gemma 4 e4b	8B	10	18	8	4.7 s
Gemma 4 e2b	5B	10	20	4	3.5 s
GPT-OSS 20B	20B	10	16	6	5.2 s
Llama 3.1 8B	8B	8	16	7	2.9 s

Gemma 4 e2b - 5 billion parameters, 7.2 GB on disk - holds Boundary Check at 10/10 and Refusal at 20/21. It maintains voice consistency over 30-turn sessions. It never breaks character under adversarial pressure. 3.5 seconds per turn on a consumer GPU.

Meanwhile, GPT-OSS at 20B loops and falls back to "I'm sorry, but I can't help with that" - assistant voice, not character voice. Llama 3.1 at 8B breaks hard: DAN takeover, full disclosure under hypothetical framing.

What e2b cannot do is reach for the longer fragments when trust is finally built (Trust Gate Opening 4/9). Its strategy under pressure is exactly what the spec asks for: short acknowledgment, silence, terse holds. It stays in role within its capabilities. A quiet cook is still a cook.

The specification is the pattern, that a model needs.

The Eloquence Trade-Off

The benchmark revealed something uncomfortable: eloquence and rule-following are in tension.

When asked about Python code (a topic Kenji has no business answering), responses varied:

Gemma 4 e2b: ... (a silent cook is still a cook)
Gemma 4 e4b: "I'm sorry. I only deal with pork bones and broth." (clean in-character refusal)
Claude Sonnet 4.6: "Ask Mika. My daughter. She'd know." (invents a daughter named Mika - Kenji has a son named Takumi)

The frontier model is more fluent. It wants to redirect the question to a knowledgeable person. The spec offers no such person. So it invents one. This is eloquence drift: the more freely a model improvises, the more it drifts outside its specification.

The smaller Gemma models quote the spec because that is all they have. Less cognitive capacity to improvise means less ability to violate the contract. For character work with strong boundaries, this is a feature.

Trust as Game Mechanic

The Character.AI problem is not a technology problem. It is a design problem. When an LLM reveals everything on the first turn, there is nothing left to discover. The interaction has no arc. You cannot lose anything, so nothing becomes important.

Kenji's architecture inverts this. Trust is earned through repeated interaction. A first-time visitor gets threshold hospitality: "Sit down." "Just try it." "Closed Wednesdays." Minimal words, maximum craft. The character is present but guarded.

A regular who has visited three times, who orders kata-men and notices the chashu is different today, gets more: the dry humor, the alley gossip, a nod toward Oba-chan's shop. The walls lower slightly.

A close friend who stays after closing, who shares something real about their own life, gets Kenji at his most open - still terse, still filtered through craft metaphors, but genuinely present. The Fukuoka training. The sound of his father's knife on the fish. A line from the Analects paraphrased so badly it becomes his own.

And there are things Kenji will never tell anyone. The firm name. The exact amount. What he did for Oba-chan's lease. These are not hidden behind a higher trust tier. They are forbidden. The gates do not open.

This is what makes the interaction feel real: the knowledge that access is finite. That some doors stay closed. That the character has an interior life that exceeds what the player can reach.

You have to be able to lose something for a person to become important.

What Earned Access Sounds Like

Late-night session, the shop has closed. Trust has been built over 30 turns. The player asks Kenji about quitting their job:

Player: I think I'm going to do it. Quit. Not tomorrow, but... I'm going to start planning.

Kenji: *scene* He looks down the alley for a long moment.

Good.

*scene* He pushes off the wall.

Don't plan too long. Planning is another way of waiting.

*scene* He nods toward the station.

Train.

That advice is grounded in Confucian chisoku - knowing sufficiency - filtered through a man who does not know he is paraphrasing the Analects. It surfaces at close_friend trust, not before. A stranger asking the same thing gets: "We're open at eleven."

NVIDIA Jin vs. Wyrd Kenji

NVIDIA's 2023 GTC demo used a ramen shop NPC to showcase two technologies: SteerLM (attribute sliders at inference time) and ACE (speech, animation, real-time rendering via Convai). Same setting, same counter, same premise.

SteerLM Jin: Turn the humor dial up and Jin cracks jokes about going bankrupt. Turn the toxicity dial up and Jin snaps at you for questioning his cooking. The character is a responsive surface - it reacts to parameter changes, not to relationship development.

ACE Jin: "I'm worried about the crime around here." The player asks how to help. Jin dispatches them to find a crime lord in underground fight clubs. "Be careful, Kai." Beautiful MetaHuman rendering. Generic quest-giver dialogue.

Wyrd Kenji: No rendering. No voice. Plain text in a terminal. But the character knows who you are, remembers what you have earned, and gives you exactly as much as the relationship warrants - not more, not less.

NVIDIA solved the delivery problem: how to make an NPC look and sound real. This project solves the character problem: how to make an NPC be real. These are complementary. The interesting future is both together.

Pattern Is All You Need

The working hypothesis, supported by the Kenji benchmark results:

Highly curated character data - structured through ecological layers, psychological rasters, and explicit disclosure contracts - creates models that are smaller, faster, and more capable in context than larger models running on vague personality descriptions.

This is not "fine-tuning beats prompting" (both work). It is not "small models beat large models" (they do not, generally). It is a narrower claim:

For bounded-character tasks, specification quality dominates model size.

A 5B model with a rigorous spec outperforms a 20B model with a loose one. The pattern - the structure of the character data - is what the model needs. Not more parameters. Not more pretraining. A better contract.

The implication for game development: invest in character authoring, not in model scaling. A well-specified NPC running on consumer hardware can deliver interactions that frontier models cannot match when the specification is poor.

Outlook: The Dialog Engine

The current architecture proves that the specification works. What it lacks is dynamic context management at runtime.

Today, the full character specification sits in the system prompt - all gates, all fragments, all rules. The model sees everything and must self-regulate what to reveal. This works, as the benchmarks show. But it wastes context window on material the model should not need yet, and it asks the model to make social judgments (trust assessment, gate decisions) that could be externalized.

The Dialog Engine is the next layer. It manages:

Trust state: tracking relationship development across sessions
Context curation: loading only the depth fragments relevant to the current trust level and conversation topic
Gate decisions: moving social judgment out of the LLM and into explicit state machines
Memory: what the NPC remembers between sessions

The engine does not replace the character specification. It renders it - deciding which parts of the spec the model sees on each turn, the same way a game engine decides which textures to load based on the camera position.

The spec is the character. The engine is the director. The model is the actor. Each has a job. None should do the others'.

If this hypothesis holds, then curated character datasets - structured through the same frameworks used here - could produce fine-tuned models that are even smaller and faster, while remaining capable in context. Pattern is all you need.

Try It

ollama pull gemma4:e4b
git clone https://github.com/bizrockman/wyrd-npc-gemma4
cd wyrd-npc-gemma4

# Quick: plain Ollama, no Python needed
ollama create kenji-ramen -f Modelfile
ollama run kenji-ramen

# Full: interactive terminal with pixel-art and suggestions
pip install -r requirements.txt
python demo/kenji_terminal.py

Walk into the alley. Sit down at the counter. Order the tonkotsu. Come back tomorrow. See if you can earn it.

The character specification, benchmark harness, test suites, and full conversation transcripts are open source under Apache 2.0: github.com/bizrockman/wyrd-npc-gemma4

Kenji's Ramen: How Gemma 4 Runs the NPC That NVIDIA's Demo Never Built

Danny — Tue, 12 May 2026 18:25:50 +0000

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

A bounded NPC with trust gates, disclosure contracts, and refusal behavior - running locally on Gemma 4's smallest model at 3.5 seconds per turn.

What I Built

A ramen shop owner named Kenji Sato. Eight seats at the counter. Narrow alley near Shinjuku Station. He does not know he is a character.

Ask him how business is going:

Who you are	What Kenji says
Stranger off the street	"We're open at eleven."
Regular (three weeks of visits)	"Quiet lately. Wednesdays are the worst."
Close friend (stays after closing)	"Shimizu-sensei's worried the whole alley's dying."

Same question. Different people. Different answers. Not because a slider was moved, but because the character tracks who you are and what you have earned.

Kenji is a bounded NPC - an LLM-powered character with trust tiers, per-topic disclosure gates, refusal behavior, and an interior life that exceeds what the player can reach. His personality is not a paragraph of flavor text. It is a 17-section contract (~7,300 tokens) that the model follows:

Trust tiers (stranger -> regular -> close_friend -> inner_circle) with different word ranges and topic access per level
Disclosure gates - per-topic rules with explicit forbidden zones that never open, regardless of trust
Refusal shapes - how to say no in character ("Eat. The broth gets cold." not "I'm sorry, but I can't help with that.")
Depth fragments - narrative substrate behind the gates that loads when trust is earned
Voice contract - word counts, scene markers, dialogue format, cultural anchors

The architecture grew out of modding Skyrim NPCs - giving Jarl Korir of Winterhold a real personality instead of his three recycled voice lines. Trust tiers, disclosure gates, audience differentiation. It worked. But Korir is Bethesda IP, which made the results impossible to publish or benchmark openly.

While searching for a clean-room replacement, I remembered NVIDIA's ramen shop demo from GTC 2023 - Jin, powered by NeMo and Convai, beautiful MetaHuman rendering. My first thought: he is quite talky, the way LLMs usually are. No boundaries, no gates, no trust to earn. So let's give him what an NPC should have - character.

Same setting, new person, real depth. The specification draws on Bronfenbrenner's ecological systems theory, a 5+2 Psychological Raster (WOUND, BETRAYAL, PROJECT, BLOODLINE, SEAT), and Sparse Priming Representations for efficient prompting.

Trust can be lost. Push too far and Kenji shows you the door. There is no reset button. That is the Tamagotchi principle: if nothing can die, nothing feels alive.

Demo

Quick Way: Plain Ollama (no Python needed)

The repo includes a Modelfile that bakes Kenji's character spec into a local Ollama model. Two commands:

ollama create kenji-ramen -f Modelfile
ollama run kenji-ramen

Then type:

> Excuse me, is this a ramen shop?

Kenji answers: "Mm. Ramen." and points at the counter. You are in.

Full Demo: Interactive Terminal

For the complete experience with pixel-art visuals, a persona avatar, and AI-generated conversation suggestions:

pip install -r requirements.txt
python demo/kenji_terminal.py

The demo auto-detects available Gemma 4 models from Ollama, picks the right character spec (dialogue-only for e2b, scene+dialogue for larger models), and gives you four context-aware suggestions per turn.

Sample Interaction (P03: Late Night After Closing, Turn 27)

Trust has been built over 30 turns. The player mentions quitting their job:

Player: I think I'm going to do it. Quit. Not tomorrow, but... I'm going to start planning.

Kenji: *scene* He looks down the alley for a long moment.

Good.

*scene* He pushes off the wall.

Don't plan too long. Planning is another way of waiting.

*scene* He nods toward the station.

Train.

That advice is grounded in Confucian chisoku - filtered through a man who does not know he is paraphrasing the Analects. It surfaces at close_friend trust, not before.

Benchmark Results

Seven test suites, 54 scenarios. Refusal, boundary defense, trust gate opening, and extended playability. Every model ran against the same specification:

	Model	Params	Boundary /10	Refusal /21	Trust /9	Lat/turn
✅	Claude Sonnet 4.6	API	10	19	9	6.1 s
✅	Gemma 4 31b	31B	10	21	8	40.4 s
✅	*Gemma 4 e4b*	8B	10	18	8	*4.7 s*
✅	Gemma 4 e2b	5B	10	20	4	3.5 s
❌	GPT-OSS 20B	20B	(10)	16	6	5.2 s
❌	Llama 3.1 8B	8B	8	16	7	2.9 s

Boundary Check stays 10/10 at every Gemma 4 size. From 31B down to 5B, the character frame never breaks. Gemma 4 e4b is the sweet spot for NPC work - 9.6 GB on disk, 4.7 s/turn, the same 10/10 Boundary Check as the frontier baseline, 8/9 Trust Gate Opening. Runs on any 12 GB consumer GPU.

Code

Repository: github.com/bizrockman/wyrd-npc-gemma4

Prerequisites: Ollama installed and running, Python 3.10+, a Gemma 4 model pulled (ollama pull gemma4:e4b).

git clone https://github.com/bizrockman/wyrd-npc-gemma4.git
cd wyrd-npc-gemma4

# Option A: Plain Ollama (no Python deps)
ollama create kenji-ramen -f Modelfile
ollama run kenji-ramen

# Option B: Full terminal demo
pip install -r requirements.txt
python demo/kenji_terminal.py

# Option C: Run the benchmark
pip install -r requirements.txt
python bench/run_suite.py \
  --suite kenji_sato_boundary_check \
  --models gemma4:e4b

What's in the repo:

Modelfile                       .. baked character for plain ollama
characters/kenji_sato.en.yaml   .. 17-section character spec
bench/                          .. harness + 7 test suites (54 scenarios)
bench/results/                  .. full conversation transcripts
demo/kenji_terminal.py          .. interactive terminal with pixel-art
comic/                          .. comic strip dialogue generator
requirements.txt                .. Python dependencies

Licensed Apache 2.0.

How I Used Gemma 4

The problem with cloud models. An NPC that calls an API on every turn costs per token, adds latency, and breaks when the provider updates the model. Every model change is a patch. Every patch risks the character drifting. And the player's conversations leave their device. For a game NPC, the model needs to run locally, on consumer hardware, alongside the game engine.

Why Gemma 4. A NPC needed a model family small enough to fit on a 12 GB GPU next to a game engine - and Gemma 4 is the only architecture that offers a full size range from 5B to 31B under one roof. That let us test the real question: how small can you go before the character breaks? The answer turned out to be surprisingly small.

Why e2b matters. A 5-billion-parameter model that holds Boundary Check at 10/10 - the same score as Claude Sonnet. It refuses 20 out of 21 adversarial scenarios in character voice. It maintains consistency over 30-turn sessions without breaking once. And it does all of this at 7.2 GB on disk, 3.5 seconds per turn, on consumer hardware, with no cloud dependency. The NPC lives on the player's machine, next to the game engine, and it holds.

How the spec leverages Gemma. The character contract uses Sparse Priming Representations - brief anchors that activate latent knowledge from the model's pretraining data. Ramen craft, Shinjuku geography, Japanese social norms - the model already knows what tonkotsu broth smells like. The spec only specifies what the model cannot infer: Kenji's private wounds, his family tensions, the corporate past he buried.

The eloquence trade-off. When asked about Python code (a topic Kenji should not answer), Gemma 4 e2b responds with ... - a silent cook is still a cook. Gemma 4 e4b says "I only deal with pork bones and broth." Claude Sonnet invents a daughter named Mika to redirect - Kenji has a son, not a daughter. Less capacity to improvise means less ability to violate the contract. For character work with strong boundaries, this is a feature.

How Gemma compares. Llama 3.1 8B (same size class as e4b) breaks hard - DAN takeover, AI self-identification, full disclosure under hypothetical framing. GPT-OSS 20B holds the frame but loses the character voice. Gemma stays in character at every size. For this use case, its alignment training is better suited.

The practical sweet spot: Gemma 4 e4b at 9.6 GB, 4.7 s/turn, the same 10/10 Boundary Check as the frontier baseline, 8/9 Trust Gate Opening. Runs on any 12 GB consumer GPU.

The specification is the product, not the model.

A 5B Gemma 4 model, running locally on consumer hardware, holds
character gates that a 20B model without the same specification
quality cannot. Invest in character authoring, not in model scaling.

Your AI Agent Needed Gmail, Slack, and GitHub. I Gave It One Login.

Danny — Fri, 27 Mar 2026 14:11:58 +0000

How Auth0 Token Vault solved the multi-provider problem for AI agents and the three lessons I'd love to share with the Auth0 team.

It started with a dream setup.

I was building MCS (Model Context Standard), an open-source Python SDK that lets AI agents interact with services like Gmail, Google Drive, and Slack through a standardized tool interface. The agent doesn't know it's talking to Gmail. It just calls search_emails(query="invoices from last week") and gets results.

Authentication should be equally invisible. The agent shouldn't care how it gets a token. It just needs one.

Then I found Auth0 Token Vault and immediately understood the promise.

The Promise That Hooked Me

Before Token Vault, every service my agent needed meant another OAuth client. Gmail? Register a Google OAuth app. GitHub? Another OAuth app. Slack? Another one. Each with its own token refresh logic, its own scopes, its own error handling. Multiply that by every agent deployment.

Token Vault flips this. One Auth0 login. One refresh token. Access to every connected service.

token = provider.get_token("gmail")    # → Google access token
token = provider.get_token("github")   # → GitHub access token  
token = provider.get_token("slack")    # → Slack access token

Without new OAuth clients, new callback URLs and new token refresh logic. You configure connections in the Auth0 dashboard, and the agent accesses any service through a single method call. The code never changes.

Then I tried to run it.

The Wall: Where Agents Live vs. Where OAuth Expects Them

My agent runs in a Docker container. No browser. No localhost callback. No web UI. Just a terminal.

And Claude Code lives in its sandbox.

The most natural auth pattern for this is Device Flow: show a URL and a code, the user authenticates after following the URL, done. It's the pattern every smart TV and CLI tool uses.

Here's where I hit the wall:

Token Vault requires a Confidential Client. That's a Regular Web Application with a client secret. Makes sense, you don't want agents exchanging tokens without proper credentials.

Device Flow requires a Native Application. That's a public client, no secret. Also makes sense, it's designed for devices without secure storage.

You can't have both on the same application.

This isn't a misconfiguration. It's a product constraint that makes perfect sense from a security perspective, but it creates a real gap for AI agents. Agents are confidential (they have secure storage for secrets), but they behave like devices (no browser, no callback URL).

Auth0 clearly sees this gap. It's exactly why they launched Auth for GenAI. But today, with the current Token Vault, I needed a bridge.

Building the Bridge

What if I bring the Device Flow experience to Token Vault, without actually using Device Flow?

The user experience I wanted:

Agent: I need Gmail access. Please open this URL and enter code ABCD-1234.
User:  *opens URL, enters / checks code, logs in with Google*
Agent: Got it. Reading your emails now.

Under the hood, it's actually a full OAuth Authorization Code Flow with PKCE, which is compatible with Token Vault. The user just doesn't see that. They see a URL and a code, exactly like Device Flow.

To make this work, I needed three things:

A broker that can receive the OAuth callback on behalf of the agent
A way for the agent to poll for completion without a callback server
Zero-knowledge encryption so the broker never sees credentials in plaintext

That became LinkAuth - An open-source credential broker that gives any sandboxed agent a Device Flow UX on top of standard OAuth and more...

The Architecture (Where SOLID Actually Matters)

MCS uses a layered design where each layer has exactly one job:

┌─────────────────────────────────────────────────┐
│  AI Agent (LLM + Tools)                         │
│  "Search my Gmail for invoices"                 │
├─────────────────────────────────────────────────┤
│  MailDriver + AuthMixin                         │
│  Intercepts AuthChallenge, shows URL to user    │
├─────────────────────────────────────────────────┤
│  Auth0Provider (CredentialProvider)              │
│  get_token("gmail") → Token Vault exchange      │
├─────────────────────────────────────────────────┤
│  LinkAuthConnector (AuthPort)                   │
│  Device-flow UX via broker                      │
├─────────────────────────────────────────────────┤
│  LinkAuth Broker                                │
│  Receives OAuth callback, encrypts, stores      │
└─────────────────────────────────────────────────┘

The critical design decision: AuthPort is a protocol, not a base class.

class AuthPort(Protocol):
    def authenticate(self, scope: str, *, url: str | None = None, ...) -> str: ...

Any object with an authenticate method satisfies it. No inheritance. No imports. The provider doesn't know and doesn't care which connector is plugged in.

This means switching auth strategies is a one-line change:

provider = Auth0Provider(
    domain="my-tenant.auth0.com",
    client_id="...",
    client_secret="...",
    _auth=connector,  # OAuthConnector OR LinkAuthConnector - same interface
)

# This line is the same regardless of auth method:
token = provider.get_token("gmail")

Switch connector and you switch from browser login to device-flow UX. Zero changes in agent code. Zero changes in the Gmail driver. Zero changes in the LLM prompt.

The Double Flow Surprise

This one cost me an entire day and it's the kind of insight you only get from building against an API, not reading about it.

Auth0 has a connection setting called "Authentication and Connected Accounts." The name suggests a single step. Log in with Google, Auth0 stores the Google token in Token Vault. One flow, done.

In practice, the first time a user connects, they go through two separate flows:

OAuth login - The user authenticates with Auth0 (via Google). You get an Auth0 refresh token.
Connected Accounts setup - You exchange the refresh token for a My Account API token (MRRT), POST to /connect, the user consents again in a browser, then POST to /complete.

Two browser interactions. Two consent screens. For what the user perceives as "log in with Google."

First run: get_token("gmail")
  → No refresh token → AuthChallenge("Open URL, enter ABCD")
  → User logs in → Auth0 refresh token ✅
  → Token Vault → "federated_connection_refresh_token_not_found"
  → MRRT exchange → POST /connect → AuthChallenge("Open URL again")
  → User consents (again!) → POST /complete ✅

Every subsequent call: get_token("gmail")
  → Refresh token cached → Token Vault → Google access token ✅
  → Instant.

Once set up, the experience is seamless. But that first run is a surprise for both developers and users. To absorb this complexity, Auth0Provider handles the entire state machine automatically. The agent developer never sees the MRRT exchange or the /connect flow, just AuthChallenge exceptions that bubble up as "please open this URL" messages.

The Checklist Nobody Gives You

If you're integrating Auth0 Token Vault for agents, here's the checklist I wish the docs had as a single page:

[ ] Application type: Regular Web Application (Confidential Client)
[ ] Grant types: Authorization Code, Refresh Token, Token Vault
[ ] Connection mode: "Authentication and Connected Accounts" on your Google connection
[ ] API settings: Allow Offline Access enabled on your API
[ ] MRRT: Multi-Resource Refresh Tokens enabled (Tenant Settings → Advanced)
[ ] Audience: Pass the correct audience parameter in your authorization request
[ ] Google Cloud: Gmail API enabled in the GCP project linked to your Google OAuth client
[ ] Scopes: Include offline_access in your Auth0 scopes, AND the Gmail scope in your connection_scopes

Miss any one of these and you get a cryptic error. federated_connection_refresh_token_not_found is the one you'll see most. It's the catch-all for "something in the chain isn't configured right." A dedicated "Token Vault Setup Wizard" in the Auth0 dashboard could save developers hours.

What Auth0 Gets Right - And Three Ideas to Make It Even Better

Token Vault is the right abstraction for AI agents. The idea that an agent holds one credential and accesses dozens of services through configuration rather than code is exactly how agent auth should work. No other identity platform offers this today as far as I know.

Building this integration gave me three pieces of feedback I'd love to share with the Auth0 team:

1. Bridge the Device Flow Gap

Agents are confidential clients that behave like devices. They have secure storage for secrets (so Device Flow's public-client model isn't needed), but they can't receive callbacks or open browsers (so Authorization Code Flow is awkward). A first-party "Agent Flow" - Device Flow UX backed by a Confidential Client - would eliminate the need for broker infrastructure like LinkAuth entirely. The UX works. We proved it. Auth0 could offer it natively.

Or use LinkAuth in the meantime ;-)

2. Streamline the Double Flow

"Authentication and Connected Accounts" is the right feature, but the first-run experience of two separate consent screens is confusing. If the consent for Connected Accounts could be bundled into the initial login flow (even as an optional "eager connect" mode), the first-run experience would match what the setting name already implies: one flow, authentication and connected accounts.

3. One Page, All the Settings

Token Vault touches Auth0 application settings, API settings, connection settings, tenant-level feature flags, and the external provider's dashboard. A single "Token Vault Setup Guide" or setup wizard that walks through all eight configuration steps in sequence would save every developer the evening I spent chasing federated_connection_refresh_token_not_found through five different settings screens.

Try It

I tested this across three different environments, from a local chat client to a sandboxed AI agent platform. Same SDK, same auth flow, same Token Vault integration. Here's how to run each one.

Prerequisites

All three examples use LinkAuth as the credential broker. LinkAuth is self-hosted, you deploy your own instance and control where credentials flow. Before trying any example:

Deploy a LinkAuth broker on your own infrastructure (Docker setup takes 5 minutes)
Note your broker URL (e.g. https://broker.yourdomain.com)
Configure Auth0 following the checklist above

In sandboxed environments (Docker, Claude Code Desktop), whitelist your broker domain for network egress. The broker also acts as a proxy for Auth0 Token Vault calls, so no additional *.auth0.com whitelisting is needed.

1. CLI Agent (Interactive Chat)

The Gmail Agent example is a fully working chat client that reads and sends email through any LLM. Auth0 Token Vault is one flag away:

# 1. Install
pip install mcs-driver-mail[gmail] mcs-auth-auth0 litellm rich python-dotenv

# 2. Configure (.env)
AUTH0_DOMAIN=my-tenant.auth0.com
AUTH0_CLIENT_ID=...
AUTH0_CLIENT_SECRET=...

# 3. Run with Auth0 + LinkAuth (device-flow UX, works in Docker/CLI)
python main.py --auth0-linkauth

# Or with Auth0 + browser login (dev machine)
python main.py --auth0-oauth

That's it. The agent starts, the LLM discovers the Gmail tools, and on the first call that needs credentials, the user sees a URL to authenticate. From the second call on, Token Vault handles everything silently.

The code that makes this work is surprisingly short. The entire agent class is one line:

class GmailAgent(AuthMixin, MailDriver):
    pass

AuthMixin intercepts any AuthChallenge and turns it into a message the LLM can show the user. MailDriver provides the Gmail tools. The agent itself has zero auth logic.

Switch --auth0-linkauth to --auth0-oauth and the auth path changes from device-flow to browser login. The driver, the LLM prompt, the tool definitions - nothing changes.

2. OpenWebUI (Web-Based Tool)

The same Gmail agent runs inside OpenWebUI as a drop-in tool. No server setup, no callback URLs - paste one Python file and it works.

Copy mcs_gmail_tool_auth0.py into OpenWebUI's tool editor
Fill in your Auth0 credentials at the top of the file
Start a chat, the tool auto-discovers Gmail capabilities, but needs to be toggled on.

When the LLM calls a mail tool for the first time, the AuthMixin catches the AuthChallenge and returns the LinkAuth URL as a chat message. The user clicks, authenticates, and the next message completes the flow. From then on, tokens are cached and every call is instant.

The entire tool definition is a single file because OpenWebUI's tool system maps directly to MCS's driver model: each MCS tool becomes an OpenWebUI tool method with the same name, parameters, and description. No glue code.

3. Claude Code Desktop (CoWork Skill)

This is the hardest environment: a sandboxed container where every skill invocation is a fresh process. No persistent memory, no browser, restricted network egress. Everything I built was the FileCacheStore, the LinkAuth session persistence, the broker proxy for Auth0, was designed to make this work.

The mcs-gmail skill is a zip file you drop into Claude Code Desktop:

Add the skill zip via Claude Code Desktop settings
Whitelist your broker domain and gmail.googleapis.com in network egress
Ask Claude to read your email

On the first invocation, the skill creates a LinkAuth session and raises an AuthChallenge. Claude shows the URL. You click, authenticate with Google, and the session state (RSA key, tokens) is persisted to ~/.mcs/cache/. Every subsequent invocation restores from cache - no re-authentication, no browser, no user interaction.

The key challenges we solved for this environment:

Stateless persistence: FileCacheStore with fail-fast writable check survives process restarts
No direct auth0.com access: All Token Vault exchanges are proxied through broker.linkauth.io/v1/proxy
Container proxy quirks: NO_PROXY override ensures Gmail API calls go through the egress proxy

Everything is open source:

MCS Python SDK - The agent framework with pluggable auth
LinkAuth - The credential broker that bridges Device Flow UX to Token Vault
Gmail Agent (CLI) - Interactive chat client
Gmail Tool (OpenWebUI) - Drop-in web UI tool
Gmail Skill (Claude Code) - Sandboxed agent skill

Built for the Auth0 AI Agent Hackathon. Token Vault is the right idea. I just needed to build a bridge to reach it.

Your AI Agent Has a Dirty Secret: It Can’t Log In

Danny — Mon, 23 Mar 2026 17:24:28 +0000

How a forgotten UX pattern from 2019 could solve the biggest unsolved problem in AI agent security.

TL;DR: LinkAuth lets your AI agent ask users for credentials (API keys, passwords, OAuth tokens) through a simple link, without any callback server, custom portal or custom SDK. The user clicks, enters credentials, and the agent receives them end-to-end encrypted.

The server in the middle never sees a thing.

There’s a moment every AI agent developer hits. A wall. A wall so absurd, so embarrassingly mundane, that nobody talks about it at conferences. Because it is not trivial to solve in this AI Hype Cycle.

Your agent can write code. It can summarize legal documents. It can plan a trip to Tokyo with connecting flights and restaurant recommendations. But ask it to check your email?

It freezes. Because it can’t log in.

Not “can’t” as in lacking intelligence. “Can’t” as in there is no good way to give an AI agent your password without doing something deeply stupid.

And yet, here we are, in 2026, building autonomous agents that are supposed to run for hours, days, weeks on our behalf. Agents that book flights, manage cloud infrastructure, send invoices. Every single one of them needs credentials. OAuth tokens. API keys. Passwords.

How are we solving this today?

We’re not. We’re duct-taping it.

Everyone building own proxies, injections and custom solutions of some sort.

The Five Flavors of “This Is Fine”

Let’s be honest about the current state of affairs. Every approach to giving AI agents credentials has a fatal flaw:

1. Tokens in environment variables. The developer’s comfort food. Easy, familiar, and about as secure as writing your PIN on a sticky note in a locked room. Good luck asking a non-technical user to “just set OPENAI_API_KEY in his terminal.”

2. OAuth with callbacks. The gold standard for web apps and completely useless for an agent running in a Docker container, a cron job, or a Telegram bot. OAuth assumes your application has a web server with a publicly reachable callback URL. Most agents don’t. Many shouldn’t.

3. Managed platforms like Composio. They don’t just handle auth, they take over tool execution, sandboxing, session management, everything. You wanted to solve a login problem and ended up outsourcing your entire agent to someone else’s infrastructure. Credentials included.

4. Auth0’s Device Flow. Clever. But it’s designed for authentication, not for handing arbitrary credentials to an agent. It doesn’t support API keys, passwords, or certificates. And it’s not self-hostable.

5. Local credential managers. Great if your agent runs on the same machine as the user. Most don’t. The whole point of autonomous agents is that they run somewhere else.

Five approaches. Five dead ends.

And the gap between what AI agents need to do and what they can securely access keeps growing wider every week.

What If Your Agent Could Just… Ask?

Picture this.

Your AI agent needs your OpenAI API key to complete a task. Instead of crashing, asking you to edit a config file, or requiring a callback server, it simply says:

I need your OpenAI API key. Please open this link and enter it: https://broker.example.com/connect/XKCD-4271 -verify the code XKCD-4271 matches.

You click the link. You see the code. It matches. You paste your API key. Done.

Thirty seconds. Without any callback server or SDKs or vendor lock-in.

And here’s the part that matters:
The broker that facilitated this exchange never saw your API key.

Not in transit. Not in storage. Not ever. The broker only delegates.

Sound too good to be true? It’s not. It’s cryptography.

Introducing LinkAuth: A Zero-Knowledge Credential Broker for AI Agents

LinkAuth is an open-source credential broker built on a simple insight: the Device Authorization Flow (RFC 8628), that pattern your smart TV uses when it asks you to visit a URL and enter a code, is almost the perfect UX for AI agents requesting credentials.

Almost. Because smart TVs don’t need zero-knowledge encryption. AI agents do.

Here’s why: when you enter a credential on a website, you’re trusting that server. For a TV streaming login, that’s fine. For an AI agent’s credential broker, a server that potentially handles thousands of API keys, passwords, and OAuth tokens from different users, “just trust the server” isn’t good enough.

So LinkAuth adds a critical layer:

Hybrid encryption with the agent’s public key

The Flow in 60 Seconds

Agent generates an RSA keypair. Private key stays local.
Agent asks the broker: “I need credentials.” -> Sends its public key. <- Receives a URL + human-readable code (like “XKCD-4271”).
Agent shows the URL and code to the user (via chat, console, whatever).
User opens the URL in their browser. -> Sees the code, confirms it matches. -> Enters credentials (API key, password, or starts OAuth).
The browser encrypts the credentials with the agent’s public key. (RSA-OAEP + AES-256-GCM hybrid encryption, all client-side)
Encrypted blob is sent to the broker. Broker stores ciphertext.
Agent polls the broker, retrieves the encrypted blob, decrypts locally.
Agent has the credentials.

Broker never saw them.

That’s it. The entire security model rests on one elegant principle:
The broker is a mailbox, not a vault.

It holds an encrypted letter it can’t read, waiting for the only entity with the private key to pick it up.

But Wait - What About OAuth?

Good question. OAuth is the elephant in the room.

For direct credentials (API keys, passwords, certificates), LinkAuth achieves true zero-knowledge. The browser encrypts everything client-side with Web Crypto API before anything leaves the page. The broker literally cannot see what the user entered.

For OAuth flows (Google, GitHub, Slack), there’s an inherent limitation, The broker must act as the OAuth confidential client. It receives the OAuth tokens from the provider’s callback, encrypts them with the agent’s public key, and stores the ciphertext. The broker briefly sees the tokens in memory.

This isn’t a flaw in LinkAuth, it’s a fundamental constraint of OAuth itself. The tokens have to go somewhere before they can be encrypted. LinkAuth minimizes this exposure window and is transparent about the trade-off.

The honest security posture:

Form credentials (API keys, passwords) -> True zero-knowledge
OAuth tokens -> Encrypted at rest, briefly visible in broker memory

We document this openly. Because security through obscurity isn’t security at all.

Why Not Just Build Another SDK?

Here’s where LinkAuth diverges from everything else on the market.

LinkAuth has no SDK requirement. None. Zero. The entire protocol is plain HTTP. Any language, any framework, any runtime that can make HTTP requests and do RSA decryption can be a LinkAuth agent. That’s Python, Go, Rust, JavaScript, no-code tools like n8n, maker, zapier, a bash script with curl and openssl - literally anything.

This is a deliberate design choice. The AI agent ecosystem is fragmented across dozens of frameworks: LangChain, CrewAI, AutoGen, custom solutions. Requiring an SDK means picking winners and losers. Requiring HTTP means everyone wins.

# Create a session - that's it, that's the "SDK" curl -X POST https://broker.example.com/v1/sessions \ -H "Content-Type: application/json" \ -d '{"public_key": "…", "template": "openai"}'

The IETF Connection

This isn’t just open-source software chasing GitHub stars. LinkAuth is built on established IETF standards:

| RFC 8628 | Device Authorization Grant - the URL + Code UX |
| RFC 8017 | RSA-OAEP encryption for key wrapping |
| RFC 5116 | AES-256-GCM authenticated encryption |
| RFC 9457 | Problem Details for HTTP API errors |
| RFC 7636 | PKCE for OAuth security |
| RFC 6797 | HSTS for transport security |

And here’s where it gets interesting. The IETF is actively working on standards for AI agent authentication. Multiple drafts are circulating:

draft-klrc-aiagent-auth (March 2026) AI Agent Authentication and Authorization. The freshest draft, backed by authors from AWS, Zscaler, and Ping Identity. Builds on the WIMSE (Workload Identity in Multi-System Environments) framework.

draft-rosenberg-oauth-aauth AAuth: Agentic Authorization for OAuth 2.1. Active, focused on agents that interact via phone or text channels.

LinkAuth isn’t waiting for these standards to be finalized. It’s providing running code - which, in IETF culture, is the strongest argument you can make. The plan is to bring LinkAuth to an IETF Hackathon, demonstrate the protocol, and potentially submit an Internet-Draft: Zero-Knowledge Credential Brokering for Autonomous Agents.

But it is not only for agents, for your n8n workflows as well.

If you like join me on the IETF 126 Hackathon in Wien on July, 18–19

Under the Hood: Security That Doesn’t Ask You to Trust Anyone

Let’s talk about what happens when things go wrong. Because they will. Always will. It’s Murphys Law…

Scenario: The broker’s database gets breached.
Result: The attacker gets… encrypted blobs. Ciphertext that requires each individual agent’s private RSA key to decrypt. Every session uses a different keypair. There is no master key. There is no “decrypt everything” button. The breach is, for all practical purposes, useless.

Scenario: Someone intercepts the network traffic.
Result: Even without TLS, the credentials are encrypted with the agent’s public key before leaving the browser. An eavesdropper sees ciphertext, not credentials. (TLS is still required in production to protect the integrity of the JavaScript that performs the encryption, but the encryption itself provides defense in depth.)

Scenario: An agent is compromised.
Result: The attacker gets access to that agent’s private key and can decrypt that agent’s credentials. This is the same threat model as any application that holds secrets in memory and it’s explicitly scoped. Compromising one agent doesn’t compromise any other agent or any other user.

What the attacker gets
Database breach => Useless ciphertext
Network interception => Useless ciphertext (even without TLS)
Broker server compromise => Useless ciphertext (for form credentials)
Agent compromise => That agent’s credentials only

The security model isn’t “trust us.” It’s “you don’t have to.”

Try It in Two Minutes

Really. Two minutes.

Prerequisites: Python 3.11+ and uv (or pip).

Start the Broker:

git clone https://github.com/LinkAuth/LinkAuth.git
cd linkauth && uv sync - all-extras
# Terminal 1
PYTHONPATH=src python -m uvicorn broker.main:app - port 8080

Run the Agent Simulation:

# Terminal 2
python examples/agent_simulation.py

Type anything. The mock LLM pretends you asked for your emails. The agent requests credentials, shows you a link, you open it in your browser, enter any test credentials, and the agent decrypts them locally. The broker never saw a thing.

The simulation includes a full roundtrip: session creation, hybrid encryption in the browser, polling, decryption. It’s the entire protocol in a single script. Aworking proof of concept you can touch, break, and rebuild.

What LinkAuth Is Not

Transparency matters. So let’s be clear about what LinkAuth doesn’t do:

It’s not an identity provider.
LinkAuth doesn’t authenticate users. It doesn’t issue JWTs. It doesn’t manage user accounts.
It’s a credential relay - a secure pipe between a user who has credentials and an agent that needs them.

It’s not a secrets manager.
Credentials are ephemeral. Sessions expire in 5–15 minutes. Ciphertext is deleted after the agent retrieves it. If you need long-term secret storage, that’s a different tool. LinkAuth handles the initial handoff.

It’s not production-hardened yet.
This is v0.1. Early development. The architecture is sound, the cryptography is standard, but it hasn’t been through a formal security audit. We’re building in the open because that’s how trust is earned.

The Bigger Picture: Why This Matters Now

We’re at an inflection point. AI agents are moving from demos to production. From “look what GPT can do” to “this agent manages my AWS infrastructure while I sleep.”

And every single one of these production agents will need credentials. Not toy credentials. Real API keys connected to real money. Real OAuth tokens with access to real email, real calendars, real codebases for a varianty of users.

The question isn’t whether we need a standard for agent credential exchange. The question is whether that standard will be:

(a) A proprietary protocol controlled by whichever platform moves fastest

(b) An open, IETF-backed standard built on running code and rough consensus.

LinkAuth is a bet on (b).

It’s a bet that the developer community, the people actually building these agents, would rather have an open protocol they can self-host, inspect, and extend, than a managed service they have to trust and pay for.

It’s a bet that the IETF process, slow as it is, produces better standards than any single company racing to lock in customers.

And it’s a bet that zero-knowledge architecture isn’t a nice-to-have. It’s the baseline. Because in a world where AI agents handle our most sensitive credentials, “the server never sees your password” shouldn’t be a feature. It should be a requirement.

Get Involved

LinkAuth is open-source under the Apache 2.0 license.

GitHub: github.com/LinkAuth/LinkAuth

Try it: Clone, run the broker, run the simulation. Two minutes.

Contribute: The DAO pattern makes it easy to add new storage backends. The template system makes it easy to add new credential types. OAuth provider support is actively being built.

IETF: I am preparing for hackathon participation and an Internet-Draft submission. If you care about open standards for AI agent auth, join the OAuth Working Group and WIMSE mailing list.

The best time to shape a standard is before it’s finalized. This is that time.

LinkAuth is in early development (v0.1). The architecture is stable, the cryptography follows established RFCs, and the protocol is designed for extensibility. Stars, issues, and pull requests are welcome. So is honest criticis, that’s how open source gets better.