Forem: Thiago V.

Kiro Forgets Everything Every Session. So I've Built It a Memory.

Thiago V. — Mon, 27 Apr 2026 00:29:07 +0000

Originally posted on AWS Builder.

I work with an AI every day. It's smart. It writes decent code. And every single morning, it forgets who I am.

I open kiro-cli chat, and the first 10 minutes are the same tax I paid yesterday:

Yes, we use pnpm. No, not npm. Yes, Vitest. No, not Jest. The main entry is src/cli.ts. We already decided to use Result<T, E> at the CLI boundary. You told me that last week. I told you that last week. We had this exact conversation.

My teammate calls it the project re-discovery tax. Every session, you pay it. Every. Session.

I got tired of paying it.

Why the obvious fixes didn't work

I tried the obvious things first.

"Just use steering files." Steering files are great for what is this project. They're static markdown you maintain by hand. They don't capture what the AI figured out during a session. The whole point of working with an AI is that it learns things with you. Steering files can't capture that.

"Tell the agent to call a remember() tool when it learns something." I tried this. Claude is inconsistent about when to call it. GPT is inconsistent. Kiro is inconsistent. Every model is inconsistent, because memory management is a side-quest to whatever task you're actually doing. The agent forgets to remember. Turtles all the way down.

"Use a SQLite knowledge graph MCP server." Same problem. Fancier storage, same failure mode. The agent still has to decide when to store.

"Wait for Kiro to ship it." There's a proposal floating around for .kiro/tasks/*.md with auto-read/auto-write. No ETA. I had work to do this week.

The insight that actually fixed it

Here's what clicked for me, and I'll give credit where it's due — it came from a design doc by a coworker, I just productized it:

The agent should be a reader of memory, not a writer.

Writing memory is a different job from using memory. They should not share a context window. The writer can be slow, deliberate, even expensive. The reader needs to be fast, cheap, and running on every session start.

So I split them:

┌──────────────┐     MCP/stdio     ┌────────────────────┐     filesystem      ┌────────────────────────┐
│   Kiro CLI   │ ◄───────────────► │ mcp-agent-memory   │ ◄─────────────────► │ agent-memory-daemon    │
│ (the reader) │                   │  (MCP server)      │   ~/.agent-memory/   │   (the writer)         │
└──────────────┘                   └────────────────────┘                     └────────────────────────┘

Kiro reads. The MCP server gives it three tools: memory_read, memory_append_session, memory_search. That's it. Nothing fancy.

A background daemon writes. It watches the sessions directory, reads session summaries on a cadence, runs them through an LLM to extract durable facts, and updates markdown files in ~/.agent-memory/memory/.

They never talk directly. The filesystem is the contract. ~/.agent-memory/ is all they share.

Kiro burns zero tokens on memory management. The heavy lifting happens async, outside the chat.

What it looks like now

Monday:

$ kiro-cli chat
> We use pnpm. Never suggest npm. Vitest not Jest. Main entry is src/cli.ts.
  I prefer explicit return types.

[Kiro does work for 20 minutes]

> Great, call memory_append_session with a summary of what we agreed on.

Terminal closes. Life moves on.

Tuesday:

$ kiro-cli chat
[Kiro automatically calls memory_read per my steering rule]

> I see we use pnpm, Vitest (not Jest), src/cli.ts as the main entry, and
  you prefer explicit return types. What are we working on today?

No re-explaining. No pasted summary. The AI just remembers.

Between sessions, the daemon woke up, read Monday's session file, extracted the durable facts, deduplicated them against what it already knew, and updated ~/.agent-memory/memory/project-preferences.md. I didn't lift a finger.

The part I'm most proud of: it costs almost nothing

The daemon runs an LLM to do the extraction. LLMs cost money. I didn't want this tool to quietly drain my Bedrock bill.

So I added a Kiro backend. Instead of calling Bedrock or OpenAI, the daemon shells out to kiro-cli itself using your existing Kiro credits. Paired with a lean consolidation agent config (ships with the package), each extraction pass costs about 0.01 Kiro credits. Default agent would have been ~0.07. That 7× savings is the difference between "nice-to-have" and "forgot it was running."

You can still pick Bedrock or OpenAI if that's your stack.

Try it

npm install -g mcp-agent-memory
mcp-agent-memory --setup

The wizard walks you through picking a backend, registering with Kiro (and Claude Desktop and Cursor if you want), and installing the daemon as a LaunchAgent on macOS.

Add this one-line steering rule at ~/.kiro/steering/memory.md:

At the start of every session, call memory_read (no arguments) to load my
memory index. When you learn something durable about me, my projects, or
my preferences, call memory_append_session with a concise markdown summary.

Restart kiro-cli. That's it.

Reading the memory yourself

The memory isn't a black box. It's just markdown files in ~/.agent-memory/memory/:

$ ls ~/.agent-memory/memory/
MEMORY.md  cli-architecture.md  project-preferences.md  team-processes.md

$ cat ~/.agent-memory/memory/project-preferences.md
# Project Preferences
- Package manager: pnpm (never npm)
- Testing framework: Vitest (not Jest)
- Main entry: src/cli.ts
- Return types: explicit, not inferred
...

$ grep -r "Vitest" ~/.agent-memory/memory/
project-preferences.md:- Testing framework: Vitest (not Jest)

cat works. grep works. git works. If you hate what it stored, delete the file.

What this isn't

Not a knowledge graph with vector search. If you want that, totalrecallai does it beautifully — SQLite, embeddings, web dashboard, the works.
Not AgentCore Memory. That's a managed Bedrock service. This runs on your laptop.
Not a replacement for steering files. Steering is "what is this project." Memory is "what have we learned together." Use both.

Why this flavor exists

If I were going to pay for a heavyweight memory system, I probably wouldn't have built this. But:

I wanted memory as plain markdown I could read, grep, and version-control.
I wanted Kiro CLI support specifically (totalrecallai doesn't list it — targets Claude Code, Cursor, Windsurf, Cline).
I wanted near-zero ongoing cost via the Kiro backend.
I wanted the MCP server's surface area to be tiny — 3 tools, no dashboard, no SDK.

If that set of constraints sounds right for you, this is your tool. If you want the database-backed semantic-search dashboard experience, try totalrecallai — it's genuinely great at what it does.

Links

npm: mcp-agent-memory
GitHub: tverney/mcp-agent-memory
The async daemon: previous post
MCP spec: Model Context Protocol

If you try it and something breaks, file an issue. If you've got a pattern for what should be memorable vs. forgettable, drop it in the comments — that's the next hard problem I don't have a great answer for yet.

Tomorrow morning, Kiro will remember who I am. It doesn't feel that I'm unknown anymore.

Bedrock prompt caching only caches a stable prefix. If you inject memory into user content, you pay full price every turn. Here's what I learned wiring up persistent memory for OpenClaw on AgentCore Runtime. #openclawchallenge 🦞

Thiago V. — Fri, 24 Apr 2026 22:28:23 +0000

OpenClaw Challenge Submission 🦞

Thiago V.

Apr 24

Memory Daemon for OpenClaw: How I Got Bedrock Prompt Caching Right

#openclawchallenge #openclaw #aws #bedrock

Comments 1

6 min read

Memory Daemon for OpenClaw: How I Got Bedrock Prompt Caching Right

Thiago V. — Fri, 24 Apr 2026 22:23:29 +0000

If you're running an AI agent on Amazon Bedrock and injecting persistent memory into every conversation, where you put that memory in the request matters a lot — both for how well the agent uses it and for what it costs you.

I learned this the direct way while connecting agent-memory-daemon to OpenClaw running on Amazon Bedrock AgentCore Runtime. The setup works beautifully.

My agent now remembers my preferences, my projects, and the weird Bedrock timeout I debugged three weeks ago.

But along the way I hit a subtle interaction between memory injection and prompt caching that's worth documenting.

This post walks through the architecture, the Bedrock prompt caching rule that tripped me up, and the one-line fix that cut my cache-related costs dramatically.

The setup: persistent memory for a serverless agent

OpenClaw lives in a container on AgentCore Runtime.

AgentCore freezes the container when idle, which is great for cost (zero idle spend) but hostile to long-term memory (every wake is a blank slate). agent-memory-daemon solves this by running as a background process in the same container, doing two things:

Extraction — watches the session transcript directory and pulls out facts, decisions, and preferences worth remembering. Writes them as individual markdown files with YAML frontmatter.

Consolidation — periodically reorganizes the memory directory: merges duplicates, resolves contradictions, prunes stale content, and maintains a concise MEMORY.md index under a strict size budget.

Memory is synced to S3 between invocations. When a new conversation starts, the container restores the memory directory and reads MEMORY.md to bring the agent up to speed.

The daemon itself is cheap.

It makes a few Haiku calls per day — my config targets about $0.25/month for the daemon's own LLM usage. The magic happens in what it produces: a curated, size-budgeted MEMORY.md that's always ~18KB regardless of how many sessions the agent has had.

Discord → EC2 bot → AgentCore Runtime → container
                                            ├── openclaw (the agent)
                                            ├── agent-memory-daemon (curator)
                                            └── server.py (HTTP glue + S3 sync)

The daemon writes files. The agent reads files. The filesystem is the interface. No SDK, no API, no coupling.

Injecting the memory

On every invocation, I load MEMORY.md from S3 and pass it to OpenClaw as context. My first version looked like this:

memory_context = load_memory_from_s3()  # ~18KB of curated memory

effective_message = message
if memory_context:
    effective_message = (
        f"[LONG-TERM MEMORY - persisted memory from previous sessions]\n\n"
        f"{memory_context}\n\n"
        f"[END OF MEMORY]\n\n"
        f"User message: {message}"
    )

messages = [{"role": "user", "content": effective_message}]

I stuffed the memory into the user message. The agent saw it. It remembered my preferences. Everything worked.

I also had Bedrock prompt caching enabled through OpenClaw's config:

{
  "agents": {
    "defaults": {
      "models": {
        "amazon-bedrock/...claude-haiku-4-5...": {
          "params": { "cacheRetention": "short" }
        }
      }
    }
  }
}

Claude Haiku 4.5 supports prompt caching with a 5-minute TTL on the "short" retention mode.

Cache reads are billed at ~10% of the regular input rate. On paper, my 18KB memory (~4,500 tokens) should have been getting served from cache at roughly a tenth of the price on every turn after the first.

Then I looked at Cost Explorer.

What Bedrock actually caches

Three days of usage, broken down by token type:

Line item	Tokens (millions)	Cost
Cache Read	12.69	$1.40
Cache Write	7.09	$9.75
Input (uncached)	31.91	$35.10
Output	4.72	$25.96

The "Input (uncached)" line is the one that doesn't make sense if caching is working. I had 12.69M cache reads, which meant something was being cached — OpenClaw's internal system prompt was getting cached fine. But 31.91M tokens were paying full input price. Where were they coming from?

Here's the rule that trips people up:

Bedrock prompt caching caches a stable prefix. It looks at the beginning of the request, finds the longest chunk that's identical to a previously-cached request, and serves that from cache. Everything after the divergence point is recomputed and billed as regular input.

Now look at my code again:

messages = [{"role": "user", "content": effective_message}]

effective_message is "[LONG-TERM MEMORY]...18KB of memory...User message: {message}". The user's actual question is appended at the end.

What Bedrock sees:

Turn 1: messages[0].content = "[MEMORY]...same 18KB...User message: what time is it?"
Turn 2: messages[0].content = "[MEMORY]...same 18KB...User message: tell me a joke"

Those two strings share a stable 18KB prefix of memory content, but they're both in messages[0].content. The cacheable prefix is actually the system prompt that OpenClaw builds on top — OpenClaw's own system content, its tool definitions, its skill metadata.

Once the request stream reaches the user message, Bedrock sees variance (the actual user question) and stops caching.

So the memory was sitting in a position where it couldn't be cached. Every turn paid full price for those 4,500 tokens.

The fix

The change is small. Move the memory to a system message, before the user message:

messages = []

if memory_context:
    messages.append({
        "role": "system",
        "content": (
            "You have access to long-term memory from previous sessions. "
            "Use this to answer questions about the user's preferences and history.\n\n"
            f"{memory_context}"
        ),
    })

messages.append({"role": "user", "content": message})

Now the memory is part of the stable system prefix. It sits alongside OpenClaw's own system prompt, tool definitions, and skills — the stuff that genuinely doesn't change between turns. Bedrock sees the same system block on every request and serves it from cache at 10% of the regular rate.

A one-line architectural change. A 90% discount on the biggest line item in the bill.

Verifying it worked

After deploying, I asked OpenClaw for its usage stats via the /usage full chat command:

🦞 OpenClaw 2026.2.26
🧮 Tokens: 9 in / 516 out
🗄️ Cache: 99% hit · 67k cached, 715 new
📚 Context: 34k/200k (17%)

67K tokens served from cache, only 715 new tokens computed. Before the fix, the 4,500-token memory injection was in the "new" bucket every turn. Now it's in the 67K cached bucket.

The change to Cost Explorer followed. The "Input (uncached)" line dropped, and the "Cache Read" line absorbed that traffic at a tenth of the price.

Three takeaways

1. Prompt caching only caches a stable prefix. Everything up to the first point of variance between requests is cacheable. Everything after is not. If you're injecting repeated context, put it early in the request — system prompt, tool definitions, or the first message of a consistent message sequence.

2. User content is almost always the wrong place for stable context. The user's actual question varies every turn. Anything you concatenate with it inherits that variance and becomes uncacheable. Pull it out into a system message.

3. Watch cache writes in your bill. Cache writes cost more than regular input (1.25x on Haiku 4.5). If you see high cache writes, it means your TTL is expiring between requests and the cache is being rewritten each time. Keep the cache warm — for cacheRetention: "short" (5-min TTL), a heartbeat every ~4 minutes avoids cold-cache rewrites.

The daemon, revisited

None of this is a critique of agent-memory-daemon — the daemon did exactly what it was supposed to do. It produced a stable, size-budgeted 18KB memory file.

The integration code I wrote around it was putting that output in the wrong place.

In fact, the daemon's design (stable output size, consistent content, regular regeneration rhythm) is ideal for prompt caching. As long as you feed it into a system message, Bedrock can cache the whole thing for the TTL window, and the daemon's periodic consolidation doesn't bust the cache more often than necessary.

If you're running OpenClaw or any agent on Bedrock and want persistent memory without a managed memory service, the pattern works well:

Run agent-memory-daemon alongside your agent
Sync the memory directory to S3 between sessions (or use a mounted filesystem if available)
Load the curated MEMORY.md at the start of each conversation
Inject it as a system message, not user content
Enable cacheRetention on your model config

The daemon handles the hard part (curating memories without bloat). Bedrock handles the cheap part (caching the stable prefix). You just have to put the memory in the right place.

Code

tverney / agent-memory-daemon

Open-source memory manager daemon for AI agents. Filesystem-native, LLM-pluggable, framework-agnostic. Works with OpenClaw, Strands, LangChain, AgentCore Runtime or any agent that can write a file.

Open-source memory manager daemon for AI agents

Open-source memory consolidation and extraction daemon for AI agents. Filesystem-native, LLM-pluggable, framework-agnostic.

Agents feed it raw observations as markdown files; the daemon runs two complementary modes:

Consolidation — periodically reorganizes, deduplicates, and prunes existing memory files via a four-phase pass (orient → gather → consolidate → prune)
Extraction — watches for new session content and runs an LLM pass to identify facts, decisions, preferences, and error corrections worth remembering, writing them as individual memory files

The filesystem is the interface — no SDK, no API, no MCP required. The LLM backend is pluggable (OpenAI, Amazon Bedrock, or anything with a chat API).

memconsolidate is a standalone, agent-agnostic daemon — available to anyone building with OpenClaw, Strands, LangChain, or any custom agent framework.

How it works

Consolidation (reorganize existing memories)

Agents write markdown memory files (with YAML frontmatter) to a watched directory
A three-gate…

View on GitHub

tverney / openclaw-agentcore-personal

Deploy your own personal OpenClaw on AWS Bedrock AgentCore — serverless, ~$9/mo, one-click CloudFormation, Discord/WhatsApp/Telegram 🦞

Deploy Your Personal OpenClaw on AWS AgentCore — Serverless, ~$9/month

Cost-optimized OpenClaw deployment using AWS Bedrock AgentCore Runtime. Connect via Discord, WhatsApp, Telegram, or Slack. ~$9-15/month infrastructure.

What Is This?

A single-user, serverless deployment of OpenClaw on AWS. Instead of running an EC2 instance 24/7, the AI runs on-demand via AgentCore Runtime — the container freezes between invocations, so you only pay when you use it.

All messaging plugins (WhatsApp, Telegram, Discord, Slack) are pre-installed in OpenClaw. This template includes a Discord bot by default, but you can connect any platform directly through the OpenClaw Web UI.

Architecture

You (Discord / WhatsApp / Telegram / Slack)
  │
  ▼
┌──────────────────────────────────────────────────────────┐
│  AWS Cloud                                               │
│                                                          │
│  EC2 t4g.nano ──invoke──▶  AgentCore Runtime             │
│  (Discord bot)             (OpenClaw container)          │
│                                │                         │
│                            IAM Role                      │
│                                │                         │
│                            Bedrock                       │
│                          (Haiku/Sonnet/Nova)             │
│                                                          │
│  ┌─────────┐  ┌──────────┐  ┌─────────┐

…

View on GitHub

the full AgentCore deployment, including the system-message fix and the S3 sync layer

Part of the OpenClaw Challenge.

Your AI Agent Forgets Everything — Here's a Daemon That Fixes That

Thiago V. — Tue, 07 Apr 2026 16:08:59 +0000

Originally posted on AWS Builder.

You spend an hour teaching it your project structure, your coding preferences, the weird Bedrock timeout issue you debugged last Tuesday. Next session? Gone. You're back to explaining that you prefer single quotes and that the CI pipeline needs --run to avoid watch mode.

Some frameworks have memory plugins. They work — sort of. But they're coupled to one framework, they accumulate junk over time, and nobody's cleaning up the contradictions from three weeks ago when you changed your mind about the database.

So I built agent-memory-daemon.

What it does

It's a background daemon that runs alongside your agent — any agent. It watches a directory of session files and does two things:

Extraction — scans new session transcripts and pulls out facts, decisions, preferences, and error corrections. Writes each one as a structured markdown file with YAML frontmatter.

Consolidation — periodically reviews the entire memory directory. Merges duplicates, converts relative dates to absolute, removes contradicted facts, prunes stale content, and keeps a concise MEMORY.md index under a size budget.

The filesystem is the interface. Your agent writes markdown files to a directory. The daemon reads them, thinks about them, and writes organized memories back. No SDK, no API, no MCP server. If your agent can write a file, it works.

The "aha" moment

I was running an agent that had accumulated 40+ memory files over a few weeks. Half of them were duplicates with slightly different wording. Three of them contradicted each other about which AWS region we were using. The MEMORY.md index was 800 lines long and the agent was spending half its context window just reading its own memories.

That's when I realized: agents need a janitor. Not just a place to store memories, but something that actively curates them.

How it works

Extraction (discovering new memories)

Session file modified
        ↓
Cursor check: is this new content?
        ↓
Build prompt: memory manifest + session content
        ↓
LLM identifies facts, decisions, preferences
        ↓
Write structured memory files
        ↓
Advance cursor

The daemon tracks a .extraction-cursor file — a per-session offset map so it only processes genuinely new content. If a session file gets appended to, it picks up where it left off instead of reprocessing the whole thing.

Consolidation (organizing existing memories)

Three-gate trigger: time elapsed + session count + lock
        ↓
Four-phase pass: orient → gather → consolidate → prune
        ↓
Merge duplicates, resolve contradictions
        ↓
Update MEMORY.md index (200 lines / 25KB budget)
        ↓
Release lock

Both modes share a PID-based lock and never run concurrently. Consolidation takes priority — if both triggers fire on the same tick, consolidation runs first.

Quick start

npx agent-memory-daemon init    # generates memconsolidate.toml
npx agent-memory-daemon start   # starts the daemon

The config is straightforward:

memory_directory = "./memory"
session_directory = "./sessions"

extraction_enabled = true
extraction_interval_ms = 60000

[llm_backend]
name = "bedrock"
region = "us-east-1"
model = "us.anthropic.claude-sonnet-4-20250514-v1:0"

Or use OpenAI:

[llm_backend]
name = "openai"
api_key = "${OPENAI_API_KEY}"
model = "gpt-4o"

What a memory file looks like

---
name: "Bedrock timeout configuration"
description: "Default SDK timeout is too short for large prompts"
type: reference
---
The AWS SDK's default request timeout causes ECONNABORTED errors
on prompts over 30K characters. Set requestTimeout to 300000 (5 min)
via NodeHttpHandler when using BedrockRuntimeClient.

Each file has a type: user (preferences), feedback (lessons learned), project (architecture decisions), or reference (technical facts). The daemon classifies them automatically during extraction.

Framework-agnostic by design

The integration pattern is the same regardless of what you're building with:

Strands / LangChain: after each agent run, dump a session summary to the sessions directory. At startup, read MEMORY.md into the system prompt.
OpenClaw: point session_directory at your workspace's transcript directory.
Custom agents: same pattern — write files, read the index.

No plugin system, no adapter layer. The filesystem is the API.

Guardrails

One thing I learned the hard way: without limits, the extraction mode creates files exponentially. Each pass sees the new files from the last pass, prompts the LLM with a bigger manifest, and the LLM creates even more files.

So there are guardrails:

max_memory_files — hard cap on total files in the directory (default: 50)
max_files_per_batch — cap on creates per extraction pass (default: 10)
max_prompt_chars — budget enforcement with progressive truncation
Per-session cursor — prevents reprocessing already-extracted content

Observability

Every operation emits structured JSON logs:

{"timestamp":"2026-04-07T14:23:01.234Z","level":"info","event":"extraction:complete","data":{"created":3,"updated":1,"durationMs":4521,"promptLength":39102,"operationsRequested":5,"operationsApplied":4,"operationsSkipped":1}}

You get duration, prompt size, operation counts, and skip reasons. Pipe it to CloudWatch, Datadog, or just jq.

What's next

Vector similarity search for memory recall (right now it's manifest-based)
Multi-agent support (shared memory directories with conflict resolution)
A web UI for browsing and editing memories

The project is MIT-licensed and on GitHub. Issues, PRs, and feedback are welcome.

npm install agent-memory-daemon

If your agent keeps forgetting things, give it a daemon with a good memory.

The Irony of Language Models That Don't Speak Your Language

Thiago V. — Mon, 30 Mar 2026 20:53:42 +0000

This is a personal project and article. The opinions expressed here are my own and do not reflect the opinions of AWS or Amazon. This project is not an AWS product and is not endorsed by or affiliated with AWS.

AI is plugged everywhere now and its a breakthrough advanced technology.

However, there is a key element which turns out to be an elephant in the room that is not in the major headliner topics: LLMs are fundamentally centric to high-resource languages, and most specifically, English.

The only publicly disclosed training data breakdown — GPT-3 (Brown et al., 2020) — showed over 92% English tokens. Newer models don't publish exact ratios, but the picture has evolved: Llama 3 remains heavily English, GPT-4o highlights improved multilingual performance as a key feature, and models like Qwen and Aya have invested significantly more in non-English data. The gap is narrowing for high-resource languages, but for the thousands of low-resource languages, the structural imbalance remains.

The remaining languages — spoken by billions of people — are either poorly represented through low-quality machine-translated English content, or absent entirely.

This means that when a Thai farmer asks about crop subsidies, when a Nigerian mother searches for vaccination schedules in Yoruba, or when a Brazilian citizen navigates tax forms in Portuguese, the AI they're interacting with is operating at a fraction of its true capability.

Not because the intelligence isn't there, but because the model was never properly taught to listen in their language. The industry celebrates "human-level performance" on benchmarks, but those benchmarks are overwhelmingly English. For most of the world, the AI revolution hasn't arrived yet — it's still stuck at customs, waiting for a translator.

The Ancient Myth

Around 4,000 years ago, Babylon was the most cosmopolitan city on Earth.

Situated at the crossroads of ancient trade routes in modern-day Iraq, it was a place where Akkadian, Sumerian, Aramaic, Elamite, and dozens of other languages collided daily. Merchants, scholars, and diplomats from across Mesopotamia converged there, and the city thrived precisely because it found ways to bridge those languages — through scribes, translators, and the world's first multilingual libraries.

The biblical story of the Tower of Babel, set in Babylon, tells it differently: God scattered humanity across the earth and confused their languages so they could no longer understand each other. It's a story about the fracturing of communication — the moment when a shared project became impossible because people could no longer speak the same language.

We're living in a strange echo of that story. We've built the most powerful reasoning machines in human history — LLMs that can write poetry, prove theorems, and generate working code. But these machines think in English. When the rest of the world tries to speak to them, the tower crumbles. Not because the intelligence isn't there, but because the language barrier corrupts the signal before it reaches the model's reasoning core.

The Illusion of Multilingual AI

Ask any frontier LLM a question in English, and you'll get a polished, accurate, well-reasoned response. Now ask the same question in Thai. Or Amharic. Or even Portuguese.

Suddenly, the magic fades.

The response might be shorter, vaguer, or riddled with English fragments leaking through. In some cases, it's outright gibberish. And here's the part nobody talks about: you're paying more for that worse response.

While the industry celebrates benchmark after benchmark showing LLMs reaching "human-level performance," there's a massive asterisk: in English. For the 6,950 other languages spoken on this planet, AI remains broken, expensive, and in some cases, unreliable.

The Numbers Don't Lie

Most leading LLMs allocate approximately 92% of their training tokens to English (Brown et al., "Language Models are Few-Shot Learners", NeurIPS 2020). Of the approximately 7,000 spoken languages globally, most models only cover about 50 high-resource ones (Frontiers Research Topic: Language Models for Low-Resource Languages). The remaining languages lack both the digital data and quality resources to benefit from recent AI advancements — creating barriers to education, healthcare, financial access, and employment for the communities that speak them.

But the problem goes deeper than just quality. It's about money.

The Hidden Language Tax

LLMs use tokenizers to break text into chunks before processing. These tokenizers were designed primarily for English. When you feed them Thai, Japanese, Arabic, or Korean text, the same semantic content gets split into 2-4x more tokens.

I built a proxy called LLM Proxy Babylon to measure this. Here's what I found with a real Thai prompt about sorting algorithms:

Metric	Direct Thai	Optimized (English)
Prompt tokens	~166	49
Token savings	—	70% fewer input tokens
Quality score	0.456	~0.949 (English-level)

That's 3.4x fewer input tokens and 2x better quality. At Amazon Nova Lite pricing on Bedrock ($0.06/1M input tokens), sending 1M Thai prompts of this size would cost ~$0.01 directly vs ~$0.003 through the optimizer — and the optimized path delivers dramatically better responses.

The savings scale dramatically with premium models. At Claude Opus 4 pricing on Bedrock ($15/1M input tokens), the same 1M Thai prompts would cost $2.49 directly vs $0.74 through the optimizer — a $1.75 saving per million requests on input tokens alone, with better quality on every response.

Every company running a multilingual chatbot is silently paying this tax. Their English-speaking users get fast, cheap, high-quality responses. Their Thai-speaking users get slower, more expensive, lower-quality responses — for the same product, same subscription price.

And it compounds. Chatbots send the full conversation history with every request. A 10-message conversation in Thai accumulates tokens 3x faster than the same conversation in English. By turn 10, you're sending massive context windows that cost a fortune and may even overflow the model's limits.

When Government Chatbots Can't Serve Their Own Citizens

Now imagine this problem at the scale of a government.

Countries across Southeast Asia, Africa, the Middle East, and South America are deploying AI-powered chatbots to help citizens access healthcare information, navigate tax systems, apply for social programs, and find emergency services. These are critical services that directly impact people's lives.

But here's the catch: the LLMs powering these chatbots were trained on English. When a farmer in rural Thailand asks about crop subsidies in Thai, the model's reasoning capability drops by nearly 50%. When a mother in Nigeria asks about childhood vaccination schedules in Yoruba, the model might not even understand the question properly.

The irony is painful: governments invest in AI to serve their citizens better, but the AI itself delivers unequal quality across languages. Not intentionally — but structurally, through training data imbalance.

The Safety Gap Nobody Talks About

It gets worse. Research shows that low-resource languages exhibit about three times the likelihood of encountering harmful content compared to high-resource languages — and in intentional attack scenarios, unsafe output rates can reach over 80% (Deng et al., "Multilingual Jailbreak Challenges in Large Language Models", 2023).

LLM safety guardrails — the filters that prevent models from generating harmful content — were primarily trained on English data.

This means a prompt injection attack that would be caught instantly in English can sail right through in Amharic or Lao. The model simply doesn't recognize the harmful intent in languages it barely understands.

For any organization deploying AI in production — especially in healthcare, finance, or government — this isn't just a quality issue. It's a liability.

A Different Approach: Don't Fix the Model, Route Around It

The conventional wisdom says: "Just train better multilingual models." And yes, that's happening. But it's slow, expensive, and may never fully close the gap for the thousands of low-resource languages that lack sufficient training data.

What if we could get English-level quality from any language, today, without retraining a single model?

That's the idea behind LLM Proxy Babylon — an open-source proxy I built that sits between your application and any LLM API.

It detects the input language, decides whether translating to English would improve results, and if so, translates the prompt before sending it to the model. Then it appends a simple instruction: "Please respond in Thai since the original question was asked in Thai."

LLM Proxy Babylon is named for the city, not the curse. It's an attempt to do what ancient Babylon did: sit at the crossroads of languages and make sure everyone gets understood.

The key insight: LLMs have no difficulty generating output in a specified language. The performance gap is in understanding non-English prompts and in producing non-English response, so input translation for reasoning quality, and optional output translation for low-resource languages where the LLM's generation is also lossy. So we translate the input (where the model is weak) and let the model handle the output (where it's strong).

Real Results

I tested this with Mistral 7B on a Thai prompt about bubble sort complexity. The results were dramatic:

Without the optimizer (direct Thai): The model produced garbled output mixing English fragments into Thai text ("วงจirkle", "sorteering technique"), with confused, repetitive reasoning. 1,749 tokens of mostly noise.

With the optimizer (translated to English first): The same model produced a clean, structured response correctly explaining O(n²) vs O(n log n) complexity, listing Merge Sort, Quick Sort, and Heap Sort with accurate Big-O analysis — all responded back in Thai. 1,446 tokens of useful content.

The model's reasoning capability was there all along. It just couldn't access it through Thai input.

I also benchmarked Amazon Nova Lite across multiple languages using the built-in evaluation harness:

Rank	Language	Quality Score	Delta from English
1	English (baseline)	0.949	—
2	Portuguese	0.763	-0.19
3	Korean	0.663	-0.29
4	Japanese	0.595	-0.35
5	Thai	0.456	-0.49

The pattern maps exactly to language resource availability. Portuguese (high-resource) takes the smallest hit. Thai (low-resource) loses nearly half the quality.

How It Works

The proxy exposes an OpenAI-compatible API, so it works as a drop-in replacement with any framework — LangChain, Strands Agents, or any OpenAI SDK client. Just change the base URL:

from strands import Agent
from strands.models.openai import OpenAIModel

model = OpenAIModel(
    client_args={"base_url": "http://localhost:3000/v1", "api_key": "not-needed"},
    model_id="us.amazon.nova-lite-v1:0",
)

agent = Agent(model=model)
response = agent("อธิบายแนวคิดของ recursion ในการเขียนโปรแกรม")

Under the hood, each request flows through a pipeline:

Detect the language (using franc for BCP-47 identification)
Parse mixed content (preserve code blocks, URLs, JSON — only translate natural language)
Classify the task type (reasoning, math, code-generation, culturally-specific)
Route — decide whether to translate, skip, or use hybrid mode
Translate the prompt to English (if beneficial)
Inject a language instruction ("Please respond in Thai...")
Forward to the LLM (supports AWS Bedrock and OpenAI)
Return the response to the client

The routing engine is smart about when NOT to translate. Culturally-specific questions ("What's good tonight in Paris?") skip translation because the model needs cultural context, not English reasoning. English prompts skip entirely. The system only translates when it expects a quality improvement.

Built on AWS

The proxy supports AWS Bedrock natively via the Converse API. Authentication is handled automatically through the AWS SDK — no API keys needed in requests. I tested with Amazon Nova Lite and Mistral 7B, both available on Bedrock.

For translation, it supports Amazon Translate ($15/1M characters, high quality for proper nouns and technical content) and LibreTranslate (self-hosted, free) out of the box, with a pluggable interface for DeepL or Google Translate. Just set TRANSLATOR_BACKEND=amazon-translate to switch — uses your existing AWS credentials.

The Conversation Cache: Solving the Multi-Turn Problem

Multi-turn conversations are where the token tax really hurts. Every request sends the full history, and that history is in the user's language — eating 2-4x more tokens per turn.

The proxy includes a conversation translation cache. Pass an X-Conversation-Id header and previously translated messages are pulled from cache instead of being re-translated. By turn 10, you get 9 cache hits and only 1 miss per request — 9 translation API calls saved, and the LLM always sees a lean English context window.

Beyond Quality: Safety as a Side Effect

By translating low-resource language prompts to English before sending them to the LLM, the optimizer routes every prompt through the model's strongest safety alignment. A harmful prompt in Thai or Amharic gets evaluated by English-trained guardrails operating at full strength, rather than the weaker low-resource language alignment.

This isn't a complete safety solution — but for the common case, it significantly narrows the 3x safety gap between high-resource and low-resource languages identified by Deng et al.

But What If Models Get Better at Multilingual?

They will — and the optimizer is designed for that.

The token cost problem is structural, not a training problem. BPE tokenizers will always split Thai, Arabic, and Korean into 2-4x more tokens than semantically equivalent English. Unless providers fundamentally redesign their tokenizers and retrain everything, the cost disparity persists regardless of how multilingual the model becomes.

Conversation history compounding doesn't go away either. Even a perfectly multilingual model still charges per token. A 10-turn Thai conversation still accumulates tokens 3x faster than English. The conversation translation cache solves this at the infrastructure level.

RAG retrieval is an embedding problem, not an LLM problem. Vector embeddings are English-centric. Translating queries to English before retrieval improves recall regardless of how good the LLM itself is at understanding Thai.

Fine-tuning ROI is permanent. Companies fine-tune on English domain data. A perfectly multilingual base model still won't have that domain-specific knowledge accessible through non-English prompts unless the fine-tuning data was also multilingual — which it almost never is.

Safety alignment will always lag for low-resource languages. Even as models improve, safety training data will remain English-heavy. Routing through English for safety filtering is a defense-in-depth strategy that stays relevant.

And the adaptive router handles the transition gracefully. As models get better at specific languages, the shadow evaluator detects that translation no longer helps, and the router automatically switches to skip. The proxy doesn't fight against model improvements — it adapts to them. For a language where the model reaches English parity, the proxy becomes a transparent pass-through with zero overhead.

Today the proxy is primarily about quality. As models improve, it becomes primarily about cost optimization, safety, and RAG. The architecture already supports that transition because routing decisions are data-driven, not hardcoded assumptions.

What's Next

This is an open-source project and there's a lot more to explore:

RAG improvement — translate queries to English before vector retrieval for better recall (current architecture supports it)
Fine-tuning ROI — ensure non-English users benefit from English-only fine-tuned models
Dialect detection — handle Egyptian Arabic vs Modern Standard Arabic, European vs Brazilian Portuguese

The Question We Should Be Asking

The next time someone says "all LLMs are the same," ask them: in which language?

AI won't truly be intelligent until it understands every language, every culture. Until then, tools like LLM Proxy Babylon can bridge the gap — giving every user English-level quality, regardless of what language they think in.

The code is open source: github.com/tverney/llm-proxy-babylon

273 property-based tests. Real benchmarks. Ready to deploy.

Originally published on AWS Builder Center