<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Thiago V.</title>
    <description>The latest articles on Forem by Thiago V. (@tverney_77).</description>
    <link>https://forem.com/tverney_77</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3852303%2Fc28fe7f4-a933-407f-8c6b-742b24f97742.jpeg</url>
      <title>Forem: Thiago V.</title>
      <link>https://forem.com/tverney_77</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/tverney_77"/>
    <language>en</language>
    <item>
      <title>Kiro Forgets Everything Every Session. So I've Built It a Memory.</title>
      <dc:creator>Thiago V.</dc:creator>
      <pubDate>Mon, 27 Apr 2026 00:29:07 +0000</pubDate>
      <link>https://forem.com/tverney_77/kiro-forgets-everything-every-session-so-ive-built-it-a-memory-1e86</link>
      <guid>https://forem.com/tverney_77/kiro-forgets-everything-every-session-so-ive-built-it-a-memory-1e86</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally posted on &lt;a href="https://builder.aws.com/content/3CuvRVmbvmB0YYX2ekY1BCcFRJ5/give-kiro-cli-memory-that-survives-sessions" rel="noopener noreferrer"&gt;AWS Builder&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I work with an AI every day. It's smart. It writes decent code. And every single morning, it forgets who I am.&lt;/p&gt;

&lt;p&gt;I open &lt;code&gt;kiro-cli chat&lt;/code&gt;, and the first 10 minutes are the same tax I paid yesterday:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Yes, we use pnpm. No, not npm. Yes, Vitest. No, not Jest. The main entry is &lt;code&gt;src/cli.ts&lt;/code&gt;. We already decided to use &lt;code&gt;Result&amp;lt;T, E&amp;gt;&lt;/code&gt; at the CLI boundary. You told me that last week. I told you that last week. We had this exact conversation.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;My teammate calls it the &lt;strong&gt;project re-discovery tax&lt;/strong&gt;. Every session, you pay it. Every. Session.&lt;/p&gt;

&lt;p&gt;I got tired of paying it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the obvious fixes didn't work
&lt;/h2&gt;

&lt;p&gt;I tried the obvious things first.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"Just use steering files."&lt;/strong&gt; Steering files are great for &lt;em&gt;what is this project&lt;/em&gt;. They're static markdown you maintain by hand. They don't capture &lt;em&gt;what the AI figured out during a session&lt;/em&gt;. The whole point of working with an AI is that it learns things with you. Steering files can't capture that.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"Tell the agent to call a &lt;code&gt;remember()&lt;/code&gt; tool when it learns something."&lt;/strong&gt; I tried this. Claude is inconsistent about when to call it. GPT is inconsistent. Kiro is inconsistent. Every model is inconsistent, because memory management is a side-quest to whatever task you're actually doing. The agent forgets to remember. Turtles all the way down.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"Use a SQLite knowledge graph MCP server."&lt;/strong&gt; Same problem. Fancier storage, same failure mode. The agent still has to decide when to store.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"Wait for Kiro to ship it."&lt;/strong&gt; There's a proposal floating around for &lt;code&gt;.kiro/tasks/*.md&lt;/code&gt; with auto-read/auto-write. No ETA. I had work to do this week.&lt;/p&gt;




&lt;h2&gt;
  
  
  The insight that actually fixed it
&lt;/h2&gt;

&lt;p&gt;Here's what clicked for me, and I'll give credit where it's due — it came from a design doc by a coworker, I just productized it:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The agent should be a reader of memory, not a writer.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Writing memory is a &lt;em&gt;different job&lt;/em&gt; from using memory. They should not share a context window. The writer can be slow, deliberate, even expensive. The reader needs to be fast, cheap, and running on every session start.&lt;/p&gt;

&lt;p&gt;So I split them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────┐     MCP/stdio     ┌────────────────────┐     filesystem      ┌────────────────────────┐
│   Kiro CLI   │ ◄───────────────► │ mcp-agent-memory   │ ◄─────────────────► │ agent-memory-daemon    │
│ (the reader) │                   │  (MCP server)      │   ~/.agent-memory/   │   (the writer)         │
└──────────────┘                   └────────────────────┘                     └────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Kiro reads.&lt;/strong&gt; The MCP server gives it three tools: &lt;code&gt;memory_read&lt;/code&gt;, &lt;code&gt;memory_append_session&lt;/code&gt;, &lt;code&gt;memory_search&lt;/code&gt;. That's it. Nothing fancy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A background daemon writes.&lt;/strong&gt; It watches the sessions directory, reads session summaries on a cadence, runs them through an LLM to extract durable facts, and updates markdown files in &lt;code&gt;~/.agent-memory/memory/&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;They never talk directly. The filesystem is the contract. &lt;code&gt;~/.agent-memory/&lt;/code&gt; is all they share.&lt;/p&gt;

&lt;p&gt;Kiro burns zero tokens on memory management. The heavy lifting happens async, outside the chat.&lt;/p&gt;




&lt;h2&gt;
  
  
  What it looks like now
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Monday:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kiro-cli chat
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; We use pnpm. Never suggest npm. Vitest not Jest. Main entry is src/cli.ts.
  I prefer explicit &lt;span class="k"&gt;return &lt;/span&gt;types.

&lt;span class="o"&gt;[&lt;/span&gt;Kiro does work &lt;span class="k"&gt;for &lt;/span&gt;20 minutes]

&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; Great, call memory_append_session with a summary of what we agreed on.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Terminal closes. Life moves on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tuesday:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kiro-cli chat
&lt;span class="o"&gt;[&lt;/span&gt;Kiro automatically calls memory_read per my steering rule]

&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; I see we use pnpm, Vitest &lt;span class="o"&gt;(&lt;/span&gt;not Jest&lt;span class="o"&gt;)&lt;/span&gt;, src/cli.ts as the main entry, and
  you prefer explicit &lt;span class="k"&gt;return &lt;/span&gt;types. What are we working on today?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;No re-explaining. No pasted summary.&lt;/strong&gt; The AI just remembers.&lt;/p&gt;

&lt;p&gt;Between sessions, the daemon woke up, read Monday's session file, extracted the durable facts, deduplicated them against what it already knew, and updated &lt;code&gt;~/.agent-memory/memory/project-preferences.md&lt;/code&gt;. I didn't lift a finger.&lt;/p&gt;




&lt;h2&gt;
  
  
  The part I'm most proud of: it costs almost nothing
&lt;/h2&gt;

&lt;p&gt;The daemon runs an LLM to do the extraction. LLMs cost money. I didn't want this tool to quietly drain my Bedrock bill.&lt;/p&gt;

&lt;p&gt;So I added a &lt;strong&gt;Kiro backend&lt;/strong&gt;. Instead of calling Bedrock or OpenAI, the daemon shells out to &lt;code&gt;kiro-cli&lt;/code&gt; itself using your existing Kiro credits. Paired with a lean consolidation agent config (ships with the package), each extraction pass costs about &lt;strong&gt;0.01 Kiro credits&lt;/strong&gt;. Default agent would have been ~0.07. That 7× savings is the difference between "nice-to-have" and "forgot it was running."&lt;/p&gt;

&lt;p&gt;You can still pick Bedrock or OpenAI if that's your stack.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; mcp-agent-memory
mcp-agent-memory &lt;span class="nt"&gt;--setup&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The wizard walks you through picking a backend, registering with Kiro (and Claude Desktop and Cursor if you want), and installing the daemon as a LaunchAgent on macOS.&lt;/p&gt;

&lt;p&gt;Add this one-line steering rule at &lt;code&gt;~/.kiro/steering/memory.md&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;At the start of every session, call memory_read (no arguments) to load my
memory index. When you learn something durable about me, my projects, or
my preferences, call memory_append_session with a concise markdown summary.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Restart &lt;code&gt;kiro-cli&lt;/code&gt;. That's it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Reading the memory yourself
&lt;/h2&gt;

&lt;p&gt;The memory isn't a black box. It's just markdown files in &lt;code&gt;~/.agent-memory/memory/&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;ls&lt;/span&gt; ~/.agent-memory/memory/
MEMORY.md  cli-architecture.md  project-preferences.md  team-processes.md

&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; ~/.agent-memory/memory/project-preferences.md
&lt;span class="c"&gt;# Project Preferences&lt;/span&gt;
- Package manager: pnpm &lt;span class="o"&gt;(&lt;/span&gt;never npm&lt;span class="o"&gt;)&lt;/span&gt;
- Testing framework: Vitest &lt;span class="o"&gt;(&lt;/span&gt;not Jest&lt;span class="o"&gt;)&lt;/span&gt;
- Main entry: src/cli.ts
- Return types: explicit, not inferred
...

&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; &lt;span class="s2"&gt;"Vitest"&lt;/span&gt; ~/.agent-memory/memory/
project-preferences.md:- Testing framework: Vitest &lt;span class="o"&gt;(&lt;/span&gt;not Jest&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;cat&lt;/code&gt; works. &lt;code&gt;grep&lt;/code&gt; works. &lt;code&gt;git&lt;/code&gt; works. If you hate what it stored, delete the file.&lt;/p&gt;




&lt;h2&gt;
  
  
  What this isn't
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Not&lt;/strong&gt; a knowledge graph with vector search. If you want that, &lt;a href="https://github.com/Auriti-Labs/kiro-memory" rel="noopener noreferrer"&gt;totalrecallai&lt;/a&gt; does it beautifully — SQLite, embeddings, web dashboard, the works.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not&lt;/strong&gt; AgentCore Memory. That's a managed Bedrock service. This runs on your laptop.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not&lt;/strong&gt; a replacement for steering files. Steering is "what is this project." Memory is "what have we learned together." Use both.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Why this flavor exists
&lt;/h2&gt;

&lt;p&gt;If I were going to pay for a heavyweight memory system, I probably wouldn't have built this. But:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I wanted memory as &lt;strong&gt;plain markdown&lt;/strong&gt; I could read, grep, and version-control.&lt;/li&gt;
&lt;li&gt;I wanted &lt;strong&gt;Kiro CLI support&lt;/strong&gt; specifically (totalrecallai doesn't list it — targets Claude Code, Cursor, Windsurf, Cline).&lt;/li&gt;
&lt;li&gt;I wanted &lt;strong&gt;near-zero ongoing cost&lt;/strong&gt; via the Kiro backend.&lt;/li&gt;
&lt;li&gt;I wanted the MCP server's surface area to be &lt;strong&gt;tiny&lt;/strong&gt; — 3 tools, no dashboard, no SDK.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If that set of constraints sounds right for you, this is your tool. If you want the database-backed semantic-search dashboard experience, try totalrecallai — it's genuinely great at what it does.&lt;/p&gt;




&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;npm:&lt;/strong&gt; &lt;a href="https://www.npmjs.com/package/mcp-agent-memory" rel="noopener noreferrer"&gt;mcp-agent-memory&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/tverney/mcp-agent-memory" rel="noopener noreferrer"&gt;tverney/mcp-agent-memory&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The async daemon:&lt;/strong&gt; &lt;a href="https://builder.aws.com/content/3C29ijZpMg6xOxI9Rddl73OMaCX/memory-consolidation-daemon-for-ai-agents-with-bedrock" rel="noopener noreferrer"&gt;previous post&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MCP spec:&lt;/strong&gt; &lt;a href="https://modelcontextprotocol.io/" rel="noopener noreferrer"&gt;Model Context Protocol&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;If you try it and something breaks, file an issue. If you've got a pattern for what &lt;em&gt;should&lt;/em&gt; be memorable vs. forgettable, drop it in the comments — that's the next hard problem I don't have a great answer for yet.&lt;/p&gt;

&lt;p&gt;Tomorrow morning, Kiro will remember who I am. It doesn't feel that I'm unknown anymore.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>kiro</category>
      <category>mcp</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Bedrock prompt caching only caches a stable prefix. If you inject memory into user content, you pay full price every turn. Here's what I learned wiring up persistent memory for OpenClaw on AgentCore Runtime. #openclawchallenge 🦞</title>
      <dc:creator>Thiago V.</dc:creator>
      <pubDate>Fri, 24 Apr 2026 22:28:23 +0000</pubDate>
      <link>https://forem.com/tverney_77/bedrock-prompt-caching-only-caches-a-stable-prefix-if-you-inject-memory-into-user-content-you-pay-3e3l</link>
      <guid>https://forem.com/tverney_77/bedrock-prompt-caching-only-caches-a-stable-prefix-if-you-inject-memory-into-user-content-you-pay-3e3l</guid>
      <description>&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/tverney_77/persistent-memory-for-openclaw-on-bedrock-getting-prompt-caching-right-3o6c" class="crayons-story__hidden-navigation-link"&gt;Memory Daemon for OpenClaw: How I Got Bedrock Prompt Caching Right&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
      &lt;a href="https://dev.to/tverney_77/persistent-memory-for-openclaw-on-bedrock-getting-prompt-caching-right-3o6c" class="crayons-article__context-note crayons-article__context-note__feed"&gt;&lt;p&gt;OpenClaw Challenge Submission 🦞&lt;/p&gt;

&lt;/a&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;

          &lt;a href="/tverney_77" class="crayons-avatar  crayons-avatar--l  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3852303%2Fc28fe7f4-a933-407f-8c6b-742b24f97742.jpeg" alt="tverney_77 profile" class="crayons-avatar__image"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/tverney_77" class="crayons-story__secondary fw-medium m:hidden"&gt;
              Thiago V.
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                Thiago V.
                
              
              &lt;div id="story-author-preview-content-3547709" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/tverney_77" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3852303%2Fc28fe7f4-a933-407f-8c6b-742b24f97742.jpeg" class="crayons-avatar__image" alt=""&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;Thiago V.&lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

          &lt;/div&gt;
          &lt;a href="https://dev.to/tverney_77/persistent-memory-for-openclaw-on-bedrock-getting-prompt-caching-right-3o6c" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;Apr 24&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/tverney_77/persistent-memory-for-openclaw-on-bedrock-getting-prompt-caching-right-3o6c" id="article-link-3547709"&gt;
          Memory Daemon for OpenClaw: How I Got Bedrock Prompt Caching Right
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/openclawchallenge"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;openclawchallenge&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/openclaw"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;openclaw&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/aws"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;aws&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/bedrock"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;bedrock&lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
          &lt;a href="https://dev.to/tverney_77/persistent-memory-for-openclaw-on-bedrock-getting-prompt-caching-right-3o6c" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left"&gt;
            &lt;div class="multiple_reactions_aggregate"&gt;
              &lt;span class="multiple_reactions_icons_container"&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/exploding-head-daceb38d627e6ae9b730f36a1e390fca556a4289d5a41abb2c35068ad3e2c4b5.svg" width="18" height="18"&gt;
                  &lt;/span&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/multi-unicorn-b44d6f8c23cdd00964192bedc38af3e82463978aa611b4365bd33a0f1f4f3e97.svg" width="18" height="18"&gt;
                  &lt;/span&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/sparkle-heart-5f9bee3767e18deb1bb725290cb151c25234768a0e9a2bd39370c382d02920cf.svg" width="18" height="18"&gt;
                  &lt;/span&gt;
              &lt;/span&gt;
              &lt;span class="aggregate_reactions_counter"&gt;5&lt;span class="hidden s:inline"&gt; reactions&lt;/span&gt;&lt;/span&gt;
            &lt;/div&gt;
          &lt;/a&gt;
            &lt;a href="https://dev.to/tverney_77/persistent-memory-for-openclaw-on-bedrock-getting-prompt-caching-right-3o6c#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              Comments


              1&lt;span class="hidden s:inline"&gt; comment&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            6 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;


</description>
      <category>aws</category>
      <category>devchallenge</category>
      <category>llm</category>
      <category>openclaw</category>
    </item>
    <item>
      <title>Memory Daemon for OpenClaw: How I Got Bedrock Prompt Caching Right</title>
      <dc:creator>Thiago V.</dc:creator>
      <pubDate>Fri, 24 Apr 2026 22:23:29 +0000</pubDate>
      <link>https://forem.com/tverney_77/persistent-memory-for-openclaw-on-bedrock-getting-prompt-caching-right-3o6c</link>
      <guid>https://forem.com/tverney_77/persistent-memory-for-openclaw-on-bedrock-getting-prompt-caching-right-3o6c</guid>
      <description>&lt;p&gt;If you're running an AI agent on Amazon Bedrock and injecting persistent memory into every conversation, where you put that memory in the request matters a lot — both for how well the agent uses it and for what it costs you.&lt;/p&gt;

&lt;p&gt;I learned this the direct way while connecting &lt;a href="https://github.com/tverney/agent-memory-daemon" rel="noopener noreferrer"&gt;agent-memory-daemon&lt;/a&gt; to &lt;a href="https://github.com/openclaw/openclaw" rel="noopener noreferrer"&gt;OpenClaw&lt;/a&gt; running on Amazon Bedrock AgentCore Runtime. The setup works beautifully. &lt;/p&gt;

&lt;p&gt;My agent now remembers my preferences, my projects, and the weird Bedrock timeout I debugged three weeks ago. &lt;/p&gt;

&lt;p&gt;But along the way I hit a subtle interaction between memory injection and prompt caching that's worth documenting.&lt;/p&gt;

&lt;p&gt;This post walks through the architecture, the Bedrock prompt caching rule that tripped me up, and the one-line fix that cut my cache-related costs dramatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  The setup: persistent memory for a serverless agent
&lt;/h2&gt;

&lt;p&gt;OpenClaw lives in a container on AgentCore Runtime. &lt;/p&gt;

&lt;p&gt;AgentCore freezes the container when idle, which is great for cost (zero idle spend) but hostile to long-term memory (every wake is a blank slate). &lt;code&gt;agent-memory-daemon&lt;/code&gt; solves this by running as a background process in the same container, doing two things:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extraction&lt;/strong&gt; — watches the session transcript directory and pulls out facts, decisions, and preferences worth remembering. Writes them as individual markdown files with YAML frontmatter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Consolidation&lt;/strong&gt; — periodically reorganizes the memory directory: merges duplicates, resolves contradictions, prunes stale content, and maintains a concise &lt;code&gt;MEMORY.md&lt;/code&gt; index under a strict size budget.&lt;/p&gt;

&lt;p&gt;Memory is synced to S3 between invocations. When a new conversation starts, the container restores the memory directory and reads &lt;code&gt;MEMORY.md&lt;/code&gt; to bring the agent up to speed.&lt;/p&gt;

&lt;p&gt;The daemon itself is cheap. &lt;/p&gt;

&lt;p&gt;It makes a few Haiku calls per day — my config targets about $0.25/month for the daemon's own LLM usage. The magic happens in what it produces: a curated, size-budgeted &lt;code&gt;MEMORY.md&lt;/code&gt; that's always ~18KB regardless of how many sessions the agent has had.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Discord → EC2 bot → AgentCore Runtime → container
                                            ├── openclaw (the agent)
                                            ├── agent-memory-daemon (curator)
                                            └── server.py (HTTP glue + S3 sync)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The daemon writes files. The agent reads files. The filesystem is the interface. No SDK, no API, no coupling.&lt;/p&gt;
&lt;h2&gt;
  
  
  Injecting the memory
&lt;/h2&gt;

&lt;p&gt;On every invocation, I load &lt;code&gt;MEMORY.md&lt;/code&gt; from S3 and pass it to OpenClaw as context. My first version looked like this:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;memory_context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_memory_from_s3&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# ~18KB of curated memory
&lt;/span&gt;
&lt;span class="n"&gt;effective_message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;memory_context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;effective_message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[LONG-TERM MEMORY - persisted memory from previous sessions]&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;memory_context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[END OF MEMORY]&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;User message: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;effective_message&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;I stuffed the memory into the user message. The agent saw it. It remembered my preferences. Everything worked.&lt;/p&gt;

&lt;p&gt;I also had Bedrock prompt caching enabled through OpenClaw's config:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"agents"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"defaults"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"models"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"amazon-bedrock/...claude-haiku-4-5..."&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"params"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"cacheRetention"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"short"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Claude Haiku 4.5 supports prompt caching with a 5-minute TTL on the "short" retention mode. &lt;/p&gt;

&lt;p&gt;Cache reads are billed at ~10% of the regular input rate. On paper, my 18KB memory (~4,500 tokens) should have been getting served from cache at roughly a tenth of the price on every turn after the first.&lt;/p&gt;

&lt;p&gt;Then I looked at Cost Explorer.&lt;/p&gt;
&lt;h2&gt;
  
  
  What Bedrock actually caches
&lt;/h2&gt;

&lt;p&gt;Three days of usage, broken down by token type:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Line item&lt;/th&gt;
&lt;th&gt;Tokens (millions)&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Cache Read&lt;/td&gt;
&lt;td&gt;12.69&lt;/td&gt;
&lt;td&gt;$1.40&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cache Write&lt;/td&gt;
&lt;td&gt;7.09&lt;/td&gt;
&lt;td&gt;$9.75&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Input (uncached)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;31.91&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$35.10&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output&lt;/td&gt;
&lt;td&gt;4.72&lt;/td&gt;
&lt;td&gt;$25.96&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The "Input (uncached)" line is the one that doesn't make sense if caching is working. I had 12.69M cache reads, which meant &lt;em&gt;something&lt;/em&gt; was being cached — OpenClaw's internal system prompt was getting cached fine. But 31.91M tokens were paying full input price. Where were they coming from?&lt;/p&gt;

&lt;p&gt;Here's the rule that trips people up: &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bedrock prompt caching caches a stable prefix.&lt;/strong&gt; It looks at the beginning of the request, finds the longest chunk that's identical to a previously-cached request, and serves that from cache. Everything after the divergence point is recomputed and billed as regular input.&lt;/p&gt;

&lt;p&gt;Now look at my code again:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;effective_message&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;code&gt;effective_message&lt;/code&gt; is &lt;code&gt;"[LONG-TERM MEMORY]...18KB of memory...User message: {message}"&lt;/code&gt;. The user's actual question is appended at the end.&lt;/p&gt;

&lt;p&gt;What Bedrock sees:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Turn 1: &lt;code&gt;messages[0].content = "[MEMORY]...same 18KB...User message: what time is it?"&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Turn 2: &lt;code&gt;messages[0].content = "[MEMORY]...same 18KB...User message: tell me a joke"&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those two strings share a stable 18KB prefix of memory content, but they're both in &lt;code&gt;messages[0].content&lt;/code&gt;. The cacheable prefix is actually the &lt;em&gt;system prompt that OpenClaw builds on top&lt;/em&gt; — OpenClaw's own system content, its tool definitions, its skill metadata. &lt;/p&gt;

&lt;p&gt;Once the request stream reaches the user message, Bedrock sees variance (the actual user question) and stops caching.&lt;/p&gt;

&lt;p&gt;So the memory was sitting in a position where it couldn't be cached. Every turn paid full price for those 4,500 tokens.&lt;/p&gt;
&lt;h2&gt;
  
  
  The fix
&lt;/h2&gt;

&lt;p&gt;The change is small. Move the memory to a &lt;code&gt;system&lt;/code&gt; message, before the user message:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;memory_context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You have access to long-term memory from previous sessions. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Use this to answer questions about the user&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s preferences and history.&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;memory_context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Now the memory is part of the stable system prefix. It sits alongside OpenClaw's own system prompt, tool definitions, and skills — the stuff that genuinely doesn't change between turns. Bedrock sees the same system block on every request and serves it from cache at 10% of the regular rate.&lt;/p&gt;

&lt;p&gt;A one-line architectural change. A 90% discount on the biggest line item in the bill.&lt;/p&gt;
&lt;h2&gt;
  
  
  Verifying it worked
&lt;/h2&gt;

&lt;p&gt;After deploying, I asked OpenClaw for its usage stats via the &lt;code&gt;/usage full&lt;/code&gt; chat command:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;🦞 OpenClaw 2026.2.26
🧮 Tokens: 9 in / 516 out
🗄️ Cache: 99% hit · 67k cached, 715 new
📚 Context: 34k/200k (17%)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;67K tokens served from cache, only 715 new tokens computed. Before the fix, the 4,500-token memory injection was in the "new" bucket every turn. Now it's in the 67K cached bucket.&lt;/p&gt;

&lt;p&gt;The change to Cost Explorer followed. The "Input (uncached)" line dropped, and the "Cache Read" line absorbed that traffic at a tenth of the price.&lt;/p&gt;
&lt;h2&gt;
  
  
  Three takeaways
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Prompt caching only caches a stable prefix.&lt;/strong&gt; Everything up to the first point of variance between requests is cacheable. Everything after is not. If you're injecting repeated context, put it early in the request — system prompt, tool definitions, or the first message of a consistent message sequence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. User content is almost always the wrong place for stable context.&lt;/strong&gt; The user's actual question varies every turn. Anything you concatenate with it inherits that variance and becomes uncacheable. Pull it out into a system message.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Watch cache writes in your bill.&lt;/strong&gt; Cache writes cost &lt;em&gt;more&lt;/em&gt; than regular input (1.25x on Haiku 4.5). If you see high cache writes, it means your TTL is expiring between requests and the cache is being rewritten each time. Keep the cache warm — for &lt;code&gt;cacheRetention: "short"&lt;/code&gt; (5-min TTL), a heartbeat every ~4 minutes avoids cold-cache rewrites.&lt;/p&gt;
&lt;h2&gt;
  
  
  The daemon, revisited
&lt;/h2&gt;

&lt;p&gt;None of this is a critique of &lt;code&gt;agent-memory-daemon&lt;/code&gt; — the daemon did exactly what it was supposed to do. It produced a stable, size-budgeted 18KB memory file. &lt;/p&gt;

&lt;p&gt;The integration code I wrote around it was putting that output in the wrong place.&lt;/p&gt;

&lt;p&gt;In fact, the daemon's design (stable output size, consistent content, regular regeneration rhythm) is &lt;em&gt;ideal&lt;/em&gt; for prompt caching. As long as you feed it into a system message, Bedrock can cache the whole thing for the TTL window, and the daemon's periodic consolidation doesn't bust the cache more often than necessary.&lt;/p&gt;

&lt;p&gt;If you're running OpenClaw or any agent on Bedrock and want persistent memory without a managed memory service, the pattern works well:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Run &lt;a href="https://github.com/tverney/agent-memory-daemon" rel="noopener noreferrer"&gt;agent-memory-daemon&lt;/a&gt; alongside your agent&lt;/li&gt;
&lt;li&gt;Sync the memory directory to S3 between sessions (or use a mounted filesystem if available)&lt;/li&gt;
&lt;li&gt;Load the curated &lt;code&gt;MEMORY.md&lt;/code&gt; at the start of each conversation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inject it as a system message&lt;/strong&gt;, not user content&lt;/li&gt;
&lt;li&gt;Enable &lt;code&gt;cacheRetention&lt;/code&gt; on your model config&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The daemon handles the hard part (curating memories without bloat). Bedrock handles the cheap part (caching the stable prefix). You just have to put the memory in the right place.&lt;/p&gt;
&lt;h2&gt;
  
  
  Code
&lt;/h2&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/tverney" rel="noopener noreferrer"&gt;
        tverney
      &lt;/a&gt; / &lt;a href="https://github.com/tverney/agent-memory-daemon" rel="noopener noreferrer"&gt;
        agent-memory-daemon
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Open-source memory manager daemon for AI agents. Filesystem-native, LLM-pluggable, framework-agnostic. Works with OpenClaw, Strands, LangChain, AgentCore Runtime or any agent that can write a file.
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;Open-source memory manager daemon for AI agents&lt;/h1&gt;
&lt;/div&gt;

&lt;p&gt;Open-source memory consolidation and extraction daemon for AI agents. Filesystem-native, LLM-pluggable, framework-agnostic.&lt;/p&gt;

&lt;p&gt;Agents feed it raw observations as markdown files; the daemon runs two complementary modes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Consolidation&lt;/strong&gt; — periodically reorganizes, deduplicates, and prunes existing memory files via a four-phase pass (orient → gather → consolidate → prune)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Extraction&lt;/strong&gt; — watches for new session content and runs an LLM pass to identify facts, decisions, preferences, and error corrections worth remembering, writing them as individual memory files&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The filesystem is the interface — no SDK, no API, no MCP required. The LLM backend is pluggable (OpenAI, Amazon Bedrock, or anything with a chat API).&lt;/p&gt;

&lt;p&gt;memconsolidate is a standalone, agent-agnostic daemon — available to anyone building with OpenClaw, Strands, LangChain, or any custom agent framework.&lt;/p&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;How it works&lt;/h2&gt;
&lt;/div&gt;

&lt;div class="markdown-heading"&gt;
&lt;h3 class="heading-element"&gt;Consolidation (reorganize existing memories)&lt;/h3&gt;
&lt;/div&gt;


&lt;ol&gt;

&lt;li&gt;Agents write markdown memory files (with YAML frontmatter) to a watched directory&lt;/li&gt;

&lt;li&gt;A three-gate…&lt;/li&gt;

&lt;/ol&gt;
&lt;/div&gt;
&lt;br&gt;
  &lt;/div&gt;
&lt;br&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/tverney/agent-memory-daemon" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;br&gt;
&lt;/div&gt;
&lt;br&gt;



&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/tverney" rel="noopener noreferrer"&gt;
        tverney
      &lt;/a&gt; / &lt;a href="https://github.com/tverney/openclaw-agentcore-personal" rel="noopener noreferrer"&gt;
        openclaw-agentcore-personal
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Deploy your own personal OpenClaw on AWS Bedrock AgentCore — serverless, ~$9/mo, one-click CloudFormation, Discord/WhatsApp/Telegram 🦞
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;Deploy Your Personal OpenClaw on AWS AgentCore — Serverless, ~$9/month&lt;/h1&gt;
&lt;/div&gt;
&lt;p&gt;&lt;a href="https://github.com/tverney/openclaw-agentcore-personal/LICENSE" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/fdf2982b9f5d7489dcf44570e714e3a15fce6253e0cc6b5aa61a075aac2ff71b/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4c6963656e73652d4d49542d79656c6c6f772e737667" alt="License: MIT"&gt;&lt;/a&gt;
&lt;a href="https://github.com/tverney/openclaw-agentcore-personal/openclaw-simplified.yaml" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/8612cbc353da890d8c8059074a18d758fe2c3ef3e533eee1ec311668c6946992/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4157532d436c6f7564466f726d6174696f6e2d6f72616e67653f6c6f676f3d616d617a6f6e617773" alt="AWS CloudFormation"&gt;&lt;/a&gt;
&lt;a href="https://python.org" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/b9b0db6ad652ca22a00b92c5440c9f22222c10d3a77c798c712947b517dc3329/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f507974686f6e2d332e31302b2d626c75653f6c6f676f3d707974686f6e" alt="Python 3.10+"&gt;&lt;/a&gt;
&lt;a href="https://kiro.dev" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/bdab0209aac85f542446afcb9641e10e2cf0eb5a3c4d328d640e2d069a2532e5/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4275696c74253230776974682d4b69726f2d626c756576696f6c6574" alt="Built with Kiro"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Cost-optimized &lt;a href="https://github.com/aws-samples/sample-OpenClaw-on-AWS-with-Bedrock" rel="noopener noreferrer"&gt;OpenClaw&lt;/a&gt; deployment using AWS Bedrock AgentCore Runtime. Connect via Discord, WhatsApp, Telegram, or Slack. ~$9-15/month infrastructure.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="https://console.aws.amazon.com/cloudformation/home#/stacks/new?stackName=openclaw-personal&amp;amp;templateURL=https://raw.githubusercontent.com/tverney/openclaw-personal-agentcore/main/openclaw-simplified.yaml" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/6a813feb9856a2908ddb0ec6d67734b3b0792e8941544cf0ec6d16e625754869/68747470733a2f2f73332e616d617a6f6e6177732e636f6d2f636c6f7564666f726d6174696f6e2d6578616d706c65732f636c6f7564666f726d6174696f6e2d6c61756e63682d737461636b2e706e67" alt="Launch Stack"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;What Is This?&lt;/h2&gt;
&lt;/div&gt;
&lt;p&gt;A single-user, serverless deployment of OpenClaw on AWS. Instead of running an EC2 instance 24/7, the AI runs on-demand via AgentCore Runtime — the container freezes between invocations, so you only pay when you use it.&lt;/p&gt;
&lt;p&gt;All messaging plugins (WhatsApp, Telegram, Discord, Slack) are pre-installed in OpenClaw. This template includes a Discord bot by default, but you can connect any platform directly through the OpenClaw Web UI.&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Architecture&lt;/h2&gt;
&lt;/div&gt;
&lt;div class="snippet-clipboard-content notranslate position-relative overflow-auto"&gt;
&lt;pre class="notranslate"&gt;&lt;code&gt;You (Discord / WhatsApp / Telegram / Slack)
  │
  ▼
┌──────────────────────────────────────────────────────────┐
│  AWS Cloud                                               │
│                                                          │
│  EC2 t4g.nano ──invoke──▶  AgentCore Runtime             │
│  (Discord bot)             (OpenClaw container)          │
│                                │                         │
│                            IAM Role                      │
│                                │                         │
│                            Bedrock                       │
│                          (Haiku/Sonnet/Nova)             │
│                                                          │
│  ┌─────────┐  ┌──────────┐  ┌─────────┐&lt;/code&gt;&lt;/pre&gt;…&lt;/div&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/tverney/openclaw-agentcore-personal" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;


&lt;p&gt;the full AgentCore deployment, including the system-message fix and the S3 sync layer&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Part of the &lt;a href="https://dev.to/t/openclawchallenge"&gt;OpenClaw Challenge&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>openclawchallenge</category>
      <category>openclaw</category>
      <category>aws</category>
      <category>bedrock</category>
    </item>
    <item>
      <title>Your AI Agent Forgets Everything — Here's a Daemon That Fixes That</title>
      <dc:creator>Thiago V.</dc:creator>
      <pubDate>Tue, 07 Apr 2026 16:08:59 +0000</pubDate>
      <link>https://forem.com/tverney_77/your-ai-agent-forgets-everything-heres-a-daemon-that-fixes-that-5ao0</link>
      <guid>https://forem.com/tverney_77/your-ai-agent-forgets-everything-heres-a-daemon-that-fixes-that-5ao0</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally posted on &lt;a href="https://builder.aws.com/content/3C29ijZpMg6xOxI9Rddl73OMaCX/memory-consolidation-daemon-for-ai-agents-with-bedrock" rel="noopener noreferrer"&gt;AWS Builder&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;You spend an hour teaching it your project structure, your coding preferences, the weird Bedrock timeout issue you debugged last Tuesday. Next session? Gone. You're back to explaining that you prefer single quotes and that the CI pipeline needs &lt;code&gt;--run&lt;/code&gt; to avoid watch mode.&lt;/p&gt;

&lt;p&gt;Some frameworks have memory plugins. They work — sort of. But they're coupled to one framework, they accumulate junk over time, and nobody's cleaning up the contradictions from three weeks ago when you changed your mind about the database.&lt;/p&gt;

&lt;p&gt;So I built &lt;a href="https://github.com/tverney/agent-memory-daemon" rel="noopener noreferrer"&gt;agent-memory-daemon&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it does
&lt;/h2&gt;

&lt;p&gt;It's a background daemon that runs alongside your agent — any agent. It watches a directory of session files and does two things:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extraction&lt;/strong&gt; — scans new session transcripts and pulls out facts, decisions, preferences, and error corrections. Writes each one as a structured markdown file with YAML frontmatter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Consolidation&lt;/strong&gt; — periodically reviews the entire memory directory. Merges duplicates, converts relative dates to absolute, removes contradicted facts, prunes stale content, and keeps a concise &lt;code&gt;MEMORY.md&lt;/code&gt; index under a size budget.&lt;/p&gt;

&lt;p&gt;The filesystem is the interface. Your agent writes markdown files to a directory. The daemon reads them, thinks about them, and writes organized memories back. No SDK, no API, no MCP server. If your agent can write a file, it works.&lt;/p&gt;

&lt;h2&gt;
  
  
  The "aha" moment
&lt;/h2&gt;

&lt;p&gt;I was running an agent that had accumulated 40+ memory files over a few weeks. Half of them were duplicates with slightly different wording. Three of them contradicted each other about which AWS region we were using. The &lt;code&gt;MEMORY.md&lt;/code&gt; index was 800 lines long and the agent was spending half its context window just reading its own memories.&lt;/p&gt;

&lt;p&gt;That's when I realized: agents need a janitor. Not just a place to store memories, but something that actively curates them.&lt;/p&gt;

&lt;h2&gt;
  
  
  How it works
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Extraction (discovering new memories)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Session file modified
        ↓
Cursor check: is this new content?
        ↓
Build prompt: memory manifest + session content
        ↓
LLM identifies facts, decisions, preferences
        ↓
Write structured memory files
        ↓
Advance cursor
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The daemon tracks a &lt;code&gt;.extraction-cursor&lt;/code&gt; file — a per-session offset map so it only processes genuinely new content. If a session file gets appended to, it picks up where it left off instead of reprocessing the whole thing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Consolidation (organizing existing memories)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Three-gate trigger: time elapsed + session count + lock
        ↓
Four-phase pass: orient → gather → consolidate → prune
        ↓
Merge duplicates, resolve contradictions
        ↓
Update MEMORY.md index (200 lines / 25KB budget)
        ↓
Release lock
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both modes share a PID-based lock and never run concurrently. Consolidation takes priority — if both triggers fire on the same tick, consolidation runs first.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick start
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx agent-memory-daemon init    &lt;span class="c"&gt;# generates memconsolidate.toml&lt;/span&gt;
npx agent-memory-daemon start   &lt;span class="c"&gt;# starts the daemon&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The config is straightforward:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="py"&gt;memory_directory&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"./memory"&lt;/span&gt;
&lt;span class="py"&gt;session_directory&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"./sessions"&lt;/span&gt;

&lt;span class="py"&gt;extraction_enabled&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="py"&gt;extraction_interval_ms&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;60000&lt;/span&gt;

&lt;span class="nn"&gt;[llm_backend]&lt;/span&gt;
&lt;span class="py"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"bedrock"&lt;/span&gt;
&lt;span class="py"&gt;region&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"us-east-1"&lt;/span&gt;
&lt;span class="py"&gt;model&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"us.anthropic.claude-sonnet-4-20250514-v1:0"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or use OpenAI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[llm_backend]&lt;/span&gt;
&lt;span class="py"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"openai"&lt;/span&gt;
&lt;span class="py"&gt;api_key&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"${OPENAI_API_KEY}"&lt;/span&gt;
&lt;span class="py"&gt;model&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"gpt-4o"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What a memory file looks like
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bedrock&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;timeout&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;configuration"&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Default&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;SDK&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;timeout&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;is&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;too&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;short&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;for&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;large&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;prompts"&lt;/span&gt;
&lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;reference&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
The AWS SDK's default request timeout causes ECONNABORTED errors
on prompts over 30K characters. Set requestTimeout to 300000 (5 min)
via NodeHttpHandler when using BedrockRuntimeClient.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each file has a type: &lt;code&gt;user&lt;/code&gt; (preferences), &lt;code&gt;feedback&lt;/code&gt; (lessons learned), &lt;code&gt;project&lt;/code&gt; (architecture decisions), or &lt;code&gt;reference&lt;/code&gt; (technical facts). The daemon classifies them automatically during extraction.&lt;/p&gt;

&lt;h2&gt;
  
  
  Framework-agnostic by design
&lt;/h2&gt;

&lt;p&gt;The integration pattern is the same regardless of what you're building with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Strands / LangChain&lt;/strong&gt;: after each agent run, dump a session summary to the sessions directory. At startup, read &lt;code&gt;MEMORY.md&lt;/code&gt; into the system prompt.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenClaw&lt;/strong&gt;: point &lt;code&gt;session_directory&lt;/code&gt; at your workspace's transcript directory.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Custom agents&lt;/strong&gt;: same pattern — write files, read the index.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No plugin system, no adapter layer. The filesystem is the API.&lt;/p&gt;

&lt;h2&gt;
  
  
  Guardrails
&lt;/h2&gt;

&lt;p&gt;One thing I learned the hard way: without limits, the extraction mode creates files exponentially. Each pass sees the new files from the last pass, prompts the LLM with a bigger manifest, and the LLM creates even more files.&lt;/p&gt;

&lt;p&gt;So there are guardrails:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;max_memory_files&lt;/code&gt; — hard cap on total files in the directory (default: 50)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;max_files_per_batch&lt;/code&gt; — cap on creates per extraction pass (default: 10)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;max_prompt_chars&lt;/code&gt; — budget enforcement with progressive truncation&lt;/li&gt;
&lt;li&gt;Per-session cursor — prevents reprocessing already-extracted content&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Observability
&lt;/h2&gt;

&lt;p&gt;Every operation emits structured JSON logs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"2026-04-07T14:23:01.234Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"level"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"info"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"event"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"extraction:complete"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"data"&lt;/span&gt;&lt;span class="p"&gt;:{&lt;/span&gt;&lt;span class="nl"&gt;"created"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"updated"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"durationMs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;4521&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"promptLength"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;39102&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"operationsRequested"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"operationsApplied"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"operationsSkipped"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;}}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You get duration, prompt size, operation counts, and skip reasons. Pipe it to CloudWatch, Datadog, or just &lt;code&gt;jq&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Vector similarity search for memory recall (right now it's manifest-based)&lt;/li&gt;
&lt;li&gt;Multi-agent support (shared memory directories with conflict resolution)&lt;/li&gt;
&lt;li&gt;A web UI for browsing and editing memories&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The project is MIT-licensed and on &lt;a href="https://github.com/tverney/agent-memory-daemon" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;. Issues, PRs, and feedback are welcome.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install &lt;/span&gt;agent-memory-daemon
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If your agent keeps forgetting things, give it a daemon with a good memory.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>opensource</category>
      <category>typescript</category>
    </item>
    <item>
      <title>The Irony of Language Models That Don't Speak Your Language</title>
      <dc:creator>Thiago V.</dc:creator>
      <pubDate>Mon, 30 Mar 2026 20:53:42 +0000</pubDate>
      <link>https://forem.com/tverney_77/the-irony-of-language-models-that-dont-speak-your-language-5b5</link>
      <guid>https://forem.com/tverney_77/the-irony-of-language-models-that-dont-speak-your-language-5b5</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;This is a personal project and article. The opinions expressed here are my own and do not reflect the opinions of AWS or Amazon. This project is not an AWS product and is not endorsed by or affiliated with AWS.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;AI is plugged everywhere now and its a breakthrough advanced technology.&lt;/p&gt;

&lt;p&gt;However, there is a key element which turns out to be an elephant in the room that is not in the major headliner topics: LLMs are fundamentally centric to high-resource languages, and most specifically, English.&lt;/p&gt;

&lt;p&gt;The only publicly disclosed training data breakdown — GPT-3 (Brown et al., 2020) — showed over 92% English tokens. Newer models don't publish exact ratios, but the picture has evolved: Llama 3 remains heavily English, GPT-4o highlights improved multilingual performance as a key feature, and models like Qwen and Aya have invested significantly more in non-English data. The gap is narrowing for high-resource languages, but for the thousands of low-resource languages, the structural imbalance remains.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv0kx394w7v2kwjxajsg3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv0kx394w7v2kwjxajsg3.png" alt=" " width="800" height="489"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The remaining languages — spoken by billions of people — are either poorly represented through low-quality machine-translated English content, or absent entirely.&lt;/p&gt;

&lt;p&gt;This means that when a Thai farmer asks about crop subsidies, when a Nigerian mother searches for vaccination schedules in Yoruba, or when a Brazilian citizen navigates tax forms in Portuguese, the AI they're interacting with is operating at a fraction of its true capability.&lt;/p&gt;

&lt;p&gt;Not because the intelligence isn't there, but because the model was never properly taught to listen in their language. The industry celebrates "human-level performance" on benchmarks, but those benchmarks are overwhelmingly English. For most of the world, the AI revolution hasn't arrived yet — it's still stuck at customs, waiting for a translator.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Ancient Myth
&lt;/h2&gt;

&lt;p&gt;Around 4,000 years ago, Babylon was the most cosmopolitan city on Earth.&lt;/p&gt;

&lt;p&gt;Situated at the crossroads of ancient trade routes in modern-day Iraq, it was a place where Akkadian, Sumerian, Aramaic, Elamite, and dozens of other languages collided daily. Merchants, scholars, and diplomats from across Mesopotamia converged there, and the city thrived precisely because it found ways to bridge those languages — through scribes, translators, and the world's first multilingual libraries.&lt;/p&gt;

&lt;p&gt;The biblical story of the Tower of Babel, set in Babylon, tells it differently: God scattered humanity across the earth and confused their languages so they could no longer understand each other. It's a story about the fracturing of communication — the moment when a shared project became impossible because people could no longer speak the same language.&lt;/p&gt;

&lt;p&gt;We're living in a strange echo of that story. We've built the most powerful reasoning machines in human history — LLMs that can write poetry, prove theorems, and generate working code. But these machines think in English. When the rest of the world tries to speak to them, the tower crumbles. Not because the intelligence isn't there, but because the language barrier corrupts the signal before it reaches the model's reasoning core.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Illusion of Multilingual AI
&lt;/h2&gt;

&lt;p&gt;Ask any frontier LLM a question in English, and you'll get a polished, accurate, well-reasoned response. Now ask the same question in Thai. Or Amharic. Or even Portuguese.&lt;/p&gt;

&lt;p&gt;Suddenly, the magic fades.&lt;/p&gt;

&lt;p&gt;The response might be shorter, vaguer, or riddled with English fragments leaking through. In some cases, it's outright gibberish. And here's the part nobody talks about: you're paying more for that worse response.&lt;/p&gt;

&lt;p&gt;While the industry celebrates benchmark after benchmark showing LLMs reaching "human-level performance," there's a massive asterisk: in English. For the 6,950 other languages spoken on this planet, AI remains broken, expensive, and in some cases, unreliable.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Numbers Don't Lie
&lt;/h2&gt;

&lt;p&gt;Most leading LLMs allocate approximately 92% of their training tokens to English (&lt;a href="https://arxiv.org/abs/2005.14165" rel="noopener noreferrer"&gt;Brown et al., "Language Models are Few-Shot Learners", NeurIPS 2020&lt;/a&gt;). Of the approximately 7,000 spoken languages globally, most models only cover about 50 high-resource ones (&lt;a href="https://www.frontiersin.org/research-topics/77716/language-models-for-low-resource-languages" rel="noopener noreferrer"&gt;Frontiers Research Topic: Language Models for Low-Resource Languages&lt;/a&gt;). The remaining languages lack both the digital data and quality resources to benefit from recent AI advancements — creating barriers to education, healthcare, financial access, and employment for the communities that speak them.&lt;/p&gt;

&lt;p&gt;But the problem goes deeper than just quality. It's about money.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hidden Language Tax
&lt;/h2&gt;

&lt;p&gt;LLMs use tokenizers to break text into chunks before processing. These tokenizers were designed primarily for English. When you feed them Thai, Japanese, Arabic, or Korean text, the same semantic content gets split into 2-4x more tokens.&lt;/p&gt;

&lt;p&gt;I built a proxy called &lt;a href="https://github.com/tverney/llm-proxy-babylon" rel="noopener noreferrer"&gt;LLM Proxy Babylon&lt;/a&gt; to measure this. Here's what I found with a real Thai prompt about sorting algorithms:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Direct Thai&lt;/th&gt;
&lt;th&gt;Optimized (English)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Prompt tokens&lt;/td&gt;
&lt;td&gt;~166&lt;/td&gt;
&lt;td&gt;49&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Token savings&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;70% fewer input tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Quality score&lt;/td&gt;
&lt;td&gt;0.456&lt;/td&gt;
&lt;td&gt;~0.949 (English-level)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That's 3.4x fewer input tokens and 2x better quality. At Amazon Nova Lite pricing on Bedrock ($0.06/1M input tokens), sending 1M Thai prompts of this size would cost ~$0.01 directly vs ~$0.003 through the optimizer — and the optimized path delivers dramatically better responses.&lt;/p&gt;

&lt;p&gt;The savings scale dramatically with premium models. At Claude Opus 4 pricing on Bedrock ($15/1M input tokens), the same 1M Thai prompts would cost $2.49 directly vs $0.74 through the optimizer — a $1.75 saving per million requests on input tokens alone, with better quality on every response.&lt;/p&gt;

&lt;p&gt;Every company running a multilingual chatbot is silently paying this tax. Their English-speaking users get fast, cheap, high-quality responses. Their Thai-speaking users get slower, more expensive, lower-quality responses — for the same product, same subscription price.&lt;/p&gt;

&lt;p&gt;And it compounds. Chatbots send the full conversation history with every request. A 10-message conversation in Thai accumulates tokens 3x faster than the same conversation in English. By turn 10, you're sending massive context windows that cost a fortune and may even overflow the model's limits.&lt;/p&gt;

&lt;h2&gt;
  
  
  When Government Chatbots Can't Serve Their Own Citizens
&lt;/h2&gt;

&lt;p&gt;Now imagine this problem at the scale of a government.&lt;/p&gt;

&lt;p&gt;Countries across Southeast Asia, Africa, the Middle East, and South America are deploying AI-powered chatbots to help citizens access healthcare information, navigate tax systems, apply for social programs, and find emergency services. These are critical services that directly impact people's lives.&lt;/p&gt;

&lt;p&gt;But here's the catch: the LLMs powering these chatbots were trained on English. When a farmer in rural Thailand asks about crop subsidies in Thai, the model's reasoning capability drops by nearly 50%. When a mother in Nigeria asks about childhood vaccination schedules in Yoruba, the model might not even understand the question properly.&lt;/p&gt;

&lt;p&gt;The irony is painful: governments invest in AI to serve their citizens better, but the AI itself delivers unequal quality across languages. Not intentionally — but structurally, through training data imbalance.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Safety Gap Nobody Talks About
&lt;/h2&gt;

&lt;p&gt;It gets worse. Research shows that low-resource languages exhibit about three times the likelihood of encountering harmful content compared to high-resource languages — and in intentional attack scenarios, unsafe output rates can reach over 80% (&lt;a href="https://arxiv.org/abs/2310.06474" rel="noopener noreferrer"&gt;Deng et al., "Multilingual Jailbreak Challenges in Large Language Models", 2023&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;LLM safety guardrails — the filters that prevent models from generating harmful content — were primarily trained on English data.&lt;/p&gt;

&lt;p&gt;This means a prompt injection attack that would be caught instantly in English can sail right through in Amharic or Lao. The model simply doesn't recognize the harmful intent in languages it barely understands.&lt;/p&gt;

&lt;p&gt;For any organization deploying AI in production — especially in healthcare, finance, or government — this isn't just a quality issue. It's a liability.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Different Approach: Don't Fix the Model, Route Around It
&lt;/h2&gt;

&lt;p&gt;The conventional wisdom says: "Just train better multilingual models." And yes, that's happening. But it's slow, expensive, and may never fully close the gap for the thousands of low-resource languages that lack sufficient training data.&lt;/p&gt;

&lt;p&gt;What if we could get English-level quality from any language, today, without retraining a single model?&lt;/p&gt;

&lt;p&gt;That's the idea behind &lt;a href="https://github.com/tverney/llm-proxy-babylon" rel="noopener noreferrer"&gt;LLM Proxy Babylon&lt;/a&gt; — an open-source proxy I built that sits between your application and any LLM API.&lt;/p&gt;

&lt;p&gt;It detects the input language, decides whether translating to English would improve results, and if so, translates the prompt before sending it to the model. Then it appends a simple instruction: "Please respond in Thai since the original question was asked in Thai."&lt;/p&gt;

&lt;p&gt;LLM Proxy Babylon is named for the city, not the curse. It's an attempt to do what ancient Babylon did: sit at the crossroads of languages and make sure everyone gets understood.&lt;/p&gt;

&lt;p&gt;The key insight: LLMs have no difficulty generating output in a specified language. The performance gap is in understanding non-English prompts and in producing non-English response, so input translation for reasoning quality, and optional output translation for low-resource languages where the LLM's generation is also lossy. So we translate the input (where the model is weak) and let the model handle the output (where it's strong).&lt;/p&gt;

&lt;h2&gt;
  
  
  Real Results
&lt;/h2&gt;

&lt;p&gt;I tested this with Mistral 7B on a Thai prompt about bubble sort complexity. The results were dramatic:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without the optimizer (direct Thai):&lt;/strong&gt; The model produced garbled output mixing English fragments into Thai text ("วงจirkle", "sorteering technique"), with confused, repetitive reasoning. 1,749 tokens of mostly noise.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;With the optimizer (translated to English first):&lt;/strong&gt; The same model produced a clean, structured response correctly explaining O(n²) vs O(n log n) complexity, listing Merge Sort, Quick Sort, and Heap Sort with accurate Big-O analysis — all responded back in Thai. 1,446 tokens of useful content.&lt;/p&gt;

&lt;p&gt;The model's reasoning capability was there all along. It just couldn't access it through Thai input.&lt;/p&gt;

&lt;p&gt;I also benchmarked Amazon Nova Lite across multiple languages using the built-in evaluation harness:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rank&lt;/th&gt;
&lt;th&gt;Language&lt;/th&gt;
&lt;th&gt;Quality Score&lt;/th&gt;
&lt;th&gt;Delta from English&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;English (baseline)&lt;/td&gt;
&lt;td&gt;0.949&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Portuguese&lt;/td&gt;
&lt;td&gt;0.763&lt;/td&gt;
&lt;td&gt;-0.19&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Korean&lt;/td&gt;
&lt;td&gt;0.663&lt;/td&gt;
&lt;td&gt;-0.29&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Japanese&lt;/td&gt;
&lt;td&gt;0.595&lt;/td&gt;
&lt;td&gt;-0.35&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Thai&lt;/td&gt;
&lt;td&gt;0.456&lt;/td&gt;
&lt;td&gt;-0.49&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The pattern maps exactly to language resource availability. Portuguese (high-resource) takes the smallest hit. Thai (low-resource) loses nearly half the quality.&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;The proxy exposes an OpenAI-compatible API, so it works as a drop-in replacement with any framework — LangChain, Strands Agents, or any OpenAI SDK client. Just change the base URL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;strands&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Agent&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;strands.models.openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAIModel&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAIModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;client_args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;base_url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:3000/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;api_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;not-needed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us.amazon.nova-lite-v1:0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;อธิบายแนวคิดของ recursion ในการเขียนโปรแกรม&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Under the hood, each request flows through a pipeline:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Detect&lt;/strong&gt; the language (using franc for BCP-47 identification)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parse&lt;/strong&gt; mixed content (preserve code blocks, URLs, JSON — only translate natural language)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Classify&lt;/strong&gt; the task type (reasoning, math, code-generation, culturally-specific)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Route&lt;/strong&gt; — decide whether to translate, skip, or use hybrid mode&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Translate&lt;/strong&gt; the prompt to English (if beneficial)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inject&lt;/strong&gt; a language instruction ("Please respond in Thai...")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Forward&lt;/strong&gt; to the LLM (supports AWS Bedrock and OpenAI)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Return&lt;/strong&gt; the response to the client&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The routing engine is smart about when NOT to translate. Culturally-specific questions ("What's good tonight in Paris?") skip translation because the model needs cultural context, not English reasoning. English prompts skip entirely. The system only translates when it expects a quality improvement.&lt;/p&gt;

&lt;h3&gt;
  
  
  Built on AWS
&lt;/h3&gt;

&lt;p&gt;The proxy supports AWS Bedrock natively via the Converse API. Authentication is handled automatically through the AWS SDK — no API keys needed in requests. I tested with Amazon Nova Lite and Mistral 7B, both available on Bedrock.&lt;/p&gt;

&lt;p&gt;For translation, it supports Amazon Translate ($15/1M characters, high quality for proper nouns and technical content) and LibreTranslate (self-hosted, free) out of the box, with a pluggable interface for DeepL or Google Translate. Just set &lt;code&gt;TRANSLATOR_BACKEND=amazon-translate&lt;/code&gt; to switch — uses your existing AWS credentials.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Conversation Cache: Solving the Multi-Turn Problem
&lt;/h2&gt;

&lt;p&gt;Multi-turn conversations are where the token tax really hurts. Every request sends the full history, and that history is in the user's language — eating 2-4x more tokens per turn.&lt;/p&gt;

&lt;p&gt;The proxy includes a conversation translation cache. Pass an &lt;code&gt;X-Conversation-Id&lt;/code&gt; header and previously translated messages are pulled from cache instead of being re-translated. By turn 10, you get 9 cache hits and only 1 miss per request — 9 translation API calls saved, and the LLM always sees a lean English context window.&lt;/p&gt;

&lt;h2&gt;
  
  
  Beyond Quality: Safety as a Side Effect
&lt;/h2&gt;

&lt;p&gt;By translating low-resource language prompts to English before sending them to the LLM, the optimizer routes every prompt through the model's strongest safety alignment. A harmful prompt in Thai or Amharic gets evaluated by English-trained guardrails operating at full strength, rather than the weaker low-resource language alignment.&lt;/p&gt;

&lt;p&gt;This isn't a complete safety solution — but for the common case, it significantly narrows the 3x safety gap between high-resource and low-resource languages identified by Deng et al.&lt;/p&gt;

&lt;h2&gt;
  
  
  But What If Models Get Better at Multilingual?
&lt;/h2&gt;

&lt;p&gt;They will — and the optimizer is designed for that.&lt;/p&gt;

&lt;p&gt;The token cost problem is structural, not a training problem. BPE tokenizers will always split Thai, Arabic, and Korean into 2-4x more tokens than semantically equivalent English. Unless providers fundamentally redesign their tokenizers and retrain everything, the cost disparity persists regardless of how multilingual the model becomes.&lt;/p&gt;

&lt;p&gt;Conversation history compounding doesn't go away either. Even a perfectly multilingual model still charges per token. A 10-turn Thai conversation still accumulates tokens 3x faster than English. The conversation translation cache solves this at the infrastructure level.&lt;/p&gt;

&lt;p&gt;RAG retrieval is an embedding problem, not an LLM problem. Vector embeddings are English-centric. Translating queries to English before retrieval improves recall regardless of how good the LLM itself is at understanding Thai.&lt;/p&gt;

&lt;p&gt;Fine-tuning ROI is permanent. Companies fine-tune on English domain data. A perfectly multilingual base model still won't have that domain-specific knowledge accessible through non-English prompts unless the fine-tuning data was also multilingual — which it almost never is.&lt;/p&gt;

&lt;p&gt;Safety alignment will always lag for low-resource languages. Even as models improve, safety training data will remain English-heavy. Routing through English for safety filtering is a defense-in-depth strategy that stays relevant.&lt;/p&gt;

&lt;p&gt;And the adaptive router handles the transition gracefully. As models get better at specific languages, the shadow evaluator detects that translation no longer helps, and the router automatically switches to skip. The proxy doesn't fight against model improvements — it adapts to them. For a language where the model reaches English parity, the proxy becomes a transparent pass-through with zero overhead.&lt;/p&gt;

&lt;p&gt;Today the proxy is primarily about quality. As models improve, it becomes primarily about cost optimization, safety, and RAG. The architecture already supports that transition because routing decisions are data-driven, not hardcoded assumptions.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;This is an open-source project and there's a lot more to explore:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;RAG improvement&lt;/strong&gt; — translate queries to English before vector retrieval for better recall (current architecture supports it)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fine-tuning ROI&lt;/strong&gt; — ensure non-English users benefit from English-only fine-tuned models&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dialect detection&lt;/strong&gt; — handle Egyptian Arabic vs Modern Standard Arabic, European vs Brazilian Portuguese&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Question We Should Be Asking
&lt;/h2&gt;

&lt;p&gt;The next time someone says "all LLMs are the same," ask them: &lt;strong&gt;in which language?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;AI won't truly be intelligent until it understands every language, every culture. Until then, tools like LLM Proxy Babylon can bridge the gap — giving every user English-level quality, regardless of what language they think in.&lt;/p&gt;

&lt;p&gt;The code is open source: &lt;a href="https://github.com/tverney/llm-proxy-babylon" rel="noopener noreferrer"&gt;github.com/tverney/llm-proxy-babylon&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;273 property-based tests. Real benchmarks. Ready to deploy.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://builder.aws.com/content/3BfRX8ILgQnT0aO1vmYWvgVCHKT/the-irony-of-language-models-that-dont-speak-your-language" rel="noopener noreferrer"&gt;AWS Builder Center&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>i18n</category>
      <category>llm</category>
      <category>agents</category>
    </item>
  </channel>
</rss>
