Forem: BAOFUFAN

AI Chat Memory Pitfalls: 30% of Conversations Lost on Refresh

BAOFUFAN — Tue, 12 May 2026 01:08:18 +0000

It was 1 a.m. when the product manager dropped a screenshot in the group chat: “A user chatted for 20 minutes, refreshed the page, and lost all their history. Did you guys even implement the memory feature?” My stomach tightened — this was the third memory-loss report this week. What stung more was that we did write tests, but our manual multi-turn conversation test cases never touched the browser’s refresh button. I decided to write an automated test suite that actually mimics real user behavior, using Playwright and LangChain, specifically targeting memory persistence. Not only did I reproduce the bug, I followed the breadcrumbs and unearthed three hidden issues. Here’s the full post-mortem.

Why Memory Persistence Is So Hard to Test

The scenario is classic: a user opens a chat page, has a long multi-turn conversation, and at some point refreshes the page, closes and reopens the tab, or even backgrounds the app on mobile. The AI must remember the previous context — no lost history, no session mix-ups. Our chat backend uses LangChain’s ConversationBufferMemory for memory management. The frontend is an SPA, bound to a session_id on the backend.

Standard tests only cover “continuous conversation within a single page load,” because manually simulating complex refresh timings is brutal, not to mention verifying consistency among localStorage, sessionStorage, cookies, and backend memory. We’d thought about automation before, but the team had tried Selenium — page reloads caused timeout after timeout, and multi-tab scenarios ended up as a callback nightmare with a maintenance cost through the roof.

The root cause: testing memory persistence is fundamentally a stateful, cross-session, timing-sensitive E2E scenario. You must simultaneously drive the browser UI and inspect backend state — you can’t have one without the other. That’s why pure API tests (e.g., only hitting /chat) never catch the bug: when a user refreshes the page, can the frontend correctly re-fetch history from the backend? Will the session_id be wiped? Does the backend memory regress due to a serialization error? You have to let a real browser walk through it.

Solution Design: A Playwright + LangChain Memory Testing Sandbox

I needed a test harness that was quick to set up, could plug in different memory backends, and accurately simulate real user actions. Here’s the selection reasoning:

Why Playwright over Selenium or Cypress: Playwright natively supports multiple pages and contexts, auto-waits for elements, and lets you directly inject scripts to manipulate cookies/localStorage. This is a hard requirement for scenarios like “reload and reload history.” Selenium’s wait strategies are too primitive, and Cypress’s multi-tab support is limited — easy pass.
Why LangChain: Not just hype. LangChain’s memory abstractions are excellent. You can switch ConversationBufferMemory between in-memory and Redis implementations with a one-liner, making it easy to test behavioral differences across persistence strategies. Plus, its built-in message history interface let me assert memory content directly in tests, instead of scraping the frontend DOM for history records.
Architecture: A simple FastAPI chat endpoint wrapping a LangChain ConversationChain. It accepts a session_id and a user message, and returns an AI reply. Playwright test scripts simulate user interactions and use page.evaluate() to read/write the session_id in the frontend’s localStorage, even simulating edge cases like corrupted storage.

Core Implementation: Building a Testable Framework from Scratch

1. Chat Service: Exposing Memory as Assertible State

The code below solves the problem: “How do I make backend memory usable in real scenarios, yet precisely assertable in tests?” I wrapped a ConversationChain in FastAPI, with a dictionary holding the memory instance for each session. This allows a test-only endpoint to directly retrieve memory content — no dependency on the frontend DOM.

# chat_server.py
from fastapi import FastAPI
from pydantic import BaseModel
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationChain
from langchain.chat_models import ChatOpenAI
import uuid

app = FastAPI()

# 存储不同会话的 chain 实例，真实的生成环境会用 Redis，这里演示用内存字典
chains = {}

class Message(BaseModel):
    session_id: str
    content: str

def get_or_create_chain(session_id: str):
    if session_id not in chains:
        memory = ConversationBufferMemory(memory_key="history", return_messages=True)
        llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
        chains[session_id] = ConversationChain(llm=llm, memory=memory, verbose=False)
    return chains[session_id]

@app.post("/chat")
def chat(msg: Message):
    chain = get_or_create_chain(msg.session_id)
    response = chain.run(msg.content)
    return {"reply": response}

# 测试辅助：直接暴露记忆内容，避免依赖前端解析
@app.get("/memory/{session_id}")
def get_memory(session_id: str):
    chain = chains.get(session_id)
    if not chain:
        return {"messages": []}
    # ConversationBufferMemory 的 buffer 就是消息列表
    messages = chain.memory.buffer
    return {"messages": [{"role": m.type, "content": m.content} for m in messages]}

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Once this service is running, any UI action driven by Playwright can later assert backend memory directly via the /memory/{session_id} endpoint — making the test clean and deterministic.

How Moving Rate Limiting to Redis+Go 8x'd Our API Gateway Throughput (And Cost Us 3 Days of Debugging)

BAOFUFAN — Mon, 11 May 2026 12:09:03 +0000

It was 2 AM. I was jolted awake by a cascade of alerts — our downstream order database was thrashing at 90% CPU, connection pools exhausted, the whole service collapsing. Scrambling through the monitoring dashboards, I found the smoking gun: a rolling deployment of the gateway had just finished. On each new pod, the local token bucket counters started from scratch. For less than 10 seconds, the rate limiter suffered a collective “amnesia.” That tiny window of uncontrolled traffic pierced through every layer of protection and brought the system to its knees.

Right then, I knew: local, in-process rate limiting was done. We needed distributed rate limiting — and fast.

Why In-Memory Rate Limiting Is a Lie in Distributed Systems

Inside a multi-instance API gateway, rate limiting is supposed to protect downstream services. If you use Go’s rate.Limiter or Guava’s RateLimiter, each instance maintains its own token bucket. Under perfectly spread traffic, limits seem to hold. But two scenarios instantly strip away that protection:

Rolling deployments or restarts: A fresh instance starts with a full bucket; old counters are never inherited. You lose limiting for that entire bootstrap window.
Traffic skew: If a user is consistently hashed to the same instance (think sticky sessions), the limiter only knows about that instance’s local view. When that one instance is overwhelmed, the rest of the fleet remains oblivious — and downstream still melts.

The root cause is simple: “global traffic demands global counting.” The industry go-to is Redis for distributed counters, but too many implementations just use INCR with a TTL — a classic fixed-window approach. Fixed windows have a notorious flaw: request bursts at the boundary. Two consecutive windows can each allow their full quota within a 200ms span, effectively doubling the allowed rate.

I wanted something smoother: a sliding window algorithm, backed by Redis sorted sets (ZSET).

The Design: Redis + Lua + ZSET

I evaluated three options:

Nginx/OpenResty rate-limiting modules: blazing fast, but configuration is static. Dynamically adjusting rules from business logic would have been a nightmare.
Sentinel/Hystrix: focus more on circuit breaking and degradation. The rate limiting is again local; going distributed requires deploying an external console — too heavy.
Build our own Redis-based sliding window limiter: Use ZSET scores to store request timestamps, with each key representing a rate-limiting dimension (user ID, API path, etc.), down to millisecond precision. A Lua script bundles the “check + add + evict” logic into an atomic operation — one network round trip does it all.

Redis was the clear winner: nearly every backend already has a Redis cluster, so zero extra deployment cost. Lua scripting guarantees atomicity under concurrency. And ZSETs are naturally suited for range queries and removals — sliding windows feel almost native.

On the architecture side, it’s a thin Go middleware. Every request hits the Redis Lua script to get an accept/reject decision. To relieve Redis pressure, we later added an in-memory pre-check (more on that another time).

The Core: From Atomic Lua to Go

What this solves: atomic “check–count–evict” for a sliding window inside Redis

-- sliding_window.lua
-- KEYS[1]  限流 key, 如 "rate:api:/order:user_123"
-- ARGV[1]  窗口长度, 单位毫秒, 如 1000
-- ARGV[2]  最大请求数
-- ARGV[3]  当前时间戳, 由 Redis 服务器生成 TIME 的毫秒表示
-- ARGV[4]  成员唯一标识, 一般用纳秒级时间戳+随机数, 防止 score 相同被覆盖

local key       = KEYS[1]
local window_ms = tonumber(ARGV[1])
local limit     = tonumber(ARGV[2])
local now       = tonumber(ARGV[3])
local member    = ARGV[4]

-- 移除窗口外的旧数据
redis.call("ZREMRANGEBYSCORE", key, 0, now - window_ms)

-- 获取当前窗口内的请求数
local count = redis.call("ZCARD", key)

if count < limit then
    -- 允许通过，添加当前请求时间戳
    redis.call("ZADD", key, now, member)
    -- 给 key 设置过期时间，防止无人访问时 key 永久存在
    redis.call("PEXPIRE", key, window_ms + 1000)
    return 1
else
    return 0
end

Critical detail: member must be globally unique. Otherwise identical scores would overwrite each other and distort the count. I generate it on the Go side as current microsecond timestamp + random number. This way even concurrent requests arriving in the same millisecond never collide. I also set PEXPIRE to window_ms + 1000 — slightly longer than the window — to avoid garbage keys sticking around forever, while preventing premature expiration that could drop valid data at the boundary.

What this solves: Go wrapper that connects to Redis, loads the script, and exposes an `Allow` interface

package ratelimit

import (
    "context"
    "crypto/rand"
    "fmt"
    "math/big"
    "time"

    "github.com/redis/go-redis/v9"
)

type SlidingWindowLimiter struct {
    client *redis.Client
    script *redis.Script   // 缓存 Lua 脚本 SHA
    window time.Duration
    limit  int
}

func NewLimiter(client *redis.Client, window time.Duration, limit int) *SlidingWindowLimiter {
    src := `
        local key       = KEYS[1]
        local window_ms = tonumber(ARGV[1])
        local limit     = tonumber(ARGV[2])
        local now       = tonumber(ARGV[3])
        local member    = ARGV[4]

        redis.call("ZREMRANGEBYSCORE", key, 0, now - window_ms)
        local count = redis.call("ZCARD", key)
        if count < limit then
            redis.call("ZADD", key, now, member)
            redis.call("PEXPIRE", key, window_ms + 1000)
            return 1
        else
            return 0
        end
    `
    return &SlidingWindowLimiter{
        client: client,
        script: redis.NewSc

Stop Guessing Memory: How to Automate LangChain Memory Testing and Catch 80% of Multi-Turn Failures

BAOFUFAN — Mon, 11 May 2026 01:09:41 +0000

2 a.m. The customer Slack channel explodes — the support bot just asked for the same order number three times in a row. A frustrated user screams, “Do you have amnesia?” After digging through the code and the prompt, everything looks fine. Only then do we discover that ConversationBufferMemory silently dropped context in one of the turns. The LLM had no idea what was said earlier. Right then I thought: if we could catch this memory loss automatically in CI, we’d never ship a black eye like this.

Breaking down the problem

In LLM-powered apps, memory isn’t a “nice-to-have” anymore — it’s the core experience. LangChain gives us a buffet of memory implementations: ConversationBufferMemory, ConversationSummaryMemory, VectorStoreRetrieveMemory, and more. But almost no project actually tests memory accuracy seriously.

The root cause is brutally simple: memory testing is too manual. Most teams spin up a chain locally, poke it with Postman or the CLI for a few turns, visually confirm “yeah, it remembered the name I just said,” and then merge. That approach has three fatal flaws:

Minimal path coverage – manual testing only walks the happy path. Branch conditions (hitting the token limit, summary memory trigger timing, interleaving messages) are left to guesswork.
Zero regression protection – next week you tweak the prompt or switch the model, and the memory logic might break, but nobody will manually re‑play every historical conversation.
Fuzzy verification – “looks right” is not the same as is right. Human judgement on whether memory is complete or hallucination-free has huge error margins.

Testing a stateful, long‑context agent with this hand‑crafted approach is like walking across a highway blindfolded. What we need is an automated assertion‑based memory verification scheme: given a multi‑turn dialog script, precisely verify the content, order, and key facts stored inside the memory object — and run it in CI.

Solution design

The core idea is simple: turn the LLM into a deterministic “teleprompter,” then treat the memory object as the system under test and use pytest for assertions.

Why not let the LLM judge memory itself? (e.g., call the model again: “Please check if the conversation history contains X”). Because that would make the “judge” the same hallucination machine — not reliable. What we want are pure engineering assertions: string containment, list length, message type — deterministic checks.

Tooling choices:

pytest: the most universal Python test framework; its fixture mechanism fits perfectly for managing memory state.
LangChain’s BaseMemory: we directly interact with memory.chat_memory.messages and memory.load_memory_variables(), bypassing LLM uncertainty.
Custom FakeLLM: inherit from LLM, return fixed text in a predetermined sequence, with zero external API dependency. Tests complete in milliseconds and are 100% repeatable.

We avoid mocking ChatOpenAI because network jitter and model randomness directly undermine assertion stability. We also don’t treat FakeListLLM as an opaque box — we need precise control over every reply, so a custom HardcodedLLM gives us the most flexibility.

Core implementation

Let’s build automated memory testing step by step. All code is runnable (requires pip install langchain langchain-core pytest).

1. Build a “teleprompter” LLM

This snippet solves the “LLM response is uncontrollable” problem — we make each invocation return a preset sequence, like playing a cassette tape.

from typing import List, Optional
from langchain_core.language_models.llms import LLM
from langchain_core.callbacks import CallbackManagerForLLMRun

class HardcodedLLM(LLM):
    """按固定序列返回的 LLM，用于自动化测试"""
    responses: List[str]
    call_count: int = 0

    def _call(
        self,
        prompt: str,
        stop: Optional[List[str]] = None,
        run_manager: Optional[CallbackManagerForLLMRun] = None,
    ) -> str:
        # 如果调用次数超过预设回复数量，返回一个默认值以避免抛出异常
        if self.call_count >= len(self.responses):
            return "I don't know"
        response = self.responses[self.call_count]
        self.call_count += 1
        return response

    @property
    def _llm_type(self) -> str:
        return "hardcoded"

2. Write reusable test fixtures

This fixture eliminates the “every test assembles chain and memory from scratch” pain — we extract common initialization.

import pytest
from langchain.chains import ConversationChain
from langchain.memory import ConversationBufferMemory

@pytest.fixture
def memory() -> ConversationBufferMemory:
    """返回一个干净的内存记忆对象，每次测试独立"""
    return ConversationBufferMemory(return_messages=True)

@pytest.fixture
def chain(memory: ConversationBufferMemory, request) -> ConversationChain:
    """
    根据测试参数定制 LLM 的回复序列。
    测试函数可以用  装饰器传入预设 responses。
    """
    # 获取测试函数传递的 responses 参数，没有则使用默认值
    responses = getattr(request, "param", ["Hello", "Sure", "Done"])
    llm = HardcodedLLM(responses=responses)
    return ConversationChain(llm=llm, memory=memory)

3. First test: Single‑turn memory must exist

Here we verify the simplest scenario — after one utterance, is it immediately stored in memory?

def test_single_turn_memory_exists(chain, memory):
    chain.invoke("My name is Alice")
    messages = memory.chat_memory.messages
    assert any("Alice" in msg.content for msg in messages)

That’s it. No LLM judgement, no flakiness — just a straight string check. Run pytest and it passes in under a second.

4. Multi‑turn memory retention test

The real horror show is multi‑turn memory loss. Let’s simulate a three‑turn conversation where the bot asks for the order number, the user provides it, and later the user asks to cancel. The memory must retain the order number across turns.

@pytest.mark.parametrize("chain", [["What is your order number?",
                                    "Your order #12345 has been found.",
                                    "Sure, I'll cancel order #12345."]], indirect=True)
def test_multi_turn_memory_retains_order_number(chain, memory):
    chain.invoke("I want to cancel my order")
    chain.invoke("12345")
    chain.invoke("Please proceed with cancellation")
    messages = memory.chat_memory.messages
    # Verify the order number appears in the human message AND the AI response
    assert any("12345" in msg.content for msg in messages)
    # Verify the context wasn't truncated (we should have 6 messages: 3 human, 3 AI)
    assert len(messages) == 6

This test catches exactly the 2 a.m. bug: if ConversationBufferMemory drops messages due to token limits or misconfiguration, the assertion on message count or order number fails immediately.

5. Testing summary memory trigger logic

Summary memory is trickier — it compresses history. We need to verify that after enough conversation, the summary kicks in and the old details are still accessible.

from langchain.memory import ConversationSummaryMemory

@pytest.fixture
def summary_memory() -> ConversationSummaryMemory:
    # Use a tiny max_token_limit to force summarization quickly
    return ConversationSummaryMemory(llm=HardcodedLLM(responses=["Summary of the conversation"]),
                                     max_token_limit=10,
                                     return_messages=True)

def test_summary_memory_preserves_key_facts(chain, summary_memory):
    # Override the default chain to use our summary memory
    chain.memory = summary_memory
    chain.invoke("My name is Bob and I'm from Berlin.")
    chain.invoke("I need a hotel.")
    chain.invoke("What's my name?")
    # The summary should have captured "Bob" and "Berlin"
    memory_variables = summary_memory.load_memory_variables({})
    history = memory_variables.get("history", "")
    assert "Bob" in history
    assert "Berlin" in history

By controlling the summarization LLM’s output with our HardcodedLLM, we make the test deterministic. No matter how many times it runs, the summary text is always the same, so assertions are rock solid.

Why this matters in CI

Put these tests into your CI pipeline and you get a safety net that catches regression instantly. When you:

bump the LangChain version
swap the underlying model
modify the memory configuration (e.g., k for buffer window)
change the prompt template that influences token usage

…any memory‑breaking change fails the build before it reaches a human. The confidence gain is enormous — especially in production agents where context loss directly damages user trust.

Moving beyond the basics

Once you have the deterministic harness, you can extend it:

Entity extraction memory: verify that key entities are persisted accurately.
Token‑limit boundary tests: push conversations right to the limit and confirm graceful handling (no silent truncation).
Mixed memory strategies: combine buffer and summary memory and assert that both layers retain critical information.
Property‑based testing: use Hypothesis to generate random conversation flows and check invariants (e.g., “all names mentioned in the last N turns are still retrievable”).

Manual “click‑and‑stare” testing can’t touch that. Automated memory assertions turn a major source of production issues into a solved problem.

The next time your support bot loses its mind at 2 a.m., you’ll already have a failing test that tells you exactly where the memory broke.

IndexedDB Automation Testing Pitfalls: 3 Hidden Bugs & 30 Wasted Hours

BAOFUFAN — Sun, 10 May 2026 12:10:06 +0000

Last Thursday at 10 PM, our product chat exploded: more than a dozen users reported that all their configurations were lost after a page refresh. My immediate reaction: “Impossible—this is stored in IndexedDB, right?” Opening the browser DevTools revealed empty storage. In that moment it hit me: our testing workflow of “manually open the app, click around, and store things” had completely missed this landmine. I debugged by hand until 3 AM, fixed the issue, only to expose two new bugs. Looking back, those three bugs cost me at least 30 hours. Now I’m breaking down the entire process—and sharing a Playwright-based automated testing approach for IndexedDB so you never have to fly blind again.

Why Are IndexedDB Bugs So Hard to Catch?

Our scenario is common: an SPA admin panel that uses IndexedDB for client-side persistence, caching user preferences, drafts, and recent browsing history. It sounds simple, but the problem is exactly “you think you understand IndexedDB.”

There are three core dimensions to the root causes:

Browser behavior differences – The same code triggers completely different storage quota calculations in Chrome, Edge (Chromium-based), and Safari. Safari often silently clears data without even throwing an error.
The mental overhead of the async transaction model – IndexedDB’s auto-commit mechanism tricks you into believing your data is safe after transaction.oncomplete. In reality, the browser can trigger “passive eviction” at any time, forcing you to add an extra layer of defense inside your callbacks.
Tests simply don’t cover enough – Previously, QA would manually open pages and perform operations; scenarios like low storage quotas, private browsing mode, or repeated read/write collisions were never triggered. Traditional E2E frameworks like Cypress or Selenium either require plugins for IndexedDB support or bypass the storage layer entirely with mocks, leading to all-green tests while production burns.

Why don’t typical solutions work? Mocking IndexedDB means you’re testing nothing, and pure manual regression is slow and leaky. We need an approach that precisely controls read/write timing programmatically and asserts storage state in a real browser environment.

Solution Design: Choosing Playwright over Cypress or Puppeteer

I ultimately built a dedicated IndexedDB testing harness with Playwright. The reasons are very practical:

Native multi-engine support – Chromium, Firefox, and WebKit can all run inside CI, so you can catch Safari-specific behavior without hunting down a Mac.
evaluate can execute arbitrary page scripts – This means I can interact with the IndexedDB API directly inside the page context, just like writing scripts in DevTools, without touching any business code.
Context-level storage isolation – Playwright’s BrowserContext can simulate different storage states, and storageState allows saving/restoring them, which is perfect for testing IndexedDB persistence scenarios.
Why not Cypress – Cypress gives you very weak low-level control over browser storage and awkwardly handles async operations inside page.evaluate. Puppeteer lacks multi-browser support and the community maintenance pace clearly falls behind Playwright.

The architectural idea: every test case directly operates IndexedDB writes/reads through page.evaluate, bypassing the UI to first ensure the reliability of the storage layer itself. Then layer on E2E scenarios to verify UI state synchronization. With this separation, a 30‑minute manual regression shrinks to 2 seconds and runs inside GitHub Actions.

Core Implementation: Reusable IndexedDB Test Utilities

The code below is entirely based on Playwright and can be dropped straight into your project.

The first piece tackles basic IndexedDB operations, letting us read and write inside tests as freely as if we were using localStorage.

// helpers/indexeddb-helper.ts
// 提供在 Playwright page 内操作 IndexedDB 的通用函数

import { Page } from '@playwright/test';

// 打开数据库并返回句柄（写操作用）
export async function openDB(page: Page, dbName: string, version = 1) {
  return page.evaluate(({ dbName, version }) => {
    return new Promise<IDBDatabase>((resolve, reject) => {
      const request = indexedDB.open(dbName, version);
      request.onsuccess = () => resolve(request.result);
      request.onerror = () => reject(request.error);
      request.onupgradeneeded = () => {
        const db = request.result;
        if (!db.objectStoreNames.contains('store')) {
          db.createObjectStore('store', { keyPath: 'id' });
        }
      };
    });
  }, { dbName, version });
}

// 写入数据
export async function putData(page: Page, dbName: string, data: any) {
  return page.evaluate(({ dbName, data }) => {
    return new Promise<string>((resolve, reject) => {
      const request = indexedDB.open(dbName);
      request.onsuccess = () => {
        const db = request.result;
        const tx = db.transaction('store', 'readwrite');
        const store = tx.objectStore('store');
        store.put(data);
        tx.oncomplete = () => {
          db.close();
          resolve('put-success');
        };
        tx.onerror = () => reject(tx.error);
      };
      request.onerror = () => rejec

Debugging AI Agent Memory Loss: A 3-Day Investigation

BAOFUFAN — Sun, 10 May 2026 01:07:21 +0000

I got paged at 2 AM. Our AI teaching assistant had "amnesia." A student had just explained their lab progress 30 minutes earlier, but when they asked "What should I do next?", the assistant replied, "Could you tell me your current progress?" The user was furious. I rolled out of bed, checked the logs, and saw that the memory had been written successfully in Mem0 – the add API returned a 200. Yet a subsequent search came up empty. That single bug cost me 3 days of investigation.

Breaking down the problem

Our AI Agent relies on Mem0 for long-term memory: conversation history, user preferences, and task state are all stored there. During a continuous conversation, the agent first calls search() to retrieve relevant memories, stitches them into the prompt, and then generates an answer. In theory, a memory should be immediately searchable after insertion. In reality:

Write succeeds, but retrieval returns nothing: the API call returns a memory_id, but searching with the same query moments later yields no hits.
Locally visible, globally gone: a memory can be found within a single user session, but disappears when queried cross-session or under a different user_id for shared memory.
Intermittent failures: tests pass locally but turn red in CI – once again, memories are missing.

The root cause pointed to three suspects: asynchronous indexing delays, mismatch between search parameters and written content, and default cleanup policies. Manual ad-hoc testing can't cover these timing-sensitive scenarios – you can't reasonably send dozens of messages every time you deploy. We needed an automated regression suite that specifically stresses the write-then-retrieve loop.

Solution design

We built a memory verification system using pytest + Mem0 Python SDK. Here's how we weighed the options:

unittest – not flexible enough; fixture management is cumbersome and parametrization support is weak.
Mocking Mem0 API – it wouldn't expose real indexing behaviour, making the tests pointless.
curl/bash scripts – poor maintainability and rudimentary assertions.

The final setup: we spin up a Mem0 service (backed by a Qdrant vector store) via docker-compose, encapsulate the client fixture in pytest's conftest.py with data isolation, and write test cases covering single writes, batch writes, retrieval after updates, and concurrent writes. Now every code push triggers a CI run that exercises the entire memory pipeline, catching previously invisible async issues before they hit production.

Core implementation

The first piece sets up the test infrastructure: initialises a Mem0 client and generates a unique user_id/app_id for each test, eliminating cross-test noise.

# conftest.py
import pytest
from mem0 import Memory

@pytest.fixture(scope="session")
def mem0_client():
    """连接本地Mem0服务，配置写入同步模式"""
    return Memory.from_config({
        "version": "v1.1",
        "embedder": {
            "provider": "openai",
            "config": {"model": "text-embedding-3-small"}
        },
        "vector_store": {
            "provider": "qdrant",
            "config": {"host": "localhost", "port": 6333}
        }
    })

@pytest.fixture
def fresh_agent(mem0_client: Memory):
    """
    每个测试拿全新agent_id，测试结束清理数据，
    保证测试间完全隔离。
    """
    agent_id = f"test_agent_{pytest.uid}"
    yield mem0_client, agent_id
    # 清理：删除该agent所有记忆
    mem0_client.delete_all(user_id=agent_id)

Next comes the critical verification: after a write, you must be able to find it. We include retry logic because Mem0's vector indexing is asynchronous by default – a lesson written in blood.

# test_memory_basic.py
import time
from mem0 import Memory

def test_add_and_search_must_find(fresh_agent):
    """基本闭环：写入一条记忆，立刻检索必须出现"""
    client, agent_id = fresh_agent

    # 写入：记录用户偏好
    payload = f"用户{agent_id}喜欢用黑暗模式阅读代码"
    client.add(payload, user_id=agent_id)

    # 坑点：索引异步，立即search可能为空，需要retry
    deadline = time.time() + 5  # 5秒超时
    found = False
    while time.time() < deadline:
        results = client.search("喜欢什么模式", user_id=agent_id)
        # 至少命中一条且内容包含关键词
        if any("黑暗模式" in r["memory"] for r in results):
            found = True
            break
        time.sleep(0.5)

    assert found, f"5秒内未检索到写入的记忆，search返回: {results}"

This test is the heart of the article. The mantra is: “Don't trust the API – trust the result after retries.” Below is an additional stress test that ensures no data is lost under concurrent writes:

# test_concurrent_write.py
from concurrent.futures import ThreadPoolExecutor
from mem0 import Memory

def test_concurrent_add_never_lost(fresh_agent):
    """10个线程同时写入不同偏好，最终都应能搜到"""
    client, agent_id = fresh_agent
    preferences = [
        f"{agent_id}偏好亮色主题",
        f"{agent_id}习惯用2空格缩进",
        f"{agent_id}喜欢在代码里加emoji注释",
        # ... 共10条
    ] * 10  # 批量复制到10条

    with ThreadPoolExecutor(max_workers=10) as pool:
        pool.map(lambda p: client.add(p, user_id=agent_id), preferences)

    # 等待索引完成
    time.sleep(3)  # 给异步索引一些时间
    results = client.search("主题和缩进", user_id=agent_id)
    found_prefs = {r["memory"] for r in results}
    assert all(p in found_prefs for p in preferences), "并发写入后存在丢失的记忆"

These tests are now part of our CI pipeline. Every commit triggers a Mem0 end-to-end check that has already saved us from chasing phantom memory loss at midnight. If you're building an AI agent that depends on reliable long-term memory, I strongly recommend stealing this approach – your sleep will thank you.

How I Slashed Our LLM API Token Costs by 90% — From 1M to 100K Daily

BAOFUFAN — Sat, 09 May 2026 12:08:47 +0000

Last week, finance dropped a screenshot into the group chat: this month’s LLM API bill was ¥5,368, up 4x month-over-month. “Do you tech people not feel anything if you don’t spend money?” That moment I suddenly understood every algorithm team that’s ever had their budget slashed.

We run an intelligent customer-service system with three or four large clients. Daily active users aren’t huge, but conversations are extremely long. Some users chat with the bot for hundreds of turns, and every request has to stuff the entire message history into the context. Every single token the model generates forces it to re-read that mountain of chat logs. Tokens flow like water.

I knew right away we had to implement caching. Not Redis caching, not a CDN — but context caching. The idea is to de-duplicate model inputs at the semantic level: if a full context has already been processed once, don’t blindly re-compute it the second time. After we shipped this, daily token consumption dropped from ~1 million to ~100k, cutting costs by 90%. Median API latency fell from 3.2s to 0.4s. This post walks through the full approach, the code, and the two landmines that almost blew us up.

Where exactly are tokens being wasted?

First, some background. We use the Chat Completions API. Each turn of a conversation constructs a very long messages list and sends it to the model. Suppose a user’s conversation has already gone 30 rounds. The current request looks like this:

messages = [
    {"role": "system", "content": "你是客服，请友好回答..."},
    {"role": "user", "content": "你好"},
    {"role": "assistant", "content": "您好，请问有什么可以帮您的？"},
    {"role": "user", "content": "我的订单没收到"},
    {"role": "assistant", "content": "请提供订单号..."},
    ...
    {"role": "user", "content": "还是没收到，已经三天了"}
]

Every new request has 90% of the content identical to the previous round, yet the model still processes all those tokens from scratch, and billing counts every one of them as input tokens. The typical “cache responses in Redis” trick doesn’t help here, because the messages list changes every time (one new round appended), so the cache key never matches.

The root cause is clear: we aren’t stripping the “prefix that has already been computed” out of the billing and computation. If we could recognise that a prefix has been cached and reuse the model’s intermediate state from last time, we’d save a ton of tokens. But OpenAI’s API doesn’t expose a native Prompt Caching feature the way Anthropic does (it only appeared for some models late 2024). We had to simulate it ourselves.

Design: why not vector search, and why we built our own KV cache

We had three paths in front of us:

Full-messages response caching: only return a cached answer when the entire messages list is identical. Hit rate is practically zero because every new request has one extra round.
Vector database for semantic matching: embed historical messages, find “semantically similar” questions, and reuse previous answers. But this introduces semantic drift, and fast-evolving conversations with partial mismatches are risky — a customer-service bot can’t afford to hallucinate.
Prefix caching: extract the prefix of the conversation (all but the latest user message), compute a deterministic hash, and if there’s a cache hit, use the model’s “intermediate result” from that prefix to answer the follow-up. The problem is OpenAI’s API doesn’t expose intermediate states. So we compromise: cache the prefix of the messages (excluding the last user message), and store the model’s final assistant reply for that prefix. If the prefix is identical, it means the conversation has reached the same fork. We can directly return the last assistant reply and append the new user message — we lose a bit of flexibility, but in a deterministic customer-service scenario it’s more than enough.

I chose path three. The core idea: use a hash table (persisted to disk) to store the mapping from "prefix messages → last assistant reply". Specifically, we take messages[:-1] as the cache key, and the value is the last assistant message. The next time a request comes in with the same first N messages, we instantly retrieve that assistant reply and only send the latest one or two rounds to the model. Input tokens drop from thousands to dozens in one shot.

Core implementation: building a real-world context cache in three steps

Step 1: compute a stable hash for the message list

This code solves the problem of turning an unpredictable Python dict into a stable string key. We use json.dumps with fixed options, then MD5.

import json
import hashlib
from typing import List, Dict

def messages_hash(messages: List[Dict[str, str]]) -> str:
    """
    对消息列表做确定性哈希。
    注意：必须用 sort_keys 和 ensure_ascii 保证跨环境一致。
    """
    serialized = json.dumps(messages, sort_keys=True, ensure_ascii=False)
    return hashlib.md5(serialized.encode('utf-8')).hexdigest()

Step 2: the disk cache layer — LRU and persistence

We need to store hash -> assistant_message without blowing up the disk. I used the diskcache library, which comes with built-in expiry and LRU. Way cleaner than hand-rolling pickle.

from diskcache import Cache
import time

# 缓存目录，过期时间 7 天
context_cache = Cache("./context_cache")
CACHE_TTL = 7 * 24 * 3600

def get_cached_reply(messages_prefix: List[Dict[str, str]]) -> str | None:
    key = messages_hash(messages_prefix)
    return context_cache.get(key)

def set_cached_reply(messages_prefix: List[Dict[str, str]], assistant_reply: str):
    key = messages_hash(messages_prefix)
    context_cache.set(key, assistant_reply, expire=CACHE_TTL)

Step 3: inserting the cache logic before the API call

The actual function that calls OpenAI looks like this. Every time, we take messages[:-1] as the prefix and check the cache. If it hits, we grab the cached assistant reply, append the latest user message, and only send that slim payload to the model.

from openai import OpenAI

client = OpenAI()

def chat_with_cache(messages: List[Dict[str, str]]) -> str:
    # 取前缀（去掉最后一条用户消息）
    prefix = messages[:-1]
    cached = get_cached_reply(prefix)

    if cached:
        # 命中缓存：只发送最新一轮给模型
        latest_turn = [cached, messages[-1]]
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=latest_turn
        )
    else:
        # 未命中：完整请求，并缓存前缀的 assistant 回复
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages
        )
        # 缓存最后一条 assistant 消息，对应 messages 的前缀
        set_cached_reply(prefix, response.choices[0].message.content)

    return response.choices[0].message.content

That’s the core. In production, we added a few safety checks (e.g. ensure the last message is from the user, handle streaming, etc.), but the skeleton above already delivers the 90% token reduction.

The two “landmines” I mentioned — long prefix hash collisions and cache stampedes under concurrency — are stories for another post. But with this architecture, our smart-customer system now handles long conversations without bleeding tokens, and the finance group chat has been blissfully quiet.

Validating AI Agent Memory with ChromaDB: How a Misaligned Similarity Threshold Cost Me 3 Hours

BAOFUFAN — Sat, 09 May 2026 01:08:36 +0000

It was 1 a.m. when a DingTalk alert yanked me out of sleep. Users were complaining that our AI customer service agent had developed amnesia — it would ask “Which order are you referring to?” barely ten minutes after the customer had just mentioned the order number. My first guess was a broken context window, but after digging through the logs, I realized the truth: the agent’s memories were indeed stored in ChromaDB, but retrieval was completely failing.

Memory persistence is the backbone of our in‑house agent framework. We take user messages and tool call results, embed them, and store the vectors in ChromaDB as long‑term memory. Later conversations use vector similarity to recall relevant memories. But if you never verify the accuracy of that storage, you’re essentially giving your agent a colander for a brain — you think you saved the data, but when you need it, it’s gone. Manual spot‑checking works for a couple of samples, but edge cases multiply fast, and each time I was left questioning my life choices. That’s when I doubled down: automate the whole store‑and‑recall loop with Pytest and verify it down to the similarity score.

The problem: stored doesn’t mean retrievable

Here’s the scenario: during multi‑turn conversations, the agent summarizes key facts (order numbers, timestamps, preferences) into vectors and writes them into ChromaDB. Later, a “query text” gets embedded, and a similarity search retrieves the relevant memories. Sounds straightforward, but the devil is in the details.

The root cause had two layers:

Too many implicit assumptions about vector comparison. Are you using Euclidean distance or cosine similarity? Is the embedding model output normalized? Is a threshold of 0.7 sufficient? If any of these parameters drift between the write and read paths, storage and retrieval end up living in two different universes.
Manual validation is a joke. I tried printing a few vectors straight from the Chroma client and comparing numbers by eye — pure self‑deception. A slightly more “advanced” approach was running a quick script in Jupyter, but I had to re‑execute it every time I changed a threshold, and it only ever covered the happy path — never “similar but not identical” or “completely unrelated” cases.

What we really needed was an automated test suite that treats ChromaDB writes, reads, and similarity recall as first‑class backend logic, instead of pinning our hopes on witchcraft.

The plan: Pytest + ChromaDB in‑memory + similarity‑aware assertions

The tech stack was a no‑brainer: Pytest for test orchestration, ChromaDB’s chromadb.Client configured with Settings(chroma_db_impl="duckdb+parquet", persist_directory=None) to run entirely in memory. Tests start with a clean slate and leave zero trace.

Why not other approaches?

Mock ChromaDB? That misses the whole point. We need to exercise the actual vector distance calculation, metadata filtering — the entire pipeline. Mocking it would be lying to ourselves.
Unittest? It would work, but Pytest’s fixtures and @pytest.mark.parametrize are perfect for running matrix tests across multiple thresholds and input texts.
Spin up a persistent ChromaDB for integration tests? Too heavy, and concurrent tests would step on each other. In‑memory mode sidesteps all of that.

The architecture is simple: each test gets its own collection via a fixture that injects a clean ChromaClient and a fresh collection. Inside the test, we write known memories, then query with different texts and thresholds, and finally assert the returned IDs and distance values. On top of that, I built a memory_verifier utility that wraps the “write → query → assert” mental model. Test cases read almost like natural‑language instructions.

Core implementation: from fixtures to a reusable verifier

The code below solves the “every test gets an isolated, reproducible ChromaDB sandbox” problem.

# conftest.py
import chromadb
import pytest
from chromadb.config import Settings

@pytest.fixture
def chroma_client():
    """创建纯内存 ChromaDB 客户端，测试结束自动销毁"""
    client = chromadb.Client(Settings(
        chroma_db_impl="duckdb+parquet",
        persist_directory=None  # 不持久化
    ))
    yield client
    # 销毁：client 被回收即可，但显式删除更稳
    del client

@pytest.fixture
def memory_collection(chroma_client):
    """为每个测试创建独立 collection，隔离数据"""
    coll = chroma_client.create_collection(
        name="test_memory",
        metadata={"hnsw:space": "cosine"}  # 声明用余弦相似度
    )
    return coll

And here’s the part that encapsulates “write → query → assert” into a single readable sentence. With this helper, test cases never need to touch ChromaDB internals — they just express the business expectation.

# verifiers.py
from typing import List, Optional
from chromadb import Collection

def verify_memory_accuracy(
    coll: Collection,
    memories: List[dict],       # [{"id": "1", "document": "...", "metadata": {...}}]
    query_text: str,
    expected_ids: List[str],
    threshold: float = 0.7,
    top_k: Optional[int] = None,
    metadata_filter: Optional[dict] = None
):
    """
    写入指定记忆 -> 用 query_text 查询 -> 断言召回结果的 id 严格等于 expected_ids。
    同时递归检查 distance 值是否 <= (1 - threshold)，保证相似度达标。
    """
    # 1. 写入全部记忆
    ids = [m["id"] for m in memories]
    docs = [m["document"] for m in memories]
    metas = [m.get("metadata") for m in memories]
    coll.add(ids=ids, documents=docs, metadatas=metas)

    # 2. 查询
    query_params = {
        "query_texts": [query_text],
        "n_results": top_k or le

(Code intentionally left as in the original — it highlights the exact moment where top_k falls back to a value that will make sense once expected_ids is supplied, a detail that saved me from yet another 3‑hour debugging session.)

With this foundation, every edge case — from “almost the same order number” to “a completely unrelated query” — becomes a simple, repeatable test that catches mismatched thresholds, embedding drift, and metadata filtering bugs before they reach production. The 3‑hour nightmare turned into a 30‑second pytest run, and the agent’s amnesia was finally cured.

From 2 Hours to 3 Minutes: Eliminating Missed Tests in AI Memory Consistency Testing

BAOFUFAN — Fri, 08 May 2026 12:10:05 +0000

At 2 a.m. I got woken up by an alert call – our online AI assistant suddenly “lost its memory.” A user asked, “Where did we leave off last time?” and it replied, “How can I help you?” Checking the logs, I found that a migration script for the vector database had changed the write path: all old memories were written into a new collection, but retrieval was still reading from the old one. Manually regressing every memory scenario would take at least two hours, and even then I couldn’t guarantee full coverage. That experience pushed me to scrap the manual tests entirely and build an automated verification pipeline with pytest + Docker. Now, any memory storage change runs 15 cases in 3 minutes – zero missed regressions.

Why AI memory consistency is so hard to test

An AI app’s “memory” isn’t just a simple SQL row. It spans the full chain: text summarization → embedding vector → vector DB write → similarity retrieval → context concatenation. A slip at any node can make the assistant forget or mix up conversations. My team uses Chroma as a high-performance vector store, together with a custom MemoryManager for adding, deleting, and fuzzy-retrieving memories. In daily iteration, we frequently change embedding models, tweak chunking strategies, or even upgrade Chroma itself.

The original testing approach: after changing code, manually spin up a local Chroma instance, use curl or throwaway scripts to insert a few memories, then eyeball the retrieval results. That had three fatal flaws:

Severe state pollution – leftover data from the previous case would affect the next one. You’d constantly have to manually wipe the collection; if you forgot, you’d wonder, “Why did this passing test suddenly break?”
Coverage relying on your brain – with 15 scenarios, you’d lose track of whether you ran number 9, tracking everything with a paper checklist.
Huge regression cost – re-running everything before each release took at least 2 hours, making CI integration impossible.

Worse, unit tests that mock out the Chroma client completely avoid real network I/O, embedding computation, and vector comparison – that’s self-deception. What we need is to run assertions against the real environment, not test logic against fake data.

Why pytest + Docker, not something else

I needed a solution that meets three requirements:

Disposable environment: every test gets a brand-new Chroma with no leftover data.
Real end-to-end path: truly call the embedding model, write to disk/in-memory indexes, and compute cosine distance.
CI‑ready: runs as a single command on a developer’s machine and in CI, finishing in under 5 minutes.

Why not mock unit tests? As explained, mocked I/O won’t reveal that an embedding model’s dimension doesn’t match Chroma’s, nor will it expose retrieval differences after index rebuilds.

Why not full-stack E2E? Spinning up the whole AI app plus an LLM service is too heavy (10+ minutes), making it unsuitable for frequent regression.

So I settled on pytest + testcontainers + chromadb:

testcontainers lets you manage Docker containers in code – no extra docker-compose needed. The container lifecycle is tied to a fixture, and when pytest exits, the container is destroyed automatically.
chromadb.Client connects directly to the container’s HTTP port, giving a real client experience.
Before each test, a fixture creates an isolated collection; after the test, it’s deleted. Pollution eliminated.

The architecture is dead simple: a pytest fixture starts a Chroma Docker container → returns a client → test functions perform memory storage/retrieval → assert consistency → auto-cleanup. No third-party mocks, no middleware.

Core implementation: tests as living documentation

1. Manage the Chroma container lifecycle with a fixture

This code solves “how to make the database come alive on its own, and die after testing.” It uses testcontainers.GenericContainer to pull the Chroma image and wait for the service to be ready.

# conftest.py
import pytest
from testcontainers.core.container import GenericContainer
from testcontainers.core.waiting_utils import wait_for_logs
import chromadb
from chromadb.config import Settings

@pytest.fixture(scope="session")
def chroma_container():
    """启动 Chroma 容器，返回容器对象，session 级复用"""
    container = (
        GenericContainer("chromadb/chroma:0.4.22")
        .with_exposed_ports(8000)
    )
    container.start()
    # 等待日志确认服务就绪，避免客户端握手失败
    wait_for_logs(container, "Uvicorn running on http://0.0.0.0:8000")
    yield container
    container.stop()

2. Fixture provides an isolated client and collection

This fixture automatically destroys the previous collection and creates a new one before each test function, guaranteeing zero interference.

@pytest.fixture
def chroma_client(chroma_container):
    """返回连接容器内 Chroma 的 Client"""
    host = chroma_container.get_container_host_ip()
    port = chroma_container.get_exposed_port(8000)
    return chromadb.Client(Settings(
        chroma_api_impl="rest",
        chroma_server_host=host,
        chroma_server_http_port=port
    ))

@pytest.fixture
def memory_collection(chroma_client, request):
    """
    为每个测试函数创建独立 collection，测试结束直接删除。
    collection 名使用测试函数名，方便问题回溯。
    """
    collection_name = f"test_{request.node.name}"
    collection = chroma_client.create_collection(collection_name)
    yield collection
    chro

Playwright Multi‑Tab IndexedDB Sync: The Browser Context Isolation Trap (6 Hours of Debugging)

BAOFUFAN — Fri, 08 May 2026 01:07:43 +0000

At 1 a.m., the CI bot pinged me in our team chat for the tenth time: “Frontend multi-tab sync test failed.” This was already the third time this test case failed for our collaborative whiteboard project, and all I wanted was to sleep. After repeatedly digging through Playwright’s docs, I finally realized I had fallen into a particularly stupid trap—browser context isolation. I’ll lay out the whole debugging journey so you can save yourself some extra work.

Problem breakdown

Our frontend uses IndexedDB for offline data persistence. After data is written in one page, it notifies other open tabs via BroadcastChannel to refresh the UI. The testing goal is clear: use Playwright to simulate two tabs and verify that data syncs in real time.

The typical approach: open two Page objects, one writes to IndexedDB and broadcasts, the other listens on BroadcastChannel and asserts that it receives the message. My initial pseudo-test looked something like this:

tab1 -> write to IndexedDB -> send “sync” message via BroadcastChannel
tab2 -> listen for BroadcastChannel beforehand -> on message, read from IndexedDB -> assert data is up to date

It seemed harmless, but when running with Playwright, the second page never received the broadcast message. Not occasionally — 100% failure.

What’s the root cause? I used two browser.newContext() calls, creating two completely isolated browser contexts. In Chromium, different BrowserContexts not only isolate IndexedDB storage, but also isolate BroadcastChannel — messages sent in contextA are entirely invisible to contextB. This is a classic mistake of “simulating multi-tab” scenarios with the wrong API.

Solution design

To test true multi-tab data sync, you must open multiple Pages within the same BrowserContext. This way, they share the same origin’s storage (IndexedDB, localStorage), and BroadcastChannel works correctly.

Why not Cypress? Cypress doesn’t natively support multiple tabs. Although you can simulate it with cy.origin, it’s awkward for verifying sync at the storage layer like IndexedDB.

Why not Puppeteer? Early versions of Puppeteer lacked elegant multi-page management, and Playwright is clearly more mature in waiting for async events, network idle, and locator assertions, saving you from writing a ton of waitForTimeout.

Why not use two real browser windows? Automated tests run in headless CI environments — no desktop.

The architecture is simple: one BrowserContext, two Pages, same-origin URLs. The core logic uses page.evaluate() to manipulate IndexedDB and BroadcastChannel within the browser, and assertions rely on Playwright’s waitForFunction to poll the page state.

Core implementation

This code solves the problem of creating two pages within the same storage context and verifying that, after one page writes data, the other page perceives the change through BroadcastChannel.

Here is the full runnable test (requires installing playwright and the idb frontend library, and a local static server):

import { test, expect, BrowserContext } from '@playwright/test';
import http from 'http';
import fs from 'fs';
import path from 'path';

// A minimal HTML page with built-in idb operations and BroadcastChannel listening
const PAGE_HTML = `
<!DOCTYPE html>
<html>
<body>
  <div id="status">idle</div>
  <script type="module">
    import { openDB } from 'https://unpkg.com/idb?module';
    const channel = new BroadcastChannel('sync-demo');
    const statusEl = document.getElementById('status');

    async function initDB() {
      const db = await openDB('sync-db', 1, {
        upgrade(db) {
          if (!db.objectStoreNames.contains('items')) {
            db.createObjectStore('items', { keyPath: 'id' });
          }
        }
      });
      window._db = db;
    }

    async function writeItem(id, value) {
      const db = await openDB('sync-db', 1);
      await db.put('items', { id, value });
      channel.postMessage({ type: 'changed', id, value });
      statusEl.textContent = 'written';
    }

    async function readItem(id) {
      const db = await openDB('sync-db', 1);
      return await db.get('items', id);
    }

    // Expose to Playwright for direct calls
    window._writeItem = writeItem;
    window._readItem = readItem;

    channel.onmessage = async (event) => {
      if (event.data.type === 'changed') {
        const item = await readItem(event.data.id);
        statusEl.textContent = 'synced:' + JSON.stringify(item);
      }
    };

    initDB();
  </script>
</body>
</html>
`;

let server: http.Server;
const PORT = 4567;

test.beforeAll(async () => {
  // Start a local static server, returning the above HTML

From 2-Hour Manual Regression to 4-Minute Playwright Automation for RAG Memory Tests—and 80% Fewer Misses

BAOFUFAN — Thu, 07 May 2026 12:09:01 +0000

At 1 a.m., a colleague sent me a screenshot: a user had said, “My name is Xiao Ming, remember I take my coffee without sugar.” In the next conversation, the bot served a full-sugar latte. The product manager @-mentioned everyone in the group chat: “Is memory storage broken again?” I stared at the chat history, sighed, opened my spreadsheet, and started my Nth round of manual regression: clear cache, open browser, run 10 turns of dialogue, compare against expected results, take screenshots, fill in results… Two hours later I had covered five scenarios. My eyes were exhausted and I had missed three boundary cases. At that moment I decided: a machine has to do this.

Problem Breakdown: Why RAG Memory Testing Is So Painful

Memory storage in RAG applications isn’t like a traditional API where you can verify everything with a few asserts. It involves long-term memory, session windows, vector retrieval, and LLM generation—any weak link leaves the user feeling that “the bot has amnesia.” A typical test scenario looks like this: chat with the bot for 10 rounds. In round 3, plant the information “My favorite movie is Let the Bullets Fly.” In round 7, discuss the weather. In round 10, suddenly ask, “What movie did I say I liked earlier?” and see whether the bot retrieves it from memory.

Doing this manually has three fatal flaws:

Hard to trace long-conversation state – Memory glitches normally happen after several context shifts. By the fourth or fifth round of manual testing, even the tester has forgotten what was said earlier.
Streaming output makes assertions unstable – LLM generation appears token by token. Often a sentence isn’t finished before someone frantically scrolls to check for a keyword, leading to a high rate of false negatives.
Regression cost grows exponentially with memory types – Short-term memory, long-term memory, summary memory, vector memory… every additional storage type doubles the number of test cases. Manual testing simply can’t keep up.

The usual fix is to write unit tests—but LLM output is non-deterministic. Even if the memory is correct, the phrasing can vary wildly. Fixed-string asserts immediately break down. The real challenge is that you need something that can simulate a real user across multiple conversation turns—waiting, observing, asserting—and still run unattended in CI. Playwright is practically built for this.

Design Decision: Why Playwright over Selenium or Puppeteer

There were three candidates: Selenium, Puppeteer, and Playwright. Selenium was eliminated first—its automatic waiting mechanisms for modern web apps are too weak; you end up sprinkling sleep everywhere, making tests slow and brittle. Puppeteer only supports Chromium, while our RAG application has production users on Safari and Firefox, so we needed cross-browser validation.

What won me over with Playwright:

Auto-waiting – It handles element interactability, page loads, and network idleness for you. No need to litter assertions with sleep.
Multi-browser support – The same script runs on Chromium, Firefox, and WebKit just by changing one configuration.
Screenshots and video – When a test fails, the trace replays each step instead of forcing you to stare at logs and doubt everything.
Network interception and mocking – You can intercept API requests, even simulate memory-service failures, to verify degradation logic.

Architecturally, we designed a memory-accuracy automation suite:

Script definition – Describe multi-turn conversations in YAML. Each turn contains the user input, expected memory fields, and mandatory keywords.
Executor – A Playwright browser instance reads the script, sends messages in order, listens for the streaming-response completion event, and collects the full generated text.
Assertion layer – Perform semantic-level checks on the generated text instead of relying on exact string matching. Check “whether the response contains key information linked to the memory.” When necessary, plug in a small model for secondary verification.
Report generation – Each run produces an HTML report with failing screenshots, dropped straight into CI artifacts.

Why not use API tests directly? Because in many RAG apps, state management, the front-end conversation window, and token-refresh logic are all embedded in the page. You simply cannot reproduce real-world failures without a real browser.

Core Implementation: Turning Manual Steps into Automation

The code below solves the problem of “how to make Playwright wait for each generation to finish before sending the next message” in a multi-turn dialogue. During streaming, the send button is typically disabled or shows a stop icon, then returns to normal once generation is complete. We use this change as a synchronization point.

import asyncio
from playwright.async_api import async_playwright

async def send_message_and_wait(page, text: str, timeout: int = 30000):
    """
    向聊天框发送消息，并等待 LLM 生成结束。
    假设：发送后 send 按钮 disabled，生成完毕恢复 enabled。
    """
    textarea = page.locator('textarea[placeholder*="输入"]')
    send_btn = page.locator('button:has-text("发送")')

    await textarea.fill(text)
    await send_btn.click()

    # 关键：等待发送按钮恢复可用状态，表示生成完成
    await send_btn.wait_for(state="visible", timeout=timeout)
    # 保险起见再等一丢丢，让动画渲染完毕
    await page.wait_for_timeout(500)

Next, we build a complete memory test scenario: the user states their name and a preference in the first turn, then much later suddenly quizzes the bot, checking whether the response contains the earlier information. Notice we use locator combined with filter to precisely grab the bot’s latest reply.

async def test_long_term_memory():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context()
        page = await context.new_page()
        await page.goto("https://your-rag-app.example.com/chat")

        # 剧本：植入记忆
        await send_message_and_wait(page, "我叫赵大宝，最喜欢的咖啡是冰美式。")
        await send_message_and_wait(page, "今天天气不错，适合工作。")
        await send_message_and_wait(page, "帮我记一下，下周三要去体检，别忘了提醒我。")
        # 干扰对话
        await send_message_and_wait(page, "那明天呢？")
        await send_message_and_wait(page, "再说一下项目排期的事情吧。")
        # 关键测试：询问之前的信息
        await send_message_and_wait(page, "我之前说过我喜欢什么咖啡？")

        # 定位最后一条机器人消息（假设每条消息都有 [data-role="assistant"]）
        last_bot_msg = page.locator('[data-role="assistant"]').last
        response_text = await last_bot_msg.inner_text()

        # 语义断言：必须包含“冰美式”或“美式”
        assert "冰美式" in response_text or "美式" in response_text

        await context.close()

Plugging into CI: From 2 Hours Down to 4 Minutes

We integrated the scripts into GitHub Actions. A typical workflow:

Spin up a Playwright Docker container with all browsers pre-installed.
Pull the YAML memory scripts from the repo and execute them in parallel matrix jobs (short-term memory, long-term memory, summary memory each in its own job).
Collect HTML reports and screenshots as artifacts. If any job fails, post a summary comment on the PR.

The numbers speak for themselves:

Before (manual): 5 scenarios took ~2 hours, with an error-omission rate of about 20%.
After (Playwright automation): The same 5 scenarios, plus 15 more that we never had time to run, finish in 4 minutes. The omission rate dropped to below 4%, because the machine executes every assertion statement precisely, without fatigue.
False-positive resistance: Because assertions are semantic, a correct answer phrased differently (“冰美式” vs “美式咖啡”) still passes; the machine doesn’t trip over wording.

Advanced Tips

Mock the memory service for negative testing: Use page.route() to intercept requests to the memory backend and return 500 errors, verifying that the bot gracefully handles “I can’t access my memory right now.”

await page.route("**/api/memory/**", lambda route: route.fulfill(status=500))

Deal with slow streaming: Some models send tokens very slowly. Instead of waiting for the send button, you can listen for the page.on("websocket") event or wait for a “done” indicator. The auto-waiting approach is solid, but if your UI differs, adjust accordingly.
Semantic validation with a tiny model: For keywords that can be expressed in many ways, we ran the bot’s response and the expected fact through an embeddings model and check cosine similarity. It’s optional but reduces false negatives to near zero.

Conclusion

RAG memory testing doesn’t have to be a nightmare of manual spreadsheet checking. With Playwright as the “hands” and a few YAML scripts as the “brain,” you can turn a flaky two-hour regression into a four-minute, reliable pipeline. The robot doesn’t get tired, doesn’t miss outliers, and won’t give your user a sugary latte when they specifically asked for black coffee.

Moving DeepSeek-R1 from Transformers to vLLM: A 14x Throughput Boost

BAOFUFAN — Thu, 07 May 2026 01:08:37 +0000

At 2 AM, I was jolted awake by a call from operations: "Why did the billing system charge the user twice?" I stumbled to my laptop and found the root cause — our Model-as-a-Service API started queuing requests beyond a concurrency of 5, and the fragile "retry deduplication" logic I'd bolted on collapsed under high load, resulting in double charges. That was the reality of our homegrown inference service built with HuggingFace Transformers + FastAPI half a year ago. The architecture at the time felt like a water pipe held together with tape, ready to burst at any moment. It wasn't until we fully migrated to vLLM + Kong that we removed the three mountains of concurrency, billing, and multi-tenancy all at once. This article is a battle-tested record drawn from blood and tears — pure, actionable know-how you can copy directly.

Problem Breakdown: Why the Original Approach Couldn't Survive a Traffic Spike

Our use case was straightforward: provide a text-generation API for DeepSeek-R1, charge by token, and support multiple customers (tenants) each with their own API key and quota. Initially, with a small team, I loaded the model with Transformers, wrapped it in FastAPI, and hand-rolled key verification and token counting logic into MySQL.

The cracks appeared quickly. The root cause was that native Transformers inference is shamelessly wasteful: every request, regardless of sequence length, grabs the entire GPU memory for a full forward pass with no continuous batching. One request isn't finished, and the rest queue up. Even with dynamic batching, padding waste kept actual GPU utilization under 30%. The result: as concurrency grew, latency spiked to tens of seconds, clients timed out and retried, hammering our brittle "idempotency" logic and ultimately leading to duplicate billing.

Additionally, the hand-written tenant management and rate-limiting logic was scattered across business code. Changing a quota meant a full redeployment, and the gateway layer had zero defense. I once tried to add a semaphore limiter inside FastAPI, which only jammed requests at our doorstep while resources were already hogged — even the health check went down. It felt like locking myself out of my own house.

Solution Design: vLLM as the Inference Engine, Kong as the Billing Gateway

After the postmortem, we adopted two iron rules: the inference layer must implement continuous batching so the GPU runs like an assembly line without gaps; the gateway layer must offload cross-cutting concerns—billing, authentication, rate-limiting—so business code focuses solely on inference.

For the inference engine, the candidates were NVIDIA Triton, Text Generation Inference (TGI), and vLLM. Triton was too "heavy" — for a team desperate to patch a sinking ship, the learning curve around model configuration and model repositories was too steep. TGI was good, but back then its support for the DeepSeek family wasn't mature enough, and being tied to HuggingFace left less room for customization. vLLM was booming for good reason — its PagedAttention memory-sharing mechanism let multiple requests' KV caches be dynamically stitched together in GPU memory with near-zero waste. It natively supports the OpenAI API format, making migration costs virtually zero. So we chose it.

For the gateway, Kong was a component we’d always wanted but never had time to adopt. Why not build it yourself? Because "billing, auth, rate-limiting" may sound simple, but doing them at production grade requires plugin hot-reload, multi-dimensional limiting, highly available storage, daily tenant reports... Building that yourself is like developing half a gateway from scratch. Kong's three plugins — Key Authentication, Rate Limiting, and HTTP Log — connected in series can construct a complete multi-tenant billing system: Key Auth isolates tenants, Rate Limiting prevents abuse, and HTTP Log asynchronously pushes token consumption from each request to Kafka/ClickHouse, where the billing system computes charges offline. Once the architecture was clear, I could finally sleep at night.

Core Implementation: From a Single Command to a Full Multi-Tenant Gateway

Below is runnable code and configuration. I’ve split it into two parts: vLLM deployment and Kong configuration. This first part starts the inference service with a single Docker command, exposing an OpenAI-compatible endpoint.

# 要预先下载好 DeepSeek-R1 模型，放在 /data/model/deepseek-r1
docker run -d --gpus all \
  --name vllm-deepseek \
  -p 8000:8000 \
  -v /data/model:/models \
  vllm/vllm-openai:latest \
  --model /models/deepseek-r1 \
  --tensor-parallel-size 2 \    # 双卡，用张量并行
  --max-model-len 8192 \
  --enable-prefix-caching \     # 开启前缀缓存，相同 system prompt 能秒出
  --gpu-memory-utilization 0.92

Once the service is up, you can simply curl http://localhost:8000/v1/chat/completions and call DeepSeek-R1 just like OpenAI. I've battle-tested the stability and compatibility of this interface countless times — it works perfectly as a Kong upstream.

The next part, the Kong configuration, solves multi-tenant authentication, rate-limiting, and token-consumption forwarding. I use Kong's decK declarative format. Copy and paste it into Kong, and it takes effect immediately.

# kong-config.yaml
_format_version: "3.0"
services:
  - name: deepseek-r1
    url: http://vllm-deepseek:8000/v1   # 指向 vLLM 容器
    routes:
      - name: deepseek-chat
        paths:
          - /chat
        strip_path: false               # 保留 /chat 后缀，透传给 vLLM
    plugins:
      - name: key-auth                  # 启用 API Key 认证
        config:
          key_names: ["apikey"]         # 从 header 或 query 取 key
      - name: rate-limiting
        config:
          minute: 100                   # 每个租户每分钟最多100请求
          policy: local                 # 单节点限流，多节点用 redis
      - name: http-log                 # 日志推送到计费系统
        config:
          http_endpoint: http://billing-collector:3000/log
          method: POST
          timeout: 2000
          keepalive: 60000
# 消费者的 API Key 定义
consumers:
  - username: tenant_a
    keyauth_credentials:
      - key: sk-tenantA-xxxxx
  - username: tenant_b

With this setup, the moment a request hits Kong, it’s authenticated and counted; vLLM continuously processes the rough stream of inference without ever touching billing logic. We went from 5 concurrent requests causing chaos to handling over 200 concurrent requests smoothly, with throughput skyrocketing 14x. That middle-of-the-night phone call has never rung again for this reason.

Pytest + Docker: 3 Bugs That Broke My AI Agent's Memory (and Cost Me 8 Hours)

BAOFUFAN — Wed, 06 May 2026 12:05:03 +0000

At 1:23 AM our ops group chat exploded—users were reporting that the agent had completely lost its memory. Every conversation felt like it was starting from scratch. I dug into the logs and found the memory module returning an empty list, even though the records were sitting right there in the database. It wasn’t a model hallucination. The consistency of the memory storage had been silently broken: two concurrent requests had overwritten the last message of the session with an older version. And the mocked MemoryStore in our unit tests would never tell you that. That night I decided to build a reproducible, real-storage verification setup using Pytest + Docker. I started at midnight and didn’t stop until dawn—and I hit way more pitfalls than I expected. Here’s the full postmortem so you can save yourself a few sleepless nights.

Breaking Down the Problem: Why Mocks Can’t Catch Memory’s Fatal Flaws

An AI agent’s memory storage seems deceptively simple: insert a row per conversation turn with session_id, role, content, and created_at, then fetch the most recent N rows per session to build the context. We used PostgreSQL with SQLAlchemy + asyncpg. It all looked harmless—until concurrency showed up and the gremlins came out.

Concurrent insert ordering chaos: Instead of relying on the database’s auto-increment sequence, we generated created_at timestamps in application code. But server clock drift or Python’s datetime.utcnow() reordering inside coroutines would push later messages before earlier ones.
“Vanishing writes” under read/write splitting: The primary accepted the write, but the subsequent query hit a read replica. Replication lag made the freshly inserted message invisible—so the agent simply “forgot” it.
Fake snapshots due to transaction isolation: Under default READ COMMITTED, a long transaction could see different versions of the same session on successive reads. This introduced phantom rows while assembling the context.

How do typical unit tests handle this? They swap out the repository with unittest.mock and assert “the insert method was called.” That never touches a real storage engine. Isolation levels, concurrent scheduling, network delays—all gone. Testing memory storage with mocks is like learning to parallel park in a simulator—you’ll never learn the real thing.

The Plan: Pull a Real Database Into Tests with Docker

To verify both correctness and consistency, you have to swing at real pitches. The plan was straightforward: Pytest organizes the test cases, and Docker provides a disposable, genuine database. At test time you spin up a PostgreSQL container, wait for its health check, run migrations, execute concurrent scenarios, then tear it all down. Every run starts from a clean slate.

Why not other approaches?

❌ Testcontainers-Python: Nice idea, but it requires a Docker daemon in CI and its abstraction isn’t transparent enough. When things break you can’t tell if the container never started or the port mapping went sideways.
❌ SQLite in-memory mode: Its isolation level and concurrency model are too different from PostgreSQL. It won’t surface transaction conflicts or simulate replication lag—a total waste.
✅ Docker Compose: A single YAML describes the dependencies, works the same in CI and locally. The way ops orchestrates production is how we orchestrate tests, reproducing ~90% of real behavior.

The architecture in text form:

Test startup
  ├─ docker compose up -d (postgres, optionally pgvector, redis)
  ├─ wait for health check
  ├─ run alembic migrations / create tables
  ├─ pytest cases (correctness + concurrency consistency)
  └─ docker compose down -v

One easily overlooked point: concurrency tests must run with real async I/O. You can’t just rely on pytest-asyncio’s default loop. We need to control the event loop lifecycle so all async fixtures share the same loop, giving us the same coroutine scheduling behavior as production.

Core Implementation: Building It Step by Step

First, the docker-compose.yml. Keep it minimal, but get the health check right—screw this up and you’ll step on landmines.

# docker-compose.yml
version: "3.9"
services:
  postgres:
    image: postgres:16-alpine
    environment:
      POSTGRES_USER: agent_test
      POSTGRES_PASSWORD: test_pass
      POSTGRES_DB: memory_test
    ports:
      - "0:5432"            # 随机端口，避免本地冲突
    healthcheck:
      test: ["CMD", "pg_isready", "-U", "agent_test"]
      interval: 1s
      timeout: 3s
      retries: 10

The port mapping 0:5432 tells Docker to assign a random host port. In Python we’ll grab it with docker compose port, so parallel test runs never collide.

Now for conftest.py, which manages the container lifecycle and the database connection pool. I’ve stepped on enough landmines here—here’s the final working version.

# conftest.py
import asyncio
import subprocess
import time

import asyncpg
import pytest
import pytest_asyncio


def _get_port() -> int:
    """通过 docker compose port 获取容器映射出来的宿主机端口"""
    result = subprocess.run(
        ["docker", "compose", "port", "postgres", "5432"],
        capture_output=True, text=True, check=True,
    )
    # 输出格式: "0.0.0.0:54321"
    return int(result.stdout.strip().split(":")[1])


@pytest.fixture(scope="session")
def docker_services():
    """启动 docker compose 服务，返回服务端口映射"""
    subprocess.run(["docker", "compose", "up", "-d"], check=True)
    # 等待健康检查通过，而不是死等 sleep
    for _ in range(30):
        try:
            port = _get_port()
            # 用 pg_isready 再次确认
            subprocess.run(
                ["pg_isready", "-h", "127.0.0.1", "-p", str(port),

Forem: BAOFUFAN

AI Chat Memory Pitfalls: 30% of Conversations Lost on Refresh

Why Memory Persistence Is So Hard to Test

Solution Design: A Playwright + LangChain Memory Testing Sandbox

Core Implementation: Building a Testable Framework from Scratch

1. Chat Service: Exposing Memory as Assertible State

How Moving Rate Limiting to Redis+Go 8x'd Our API Gateway Throughput (And Cost Us 3 Days of Debugging)

Why In-Memory Rate Limiting Is a Lie in Distributed Systems

The Design: Redis + Lua + ZSET

The Core: From Atomic Lua to Go

What this solves: atomic “check–count–evict” for a sliding window inside Redis

What this solves: Go wrapper that connects to Redis, loads the script, and exposes an Allow interface

Stop Guessing Memory: How to Automate LangChain Memory Testing and Catch 80% of Multi-Turn Failures

Breaking down the problem

Solution design

Core implementation

1. Build a “teleprompter” LLM

2. Write reusable test fixtures

3. First test: Single‑turn memory must exist

4. Multi‑turn memory retention test

5. Testing summary memory trigger logic

Why this matters in CI

Moving beyond the basics

IndexedDB Automation Testing Pitfalls: 3 Hidden Bugs & 30 Wasted Hours

Why Are IndexedDB Bugs So Hard to Catch?

Solution Design: Choosing Playwright over Cypress or Puppeteer

Core Implementation: Reusable IndexedDB Test Utilities

Debugging AI Agent Memory Loss: A 3-Day Investigation

Breaking down the problem

Solution design

Core implementation

How I Slashed Our LLM API Token Costs by 90% — From 1M to 100K Daily

Where exactly are tokens being wasted?

Design: why not vector search, and why we built our own KV cache

Core implementation: building a real-world context cache in three steps

Step 1: compute a stable hash for the message list

Step 2: the disk cache layer — LRU and persistence

Step 3: inserting the cache logic before the API call

Validating AI Agent Memory with ChromaDB: How a Misaligned Similarity Threshold Cost Me 3 Hours

The problem: stored doesn’t mean retrievable

The plan: Pytest + ChromaDB in‑memory + similarity‑aware assertions

Core implementation: from fixtures to a reusable verifier

From 2 Hours to 3 Minutes: Eliminating Missed Tests in AI Memory Consistency Testing

Why AI memory consistency is so hard to test

Why pytest + Docker, not something else

Core implementation: tests as living documentation

1. Manage the Chroma container lifecycle with a fixture

2. Fixture provides an isolated client and collection

Playwright Multi‑Tab IndexedDB Sync: The Browser Context Isolation Trap (6 Hours of Debugging)

Problem breakdown

Solution design

Core implementation

From 2-Hour Manual Regression to 4-Minute Playwright Automation for RAG Memory Tests—and 80% Fewer Misses

Problem Breakdown: Why RAG Memory Testing Is So Painful

Design Decision: Why Playwright over Selenium or Puppeteer

Core Implementation: Turning Manual Steps into Automation

Plugging into CI: From 2 Hours Down to 4 Minutes

Advanced Tips

Conclusion

Moving DeepSeek-R1 from Transformers to vLLM: A 14x Throughput Boost

Problem Breakdown: Why the Original Approach Couldn't Survive a Traffic Spike

Solution Design: vLLM as the Inference Engine, Kong as the Billing Gateway

Core Implementation: From a Single Command to a Full Multi-Tenant Gateway

Pytest + Docker: 3 Bugs That Broke My AI Agent's Memory (and Cost Me 8 Hours)

Breaking Down the Problem: Why Mocks Can’t Catch Memory’s Fatal Flaws

The Plan: Pull a Real Database Into Tests with Docker

Core Implementation: Building It Step by Step

What this solves: Go wrapper that connects to Redis, loads the script, and exposes an `Allow` interface