Forem: sm1ck

IP-Adapter + LoRA for product catalog rendering — putting shop items on AI characters

sm1ck — Sat, 25 Apr 2026 02:35:59 +0000

📦 Runnable workflow: github.com/sm1ck/honeychat/tree/main/tutorial/04-ipadapter — a ComfyUI workflow.json (with <tune> placeholders for IP-Adapter weight/end_at) plus a stdlib Python client that posts it to your ComfyUI instance and saves the output.

In the previous post I argued that LoRA per character is often the strongest fit for visual identity. But what happens when you want to render that character wearing a specific item — a shop product, a user-uploaded outfit, a gift from another user?

LoRA helps stabilize the character. To also preserve an arbitrary reference image, IP-Adapter is a common fit. Those two techniques can compete unless you configure them carefully.

TL;DR

LoRA stabilizes the character's face. IP-Adapter pulls features from a reference image. If both are too strong late in sampling, the face can drift toward the reference.
Balance: moderate IP-Adapter weight (lower half of 0–1) with early handoff (IP-Adapter releases control before the final denoising steps). The final steps belong to the LoRA.
A useful node order: Checkpoint → LoRA → FreeU → IP-Adapter → KSampler. Feeding IP-Adapter into the model conditioning after LoRA lets LoRA reassert on late steps.

Render your first outfit preview

This section walks you from clone to a generated image in under ten minutes.

1. Prereqs

A running ComfyUI instance (local GPU, rented box, or a friend's)
ComfyUI_IPAdapter_plus installed in it
ip-adapter-plus_sdxl_vit-h.safetensors in models/ipadapter/
CLIP-ViT-H-14-laion2B-s32B-b79K.safetensors in models/clip_vision/
Your own SDXL base checkpoint
A character LoRA — if you don't have one, go through the previous article first

2. Clone and install the client

git clone https://github.com/sm1ck/honeychat
cd honeychat/tutorial/04-ipadapter
pip install -e .

3. Put your outfit reference next to the client

Anything flat-lay, clean-background works best. ./my-dress.png for this example.

4. Run — start at the middle of both tuning ranges

export COMFY_URL=http://localhost:8188
export REFERENCE_IMAGE=./my-dress.png
export CHECKPOINT=your-sdxl-base.safetensors
export LORA=your-character-v1.safetensors
export IPADAPTER_WEIGHT=0.4      # lower half of 0–1
export IPADAPTER_END_AT=0.8      # upper half of 0–1

python client.py

Output lands in ./out/outfit_preview_<n>.png. First run should usually show your character wearing something that resembles the reference dress.

5. Tune

Inspect the output. Two failure modes tell you how to adjust:

Face drifted → lower IPADAPTER_WEIGHT or lower IPADAPTER_END_AT by 0.05 and re-run.
Item doesn't resemble the reference → raise IPADAPTER_WEIGHT by 0.05, or raise IPADAPTER_END_AT slightly.

Sweep in 0.05 steps, not 0.1. The usable range can be narrower than expected, and a new base model may take several tuning sweeps before the balance feels stable.

6. Validate the workflow JSON with pytest

pip install -e ".[dev]"
pytest -v

Five tests make sure workflow.json stays valid JSON, every node class is still referenced, and <tune> placeholders haven't been accidentally committed with real values.

The problem

You have a character (Anna) stabilized by a custom LoRA. She appears reasonably consistent across generations. Now the user buys a specific dress in your shop. The dress is a reference image. You want:

Anna's face — unchanged.
This specific dress — rendered faithfully on Anna.

Prompt engineering usually can't guarantee this. "Anna wearing a red silk dress with a white collar" generates a red silk dress, not necessarily this red silk dress. SKU-level fidelity needs the reference image in the generation path.

Why naive IP-Adapter breaks the character

IP-Adapter pulls features from a reference image into the model's cross-attention. If you set it too high, it can preserve the reference image aggressively — including its face, if there is one. Even if the reference is an unworn product shot, IP-Adapter can pull in lighting, backdrop, and styling from the reference photo.

At high weight: Anna's face may start looking more like whoever (or whatever) is in the reference. Lighting and pose can bias toward the reference.

At low weight: The character is fine. The dress is approximately the right color and cut but not recognizable as this dress. Your product catalog becomes decorative rather than accurate.

The balance: moderate weight + early handoff

The two knobs that matter are weight and end_at.

Weight — the multiplier on IP-Adapter's contribution to cross-attention. Below the lower-middle of the 0–1 range, the reference is a "mood" more than a fact. Above the upper-middle, the reference dominates. Somewhere in the lower half is where you find the range that preserves item identity without killing face identity.

end_at — the fraction of denoising steps during which IP-Adapter is active. If it runs through all steps, it has a say in the final face details. If it ends earlier (say 70–90% of the way through), the last steps belong to the rest of the pipeline, and LoRA face features reassert.

In rough terms: the item gets baked in during the middle of denoising, the face re-sharpens at the end.

Workflow node order (ComfyUI)

[Checkpoint Loader]
  → [LoRA Loader: character_lora]
    → [FreeU: quality touch-up]
      → [IPAdapter Advanced: reference, weight=W, end_at=E]
        → [KSampler]
          → [VAE Decode]

Two things about this order:

LoRA comes before IP-Adapter in the chain. The LoRA modifies the checkpoint weights; IP-Adapter modifies cross-attention during sampling. When IP-Adapter ends at step end_at, the remaining steps operate on the LoRA-modified weights without IP-Adapter influence — this is what lets the face reassert.
FreeU is optional. It's a noise rebalance that improves quality without adding compute.

The tutorial client takes the base workflow.json, rewrites the <tune> placeholders with env-supplied values, uploads the reference image to ComfyUI, and queues the prompt:

def rewrite_workflow(wf: dict[str, Any], args: argparse.Namespace, ref_filename: str) -> dict[str, Any]:
    """Fill in the `<tune>` and `<path>` placeholders with actual values."""
    wf = json.loads(json.dumps(wf))  # deep copy

    if args.checkpoint:
        wf["1"]["inputs"]["ckpt_name"] = args.checkpoint
    if args.lora:
        wf["2"]["inputs"]["lora_name"] = args.lora
    wf["2"]["inputs"]["strength_model"] = args.lora_strength
    wf["2"]["inputs"]["strength_clip"]  = args.lora_strength
    wf["5"]["inputs"]["image"] = ref_filename
    wf["6"]["inputs"]["weight"] = args.weight
    wf["6"]["inputs"]["end_at"] = args.end_at
    wf["7"]["inputs"]["text"] = args.prompt
    wf["10"]["inputs"]["seed"] = int(time.time()) & 0xFFFFFFFF
    return wf

→ full source

The full workflow.json in the tutorial folder ships with <tune> placeholders on every field you should touch. The test suite asserts those placeholders stay in the template — a safety net against accidentally committing your tuned production values.

Weight tuning loop

The practical process:

Pick a reference item with a clean product photo.
Pick a character with a strong LoRA.
Render around weight=0.3, end_at=0.8. Check face, check item.
Face drifts → lower weight or lower end_at.
Item doesn't resemble the reference → raise weight carefully, or leave weight and raise end_at.
Sweep in 0.05 increments, not 0.1. The usable range is narrower than you'd expect.

Several tuning sweeps on realistic and anime bases usually land you on a working pair.

Production integration

Outfit catalog as reference images. Each shop item has a reference image stored in object storage. At generation time, pass the reference URL to the GPU worker, which downloads it once and caches.

Catalog pre-rendering for previews. When a user browses the shop, they see a preview of each item rendered on their active character. These previews don't need to happen on every page load — generate them asynchronously (Celery worker), store in S3, serve from cache.

Consistency across image and video. The same IP-Adapter + LoRA pair used for images can often drive the start-frame of video generation (e.g., Kling). Tune the still-image path first, then reuse it carefully.

Fallback when the item isn't visual. Some "items" in a shop are stats buffs, relationship flags, or dialogue unlocks — things without a visual. Gate the IP-Adapter pathway to items flagged as visual-only.

Production issues that came up

Face drifted on a noticeable slice of catalog previews. Running IP-Adapter weight too high "for stronger outfit adherence." Rolled back to the lower-half range after face-drift complaints spiked. Lesson: tune one variable at a time, even when it feels slow.

Cached reference URLs expired. Shop items in S3 had time-limited presigned URLs. Generation workers fetched the URL at queue-time, but the URL expired before ComfyUI actually downloaded it. Fix: pre-fetch on the worker side, pass the ComfyUI-side filename instead of the external URL.

IP-Adapter model version mismatch with SDXL base. IP-Adapter Plus ships multiple weights keyed to specific SDXL base models. Mixing can produce worse output without an obvious runtime error — just lower fidelity. Pin the IP-Adapter version to the base in your deployment config.

Non-visual shop items crashed the workflow. The API tried to render "stat boost" items through the image pipeline. Fix: a visual: true|false flag on catalog entries, checked at the API boundary before queuing.

What I'd change if starting over

Start with a clean catalog. Reference images with consistent backgrounds, consistent lighting, no model already wearing the item if possible.
Version the tuning. When you move base models, your IP-Adapter weight/end_at values probably move too. Treat them as part of the deployment, not as constants.
Cache the pre-rendered previews aggressively. A character × item grid grows multiplicatively. Pre-render on character creation and on new item add.

Where this lives

HoneyChat's shop renders outfits, accessories, and gifts on active characters using IP-Adapter Plus layered over per-character LoRA. Public architecture doc: github.com/sm1ck/honeychat/blob/main/docs/architecture.md.

References

If you've shipped an IP-Adapter + LoRA combo in production, I'm curious what weight / end_at pairs you landed on and for which base. The sweet spot seems to shift meaningfully between anime and realistic bases.

Character consistency in AI image generation — where prompts break down and LoRA helps

sm1ck — Wed, 22 Apr 2026 12:22:02 +0000

📦 Training template: github.com/sm1ck/honeychat/tree/main/tutorial/03-lora — a generic Kohya SDXL config with <tune> placeholders and a dataset curation guide. No docker-compose (LoRA training is GPU-heavy) — you bring your own GPU or rent one.

Here's a failure mode many AI companion apps run into on launch day: users send two requests in a row for the same character, get two different faces, and conclude the product is broken. They're not wrong to feel that way. Character identity is part of the product.

This post is about why that happens, why the obvious fixes (seed-pinning, more prompt detail, reference images) often don't fully solve it, and what class of solution works better.

TL;DR

Identical seed + identical prompt + different batch size = different face. Seeds only help within the same sampler run.
Prompt detail plateaus fast. Past a certain tag count, the model interpolates anyway.
Reference image (IP-Adapter) works but can bleed stylistic features — outfit, lighting, background — into generations where you only wanted identity.
Custom LoRA per character makes identity much more stable by encoding it at the weights level instead of relying only on prompt text.

Train your own character LoRA — the short walkthrough

LoRA training is GPU-heavy and doesn't belong in a docker-compose, so the tutorial folder at tutorial/03-lora ships the config template and recipe. You bring the GPU.

1. Get a GPU

24 GB VRAM (RTX 3090/4090) fits SDXL LoRA at batch size 2–4 comfortably. Don't own one? Rent a spot — Vast.ai, RunPod, Modal, Paperspace, Lambda. A full training run costs a few dollars.

2. Install Kohya_ss

git clone https://github.com/bmaltais/kohya_ss ~/kohya_ss
cd ~/kohya_ss && ./setup.sh

3. Grab the template

git clone https://github.com/sm1ck/honeychat
cp -r honeychat/tutorial/03-lora ./my-character-lora
cd my-character-lora

4. Prepare your dataset

Drop 15–30 varied images of your subject into dataset/train/5_character/ (the 5_ is the repeat count). For each image, create a same-named .txt caption describing the scene — not the character. See dataset/README.md for the full curation checklist.

5. Fill the <tune> slots in kohya-config.toml

Every hyperparameter is a placeholder you pick based on your dataset and base model. Read the inline comments, then replace each <tune> with a real value. The safety check in train.sh will refuse to run if any placeholder remains.

6. Train

export KOHYA_DIR=~/kohya_ss
bash train.sh

The checkpoint lands at ./output/<your-character>.safetensors. Load it into ComfyUI or Diffusers like any other SDXL LoRA. Generate a test grid, iterate, retrain if needed.

Why "same prompt, same face" doesn't hold

Users naturally assume this works:

"anime girl, long silver hair, green eyes, Arknights operator outfit"
+ seed=12345
→ Anna, always. Or so it seems.

Not reliably. Three reasons.

Batch size changes the output. In most Stable Diffusion runs, batch_size=1 and batch_size=4 with the same seed produce different images for position 0. The RNG state depends on batch dimension.

Provider-side sampler drift. If you're calling a managed API (fal.ai, Replicate, Together), provider-side changes — model updates, sampler tweaks, default parameter shifts — can produce visually different outputs across weeks. Your "locked" character can drift.

Prompt detail saturates. At some point, adding more tags ("sharp nose, high cheekbones, narrow eyes, specific mole position") stops helping much. The model has a rough template and interpolates within it.

The in-between fix that doesn't quite work: IP-Adapter

IP-Adapter lets you pass a reference image alongside the prompt. The model bakes the reference's features into the cross-attention. For product photography, excellent.

For character identity, it has a practical drawback: IP-Adapter can carry stylistic baggage. A reference photo with specific lighting, pose, outfit, and background can bleed those into the generated image. You can turn the weight down, but then identity may weaken; turn it up, and the reference can dominate.

IP-Adapter is a good fit when the reference is what you want preserved (e.g., rendering a shop item on a character — next post in the series). It's usually a poor fit when what you want preserved is only the face.

The solution: custom LoRA per character

A LoRA (Low-Rank Adaptation) is a small set of additional weights layered on top of a base model. A character-specific LoRA trained on a curated dataset — consistent face, varied pose/outfit/lighting — encodes the identity into the weights themselves, not into the prompt.

Inference pipeline:

workflow = [
    "Checkpoint",           # base SDXL model
    f"LoRA: {char.lora}",   # the character's custom LoRA
    "FreeU",                # quality touch-up
    "KSampler",             # actual diffusion
]

Now Anna is much more likely to stay Anna across pose, outfit, and lighting changes. The face is represented in the weights, not only in the words.

Training a character LoRA (public-friendly template)

The conceptual shape of the training job using the publicly available Kohya_ss SDXL trainer:

# Kohya_ss SDXL LoRA training config — generic template
# Replace every <tune> value based on your dataset and base model.
# See Kohya docs for the full parameter reference.

[model_arguments]
pretrained_model_name_or_path = "<path/to/sdxl-base-or-finetune.safetensors>"

[dataset_arguments]
train_data_dir = "./dataset/train"
resolution     = "1024,1024"
caption_extension = ".txt"

[training_arguments]
output_dir      = "./output"
output_name     = "<your_character_v1>"
save_model_as   = "safetensors"

# Training steps and batch — VRAM-bound. Tune for your hardware.
learning_rate    = "<tune>"
max_train_steps  = "<tune>"
train_batch_size = "<tune>"

[network_arguments]
network_module = "networks.lora"
network_dim    = "<tune>"
network_alpha  = "<tune>"

→ full template on GitHub

The parameters that matter — LR, step count, rank, alpha, dataset size — are subject-dependent. Anime faces converge differently than realistic faces. There is no universal "best" setting.

What to actually optimize for:

Dataset quality over dataset size. 20 clean, varied, captioned images beat 100 messy ones.
Varied pose and lighting, constant face. Same angle 30 times teaches "this angle," not "this character."
Clean captions — describe the scene, not the character. "Woman standing in a garden" is better than "Anna standing in a garden" because you want the model to learn the face from context, not from the token.
Dedicated rank for face detail. Lower ranks underfit the identity; higher ranks overfit and kill flexibility.

Marginal cost: usually manageable

If you're running inference on a rented or owned GPU, training one character LoRA is a one-time cost usually measured in minutes to hours of GPU time, depending on dataset and settings. Inference with the LoRA attached often adds little overhead compared with the base generation. At scale, the per-character cost is dominated by dataset curation, not just training compute.

This is why a LoRA-per-character pipeline can be viable for products with many characters: once the pipeline exists, adding a new character is mostly a dataset and QA exercise, not a research project.

Production concerns

LoRA hot-swapping. Load the base checkpoint once, swap LoRAs per request. ComfyUI and Diffusers both support this natively.

Dataset hygiene. LoRAs memorize whatever's in the dataset. Enforce licensing upstream — the LoRA is downstream of the decision.

Storage at scale. LoRA file size depends on base model and rank; expect anything from a few MB to much larger checkpoints. Object storage + hot-LoRA pinning on inference workers keeps latency down.

Face ≠ body. A LoRA trained on face crops will not lock body proportions. Include full-body shots in the dataset if you need full-body consistency.

What I'd change if starting over

Ship the LoRA pipeline from day 1, even for three characters. Inconsistent visuals in the free tier can hurt activation before users ever see the stronger parts of the product.
Curate datasets manually, don't scrape. Five iterations of a hand-picked set of 20 images beat a scraped 200.
Store base-model version with each LoRA. When you update the base, you need to know which LoRAs need retraining.
Version LoRAs (v1, v2) and keep old versions live. If v2 ships with a regression, roll back per-character without reverting a whole release.

Where this lives

HoneyChat uses custom LoRA per character for visual identity in image and video generation. Public architecture: github.com/sm1ck/honeychat.

Previous: LLM routing per tier via OpenRouter.
Next: IP-Adapter Plus for a product catalog — how to put arbitrary shop items on a character while keeping the character's face locked.

References

If you've trained character LoRAs in production and have opinions on rank selection or caption strategy, I'd love to hear them in the comments. There's very little public writing on this outside the anime generation community.

LLM routing per tier via OpenRouter — when one model doesn't fit all

sm1ck — Tue, 21 Apr 2026 23:50:29 +0000

📦 Full runnable example: github.com/sm1ck/honeychat/tree/main/tutorial/02-routing — docker compose up exposes POST /complete on localhost:8000. Every snippet below is pulled from that repo.

Most introductory "chat with AI" tutorials pick one model and call it a day. That works in a toy. It stops being enough in production, where users have different price sensitivity, different conversation styles, and different expectations for what the product should allow.

Here's how to route LLM calls across a handful of providers via OpenRouter, how that routing handles finish_reason=content_filter empty-completion edge cases, and the fallback chain pattern that keeps replies flowing.

TL;DR

Route by tier (price elasticity) and by content mode (what kind of turn this is). A single default model can't do both.
Some reasoning/model-provider combinations can return finish_reason=content_filter with empty content on borderline content. A retry policy that only catches HTTP errors can miss this.
The working pattern: primary → different-provider fallback → specialized last resort, with retries triggered by both error responses and suspicious empty completions.

Run it yourself in 3 minutes

1. Clone and configure

git clone https://github.com/sm1ck/honeychat
cd honeychat/tutorial/02-routing
cp .env.example .env

Open .env, paste your OPENROUTER_API_KEY (get one here). The three default model slots all point to free-tier OpenRouter models so you can experiment without spending.

2. Start the service

docker compose up --build -d
curl http://localhost:8000/health   # {"ok":true}

3. Send a normal turn — primary answers

curl -X POST http://localhost:8000/complete \
  -H 'content-type: application/json' \
  -d '{"messages":[{"role":"user","content":"Name three cold-climate fruits."}]}' \
  | jq

Expected response:

{
  "content": "Apples, pears, and cloudberries...",
  "model": "meta-llama/llama-3.1-8b-instruct:free",
  "attempt": 0,
  "used_fallback": false
}

attempt: 0 means the primary model answered. used_fallback: false means no retry was needed.

4. Force a fallback

Override the primary to point at a model you know tends to refuse — or any bogus model name — and watch the chain kick in:

curl -X POST http://localhost:8000/complete \
  -H 'content-type: application/json' \
  -d '{"messages":[{"role":"user","content":"Say hi"}],"primary":"this/model-does-not-exist"}' \
  | jq '.model, .attempt, .used_fallback'

attempt: 1 (or 2) — the next rung answered. In production, log this metric: a rising fallback rate on a class of content means it's time to move the content to a different primary, not to tweak retry logic.

5. Run the unit tests

pip install -e ".[dev]"
pytest -v

Seven tests cover the failure modes in this chain — content_filter=empty, transient 5xx, non-transient 4xx, all-models-fail.

With the service running and the tests green, the rest of this post explains why the chain is shaped this way.

Why one model doesn't fit all

Three distinct pressures push against a single-model setup:

Price elasticity by tier. A free user generating 20 messages a day at flagship-model prices can burn cash every month per active user for zero revenue. A paying top-tier user sending the same 20 messages may reasonably expect higher quality. The unit economics do not agree.

Content mode. Mainstream-aligned models can refuse content that some legitimate companion/roleplay products allow on paid tiers. Conversely, less-restrictive models can have weaker long-context coherence. The right model depends on the turn.

Latency vs. depth. Instant conversational turns need sub-3-second responses. Long scene-writing turns can tolerate 10+ seconds for better prose. Hardcoding a single model optimizes for one and sacrifices the other.

The reasoning-model empty-completion edge case

This is the one that cost me a full afternoon to diagnose.

Some reasoning-class model/provider combinations do server-side moderation or filtering before returning a final answer. On borderline turns, they may not return an HTTP error. Instead, they can return a valid response with:

{
  "choices": [{
    "finish_reason": "content_filter",
    "message": { "content": "" }
  }]
}

Empty string. No exception. No status code to check. If you don't guard for it, your user sees a blank reply.

If your retry logic only triggers on httpx.HTTPStatusError, this can pass through.

The guard

The whole failure mode is caught by a tiny function:

def _is_silent_refusal(choice: dict) -> bool:
    """
    The whole point of this post: reasoning models can return a successful
    HTTP response with finish_reason=content_filter AND an empty content.
    If you only check HTTP status, you ship blank replies to users.
    """
    reason = choice.get("finish_reason")
    content = choice.get("message", {}).get("content") or ""
    return reason in ("content_filter", "length") and not content.strip()

→ full source

Resilient fallback chain

async def complete(
    messages: list[dict],
    *,
    primary: str | None = None,
    chain: Iterable[str] | None = None,
) -> CompletionResult:
    """Run the fallback chain. Return the first usable response."""
    models = list(chain) if chain is not None else _build_chain(primary)

    async with httpx.AsyncClient() as client:
        for attempt, model in enumerate(models):
            try:
                data = await _call(client, model, messages)
            except httpx.HTTPStatusError as e:
                if e.response.status_code in TRANSIENT_CODES:
                    continue
                raise
            except (httpx.ReadTimeout, httpx.ConnectError):
                continue

            choice = (data.get("choices") or [{}])[0]
            if _is_silent_refusal(choice):
                continue

            content = choice.get("message", {}).get("content") or ""
            if not content.strip():
                continue

            return CompletionResult(content=content, model=model, attempt=attempt)

    raise AllModelsFailedError(f"no model returned usable content; tried {models}")

→ full source

Two details worth calling out:

Empty content check is separate from the finish reason. Some models can return finish_reason=stop with empty content when they refuse. Always check not content.strip().
Track which model ultimately answered. Log attempt > 0 as a fallback event. If your primary fails 10% of the time on a class of content, that's a routing decision, not a retry problem — move that content to a different primary.

Picking the fallback order

For a permissive roleplay mode, the shape looks like this:

content-mode primary   → first model for this type of turn
  ↓ (on failure / empty)
diff-provider fallback → avoids the same upstream failure mode
  ↓
specialized last resort
  ↓
abort — ask the user to try a shorter or clearer prompt

The ordering rule: different-provider fallbacks. If the primary is hosted on provider A and fails for a provider-side reason, prefer a fallback hosted on provider B. Same-provider fallbacks can fail on the same content because the provider's moderation layer may be upstream of the model. OpenRouter makes this easier because each model's provider metadata is visible.

Content-level gating happens before the LLM, not after

The fallback chain handles model-level refusals. But if the user's intent is clearly above your product's content ceiling, retrying on a more permissive model just burns extra tokens before the user hits the real limit. Gate the content level in your system prompt assembly — don't rely on the model to enforce policy.

Keep the tier-level policy simple: the escalation class (detected from user intent) must be ≤ the user's plan ceiling. If over, the character responds in-character and the bot sends the upsell. The LLM does not need to know the tier exists — it just gets a system prompt with the right constraints for this turn.

Instrumentation that matters

Log three things per LLM call:

Model that answered (primary or fallback index)
Time to first token vs total time — tells you whether latency was model-side or network-side
Token cost (input + output) per message, bucketed by tier

Costs track in Redis counters with short TTL — daily sum, per-user daily sum. A global daily ceiling blocks new generations if spend crosses a configured threshold (fail-closed: if the counter is unreachable, block, don't pass). This helped cap a runaway generation loop at a known ceiling.

What I'd change if starting over

Route by content mode from day 1, not as an afterthought. Retrofitting the split into an existing handler is painful.
Instrument the silent-refusal rate. It may be rare, but you won't know unless you measure it specifically.
Don't share a single OpenRouter key across environments. Rate limits are per-key and dev noise eats prod quota.
Publish the tier → model map in your public docs. Users comparing products care. Competitors already know. Keeping the docs in sync with the code forces alignment.

Where this lives

HoneyChat's LLM router sits behind the chat handler on both the Telegram bot and the web app. Public architecture: github.com/sm1ck/honeychat/blob/main/docs/architecture.md.

Previous in the series: dual-layer memory with Redis + ChromaDB.
Next: character consistency with custom LoRA.

References

Curious how others have solved the silent-refusal pattern. If you've hit it on a different provider, drop a comment — I want to know which models ship which behavior.

Building an AI companion with persistent memory — Redis + ChromaDB

sm1ck — Mon, 20 Apr 2026 12:16:42 +0000

📦 Full runnable example: github.com/sm1ck/honeychat/tree/main/tutorial/01-memory — clone, docker compose up, chat with the demo bot on Telegram. Every code snippet below is pulled from that repo.

Most AI chatbots still struggle with reliable, queryable long-term recall. Character.AI has pinned and chat memories, but unpinned details can still fall out of the active conversation context. Replika remembers profile facts, preferences, and generated memories, but that is not the same as semantic recall over the full conversation. Even ChatGPT's Memory is built for useful preferences and details, not verbatim replay of long sessions.

I wanted a chat companion with practical persistent memory — not just the current conversation, but older facts and events surfaced when they matter. Here's the architecture that worked well for this use case.

TL;DR

Hot layer (Redis) — recent messages per conversation, short TTL, low-latency reads.
Cold layer (ChromaDB) holds summaries of chunks, not individual messages. Every N bot turns, a background task summarizes that window via a cheap LLM and stores the summary as a document. Keeps the vector index tiny, queries fast.
On every user message, three retrieval paths fire in parallel via asyncio.gather: recent buffer, latest summary, top-K semantic search. All three get assembled into the system prompt.
Result: substantially fewer tokens than full-history replay, while still making old context retrievable weeks later.

Run it yourself in 5 minutes

Before the architectural deep-dive, boot the demo so you can poke the memory layers live.

1. Clone and enter the folder

git clone https://github.com/sm1ck/honeychat
cd honeychat/tutorial/01-memory

2. Configure two tokens

cp .env.example .env

Open .env and fill:

TELEGRAM_BOT_TOKEN — get it from @BotFather (30 seconds: /newbot, pick a name, copy the token)
OPENROUTER_API_KEY — from openrouter.ai/keys. The default LLM_MODEL is a free-tier Llama 3.1 8B so you don't spend a cent.

3. Start the stack

docker compose up --build -d
docker compose logs -f bot       # watch the bot come alive

Four containers: redis, chromadb, api (FastAPI inspector on localhost:8000), bot (your Telegram bot polling).

4. Talk to your bot

Open it on Telegram, hit /start, chat for 10–20 turns. Tell it things about yourself. Come back later and reference something you said earlier — it'll pull it from ChromaDB.

5. Peek at what each layer holds

# Replace 12345 with your own Telegram user ID (ask @userinfobot)
curl http://localhost:8000/memory/12345/demo/recent  | jq
curl http://localhost:8000/memory/12345/demo/summary | jq

recent shows the raw Redis buffer. summary shows the latest ChromaDB document.

With the demo running, the rest of this post explains what you just booted.

Why rolling summaries alone don't work

A common pattern for chatbot memory is a rolling summary — every N messages, regenerate a compressed version of older context. It's cheap. It's also lossy in a very specific way: nuance dies in repeated compression.

Walk it through three regenerations:

Turn 1: "She said she hates her boss because he takes credit for her work"
Turn 2 summary: "User mentioned workplace frustration with manager"
Turn 3 summary: "User has job-related stress"
Turn 4 summary: "User has a job"

By turn 4, the reason is gone. A companion bot starts sounding generic. The fix used here: keep raw recent messages verbatim and only summarize chunks that are genuinely old, while being able to semantically retrieve any summary from the full history when the current conversation calls back.

Architecture

Two independent layers. Writes to Redis are synchronous on every turn; writes to ChromaDB are asynchronous, batched. Reads from both happen in parallel on every message.

The hot layer — Redis

Each (user_id, character_id) conversation is stored as a bounded Redis list:

async def save_message(user_id: int, char_id: str, role: str, content: str) -> None:
    r = get_redis()
    key = f"chat:{user_id}:{char_id}:messages"
    msg = json.dumps({
        "role": role,
        "content": content,
        "ts": datetime.now(timezone.utc).isoformat(),
    })
    pipe = r.pipeline()
    pipe.rpush(key, msg)
    pipe.ltrim(key, -HOT_BUFFER_SIZE, -1)
    pipe.expire(key, 86400 * HOT_BUFFER_TTL_DAYS)
    await pipe.execute()

→ full source on GitHub

Three things matter here:

ltrim on every write. The list is bounded. Memory per user is O(1), not O(conversation length).
TTL extended on every write. Inactive users' history evicts automatically. Configure Redis with allkeys-lru so overflow evicts instead of refusing writes — noeviction is the default and it's a footgun.
Pipelined writes. rpush + ltrim + expire in one round trip.

The cold layer — ChromaDB with summaries, not messages

A tempting implementation is to embed every message and run semantic search over them. Two problems: the index grows linearly with conversation volume, and individual messages are often too short or context-free to retrieve meaningfully ("yeah" returns a lot of "yeah" matches).

Instead: embed LLM-generated summaries of chunks. Every N bot turns, compress the window via a cheap LLM and write it as one document to a per-(user, character) ChromaDB collection. Ten weeks of active conversation is maybe 30–50 documents per collection, not tens of thousands.

Retrieval — three paths in parallel

On every user message, the chat handler fires three reads in parallel via asyncio.gather:

async def build_prompt_context(user_id: int, char_id: str, user_query: str) -> dict:
    """Parallel fire the three reads. Returns everything the handler needs."""
    recent, summary, memories = await asyncio.gather(
        get_recent(user_id, char_id),
        get_latest_summary(user_id, char_id),
        get_relevant_memories(user_id, char_id, user_query),
    )
    return {"recent": recent, "summary": summary, "memories": memories}

→ full source

The fast path for the summary hits Redis. The slower path queries ChromaDB only when the Redis cache expired, then writes back so the next call is hot again.

Production issues that came up

Double-summarize race. Two concurrent messages for the same pair both trigger summarization, writing overlapping summaries. Fix: per-key task tracking, cancel the pending task if a new one fires.

User clears history mid-summarize. A user hits "reset chat" while a summary is in flight. The summary then writes to a collection that just got deleted. Fix: re-check r.exists(key) before writing; bail if the list is gone.

Empty summaries cached. LLM rate-limited, returned empty content — and I was caching the empty string with a 3-day TTL. Fix: if summary: guard before setex.

ChromaDB collection doesn't exist for new users. col.query raises on a non-existent collection. Wrap in try/except and return empty — normal for a user's first few messages.

What I'd change if starting over

Skip pgvector for this shape of workload. Two weeks on it first; for my short-query summaries, recall was worse than ChromaDB and reindexing pain wasn't worth it.
Don't embed per message. Index exploded, recall didn't improve. Summary-level is the right granularity.
Summarize fixed-size windows, not time-based batches. Daily summaries are useless for users who chatted 500 times in one day.
Build the cancellation pattern from day 1. Race conditions around user actions (clear history, switch character) became one of the top sources of production bugs.

Where this lives

HoneyChat — an AI companion that runs both as a Telegram bot and a web app on the same backend. The architecture above is in production. Try it: @HoneyChatAIBot on Telegram or honeychat.bot in the browser.

Public docs: github.com/sm1ck/honeychat — service topology, API surface, major flows.

Next in the series: LLM routing per tier — why one model doesn't fit all, and how to handle content_filter errors from reasoning models.

References

If you're building something similar and have questions about the memory layout or the summarization pipeline, drop a comment. Especially curious how others handle race conditions around user-initiated state resets.