<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: sm1ck</title>
    <description>The latest articles on Forem by sm1ck (@sm1ck).</description>
    <link>https://forem.com/sm1ck</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3035707%2F79914038-15a3-4a68-9e07-803c587c48a8.png</url>
      <title>Forem: sm1ck</title>
      <link>https://forem.com/sm1ck</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/sm1ck"/>
    <language>en</language>
    <item>
      <title>Character consistency in AI image generation — where prompts break down and LoRA helps</title>
      <dc:creator>sm1ck</dc:creator>
      <pubDate>Wed, 22 Apr 2026 12:22:02 +0000</pubDate>
      <link>https://forem.com/sm1ck/character-consistency-in-ai-image-generation-where-prompts-break-down-and-lora-helps-320b</link>
      <guid>https://forem.com/sm1ck/character-consistency-in-ai-image-generation-where-prompts-break-down-and-lora-helps-320b</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;📦 Training template:&lt;/strong&gt; &lt;a href="https://github.com/sm1ck/honeychat/tree/main/tutorial/03-lora" rel="noopener noreferrer"&gt;github.com/sm1ck/honeychat/tree/main/tutorial/03-lora&lt;/a&gt; — a generic Kohya SDXL config with &lt;code&gt;&amp;lt;tune&amp;gt;&lt;/code&gt; placeholders and a dataset curation guide. No docker-compose (LoRA training is GPU-heavy) — you bring your own GPU or rent one.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Here's a failure mode many AI companion apps run into on launch day: users send two requests in a row for the same character, get two different faces, and conclude the product is broken. They're not wrong to feel that way. Character identity is part of the product.&lt;/p&gt;

&lt;p&gt;This post is about why that happens, why the obvious fixes (seed-pinning, more prompt detail, reference images) often don't fully solve it, and what class of solution works better.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Identical seed + identical prompt + different batch size = different face.&lt;/strong&gt; Seeds only help within the same sampler run.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prompt detail plateaus fast.&lt;/strong&gt; Past a certain tag count, the model interpolates anyway.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reference image (IP-Adapter) works but can bleed stylistic features&lt;/strong&gt; — outfit, lighting, background — into generations where you only wanted identity.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Custom LoRA per character makes identity much more stable&lt;/strong&gt; by encoding it at the weights level instead of relying only on prompt text.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Train your own character LoRA — the short walkthrough
&lt;/h2&gt;

&lt;p&gt;LoRA training is GPU-heavy and doesn't belong in a docker-compose, so the tutorial folder at &lt;a href="https://github.com/sm1ck/honeychat/tree/main/tutorial/03-lora" rel="noopener noreferrer"&gt;tutorial/03-lora&lt;/a&gt; ships the &lt;strong&gt;config template and recipe&lt;/strong&gt;. You bring the GPU.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Get a GPU&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;24 GB VRAM (RTX 3090/4090) fits SDXL LoRA at batch size 2–4 comfortably. Don't own one? Rent a spot — Vast.ai, RunPod, Modal, Paperspace, Lambda. A full training run costs a few dollars.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Install Kohya_ss&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/bmaltais/kohya_ss ~/kohya_ss
&lt;span class="nb"&gt;cd&lt;/span&gt; ~/kohya_ss &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; ./setup.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Grab the template&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/sm1ck/honeychat
&lt;span class="nb"&gt;cp&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; honeychat/tutorial/03-lora ./my-character-lora
&lt;span class="nb"&gt;cd &lt;/span&gt;my-character-lora
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;4. Prepare your dataset&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Drop 15–30 varied images of your subject into &lt;code&gt;dataset/train/5_character/&lt;/code&gt; (the &lt;code&gt;5_&lt;/code&gt; is the repeat count). For each image, create a same-named &lt;code&gt;.txt&lt;/code&gt; caption describing the &lt;em&gt;scene&lt;/em&gt; — not the character. See &lt;a href="https://github.com/sm1ck/honeychat/blob/main/tutorial/03-lora/dataset/README.md" rel="noopener noreferrer"&gt;dataset/README.md&lt;/a&gt; for the full curation checklist.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Fill the &lt;code&gt;&amp;lt;tune&amp;gt;&lt;/code&gt; slots in &lt;code&gt;kohya-config.toml&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every hyperparameter is a placeholder you pick based on your dataset and base model. Read the inline comments, then replace each &lt;code&gt;&amp;lt;tune&amp;gt;&lt;/code&gt; with a real value. The safety check in &lt;code&gt;train.sh&lt;/code&gt; will refuse to run if any placeholder remains.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Train&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;KOHYA_DIR&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;~/kohya_ss
bash train.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The checkpoint lands at &lt;code&gt;./output/&amp;lt;your-character&amp;gt;.safetensors&lt;/code&gt;. Load it into ComfyUI or Diffusers like any other SDXL LoRA. Generate a test grid, iterate, retrain if needed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why "same prompt, same face" doesn't hold
&lt;/h2&gt;

&lt;p&gt;Users naturally assume this works:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"anime girl, long silver hair, green eyes, Arknights operator outfit"
+ seed=12345
→ Anna, always. Or so it seems.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not reliably. Three reasons.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Batch size changes the output.&lt;/strong&gt; In most Stable Diffusion runs, &lt;code&gt;batch_size=1&lt;/code&gt; and &lt;code&gt;batch_size=4&lt;/code&gt; with the same seed produce &lt;em&gt;different&lt;/em&gt; images for position 0. The RNG state depends on batch dimension.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Provider-side sampler drift.&lt;/strong&gt; If you're calling a managed API (fal.ai, Replicate, Together), provider-side changes — model updates, sampler tweaks, default parameter shifts — can produce visually different outputs across weeks. Your "locked" character can drift.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt detail saturates.&lt;/strong&gt; At some point, adding more tags ("sharp nose, high cheekbones, narrow eyes, specific mole position") stops helping much. The model has a rough template and interpolates within it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The in-between fix that doesn't quite work: IP-Adapter
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/tencent-ailab/IP-Adapter" rel="noopener noreferrer"&gt;IP-Adapter&lt;/a&gt; lets you pass a reference image alongside the prompt. The model bakes the reference's features into the cross-attention. For product photography, excellent.&lt;/p&gt;

&lt;p&gt;For character identity, it has a practical drawback: &lt;strong&gt;IP-Adapter can carry stylistic baggage&lt;/strong&gt;. A reference photo with specific lighting, pose, outfit, and background can bleed those into the generated image. You can turn the weight down, but then identity may weaken; turn it up, and the reference can dominate.&lt;/p&gt;

&lt;p&gt;IP-Adapter is a good fit when the &lt;em&gt;reference&lt;/em&gt; is what you want preserved (e.g., rendering a shop item on a character — next post in the series). It's usually a poor fit when what you want preserved is only the face.&lt;/p&gt;

&lt;h2&gt;
  
  
  The solution: custom LoRA per character
&lt;/h2&gt;

&lt;p&gt;A LoRA (Low-Rank Adaptation) is a small set of additional weights layered on top of a base model. A character-specific LoRA trained on a curated dataset — consistent face, varied pose/outfit/lighting — encodes the identity into the weights themselves, not into the prompt.&lt;/p&gt;

&lt;p&gt;Inference pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;workflow&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Checkpoint&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;# base SDXL model
&lt;/span&gt;    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LoRA: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;char&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lora&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# the character's custom LoRA
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;FreeU&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                &lt;span class="c1"&gt;# quality touch-up
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;KSampler&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;             &lt;span class="c1"&gt;# actual diffusion
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now Anna is much more likely to stay Anna across pose, outfit, and lighting changes. The face is represented in the weights, not only in the words.&lt;/p&gt;

&lt;h3&gt;
  
  
  Training a character LoRA (public-friendly template)
&lt;/h3&gt;

&lt;p&gt;The conceptual shape of the training job using the publicly available &lt;a href="https://github.com/bmaltais/kohya_ss" rel="noopener noreferrer"&gt;Kohya_ss SDXL trainer&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="c"&gt;# Kohya_ss SDXL LoRA training config — generic template&lt;/span&gt;
&lt;span class="c"&gt;# Replace every &amp;lt;tune&amp;gt; value based on your dataset and base model.&lt;/span&gt;
&lt;span class="c"&gt;# See Kohya docs for the full parameter reference.&lt;/span&gt;

&lt;span class="nn"&gt;[model_arguments]&lt;/span&gt;
&lt;span class="py"&gt;pretrained_model_name_or_path&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"&amp;lt;path/to/sdxl-base-or-finetune.safetensors&amp;gt;"&lt;/span&gt;

&lt;span class="nn"&gt;[dataset_arguments]&lt;/span&gt;
&lt;span class="py"&gt;train_data_dir&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"./dataset/train"&lt;/span&gt;
&lt;span class="py"&gt;resolution&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"1024,1024"&lt;/span&gt;
&lt;span class="py"&gt;caption_extension&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;".txt"&lt;/span&gt;

&lt;span class="nn"&gt;[training_arguments]&lt;/span&gt;
&lt;span class="py"&gt;output_dir&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"./output"&lt;/span&gt;
&lt;span class="py"&gt;output_name&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"&amp;lt;your_character_v1&amp;gt;"&lt;/span&gt;
&lt;span class="py"&gt;save_model_as&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"safetensors"&lt;/span&gt;

&lt;span class="c"&gt;# Training steps and batch — VRAM-bound. Tune for your hardware.&lt;/span&gt;
&lt;span class="py"&gt;learning_rate&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"&amp;lt;tune&amp;gt;"&lt;/span&gt;
&lt;span class="py"&gt;max_train_steps&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"&amp;lt;tune&amp;gt;"&lt;/span&gt;
&lt;span class="py"&gt;train_batch_size&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"&amp;lt;tune&amp;gt;"&lt;/span&gt;

&lt;span class="nn"&gt;[network_arguments]&lt;/span&gt;
&lt;span class="py"&gt;network_module&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"networks.lora"&lt;/span&gt;
&lt;span class="py"&gt;network_dim&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"&amp;lt;tune&amp;gt;"&lt;/span&gt;
&lt;span class="py"&gt;network_alpha&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"&amp;lt;tune&amp;gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;→ &lt;a href="https://github.com/sm1ck/honeychat/blob/main/tutorial/03-lora/kohya-config.toml" rel="noopener noreferrer"&gt;full template on GitHub&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The parameters that matter — LR, step count, rank, alpha, dataset size — are subject-dependent. Anime faces converge differently than realistic faces. There is no universal "best" setting.&lt;/p&gt;

&lt;p&gt;What to actually optimize for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dataset quality over dataset size.&lt;/strong&gt; 20 clean, varied, captioned images beat 100 messy ones.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Varied pose and lighting&lt;/strong&gt;, constant face. Same angle 30 times teaches "this angle," not "this character."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Clean captions&lt;/strong&gt; — describe the scene, not the character. "Woman standing in a garden" is better than "Anna standing in a garden" because you want the model to learn the face from context, not from the token.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dedicated rank for face detail.&lt;/strong&gt; Lower ranks underfit the identity; higher ranks overfit and kill flexibility.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Marginal cost: usually manageable
&lt;/h2&gt;

&lt;p&gt;If you're running inference on a rented or owned GPU, training one character LoRA is a one-time cost usually measured in minutes to hours of GPU time, depending on dataset and settings. Inference with the LoRA attached often adds little overhead compared with the base generation. At scale, the per-character cost is dominated by dataset curation, not just training compute.&lt;/p&gt;

&lt;p&gt;This is why a LoRA-per-character pipeline can be viable for products with many characters: once the pipeline exists, adding a new character is mostly a dataset and QA exercise, not a research project.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production concerns
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;LoRA hot-swapping.&lt;/strong&gt; Load the base checkpoint once, swap LoRAs per request. ComfyUI and Diffusers both support this natively.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dataset hygiene.&lt;/strong&gt; LoRAs memorize whatever's in the dataset. Enforce licensing upstream — the LoRA is downstream of the decision.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Storage at scale.&lt;/strong&gt; LoRA file size depends on base model and rank; expect anything from a few MB to much larger checkpoints. Object storage + hot-LoRA pinning on inference workers keeps latency down.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Face ≠ body.&lt;/strong&gt; A LoRA trained on face crops will not lock body proportions. Include full-body shots in the dataset if you need full-body consistency.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd change if starting over
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ship the LoRA pipeline from day 1&lt;/strong&gt;, even for three characters. Inconsistent visuals in the free tier can hurt activation before users ever see the stronger parts of the product.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Curate datasets manually, don't scrape.&lt;/strong&gt; Five iterations of a hand-picked set of 20 images beat a scraped 200.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Store base-model version with each LoRA.&lt;/strong&gt; When you update the base, you need to know which LoRAs need retraining.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Version LoRAs (v1, v2) and keep old versions live.&lt;/strong&gt; If v2 ships with a regression, roll back per-character without reverting a whole release.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Where this lives
&lt;/h2&gt;

&lt;p&gt;HoneyChat uses custom LoRA per character for visual identity in image and video generation. Public architecture: &lt;a href="https://github.com/sm1ck/honeychat" rel="noopener noreferrer"&gt;github.com/sm1ck/honeychat&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Previous: &lt;a href="https://dev.to/sm1ck/llm-routing-per-tier-via-openrouter-when-one-model-doesnt-fit-all-3ami"&gt;LLM routing per tier via OpenRouter&lt;/a&gt;.&lt;br&gt;
Next: &lt;strong&gt;IP-Adapter Plus for a product catalog&lt;/strong&gt; — how to put arbitrary shop items on a character while keeping the character's face locked.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2106.09685" rel="noopener noreferrer"&gt;LoRA paper — Hu et al.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/bmaltais/kohya_ss" rel="noopener noreferrer"&gt;Kohya_ss SDXL training&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/tencent-ailab/IP-Adapter" rel="noopener noreferrer"&gt;IP-Adapter (for comparison)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0" rel="noopener noreferrer"&gt;SDXL base model&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;If you've trained character LoRAs in production and have opinions on rank selection or caption strategy, I'd love to hear them in the comments. There's very little public writing on this outside the anime generation community.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>python</category>
      <category>programming</category>
    </item>
    <item>
      <title>LLM routing per tier via OpenRouter — when one model doesn't fit all</title>
      <dc:creator>sm1ck</dc:creator>
      <pubDate>Tue, 21 Apr 2026 23:50:29 +0000</pubDate>
      <link>https://forem.com/sm1ck/llm-routing-per-tier-via-openrouter-when-one-model-doesnt-fit-all-3ami</link>
      <guid>https://forem.com/sm1ck/llm-routing-per-tier-via-openrouter-when-one-model-doesnt-fit-all-3ami</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;📦 Full runnable example:&lt;/strong&gt; &lt;a href="https://github.com/sm1ck/honeychat/tree/main/tutorial/02-routing" rel="noopener noreferrer"&gt;github.com/sm1ck/honeychat/tree/main/tutorial/02-routing&lt;/a&gt; — &lt;code&gt;docker compose up&lt;/code&gt; exposes &lt;code&gt;POST /complete&lt;/code&gt; on localhost:8000. Every snippet below is pulled from that repo.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Most introductory "chat with AI" tutorials pick one model and call it a day. That works in a toy. It stops being enough in production, where users have different price sensitivity, different conversation styles, and different expectations for what the product should allow.&lt;/p&gt;

&lt;p&gt;Here's how to route LLM calls across a handful of providers via &lt;a href="https://openrouter.ai/" rel="noopener noreferrer"&gt;OpenRouter&lt;/a&gt;, how that routing handles &lt;code&gt;finish_reason=content_filter&lt;/code&gt; empty-completion edge cases, and the fallback chain pattern that keeps replies flowing.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Route by &lt;strong&gt;tier&lt;/strong&gt; (price elasticity) &lt;em&gt;and&lt;/em&gt; by &lt;strong&gt;content mode&lt;/strong&gt; (what kind of turn this is). A single default model can't do both.&lt;/li&gt;
&lt;li&gt;Some reasoning/model-provider combinations can return &lt;code&gt;finish_reason=content_filter&lt;/code&gt; with empty content on borderline content. A retry policy that only catches HTTP errors can miss this.&lt;/li&gt;
&lt;li&gt;The working pattern: &lt;code&gt;primary → different-provider fallback → specialized last resort&lt;/code&gt;, with retries triggered by both error responses &lt;em&gt;and&lt;/em&gt; suspicious empty completions.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Run it yourself in 3 minutes
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Clone and configure&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/sm1ck/honeychat
&lt;span class="nb"&gt;cd &lt;/span&gt;honeychat/tutorial/02-routing
&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open &lt;code&gt;.env&lt;/code&gt;, paste your &lt;code&gt;OPENROUTER_API_KEY&lt;/code&gt; (&lt;a href="https://openrouter.ai/keys" rel="noopener noreferrer"&gt;get one here&lt;/a&gt;). The three default model slots all point to free-tier OpenRouter models so you can experiment without spending.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Start the service&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker compose up &lt;span class="nt"&gt;--build&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt;
curl http://localhost:8000/health   &lt;span class="c"&gt;# {"ok":true}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Send a normal turn — primary answers&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8000/complete &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'content-type: application/json'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"messages":[{"role":"user","content":"Name three cold-climate fruits."}]}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  | jq
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Apples, pears, and cloudberries..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"meta-llama/llama-3.1-8b-instruct:free"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"attempt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"used_fallback"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;attempt: 0&lt;/code&gt; means the primary model answered. &lt;code&gt;used_fallback: false&lt;/code&gt; means no retry was needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Force a fallback&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Override the primary to point at a model you know tends to refuse — or any bogus model name — and watch the chain kick in:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8000/complete &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'content-type: application/json'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"messages":[{"role":"user","content":"Say hi"}],"primary":"this/model-does-not-exist"}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  | jq &lt;span class="s1"&gt;'.model, .attempt, .used_fallback'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;attempt: 1&lt;/code&gt; (or 2) — the next rung answered. In production, log this metric: a rising fallback rate on a class of content means it's time to move the content to a different primary, not to tweak retry logic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Run the unit tests&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s2"&gt;".[dev]"&lt;/span&gt;
pytest &lt;span class="nt"&gt;-v&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Seven tests cover the failure modes in this chain — &lt;code&gt;content_filter=empty&lt;/code&gt;, transient 5xx, non-transient 4xx, all-models-fail.&lt;/p&gt;

&lt;p&gt;With the service running and the tests green, the rest of this post explains why the chain is shaped this way.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why one model doesn't fit all
&lt;/h2&gt;

&lt;p&gt;Three distinct pressures push against a single-model setup:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Price elasticity by tier.&lt;/strong&gt; A free user generating 20 messages a day at flagship-model prices can burn cash every month per active user for zero revenue. A paying top-tier user sending the same 20 messages may reasonably expect higher quality. The unit economics do not agree.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Content mode.&lt;/strong&gt; Mainstream-aligned models can refuse content that some legitimate companion/roleplay products allow on paid tiers. Conversely, less-restrictive models can have weaker long-context coherence. The right model depends on the turn.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Latency vs. depth.&lt;/strong&gt; Instant conversational turns need sub-3-second responses. Long scene-writing turns can tolerate 10+ seconds for better prose. Hardcoding a single model optimizes for one and sacrifices the other.&lt;/p&gt;

&lt;h2&gt;
  
  
  The reasoning-model empty-completion edge case
&lt;/h2&gt;

&lt;p&gt;This is the one that cost me a full afternoon to diagnose.&lt;/p&gt;

&lt;p&gt;Some reasoning-class model/provider combinations do server-side moderation or filtering before returning a final answer. On borderline turns, they may not return an HTTP error. Instead, they can return a valid response with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"choices"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"finish_reason"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"content_filter"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;""&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Empty string. No exception. No status code to check. If you don't guard for it, your user sees a blank reply.&lt;/p&gt;

&lt;p&gt;If your retry logic only triggers on &lt;code&gt;httpx.HTTPStatusError&lt;/code&gt;, this can pass through.&lt;/p&gt;

&lt;h2&gt;
  
  
  The guard
&lt;/h2&gt;

&lt;p&gt;The whole failure mode is caught by a tiny function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_is_silent_refusal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    The whole point of this post: reasoning models can return a successful
    HTTP response with finish_reason=content_filter AND an empty content.
    If you only check HTTP status, you ship blank replies to users.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;reason&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;finish_reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{}).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;reason&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content_filter&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;length&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;→ &lt;a href="https://github.com/sm1ck/honeychat/blob/main/tutorial/02-routing/app/router.py#L64-L73" rel="noopener noreferrer"&gt;full source&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Resilient fallback chain
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frbeynrwjlgh7bsgzvb90.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frbeynrwjlgh7bsgzvb90.webp" alt="LLM routing fallback chain: a chat turn tries a tier-specific primary model, retries on a different-provider fallback after empty content_filter responses, then falls back to a specialized last resort" width="800" height="373"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;complete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;primary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Iterable&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;CompletionResult&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Run the fallback chain. Return the first usable response.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;models&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;chain&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="nf"&gt;_build_chain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;primary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AsyncClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;_call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HTTPStatusError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;TRANSIENT_CODES&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="k"&gt;continue&lt;/span&gt;
                &lt;span class="k"&gt;raise&lt;/span&gt;
            &lt;span class="nf"&gt;except &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ReadTimeout&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ConnectError&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;

            &lt;span class="n"&gt;choice&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;choices&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;[{}])[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;_is_silent_refusal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;

            &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{}).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;

            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;CompletionResult&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;AllModelsFailedError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;no model returned usable content; tried &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;→ &lt;a href="https://github.com/sm1ck/honeychat/blob/main/tutorial/02-routing/app/router.py#L90-L128" rel="noopener noreferrer"&gt;full source&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Two details worth calling out:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Empty content check is separate from the finish reason.&lt;/strong&gt; Some models can return &lt;code&gt;finish_reason=stop&lt;/code&gt; with empty content when they refuse. Always check &lt;code&gt;not content.strip()&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Track which model ultimately answered.&lt;/strong&gt; Log &lt;code&gt;attempt &amp;gt; 0&lt;/code&gt; as a fallback event. If your primary fails 10% of the time on a class of content, that's a routing decision, not a retry problem — move that content to a different primary.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Picking the fallback order
&lt;/h2&gt;

&lt;p&gt;For a permissive roleplay mode, the shape looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;content-mode primary   → first model for this type of turn
  ↓ (on failure / empty)
diff-provider fallback → avoids the same upstream failure mode
  ↓
specialized last resort
  ↓
abort — ask the user to try a shorter or clearer prompt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The ordering rule: &lt;strong&gt;different-provider fallbacks&lt;/strong&gt;. If the primary is hosted on provider A and fails for a provider-side reason, prefer a fallback hosted on provider B. Same-provider fallbacks can fail on the same content because the provider's moderation layer may be upstream of the model. OpenRouter makes this easier because each model's provider metadata is visible.&lt;/p&gt;

&lt;h2&gt;
  
  
  Content-level gating happens before the LLM, not after
&lt;/h2&gt;

&lt;p&gt;The fallback chain handles &lt;em&gt;model-level&lt;/em&gt; refusals. But if the user's intent is clearly above your product's content ceiling, retrying on a more permissive model just burns extra tokens before the user hits the real limit. Gate the content level in your system prompt assembly — don't rely on the model to enforce policy.&lt;/p&gt;

&lt;p&gt;Keep the tier-level policy simple: the escalation class (detected from user intent) must be &lt;code&gt;≤&lt;/code&gt; the user's plan ceiling. If over, the character responds in-character and the bot sends the upsell. The LLM does not need to know the tier exists — it just gets a system prompt with the right constraints for this turn.&lt;/p&gt;

&lt;h2&gt;
  
  
  Instrumentation that matters
&lt;/h2&gt;

&lt;p&gt;Log three things per LLM call:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Model that answered&lt;/strong&gt; (primary or fallback index)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Time to first token&lt;/strong&gt; vs &lt;strong&gt;total time&lt;/strong&gt; — tells you whether latency was model-side or network-side&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Token cost&lt;/strong&gt; (input + output) per message, bucketed by tier&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Costs track in Redis counters with short TTL — daily sum, per-user daily sum. A global daily ceiling blocks new generations if spend crosses a configured threshold (fail-closed: if the counter is unreachable, block, don't pass). This helped cap a runaway generation loop at a known ceiling.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd change if starting over
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Route by content mode from day 1&lt;/strong&gt;, not as an afterthought. Retrofitting the split into an existing handler is painful.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Instrument the silent-refusal rate&lt;/strong&gt;. It may be rare, but you won't know unless you measure it specifically.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't share a single OpenRouter key across environments.&lt;/strong&gt; Rate limits are per-key and dev noise eats prod quota.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Publish the tier → model map in your public docs.&lt;/strong&gt; Users comparing products care. Competitors already know. Keeping the docs in sync with the code forces alignment.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Where this lives
&lt;/h2&gt;

&lt;p&gt;HoneyChat's LLM router sits behind the chat handler on both the Telegram bot and the web app. Public architecture: &lt;a href="https://github.com/sm1ck/honeychat/blob/main/docs/architecture.md" rel="noopener noreferrer"&gt;github.com/sm1ck/honeychat/blob/main/docs/architecture.md&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Previous in the series: &lt;a href="https://dev.to/sm1ck/building-an-ai-companion-with-persistent-memory-redis-chromadb-4i8k"&gt;dual-layer memory with Redis + ChromaDB&lt;/a&gt;.&lt;br&gt;
Next: &lt;a href="https://dev.to/sm1ck/character-consistency-in-ai-image-generation-where-prompts-break-down-and-lora-helps-320b"&gt;character consistency with custom LoRA&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://openrouter.ai/models" rel="noopener noreferrer"&gt;OpenRouter model list&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.openai.com/docs/api-reference/chat/object" rel="noopener noreferrer"&gt;Chat Completions &lt;code&gt;finish_reason&lt;/code&gt; semantics&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Curious how others have solved the silent-refusal pattern. If you've hit it on a different provider, drop a comment — I want to know which models ship which behavior.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>python</category>
      <category>openrouter</category>
    </item>
    <item>
      <title>Building an AI companion with persistent memory — Redis + ChromaDB</title>
      <dc:creator>sm1ck</dc:creator>
      <pubDate>Mon, 20 Apr 2026 12:16:42 +0000</pubDate>
      <link>https://forem.com/sm1ck/building-an-ai-companion-with-persistent-memory-redis-chromadb-4i8k</link>
      <guid>https://forem.com/sm1ck/building-an-ai-companion-with-persistent-memory-redis-chromadb-4i8k</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;📦 Full runnable example:&lt;/strong&gt; &lt;a href="https://github.com/sm1ck/honeychat/tree/main/tutorial/01-memory" rel="noopener noreferrer"&gt;github.com/sm1ck/honeychat/tree/main/tutorial/01-memory&lt;/a&gt; — clone, &lt;code&gt;docker compose up&lt;/code&gt;, chat with the demo bot on Telegram. Every code snippet below is pulled from that repo.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Most AI chatbots still struggle with reliable, queryable long-term recall. Character.AI has pinned and chat memories, but unpinned details can still fall out of the active conversation context. Replika remembers profile facts, preferences, and generated memories, but that is not the same as semantic recall over the full conversation. Even ChatGPT's Memory is built for useful preferences and details, not verbatim replay of long sessions.&lt;/p&gt;

&lt;p&gt;I wanted a chat companion with &lt;strong&gt;practical persistent memory&lt;/strong&gt; — not just the current conversation, but older facts and events surfaced when they matter. Here's the architecture that worked well for this use case.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hot layer (Redis)&lt;/strong&gt; — recent messages per conversation, short TTL, low-latency reads.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cold layer (ChromaDB)&lt;/strong&gt; holds &lt;em&gt;summaries of chunks&lt;/em&gt;, not individual messages. Every N bot turns, a background task summarizes that window via a cheap LLM and stores the summary as a document. Keeps the vector index tiny, queries fast.&lt;/li&gt;
&lt;li&gt;On every user message, three retrieval paths fire in parallel via &lt;code&gt;asyncio.gather&lt;/code&gt;: recent buffer, latest summary, top-K semantic search. All three get assembled into the system prompt.&lt;/li&gt;
&lt;li&gt;Result: substantially fewer tokens than full-history replay, while still making old context retrievable weeks later.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Run it yourself in 5 minutes
&lt;/h2&gt;

&lt;p&gt;Before the architectural deep-dive, boot the demo so you can poke the memory layers live.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Clone and enter the folder&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/sm1ck/honeychat
&lt;span class="nb"&gt;cd &lt;/span&gt;honeychat/tutorial/01-memory
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Configure two tokens&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open &lt;code&gt;.env&lt;/code&gt; and fill:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;TELEGRAM_BOT_TOKEN&lt;/code&gt; — get it from &lt;a href="https://t.me/BotFather" rel="noopener noreferrer"&gt;@BotFather&lt;/a&gt; (30 seconds: &lt;code&gt;/newbot&lt;/code&gt;, pick a name, copy the token)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;OPENROUTER_API_KEY&lt;/code&gt; — from &lt;a href="https://openrouter.ai/keys" rel="noopener noreferrer"&gt;openrouter.ai/keys&lt;/a&gt;. The default &lt;code&gt;LLM_MODEL&lt;/code&gt; is a free-tier Llama 3.1 8B so you don't spend a cent.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Start the stack&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker compose up &lt;span class="nt"&gt;--build&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt;
docker compose logs &lt;span class="nt"&gt;-f&lt;/span&gt; bot       &lt;span class="c"&gt;# watch the bot come alive&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Four containers: &lt;code&gt;redis&lt;/code&gt;, &lt;code&gt;chromadb&lt;/code&gt;, &lt;code&gt;api&lt;/code&gt; (FastAPI inspector on &lt;code&gt;localhost:8000&lt;/code&gt;), &lt;code&gt;bot&lt;/code&gt; (your Telegram bot polling).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Talk to your bot&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Open it on Telegram, hit &lt;code&gt;/start&lt;/code&gt;, chat for 10–20 turns. Tell it things about yourself. Come back later and reference something you said earlier — it'll pull it from ChromaDB.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Peek at what each layer holds&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Replace 12345 with your own Telegram user ID (ask @userinfobot)&lt;/span&gt;
curl http://localhost:8000/memory/12345/demo/recent  | jq
curl http://localhost:8000/memory/12345/demo/summary | jq
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;recent&lt;/code&gt; shows the raw Redis buffer. &lt;code&gt;summary&lt;/code&gt; shows the latest ChromaDB document.&lt;/p&gt;

&lt;p&gt;With the demo running, the rest of this post explains what you just booted.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why rolling summaries alone don't work
&lt;/h2&gt;

&lt;p&gt;A common pattern for chatbot memory is a rolling summary — every N messages, regenerate a compressed version of older context. It's cheap. It's also &lt;strong&gt;lossy in a very specific way&lt;/strong&gt;: nuance dies in repeated compression.&lt;/p&gt;

&lt;p&gt;Walk it through three regenerations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Turn 1: "She said she hates her boss because he takes credit for her work"
Turn 2 summary: "User mentioned workplace frustration with manager"
Turn 3 summary: "User has job-related stress"
Turn 4 summary: "User has a job"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By turn 4, the &lt;em&gt;reason&lt;/em&gt; is gone. A companion bot starts sounding generic. The fix used here: &lt;strong&gt;keep raw recent messages verbatim&lt;/strong&gt; and only summarize chunks that are genuinely old, while being able to semantically retrieve any summary from the full history when the current conversation calls back.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkh7qpvljz5wjeh349kel.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkh7qpvljz5wjeh349kel.webp" alt="Dual-layer memory architecture: Redis recent buffer and ChromaDB summaries retrieved in parallel before LLM prompt assembly" width="800" height="427"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Two independent layers. Writes to Redis are synchronous on every turn; writes to ChromaDB are asynchronous, batched. Reads from both happen in parallel on every message.&lt;/p&gt;

&lt;h2&gt;
  
  
  The hot layer — Redis
&lt;/h2&gt;

&lt;p&gt;Each &lt;code&gt;(user_id, character_id)&lt;/code&gt; conversation is stored as a bounded Redis list:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;save_message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;char_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_redis&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chat:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;char_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timezone&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;utc&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;isoformat&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="n"&gt;pipe&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rpush&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ltrim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;HOT_BUFFER_SIZE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;expire&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;86400&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;HOT_BUFFER_TTL_DAYS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;→ &lt;a href="https://github.com/sm1ck/honeychat/blob/main/tutorial/01-memory/app/memory.py#L75-L89" rel="noopener noreferrer"&gt;full source on GitHub&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Three things matter here:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ltrim&lt;/code&gt; on every write.&lt;/strong&gt; The list is bounded. Memory per user is O(1), not O(conversation length).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TTL extended on every write.&lt;/strong&gt; Inactive users' history evicts automatically. Configure Redis with &lt;code&gt;allkeys-lru&lt;/code&gt; so overflow evicts instead of refusing writes — &lt;code&gt;noeviction&lt;/code&gt; is the default and it's a footgun.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pipelined writes.&lt;/strong&gt; &lt;code&gt;rpush + ltrim + expire&lt;/code&gt; in one round trip.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The cold layer — ChromaDB with summaries, not messages
&lt;/h2&gt;

&lt;p&gt;A tempting implementation is to embed every message and run semantic search over them. Two problems: the index grows linearly with conversation volume, and individual messages are often too short or context-free to retrieve meaningfully ("yeah" returns a lot of "yeah" matches).&lt;/p&gt;

&lt;p&gt;Instead: &lt;strong&gt;embed LLM-generated summaries of chunks&lt;/strong&gt;. Every N bot turns, compress the window via a cheap LLM and write it as one document to a per-(user, character) ChromaDB collection. Ten weeks of active conversation is maybe 30–50 documents per collection, not tens of thousands.&lt;/p&gt;

&lt;h2&gt;
  
  
  Retrieval — three paths in parallel
&lt;/h2&gt;

&lt;p&gt;On every user message, the chat handler fires three reads in parallel via &lt;code&gt;asyncio.gather&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_prompt_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;char_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Parallel fire the three reads. Returns everything the handler needs.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;recent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;memories&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nf"&gt;get_recent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;char_id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nf"&gt;get_latest_summary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;char_id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nf"&gt;get_relevant_memories&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;char_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;recent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;recent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;memories&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;memories&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;→ &lt;a href="https://github.com/sm1ck/honeychat/blob/main/tutorial/01-memory/app/memory.py#L163-L173" rel="noopener noreferrer"&gt;full source&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The fast path for the summary hits Redis. The slower path queries ChromaDB only when the Redis cache expired, then writes back so the next call is hot again.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production issues that came up
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Double-summarize race.&lt;/strong&gt; Two concurrent messages for the same pair both trigger summarization, writing overlapping summaries. Fix: per-key task tracking, cancel the pending task if a new one fires.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;User clears history mid-summarize.&lt;/strong&gt; A user hits "reset chat" while a summary is in flight. The summary then writes to a collection that just got deleted. Fix: re-check &lt;code&gt;r.exists(key)&lt;/code&gt; before writing; bail if the list is gone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Empty summaries cached.&lt;/strong&gt; LLM rate-limited, returned empty content — and I was caching the empty string with a 3-day TTL. Fix: &lt;code&gt;if summary:&lt;/code&gt; guard before &lt;code&gt;setex&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ChromaDB collection doesn't exist for new users.&lt;/strong&gt; &lt;code&gt;col.query&lt;/code&gt; raises on a non-existent collection. Wrap in try/except and return empty — normal for a user's first few messages.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd change if starting over
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Skip pgvector for this shape of workload.&lt;/strong&gt; Two weeks on it first; for my short-query summaries, recall was worse than ChromaDB and reindexing pain wasn't worth it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't embed per message.&lt;/strong&gt; Index exploded, recall didn't improve. Summary-level is the right granularity.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Summarize fixed-size windows, not time-based batches.&lt;/strong&gt; Daily summaries are useless for users who chatted 500 times in one day.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build the cancellation pattern from day 1.&lt;/strong&gt; Race conditions around user actions (clear history, switch character) became one of the top sources of production bugs.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Where this lives
&lt;/h2&gt;

&lt;p&gt;HoneyChat — an AI companion that runs both as a Telegram bot and a web app on the same backend. The architecture above is in production. Try it: &lt;a href="https://t.me/HoneyChatAIBot" rel="noopener noreferrer"&gt;@HoneyChatAIBot&lt;/a&gt; on Telegram or &lt;a href="https://honeychat.bot" rel="noopener noreferrer"&gt;honeychat.bot&lt;/a&gt; in the browser.&lt;/p&gt;

&lt;p&gt;Public docs: &lt;a href="https://github.com/sm1ck/honeychat" rel="noopener noreferrer"&gt;github.com/sm1ck/honeychat&lt;/a&gt; — service topology, API surface, major flows.&lt;/p&gt;

&lt;p&gt;Next in the series: &lt;a href="https://dev.to/sm1ck/llm-routing-per-tier-via-openrouter-when-one-model-doesnt-fit-all-3ami"&gt;LLM routing per tier&lt;/a&gt; — why one model doesn't fit all, and how to handle content_filter errors from reasoning models.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.trychroma.com/" rel="noopener noreferrer"&gt;ChromaDB docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://redis.io/commands/ltrim/" rel="noopener noreferrer"&gt;Redis &lt;code&gt;LTRIM&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aiogram.dev/" rel="noopener noreferrer"&gt;aiogram&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://openrouter.ai/" rel="noopener noreferrer"&gt;OpenRouter&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://support.character.ai/hc/en-us/articles/24327914463003-New-Feature-Pinned-Memories" rel="noopener noreferrer"&gt;Character.AI pinned memories&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.character.ai/helping-characters-remember-what-matters-most/" rel="noopener noreferrer"&gt;Character.AI chat memories&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://help.replika.com/hc/en-us/categories/4410741916045-Conversation-Memory" rel="noopener noreferrer"&gt;Replika memory docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://help.openai.com/en/articles/10303002-how-does-memory-use-past-conversations" rel="noopener noreferrer"&gt;ChatGPT Memory FAQ&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;If you're building something similar and have questions about the memory layout or the summarization pipeline, drop a comment. Especially curious how others handle race conditions around user-initiated state resets.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>python</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
