<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Tasos Nikolaou</title>
    <description>The latest articles on Forem by Tasos Nikolaou (@tasenikol).</description>
    <link>https://forem.com/tasenikol</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3686088%2Ffe43fd9c-afa3-4795-9c34-a3d3974a6e4f.JPG</url>
      <title>Forem: Tasos Nikolaou</title>
      <link>https://forem.com/tasenikol</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/tasenikol"/>
    <language>en</language>
    <item>
      <title>When Chrome Ate My RAM: Designing a Pressure-Aware Tab Orchestrator with Rust</title>
      <dc:creator>Tasos Nikolaou</dc:creator>
      <pubDate>Wed, 01 Apr 2026 16:30:07 +0000</pubDate>
      <link>https://forem.com/tasenikol/when-chrome-ate-my-ram-designing-a-pressure-aware-tab-orchestrator-with-rust-1g05</link>
      <guid>https://forem.com/tasenikol/when-chrome-ate-my-ram-designing-a-pressure-aware-tab-orchestrator-with-rust-1g05</guid>
      <description>&lt;p&gt;Chrome wasn't "crashing."&lt;/p&gt;

&lt;p&gt;It was just...slowly suffocating my system.&lt;/p&gt;

&lt;p&gt;Over time, RAM usage would creep up. Background tabs accumulated state. Other applications started freezing. The fan would spin up. And yet, nothing looked obviously wrong. No single tab was the culprit.&lt;/p&gt;

&lt;p&gt;The problem wasn't &lt;em&gt;too many tabs&lt;/em&gt;. &lt;br&gt;
The problem was a lack of coordination between the browser and the system. &lt;br&gt;
So I built something to experiment with that idea.&lt;/p&gt;

&lt;p&gt;This article explains the architecture and reasoning behind a hybrid Chrome extension &amp;amp; Rust native host that manages tab lifecycle based on real system pressure and user context.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Problem: Browser Entropy
&lt;/h2&gt;

&lt;p&gt;Modern browsers are operating systems.&lt;/p&gt;

&lt;p&gt;They manage:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Dozens of isolated processes &lt;/li&gt;
&lt;li&gt;  Background timers &lt;/li&gt;
&lt;li&gt;  Network activity &lt;/li&gt;
&lt;li&gt;  Memory-heavy applications (Jira, GitHub, Gmail, ChatGPT, Claude 😊 etc.)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most tab suspension tools rely on a simple rule:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"If a tab hasn't been used in &lt;code&gt;X&lt;/code&gt; minutes, suspend it."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That's convenient, but blind.&lt;/p&gt;

&lt;p&gt;They don't know:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Whether the system is under memory pressure&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Whether CPU is spiking&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Whether you're on battery&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Whether the tab is part of your active workflow&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They operate on time, not state.&lt;/p&gt;

&lt;p&gt;What I wanted was:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A deterministic, pressure-aware, context-sensitive lifecycle engine.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Not AI. Not cloud analytics. Just a well-structured system.&lt;/p&gt;


&lt;h2&gt;
  
  
  Design Goals
&lt;/h2&gt;

&lt;p&gt;Before writing any code, I defined constraints:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Deterministic behavior (no black-box magic)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;No cloud, no telemetry&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Respect user intent (never discard active or pinned tabs)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Pressure-aware decisions&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Context-aware heuristics&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Clean separation of responsibilities&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This last one became the most important architectural decision.&lt;/p&gt;


&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;

&lt;p&gt;The system consists of two components:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Chrome Extension (MV3)
  - Tab activity tracking
  - Focus clustering
  - TTL gating &amp;amp; guardrails
        ↓ Native Messaging
Rust Native Host
  - System metrics (RAM, CPU, Battery)
  - Pressure scoring engine
  - Deterministic classification
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Why split it?
&lt;/h3&gt;

&lt;p&gt;Chrome extensions cannot access low-level system metrics like real memory pressure in a reliable way.&lt;/p&gt;

&lt;p&gt;So I separated concerns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;The &lt;strong&gt;extension&lt;/strong&gt; manages browser lifecycle.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The &lt;strong&gt;Rust native host&lt;/strong&gt; understands system state.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They communicate through Chrome's Native Messaging API.&lt;/p&gt;

&lt;p&gt;This keeps the system clean:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Browser logic stays in the browser.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;System logic stays native.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Pressure Engine (Rust)
&lt;/h2&gt;

&lt;p&gt;Instead of checking raw RAM percentage, I built a weighted pressure scoring model.&lt;/p&gt;

&lt;p&gt;The Rust host collects:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Total RAM &lt;/li&gt;
&lt;li&gt;  Used RAM &lt;/li&gt;
&lt;li&gt;  Free RAM &lt;/li&gt;
&lt;li&gt;  CPU usage &lt;/li&gt;
&lt;li&gt;  Battery state (if available)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;From these, it computes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;code&gt;pressure_score&lt;/code&gt; (0-100) &lt;/li&gt;
&lt;li&gt;  &lt;code&gt;pressure_level&lt;/code&gt; (LOW / MEDIUM / HIGH) &lt;/li&gt;
&lt;li&gt;  &lt;code&gt;pressure_reasons&lt;/code&gt; (RAM_HIGH, CPU_ELEVATED, ON_BATTERY, etc.)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;RAM is the dominant signal.&lt;br&gt;
CPU acts as a modifier.&lt;br&gt;
Battery adds a small aggressiveness bias.&lt;/p&gt;

&lt;p&gt;The goal is not to be perfect, it's to be consistent and explainable.&lt;/p&gt;

&lt;p&gt;Instead of saying:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"System busy."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It says:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;HIGH pressure because RAM_HIGH + ON_BATTERY.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That reason tagging matters for transparency.&lt;/p&gt;




&lt;h2&gt;
  
  
  Context Awareness - Focus Clustering
&lt;/h2&gt;

&lt;p&gt;Not all inactive tabs are equal. A tab opened 30 minutes ago in your active workflow is very different from a forgotten tab in another window.&lt;/p&gt;

&lt;p&gt;So I introduced &lt;strong&gt;Focus Mode&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Focus clustering is based on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Same hostname as active tab &lt;/li&gt;
&lt;li&gt;  Recent activity window &lt;/li&gt;
&lt;li&gt;  Same window constraint &lt;/li&gt;
&lt;li&gt;  Cluster size cap&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Tabs inside the active "cluster" use a longer TTL.Tabs outside the cluster expire faster under pressure. This makes the system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Less disruptive&lt;/li&gt;
&lt;li&gt;  More aligned with user context&lt;/li&gt;
&lt;li&gt;  Less likely to discard something you'll immediately need&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's still deterministic, but just smarter.&lt;/p&gt;




&lt;h2&gt;
  
  
  Guardrails &amp;amp; Safety
&lt;/h2&gt;

&lt;p&gt;Aggressive resource management can easily become destructive. So, strict guardrails were built in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Never discard active tabs&lt;/li&gt;
&lt;li&gt;  Never discard pinned tabs&lt;/li&gt;
&lt;li&gt;  Never discard audible tabs&lt;/li&gt;
&lt;li&gt;  Respect protected domains&lt;/li&gt;
&lt;li&gt;  Enforce TTL minimums&lt;/li&gt;
&lt;li&gt;  Apply cooldown between prune cycles&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This prevents oscillation and surprise behavior. The goal is not maximum efficiency. The goal is controlled stability.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Rust?
&lt;/h2&gt;

&lt;p&gt;Rust was chosen for the native host because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Memory safety&lt;/li&gt;
&lt;li&gt;  Explicit modeling&lt;/li&gt;
&lt;li&gt;  Strong type system&lt;/li&gt;
&lt;li&gt;  Clean modular architecture&lt;/li&gt;
&lt;li&gt;  Lightweight binary&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Rust side is structured into modules:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;code&gt;metrics&lt;/code&gt;, system state collection&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;battery&lt;/code&gt;, optional battery signal&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;pressure&lt;/code&gt;, scoring logic&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;protocol&lt;/code&gt;, native messaging transport&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;state&lt;/code&gt;, API contract&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This makes the native host feel like a real subsystem, not a script.&lt;/p&gt;




&lt;h2&gt;
  
  
  What It Achieves
&lt;/h2&gt;

&lt;p&gt;In practice, this system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Reduces RAM pressure under load&lt;/li&gt;
&lt;li&gt;  Keeps active workflows intact&lt;/li&gt;
&lt;li&gt;  Makes browser behavior predictable&lt;/li&gt;
&lt;li&gt;  Avoids blind "time-based" suspension&lt;/li&gt;
&lt;li&gt;  Plays nicer with other system applications&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It doesn't eliminate memory usage. It orchestrates it.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;p&gt;A few things stood out during this project:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. MV3 Service Workers Have Quirks
&lt;/h3&gt;

&lt;p&gt;Extension background scripts are ephemeral. State management must be deliberate.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Determinism Beats "Smartness"
&lt;/h3&gt;

&lt;p&gt;Clear, explainable rules feel safer than opaque heuristics.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Separation of Concerns Changes Everything
&lt;/h3&gt;

&lt;p&gt;Keeping system logic in Rust and browser logic in the extension made experimentation much easier.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Observability Matters
&lt;/h3&gt;

&lt;p&gt;Reason tagging and structured logging made debugging and tuning far easier.&lt;/p&gt;




&lt;h2&gt;
  
  
  Future Directions
&lt;/h2&gt;

&lt;p&gt;This project is still evolving. Some experimental directions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Event-driven pressure signals (instead of polling)&lt;/li&gt;
&lt;li&gt;  Chrome process memory integration&lt;/li&gt;
&lt;li&gt;  Predictive return probability modeling&lt;/li&gt;
&lt;li&gt;  Offline data analysis of tab lifecycle patterns&lt;/li&gt;
&lt;li&gt;  Adaptive TTL tuning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The architecture supports these without becoming tangled.&lt;/p&gt;

&lt;p&gt;That was intentional.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Chrome didn't have a bug. It was just operating without coordination. By introducing a pressure-aware, context-sensitive orchestration layer, the browser becomes less chaotic and more cooperative with the system.&lt;/p&gt;

&lt;p&gt;This project started as frustration with RAM usage. It turned into an exploration of how browsers and operating systems can communicate more intelligently, without AI hype, and without cloud dependencies.&lt;/p&gt;

&lt;p&gt;Just clean architecture and deterministic policy!&lt;/p&gt;

&lt;p&gt;Checkout the project here:&lt;br&gt;
&lt;strong&gt;Github&lt;/strong&gt;: &lt;a href="https://github.com/tase-nikol/tab-memory-orchestrator" rel="noopener noreferrer"&gt;https://github.com/tase-nikol/tab-memory-orchestrator&lt;/a&gt;&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;Sometimes the problem isn’t that a system is broken&lt;br&gt;
It’s that its parts aren’t talking to each other.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>rust</category>
      <category>chromeextension</category>
      <category>architecture</category>
      <category>performance</category>
    </item>
    <item>
      <title>PromptCache Part II: When High Cache Hit Rates Become Dangerous</title>
      <dc:creator>Tasos Nikolaou</dc:creator>
      <pubDate>Thu, 05 Mar 2026 07:50:27 +0000</pubDate>
      <link>https://forem.com/tasenikol/promptcache-part-ii-when-high-cache-hit-rates-become-dangerous-204</link>
      <guid>https://forem.com/tasenikol/promptcache-part-ii-when-high-cache-hit-rates-become-dangerous-204</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;A benchmark-driven look at semantic cache safety and intent isolation.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In the previous article, &lt;em&gt;&lt;a href="https://dev.to/tasenikol/promptcache-part-i-stop-paying-twice-for-the-same-llm-answer-202g"&gt;"Stop Paying Twice for the Same LLM Answer"&lt;/a&gt;&lt;/em&gt; - I introduced &lt;strong&gt;PromptCache&lt;/strong&gt; as a semantic caching layer designed to reduce LLM cost and latency.&lt;/p&gt;

&lt;p&gt;The premise was simple, if two prompts are semantically similar, we shouldn't pay for the answer twice. The results were compelling: high cache hit rates, significant cost reduction, lower latency. But after deploying and stress-testing the design, a more important question emerged:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What guarantees that a cache hit is actually correct?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Reducing cost is easy.&lt;br&gt;
Ensuring safe reuse is harder.&lt;/p&gt;

&lt;p&gt;This article documents the experiment that reshaped PromptCache's architecture, and why &lt;strong&gt;intent isolation&lt;/strong&gt; became a non-negotiable design constraint.&lt;/p&gt;




&lt;h2&gt;
  
  
  Measuring Semantic Cache Safety in LLM Systems
&lt;/h2&gt;

&lt;p&gt;Most semantic caches follow a simple pattern:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Embed the prompt &lt;/li&gt;
&lt;li&gt; Retrieve the nearest cached embedding &lt;/li&gt;
&lt;li&gt; If similarity ≥ threshold then reuse the answer&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This works well for performance. &lt;br&gt;
But it assumes something that isn't guaranteed:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;That semantic similarity implies safe reuse.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;To test that assumption, I built a controlled benchmark.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Question
&lt;/h2&gt;

&lt;p&gt;Can cosine similarity thresholding alone guarantee safe reuse? Or do we need structural isolation between tasks? To answer this, I defined a metric:&lt;/p&gt;

&lt;h2&gt;
  
  
  Unsafe Hit
&lt;/h2&gt;

&lt;p&gt;A cache hit is &lt;strong&gt;unsafe&lt;/strong&gt; if the returned answer belongs to a different task (intent) than the incoming request. This measures semantic collision, &lt;strong&gt;not&lt;/strong&gt; embedding quality.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is Intent Isolation?
&lt;/h2&gt;

&lt;p&gt;Intent isolation means:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Partition the semantic cache by task boundary before performing similarity search.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Instead of searching across all cached entries, we search only within the matching task. Similarity becomes a refinement step, not a boundary mechanism.  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiyk5uplv3a646aiy93gn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiyk5uplv3a646aiy93gn.png" width="800" height="286"&gt;&lt;/a&gt;&lt;strong&gt;&lt;strong&gt;Figure 0 - Semantic Cache Search Space.&lt;/strong&gt;&lt;br&gt;
Without isolation, similarity search spans all tasks in one shared embedding space.&lt;/strong&gt;  &lt;/p&gt;

&lt;p&gt;With isolation, search is restricted to the matching intent partition.&lt;/p&gt;




&lt;h2&gt;
  
  
  Experimental Setup
&lt;/h2&gt;

&lt;p&gt;I evaluated semantic caching across:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Embedding model:&lt;/strong&gt; &lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt; &lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Backends:&lt;/strong&gt; 

&lt;ul&gt;
&lt;li&gt;  In-memory brute-force cosine &lt;/li&gt;
&lt;li&gt;  Redis (HNSW via RediSearch) &lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;  &lt;strong&gt;Workloads:&lt;/strong&gt; 

&lt;ul&gt;
&lt;li&gt;  Support queries &lt;/li&gt;
&lt;li&gt;  RAG-style retrieval questions &lt;/li&gt;
&lt;li&gt;  Creative prompts &lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;  &lt;strong&gt;Threshold sweep:&lt;/strong&gt; 0.82 -&amp;gt; 0.92 &lt;/li&gt;

&lt;li&gt;  &lt;strong&gt;~2400 requests per configuration&lt;/strong&gt;
&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;Two configurations were tested:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. No Intent Isolation
&lt;/h3&gt;

&lt;p&gt;All prompts shared the same semantic space.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Intent Isolation Enabled
&lt;/h3&gt;

&lt;p&gt;Cache entries were partitioned by &lt;code&gt;intent_id&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Each configuration was evaluated across identical prompt sequences to ensure comparability.&lt;br&gt;
Unsafe hits were computed by comparing stored &lt;code&gt;intent_id&lt;/code&gt; against request intent.&lt;/p&gt;




&lt;h2&gt;
  
  
  Result 1 - Hit Rate Looked Excellent
&lt;/h2&gt;

&lt;p&gt;Without intent isolation: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Hit rate: ~97-99% 
With intent isolation: &lt;/li&gt;
&lt;li&gt;  Hit rate: 13-38% depending on threshold&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs63uo4brycdluty1821d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs63uo4brycdluty1821d.png" width="640" height="480"&gt;&lt;/a&gt;&lt;strong&gt;Figure 1 - Hit Rate vs Threshold.&lt;/strong&gt; Without intent isolation, semantic caching achieves ~98% hit rate. Enabling intent partitioning significantly reduces reuse density.  &lt;/p&gt;

&lt;p&gt;At first glance, the non-isolated configuration looks superior. But this metric is incomplete.&lt;/p&gt;




&lt;h2&gt;
  
  
  Result 2 - Unsafe Hit Rate Reveals the Problem
&lt;/h2&gt;

&lt;p&gt;Without intent isolation: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Unsafe hit rate: ~95-100% 
With intent isolation: &lt;/li&gt;
&lt;li&gt;  Unsafe hit rate: 0%&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwnah27xvm8153m8choj7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwnah27xvm8153m8choj7.png" width="640" height="480"&gt;&lt;/a&gt;&lt;strong&gt;Figure 2 - Unsafe Hit Rate vs Threshold.&lt;/strong&gt; Similarity thresholding alone does not prevent cross-intent reuse. Nearly all cache hits become unsafe without intent partitioning.  &lt;/p&gt;

&lt;p&gt;This pattern was consistent across support, RAG, and creative workloads. In other words:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Almost every "successful" cache hit without isolation was incorrect.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is not a marginal effect. It is structural cross-contamination.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Similarity Is Not Enough
&lt;/h2&gt;

&lt;p&gt;Embedding similarity measures geometric proximity in vector space. &lt;br&gt;
Intent boundaries are categorical.&lt;/p&gt;

&lt;p&gt;Cosine similarity answers:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Are these prompts semantically related?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It does not answer:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Are these prompts operationally interchangeable?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Semantic closeness is continuous and task equivalence is discrete. Threshold tuning cannot convert a continuous metric into a categorical guarantee, but, partitioning can.&lt;/p&gt;




&lt;h2&gt;
  
  
  Result 3 - Backend Did Not Affect Correctness
&lt;/h2&gt;

&lt;p&gt;Both Redis (HNSW) and the in-memory backend produced identical:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Hit rate curves &lt;/li&gt;
&lt;li&gt;  Unsafe hit rate curves&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This was expected, due to the fact that both implemented cosine nearest-neighbor search with identical threshold logic. Correctness was dominated by key structure, &lt;strong&gt;not&lt;/strong&gt; vector store implementation.&lt;/p&gt;

&lt;p&gt;Backend choice affects: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Persistence &lt;/li&gt;
&lt;li&gt;  Multi-process access &lt;/li&gt;
&lt;li&gt;  Scalability &lt;/li&gt;
&lt;li&gt;  Latency under load&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But correctness properties should not depend on storage details!&lt;/p&gt;




&lt;h2&gt;
  
  
  Cost Savings Followed Hit Rate
&lt;/h2&gt;

&lt;p&gt;In this benchmark, each miss triggered a full LLM call with similar token usage.&lt;/p&gt;

&lt;p&gt;As a result, &lt;code&gt;cost_savings ≈ hit_rate&lt;/code&gt;, which confirms internal consistency. But cost reduction is meaningless if reuse is unsafe, meaning that correctness precedes optimization.&lt;/p&gt;




&lt;h2&gt;
  
  
  Production Implications
&lt;/h2&gt;

&lt;p&gt;If you rely solely on similarity thresholding:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  You will inflate hit rates
&lt;/li&gt;
&lt;li&gt;  You will inflate cost savings &lt;/li&gt;
&lt;li&gt;  You may silently reuse incorrect answers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is particularly dangerous in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Multi-tenant systems
&lt;/li&gt;
&lt;li&gt;  Support bots
&lt;/li&gt;
&lt;li&gt;  RAG pipelines &lt;/li&gt;
&lt;li&gt;  Tool-driven workflows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The correct architectural pattern is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Partition first.&lt;br&gt;
Threshold second.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Similarity is a refinement mechanism, &lt;strong&gt;not&lt;/strong&gt; a safety boundary.&lt;/p&gt;




&lt;h2&gt;
  
  
  Limitations
&lt;/h2&gt;

&lt;p&gt;This was a controlled benchmark.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Dataset size was modest (~2-3k prompts) &lt;/li&gt;
&lt;li&gt;  Workloads were synthetic but structured &lt;/li&gt;
&lt;li&gt;  Extreme-scale recall behavior was not evaluated &lt;/li&gt;
&lt;li&gt;  Concurrency stress was not measured&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal was to isolate semantic collision behavior, &lt;strong&gt;not&lt;/strong&gt; benchmark vector database scalability.&lt;/p&gt;

&lt;p&gt;Future work should explore:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Larger datasets &lt;/li&gt;
&lt;li&gt;  Cross-model embedding drift &lt;/li&gt;
&lt;li&gt;  Concurrency stress testing &lt;/li&gt;
&lt;li&gt;  Partial response reuse&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Core Insight
&lt;/h2&gt;

&lt;p&gt;The dominant factor in semantic cache correctness is not:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  The embedding model &lt;/li&gt;
&lt;li&gt;  The vector database
&lt;/li&gt;
&lt;li&gt;  The similarity threshold&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It is key design. &lt;br&gt;
Intent isolation is not an optimization. &lt;br&gt;
It is a safety requirement.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Takeaway
&lt;/h2&gt;

&lt;p&gt;A 98% cache hit rate looks impressive. &lt;br&gt;
But without structural boundaries, it may be misleading.&lt;/p&gt;

&lt;p&gt;If your semantic cache shows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  98% hit rate &lt;/li&gt;
&lt;li&gt;  98% cost savings&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Ask yourself, how many of those hits are actually correct? &lt;br&gt;
Optimization without isolation is probabilistic reuse. If you're building LLM infrastructure, this is not an academic nuance, but a production concern.&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;Similarity optimizes reuse. &lt;br&gt;
Isolation guarantees correctness.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>backend</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>PromptCache Part I: Stop Paying Twice for the Same LLM Answer</title>
      <dc:creator>Tasos Nikolaou</dc:creator>
      <pubDate>Tue, 24 Feb 2026 08:23:32 +0000</pubDate>
      <link>https://forem.com/tasenikol/promptcache-part-i-stop-paying-twice-for-the-same-llm-answer-202g</link>
      <guid>https://forem.com/tasenikol/promptcache-part-i-stop-paying-twice-for-the-same-llm-answer-202g</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Designing a semantic cache layer for cost and latency optimization in LLM systems.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Most LLM cost isn’t spent on novelty.&lt;br&gt;
It’s spent on repetition, requests that are semantically identical, but syntactically different.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PromptCache&lt;/strong&gt; was built to eliminate that redundancy.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Invisible Cost Leak in LLM Systems
&lt;/h2&gt;

&lt;p&gt;If you’re running an LLM in production, you are almost certainly paying for this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  "How do I reset my password?" &lt;/li&gt;
&lt;li&gt;  "I forgot my password, what do I do?"
&lt;/li&gt;
&lt;li&gt;  "Steps to reset account password?" &lt;/li&gt;
&lt;li&gt;  "Help me change password"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Different strings.&lt;br&gt;
Same intent.&lt;br&gt;
Same answer.&lt;br&gt;
Different billable request.&lt;/p&gt;

&lt;p&gt;Traditional caching doesn't help because:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;"How do I reset my password?" != "Steps to reset account password?"&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Exact match fails. &lt;br&gt;
But meaning hasn't changed. &lt;br&gt;
That's where semantic caching comes in.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Theory: Why This Works
&lt;/h2&gt;

&lt;p&gt;LLMs don't understand text as strings. &lt;br&gt;
They convert text into vectors (embeddings). &lt;br&gt;
Two sentences with similar meaning produce vectors that are close together in high-dimensional space.&lt;/p&gt;

&lt;p&gt;Example (simplified):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"Reset my password"
      ↓
[0.12, -0.87, 0.44, ...]

"How do I change my password?"
      ↓
[0.11, -0.89, 0.41, ...]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These vectors are very close.&lt;/p&gt;

&lt;p&gt;So instead of asking:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Have I seen this exact string before?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Have I seen something &lt;em&gt;semantically similar&lt;/em&gt; before?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If the similarity is high enough, we reuse the answer. &lt;br&gt;
That's semantic caching.&lt;/p&gt;


&lt;h2&gt;
  
  
  How It Works in Practice
&lt;/h2&gt;

&lt;p&gt;When a request comes in:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Prompt
     ↓
Embedding
     ↓
Vector search in Redis
     ↓
High similarity?
     ↓
Yes → Return cached response
No  → Call LLM and store result
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You're adding a semantic memoization layer in front of your LLM.&lt;/p&gt;




&lt;h2&gt;
  
  
  Real Results
&lt;/h2&gt;

&lt;p&gt;In a support-heavy workload with repetitive queries:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  ~60% cache hit rate &lt;/li&gt;
&lt;li&gt;  ~50% reduction in token usage &lt;/li&gt;
&lt;li&gt;  ~40% lower API spend&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Results vary by workload density and repetition patterns, but in structured environments, the impact is immediate.&lt;/p&gt;




&lt;h2&gt;
  
  
  Example Implementation
&lt;/h2&gt;

&lt;p&gt;Here's a simplified example using Redis vector search:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;promptcache&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SemanticCache&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;promptcache.backends.redis_vector&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RedisVectorBackend&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;promptcache.embedders.openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAIEmbedder&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;promptcache.types&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;CacheMeta&lt;/span&gt;

&lt;span class="n"&gt;embedder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAIEmbedder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text-embedding-3-small&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;backend&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RedisVectorBackend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;redis://localhost:6379/0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;embedder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SemanticCache&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;backend&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;backend&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;embedder&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;embedder&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;support-bot&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.92&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;meta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;CacheMeta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4.1-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a helpful support assistant.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_or_set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;How can I change my password?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;llm_call&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;my_llm_call&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;extract_text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache_hit&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it.&lt;/p&gt;

&lt;p&gt;No orchestration framework required.&lt;/p&gt;

&lt;p&gt;If you want to try this approach, I packaged it up here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/tase-nikol/promptcache" rel="noopener noreferrer"&gt;https://github.com/tase-nikol/promptcache&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;PyPI: &lt;a href="https://pypi.org/project/promptcache-ai/" rel="noopener noreferrer"&gt;https://pypi.org/project/promptcache-ai/&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Install:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;promptcache-ai
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  When This Works Best
&lt;/h2&gt;

&lt;p&gt;Semantic caching is powerful when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Prompts are repetitive &lt;/li&gt;
&lt;li&gt;  Temperature is low &lt;/li&gt;
&lt;li&gt;  Answers are stable &lt;/li&gt;
&lt;li&gt;  Volume is high&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It won't help much for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Highly personalized prompts &lt;/li&gt;
&lt;li&gt;  Creative writing &lt;/li&gt;
&lt;li&gt;  Rapidly changing context&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In those cases, novelty dominates repetition, and caching provides diminishing returns.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Bigger Insight
&lt;/h2&gt;

&lt;p&gt;Most LLM systems are fundamentally stateless.&lt;br&gt;
They recompute answers even when nothing meaningful has changed.&lt;/p&gt;

&lt;p&gt;Semantic caching introduces selective memory, reusing intelligence only when it is economically justified.&lt;/p&gt;

&lt;p&gt;Instead of optimizing prompts endlessly, sometimes the smarter move is optimizing infrastructure.&lt;/p&gt;




&lt;p&gt;If you're building LLM systems in production, semantic caching is one of the highest-leverage optimizations you can add.&lt;/p&gt;

&lt;p&gt;But optimizing cost raised a more uncomfortable question:&lt;br&gt;
What guarantees that a cache hit is actually correct?&lt;/p&gt;

&lt;p&gt;In the next article, we examine how high hit rates can silently mask semantic errors, and why &lt;strong&gt;PromptCache&lt;/strong&gt; evolved beyond threshold tuning.&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Intelligence is expensive.&lt;/em&gt;&lt;br&gt;
&lt;em&gt;Memory is cheap.&lt;/em&gt;&lt;br&gt;
&lt;strong&gt;&lt;em&gt;Use both wisely.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>python</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Building a Framework-Agnostic Health Check Library for Python Microservices</title>
      <dc:creator>Tasos Nikolaou</dc:creator>
      <pubDate>Tue, 17 Feb 2026 21:31:51 +0000</pubDate>
      <link>https://forem.com/tasenikol/building-a-framework-agnostic-health-check-library-for-python-microservices-1402</link>
      <guid>https://forem.com/tasenikol/building-a-framework-agnostic-health-check-library-for-python-microservices-1402</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;From duplicated &lt;code&gt;/health&lt;/code&gt; endpoints to a published PyPI package - an engineering deep dive.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Problem: Death by Copy-Paste Health Checks
&lt;/h2&gt;

&lt;p&gt;In a typical microservice architecture, health endpoints start simple:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;GET /health&lt;/code&gt;&lt;br&gt;
&lt;code&gt;GET /ready&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;But over time, reality sets in.&lt;/p&gt;

&lt;p&gt;Some services use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Django + PostgreSQL + Redis + Celery&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;FastAPI + SQLAlchemy + Redis&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;BFFs that depend on upstream HTTP services&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;RabbitMQ + background workers&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Async stacks mixed with sync stacks&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each service needs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Liveness checks&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Readiness checks&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Dependency verification&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Timeouts&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Structured JSON output&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Correct HTTP status codes&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And before long, every service has its own slightly different &lt;code&gt;HealthService&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Different thresholds.&lt;br&gt;
Different response formats.&lt;br&gt;
Different timeout logic.&lt;br&gt;
Different readiness semantics.&lt;/p&gt;

&lt;p&gt;That's when I realized:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Health checks are infrastructure. They should not be rewritten per service.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So I built &lt;strong&gt;PulseCheck&lt;/strong&gt; - a framework-agnostic health and readiness library for Python.&lt;/p&gt;


&lt;h2&gt;
  
  
  Design Goals
&lt;/h2&gt;

&lt;p&gt;Before writing a single line of code, I defined constraints:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Framework-agnostic core&lt;/li&gt;
&lt;li&gt; Pluggable dependency checks&lt;/li&gt;
&lt;li&gt; Async-first design (to support FastAPI)&lt;/li&gt;
&lt;li&gt; Sync compatibility (for Django)&lt;/li&gt;
&lt;li&gt; No forced dependency pollution&lt;/li&gt;
&lt;li&gt; Kubernetes-friendly readiness semantics&lt;/li&gt;
&lt;li&gt; Optional dependency extras&lt;/li&gt;
&lt;li&gt; Clean, structured JSON output&lt;/li&gt;
&lt;li&gt; Production-safe timeouts&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This wasn't just about code reuse.&lt;/p&gt;

&lt;p&gt;It was about &lt;strong&gt;architectural consistency&lt;/strong&gt;.&lt;/p&gt;


&lt;h2&gt;
  
  
  Architecture: Core + Adapters
&lt;/h2&gt;

&lt;p&gt;The key design decision was separation of concerns.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pulsecheck/
│
├── core/        ← Framework-agnostic health engine
├── fastapi/     ← FastAPI adapter
└── django/      ← Django adapter
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  1. Core Engine
&lt;/h3&gt;

&lt;p&gt;The core layer contains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Health registry &lt;/li&gt;
&lt;li&gt;  Health aggregation logic &lt;/li&gt;
&lt;li&gt;  Status combination rules &lt;/li&gt;
&lt;li&gt;  Dependency check base class &lt;/li&gt;
&lt;li&gt;  Timeout handling &lt;/li&gt;
&lt;li&gt;  Response schema&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It has &lt;strong&gt;zero framework dependencies&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The core doesn't know what FastAPI or Django is.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. Pluggable Checks
&lt;/h3&gt;

&lt;p&gt;Each dependency is implemented as a check:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;SQLAlchemyAsyncCheck&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;DjangoDBCheck&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;RedisAsyncCheck&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;RedisSyncCheck&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;RabbitMQKombuCheck&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;CeleryInspectCheck&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;HttpDependencyCheck&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each check:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Has a name&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Has a timeout&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Has a degraded threshold&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Returns structured results&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;registry = HealthRegistry(environment="prod")

registry.register(SQLAlchemyAsyncCheck(engine))
registry.register(RedisAsyncCheck(redis_url))
registry.register(CeleryInspectCheck(celery_app))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No monolithic service class.&lt;br&gt;
Just composition.&lt;/p&gt;


&lt;h2&gt;
  
  
  Async-First, Sync-Compatible
&lt;/h2&gt;

&lt;p&gt;FastAPI is async.&lt;br&gt;
Django is traditionally sync.&lt;/p&gt;

&lt;p&gt;Instead of creating two engines, the core is async-first.&lt;/p&gt;

&lt;p&gt;Sync checks are wrapped using:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;asyncio.to_thread(...)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;This gives:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Async compatibility &lt;/li&gt;
&lt;li&gt;  Non-blocking readiness &lt;/li&gt;
&lt;li&gt;  Unified aggregation logic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This avoids duplicating the health engine.&lt;/p&gt;


&lt;h2&gt;
  
  
  Readiness vs Liveness
&lt;/h2&gt;

&lt;p&gt;This is often misunderstood.&lt;/p&gt;

&lt;p&gt;Liveness:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Is the process alive?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Readiness:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Can this service safely receive traffic?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;PulseCheck separates them cleanly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;registry.liveness()
await registry.readiness()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Readiness runs dependency checks.&lt;br&gt;
Liveness does not.&lt;/p&gt;

&lt;p&gt;This mirrors Kubernetes probe behavior.&lt;/p&gt;


&lt;h2&gt;
  
  
  Handling Degraded States
&lt;/h2&gt;

&lt;p&gt;Health isn't binary.&lt;/p&gt;

&lt;p&gt;Instead of just &lt;code&gt;UP&lt;/code&gt; or &lt;code&gt;DOWN&lt;/code&gt;, PulseCheck supports:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;code&gt;HEALTHY&lt;/code&gt; &lt;/li&gt;
&lt;li&gt;  &lt;code&gt;DEGRADED&lt;/code&gt; &lt;/li&gt;
&lt;li&gt;  &lt;code&gt;UNHEALTHY&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If a dependency is slow but responding:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
  "status": "DEGRADED",
  "response_time_ms": 750
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives operational insight without triggering restarts.&lt;/p&gt;




&lt;h2&gt;
  
  
  Optional Dependencies Done Right
&lt;/h2&gt;

&lt;p&gt;One of the most important design decisions was dependency management.&lt;/p&gt;

&lt;p&gt;FastAPI projects already have FastAPI.&lt;br&gt;
Django projects already have Django.&lt;/p&gt;

&lt;p&gt;The library must not force unnecessary installations.&lt;/p&gt;

&lt;p&gt;In &lt;code&gt;pyproject.toml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[project.optional-dependencies]
fastapi = ["fastapi&amp;gt;=0.100"]
django = ["Django&amp;gt;=4.2"]
redis_async = ["redis&amp;gt;=5.0"]
rabbitmq = ["kombu&amp;gt;=5.3"]
celery = ["celery&amp;gt;=5.3"]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now:&lt;/p&gt;

&lt;p&gt;FastAPI service:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;pip install pulsecheck-py[fastapi,redis_async]&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Django service:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;pip install pulsecheck-py[django,redis_sync]&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Clean. Explicit. Controlled.&lt;/p&gt;




&lt;h2&gt;
  
  
  Hiding Health Endpoints From Swagger
&lt;/h2&gt;

&lt;p&gt;Health endpoints are infrastructure endpoints.&lt;/p&gt;

&lt;p&gt;In FastAPI:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;@router.get("/health", include_in_schema=False)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;They exist.&lt;br&gt;
They work.&lt;br&gt;
They don't pollute public API docs.&lt;/p&gt;

&lt;p&gt;Small detail. Big professionalism signal.&lt;/p&gt;


&lt;h2&gt;
  
  
  Testing Before Publishing
&lt;/h2&gt;

&lt;p&gt;Before uploading to PyPI, I tested:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Editable installs (&lt;code&gt;pip install -e .&lt;/code&gt;) &lt;/li&gt;
&lt;li&gt;  Wheel builds (&lt;code&gt;python -m build&lt;/code&gt;) &lt;/li&gt;
&lt;li&gt;  Installation from built wheel &lt;/li&gt;
&lt;li&gt;  Installation from TestPyPI &lt;/li&gt;
&lt;li&gt;  Optional extras resolution &lt;/li&gt;
&lt;li&gt;  Fresh virtual environments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I also learned something important:&lt;/p&gt;

&lt;p&gt;TestPyPI contains junk packages that can interfere with dependency resolution. Always use:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;--extra-index-url https://test.pypi.org/simple/&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Not &lt;code&gt;--index-url&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Small ecosystem lesson.&lt;/p&gt;


&lt;h2&gt;
  
  
  Publishing to PyPI
&lt;/h2&gt;

&lt;p&gt;Publishing was straightforward:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;python -m build
python -m twine upload dist/*
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Important rule:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;You cannot overwrite a version on PyPI.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Every change requires a version bump.&lt;/p&gt;

&lt;p&gt;This enforces discipline.&lt;/p&gt;




&lt;h2&gt;
  
  
  Engineering Lessons Learned
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt; Design the API before writing implementation. &lt;/li&gt;
&lt;li&gt; Keep core logic framework-agnostic. &lt;/li&gt;
&lt;li&gt; Async-first design avoids duplication. &lt;/li&gt;
&lt;li&gt; Optional dependencies prevent ecosystem pollution. &lt;/li&gt;
&lt;li&gt; Health endpoints are infrastructure, not business logic. &lt;/li&gt;
&lt;li&gt; Packaging and versioning require discipline. &lt;/li&gt;
&lt;li&gt; Publishing is easier than maintaining.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;Microservices suffer from invisible duplication.&lt;/p&gt;

&lt;p&gt;Health checks are often treated as boilerplate.&lt;/p&gt;

&lt;p&gt;But consistency in infrastructure code improves:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Operational clarity&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Monitoring integration&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Kubernetes reliability&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Onboarding speed&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Codebase maintainability&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;PulseCheck turned copy-paste health logic into a reusable, composable abstraction.&lt;/p&gt;




&lt;h2&gt;
  
  
  Future Roadmap
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  OpenTelemetry hooks &lt;/li&gt;
&lt;li&gt;  Prometheus integration &lt;/li&gt;
&lt;li&gt;  Circuit-breaker awareness &lt;/li&gt;
&lt;li&gt;  Startup probe support &lt;/li&gt;
&lt;li&gt;  Health history tracking &lt;/li&gt;
&lt;li&gt;  Async worker health strategies&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;Publishing a library is not about writing code.&lt;/p&gt;

&lt;p&gt;It's about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  API design &lt;/li&gt;
&lt;li&gt;  Dependency discipline &lt;/li&gt;
&lt;li&gt;  Versioning strategy &lt;/li&gt;
&lt;li&gt;  Documentation clarity &lt;/li&gt;
&lt;li&gt;  Ecosystem compatibility&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;PulseCheck started as internal cleanup.&lt;br&gt;
It became a reusable infrastructure layer.&lt;/p&gt;

&lt;p&gt;If you're duplicating health logic across services, consider abstracting it.&lt;/p&gt;

&lt;p&gt;Your future self will thank you.&lt;/p&gt;




&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;p&gt;PyPI: &lt;a href="https://pypi.org/project/pulsecheck-py/" rel="noopener noreferrer"&gt;https://pypi.org/project/pulsecheck-py/&lt;/a&gt;&lt;br&gt;
GitHub: &lt;a href="https://github.com/tase-nikol/pulsecheck-py" rel="noopener noreferrer"&gt;https://github.com/tase-nikol/pulsecheck-py&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;If you'd like feedback on the architecture or want to contribute, feel free to reach out.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;"PulseCheck is intentionally minimal today. But its architecture allows deeper observability and resilience integrations"&lt;/em&gt;&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>microservices</category>
      <category>monitoring</category>
      <category>python</category>
    </item>
  </channel>
</rss>
