<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Daniel Popoola</title>
    <description>The latest articles on Forem by Daniel Popoola (@lisan_al_gaib).</description>
    <link>https://forem.com/lisan_al_gaib</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3278167%2F8337ed8b-d96c-4736-82d5-c44818266123.jpg</url>
      <title>Forem: Daniel Popoola</title>
      <link>https://forem.com/lisan_al_gaib</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/lisan_al_gaib"/>
    <language>en</language>
    <item>
      <title>Why Redis Cannot Share the Truth with Postgres - The architecture mistake that will oversell your tickets</title>
      <dc:creator>Daniel Popoola</dc:creator>
      <pubDate>Mon, 13 Apr 2026 22:16:04 +0000</pubDate>
      <link>https://forem.com/lisan_al_gaib/why-redis-cannot-share-the-truth-with-postgres-the-architecture-mistake-that-will-oversell-your-46h5</link>
      <guid>https://forem.com/lisan_al_gaib/why-redis-cannot-share-the-truth-with-postgres-the-architecture-mistake-that-will-oversell-your-46h5</guid>
      <description>&lt;p&gt;There is a moment, somewhere in the design of almost every backend system that mixes Redis and Postgres, where an engineer makes a decision that feels obviously correct and is actually subtly wrong.&lt;/p&gt;

&lt;p&gt;The decision looks like this: &lt;em&gt;Redis is fast, Postgres is durable. Use Redis to track inventory — it can handle the load. Persist the important stuff to Postgres.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;It feels right because both halves are true. Redis &lt;em&gt;is&lt;/em&gt; fast. Postgres &lt;em&gt;is&lt;/em&gt; durable. The mistake is not in the premises. The mistake is in the conclusion — that these two systems can jointly own authoritative state.&lt;/p&gt;

&lt;p&gt;They cannot. Not because of a tooling limitation you can engineer around. Because of a fundamental property of distributed systems that no amount of clever code eliminates.&lt;/p&gt;

&lt;p&gt;This article is about that property, why it matters, and what the correct mental model looks like. It is grounded in a real system I built: &lt;strong&gt;FairQueue&lt;/strong&gt;, a virtual queue and inventory allocation engine for high-demand live events in the Nigerian market — the kind of system that has to survive 50,000 people trying to buy 5,000 tickets at exactly the same second.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem Space
&lt;/h2&gt;

&lt;p&gt;Picture Detty December in Lagos. A Burna Boy concert. 5,000 tickets. The sale opens at noon.&lt;/p&gt;

&lt;p&gt;By 12:00:00.003, your server is receiving more concurrent requests than it has ever seen. Every one of those requests wants the same thing: a ticket. Most of them will be disappointed. Your job is to make sure exactly 5,000 of them succeed — no more, no less — and that payment for each of those 5,000 is correctly recorded.&lt;/p&gt;

&lt;p&gt;Overselling is not a minor bug. It means you charged someone for a ticket that does not exist. Silent inventory loss means someone got a ticket and you have no payment record. Both outcomes end careers and companies.&lt;/p&gt;

&lt;p&gt;The thundering herd problem is well understood. The less-discussed problem is what happens to your &lt;em&gt;data model&lt;/em&gt; when you try to handle it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Intuitive Architecture (And Why It Breaks)
&lt;/h2&gt;

&lt;p&gt;The most natural response to high read/write volume on a shared counter is: put it in Redis. Redis is single-threaded. Operations are atomic. A &lt;code&gt;DECR&lt;/code&gt; command cannot race with another &lt;code&gt;DECR&lt;/code&gt; the way a Postgres &lt;code&gt;UPDATE&lt;/code&gt; can under concurrent load without explicit locking. This reasoning is sound as far as it goes.&lt;/p&gt;

&lt;p&gt;So the intuitive architecture emerges:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Redis&lt;/strong&gt; holds &lt;code&gt;inventory:{event_id}&lt;/code&gt; — the live ticket count&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Postgres&lt;/strong&gt; holds orders, claims, payments — the durable record&lt;/li&gt;
&lt;li&gt;The flow: check Redis, decrement Redis, write to Postgres&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here is what that looks like in code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Check inventory&lt;/span&gt;
&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"inventory:event-123"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Int64&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;ErrSoldOut&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;// Decrement&lt;/span&gt;
&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Decr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"inventory:event-123"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;// Persist&lt;/span&gt;
&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Exec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"INSERT INTO claims (...) VALUES (...)"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This code has a bug. The bug is not in any single line. The bug is in the &lt;em&gt;model&lt;/em&gt; — in the assumption that these three operations form a coherent unit.&lt;/p&gt;

&lt;p&gt;They do not. They are three separate operations across two separate systems. No transaction spans them. Between any two of those lines, the process can crash, the network can partition, the Redis instance can restart. Each of those events produces a different kind of corruption.&lt;/p&gt;

&lt;p&gt;Let us be precise about what each failure looks like.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Four Failure Windows
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Window 1: Between check and decrement
&lt;/h3&gt;

&lt;p&gt;You read the inventory: 1 ticket left. Before you decrement, another request reads the same count. Both see 1. Both decrement. Both insert into Postgres. You have sold the same ticket twice.&lt;/p&gt;

&lt;p&gt;This is a classic time-of-check/time-of-use (TOCTOU) race. It is solvable — Redis Lua scripts can make the check-and-decrement atomic. But solving this window does not close the others.&lt;/p&gt;

&lt;h3&gt;
  
  
  Window 2: Between Redis decrement and Postgres insert
&lt;/h3&gt;

&lt;p&gt;You atomically decrement Redis to 0. Before you insert into Postgres, the process crashes — OOM kill, deployment, hardware fault, power failure. It does not matter why.&lt;/p&gt;

&lt;p&gt;Redis says 0 tickets remain. Postgres has no claim record. The ticket has vanished. A real person paid — or was about to pay — and there is no recoverable record of their claim.&lt;/p&gt;

&lt;h3&gt;
  
  
  Window 3: Between Postgres insert and Redis decrement
&lt;/h3&gt;

&lt;p&gt;You reverse the order — Postgres first, Redis second. The Postgres insert commits. Before you decrement Redis, the process crashes.&lt;/p&gt;

&lt;p&gt;Now Redis shows 1 ticket remaining. Postgres has a committed claim. The next request that checks Redis will be told a ticket is available when none is. You may oversell.&lt;/p&gt;

&lt;h3&gt;
  
  
  Window 4: Redis restart
&lt;/h3&gt;

&lt;p&gt;Your Redis instance restarts. The inventory key evaporates. All the careful decrements you performed are gone. Redis now reports the key as missing — which your code interprets as "full inventory available" — and suddenly every ticket is available again, even the ones already claimed and paid for.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why You Cannot Fix This With More Code
&lt;/h2&gt;

&lt;p&gt;The instinct at this point is to reach for compensating mechanisms. Retry logic. Distributed transactions. Two-phase commit. Sagas.&lt;/p&gt;

&lt;p&gt;These approaches are real and useful in the right contexts. They do not fix the fundamental problem here, because the fundamental problem is not a missing feature. It is a property of the environment.&lt;/p&gt;

&lt;p&gt;Martin Kleppmann puts it clearly in &lt;em&gt;Designing Data-Intensive Applications&lt;/em&gt;: the dual-write problem is not solved by making writes faster or retries smarter. It is solved by &lt;em&gt;choosing one system to be the source of truth and treating all other systems as derived state&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The moment you split authoritative state across Redis and Postgres — the moment both systems are required to agree for your data to be correct — you have created a consistency problem that lives in the gap between them. That gap cannot be closed. It can only be made smaller (with enough engineering complexity) or eliminated (by removing the split).&lt;/p&gt;

&lt;p&gt;There is no atomic operation that spans two storage systems. That is not a Redis limitation or a Postgres limitation. It is a consequence of the fact that they are separate processes, on separate machines, with separate failure modes.&lt;/p&gt;

&lt;p&gt;Every approach that tries to compensate for this — writing to both, reconciling differences, detecting divergence — is acknowledging the problem and managing it, not solving it. Management has a cost: operational complexity, latency, edge cases, and the ever-present risk that your compensation logic has its own bugs.&lt;/p&gt;

&lt;p&gt;The simpler answer is to not create the split.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Correct Mental Model: One Truth, One Cache
&lt;/h2&gt;

&lt;p&gt;The model that actually works is this:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Postgres is the single source of truth. Redis is a performance layer. Redis holds nothing that cannot be reconstructed from Postgres.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This sounds like a constraint. It is actually a &lt;em&gt;simplification&lt;/em&gt;. When Redis holds only reconstructible state, every failure mode has a clean answer: reconstruct from Postgres.&lt;/p&gt;

&lt;p&gt;The ordering rule that follows from this model is strict:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Postgres is always written first. Redis is always written second. Never the reverse.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This rule is asymmetric by design. Violating it in one direction (Redis first, Postgres second) creates the possibility of a Redis state that Postgres cannot recover — an authoritative count with no corresponding record. That is the failure mode that oversells tickets.&lt;/p&gt;

&lt;p&gt;Violating it in the other direction (Postgres first, Redis second) means a process crash between the two writes leaves Redis showing &lt;em&gt;more&lt;/em&gt; inventory than actually exists. This is inflation — Redis is too generous. It is wrong, but it is recoverable. The next reconciliation pass reads the authoritative Postgres count and corrects Redis. No customer was incorrectly turned away. No ticket was oversold.&lt;/p&gt;

&lt;p&gt;Choosing between these two failure modes is not splitting hairs. Temporary inflation that heals automatically is categorically different from silent overselling that requires manual intervention. One is a known, bounded failure. The other is a correctness violation.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Looks Like in FairQueue
&lt;/h2&gt;

&lt;p&gt;FairQueue's inventory flow is built entirely around this model. Here is the actual path a claim request takes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Claim request arrives
       │
       ▼
Redis SET NX lock acquired?  ← Layer 1: prevent concurrent claims for same customer
  No  → return ErrAlreadyClaimed
  Yes → continue
       │
       ▼
Redis Lua: DECRBY inventory if &amp;gt; 0  ← Atomic check-and-decrement
  -2 (sold out)   → return ErrEventSoldOut
  -1 (cache miss) → fall back to Postgres count, then retry
  ≥ 0 (success)   → continue
       │
       ▼
Postgres INSERT claim  ← Source of truth write
  unique violation → rollback Redis decrement, return ErrAlreadyClaimed
  success          → claim created ✓
       │
       ▼
Release lock
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Several things in this flow are worth examining closely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Lua script is not the correctness guarantee.&lt;/strong&gt; It is a performance optimisation. It prevents most concurrent claims from reaching Postgres at all, which reduces contention. But if Redis is unavailable, if the Lua script has a bug, if the lock fails — the Postgres unique constraint on &lt;code&gt;(customer_id, event_id)&lt;/code&gt; is still there. That constraint is the inviolable correctness guarantee. Two rows cannot be inserted for the same customer and the same event. The database enforces this atomically, regardless of what happened in Redis.&lt;/p&gt;

&lt;p&gt;This is the two-layer concurrency shield: Redis is the cheap doorman that turns away most concurrent attempts before they reach the database. Postgres is the last line of defence that holds even if the doorman is asleep.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The rollback on Postgres failure is explicit.&lt;/strong&gt; If the Postgres insert fails after the Redis decrement succeeds, the code immediately increments Redis back. This is a best-effort compensation — if the increment also fails, the reconciliation worker will correct the divergence on its next tick. The failure is bounded and self-healing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The cache miss path falls back to Postgres.&lt;/strong&gt; When Redis does not have the inventory key — because it restarted, because it was never set, because the key expired — the code reads the authoritative count from Postgres and retries the decrement. Redis is not required for correctness. It is required for performance.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Reconciliation Worker: Embracing Eventual Consistency
&lt;/h2&gt;

&lt;p&gt;No matter how careful your write ordering is, Redis and Postgres will diverge. Process crashes, network blips, partial failures — these are not edge cases in production systems. They are normal operating conditions.&lt;/p&gt;

&lt;p&gt;FairQueue has a reconciliation worker that runs every 30 seconds. Its job is mechanical: for every active event, derive the authoritative inventory count from Postgres (&lt;code&gt;total_inventory - COUNT(active claims)&lt;/code&gt;), compare it to the Redis count, and force-sync if they differ.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ReconciliationWorker&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;reconcileEvent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;domain&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;activeClaims&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;claims&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CountActive&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ID&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;pgCount&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="kt"&gt;int64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TotalInventory&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;activeClaims&lt;/span&gt;

    &lt;span class="n"&gt;redisCount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;inventory&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GetCount&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ID&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;redisCount&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;pgCount&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Warn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"inventory divergence detected, healing"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s"&gt;"event_id"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s"&gt;"postgres_count"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pgCount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s"&gt;"redis_count"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;redisCount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;inventory&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ForceSync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pgCount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This worker does not make the system eventually consistent in the casual, hand-wavy sense. It makes the system &lt;em&gt;intentionally&lt;/em&gt; eventually consistent with a bounded heal window. The maximum time Redis can be wrong is 30 seconds, and the direction of that wrongness (inflation, not deflation) is controlled.&lt;/p&gt;

&lt;p&gt;The worker also handles Redis restarts entirely. When Redis comes back empty, the next reconciliation tick finds every event with a missing or zero inventory key and rebuilds them from Postgres. No manual intervention. No data loss. The system heals itself.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Broader Principle
&lt;/h2&gt;

&lt;p&gt;The dual-write problem is one instance of a more general principle: every distributed system design decision is actually a choice between failure modes, not a choice between correctness and incorrectness.&lt;/p&gt;

&lt;p&gt;There is no architecture that eliminates failure. There are only architectures that choose &lt;em&gt;which&lt;/em&gt; failures are acceptable, &lt;em&gt;how long&lt;/em&gt; they last, and &lt;em&gt;whether they are recoverable&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The engineers who get this wrong are not making careless mistakes. They are often making locally reasonable decisions — Redis is fast, Postgres is slow, put the hot path in Redis — without tracking the global consequence: that splitting authoritative state across systems creates a consistency gap, and that gap will be exercised in production.&lt;/p&gt;

&lt;p&gt;The question to ask when designing a system like this is not "what happens when everything works?" It is "what happens when the process dies between these two lines of code?" And then: "is that failure mode acceptable?"&lt;/p&gt;

&lt;p&gt;For FairQueue, the acceptable failure mode is: Redis briefly shows more inventory than exists, a reconciliation worker corrects it within 30 seconds, and no customer is permanently locked out. The unacceptable failure mode is: a ticket is sold that does not exist, or a payment is charged with no record.&lt;/p&gt;

&lt;p&gt;Choosing the right failure mode and designing around it deliberately is what separates systems that survive production from systems that produce incident reports.&lt;/p&gt;




&lt;h2&gt;
  
  
  What FairQueue Ended Up With
&lt;/h2&gt;

&lt;p&gt;For reference, the final architecture that came out of this reasoning:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Concern&lt;/th&gt;
&lt;th&gt;System&lt;/th&gt;
&lt;th&gt;Rationale&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Inventory count&lt;/td&gt;
&lt;td&gt;Redis (cache)&lt;/td&gt;
&lt;td&gt;Performance — absorbs concurrent reads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inventory truth&lt;/td&gt;
&lt;td&gt;Postgres (derived)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;total - COUNT(active claims)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claim record&lt;/td&gt;
&lt;td&gt;Postgres&lt;/td&gt;
&lt;td&gt;Source of truth, unique constraint&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Concurrency shield&lt;/td&gt;
&lt;td&gt;Redis SET NX + Postgres unique index&lt;/td&gt;
&lt;td&gt;Two layers; neither alone is sufficient&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Queue position&lt;/td&gt;
&lt;td&gt;Redis ZSET&lt;/td&gt;
&lt;td&gt;Reconstructible from Postgres on restart&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Payment record&lt;/td&gt;
&lt;td&gt;Postgres&lt;/td&gt;
&lt;td&gt;Outbox pattern; written before gateway call&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Divergence healing&lt;/td&gt;
&lt;td&gt;Reconciliation worker&lt;/td&gt;
&lt;td&gt;Runs every 30s; force-syncs from Postgres&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Redis handles roughly 50,000 concurrent queue joins at O(log N) per operation without touching Postgres. Postgres handles claim inserts with a unique constraint that makes overselling physically impossible. The reconciliation worker makes the system self-healing under any single-component failure.&lt;/p&gt;

&lt;p&gt;The system never requires Redis and Postgres to agree atomically, because it never splits authoritative state between them. Redis is always derived. Postgres is always truth. The failure modes are chosen, bounded, and recoverable.&lt;/p&gt;




&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;If you are building a system that mixes Redis and Postgres — and most production backends do — the question worth sitting with is: &lt;em&gt;which system owns the truth?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Not "which system is faster" or "which system is more durable." Those are properties of the systems. The question is about your &lt;em&gt;data model&lt;/em&gt;: when Postgres and Redis disagree, which one wins?&lt;/p&gt;

&lt;p&gt;If the answer is not immediately obvious, you may have accidentally split your source of truth. That split will find you eventually. It tends to find you at the worst possible time — when load is highest, when the stakes are real, when the Detty December concert just went on sale.&lt;/p&gt;

&lt;p&gt;Choose one system to own the truth. Let the other be fast. Design your failure modes deliberately. The system will be simpler, more debuggable, and more survivable for it.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;FairQueue is open source. The full implementation — including the Lua scripts, reconciliation worker, and integration tests — is available on &lt;a href="https://github.com/DanielPopoola/fairqueue" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>distributedsystems</category>
      <category>softwareengineering</category>
      <category>go</category>
      <category>postgres</category>
    </item>
    <item>
      <title>ML in Warehouse Operations - How I Built a Production ML System to Automate Fashion Return Classification</title>
      <dc:creator>Daniel Popoola</dc:creator>
      <pubDate>Mon, 16 Mar 2026 06:50:31 +0000</pubDate>
      <link>https://forem.com/lisan_al_gaib/ml-in-warehouse-operations-how-i-built-a-production-ml-system-to-automate-fashion-return-54gf</link>
      <guid>https://forem.com/lisan_al_gaib/ml-in-warehouse-operations-how-i-built-a-production-ml-system-to-automate-fashion-return-54gf</guid>
      <description>&lt;p&gt;&lt;em&gt;From a warehouse problem I read about to a working MLOps pipeline&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;There's a stat that stuck with me when I started this project: &lt;strong&gt;online fashion retailers see return rates of up to 30%.&lt;/strong&gt; That's nearly 1 in 3 items coming back.&lt;/p&gt;

&lt;p&gt;Behind that number is a real operational headache. Every returned item — a pair of casual shoes, a handbag, a watch — has to be physically inspected, categorized, and processed. Is it a shirt or a top? Does it go back on the shelf or get refurbished? That decision, made by a human staring at an item after a long shift, happens &lt;strong&gt;hundreds of times a day&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;I wanted to solve that with machine learning. Not just train a model and call it a day — but build something that could actually run in the background of a warehouse operation: automated, reliable, and observable.&lt;/p&gt;

&lt;p&gt;That project is &lt;strong&gt;RefundClassifier&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem with "Just Training a Model"
&lt;/h2&gt;

&lt;p&gt;When I started thinking about this, my first instinct was the same as any ML student's: find a dataset, train a classifier, hit 90%+ accuracy, done.&lt;/p&gt;

&lt;p&gt;But accuracy on a test set doesn't keep a warehouse running. The real questions are harder:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What happens when the batch job crashes halfway through 400 images at 2 AM?&lt;/li&gt;
&lt;li&gt;How do you update the model without taking the whole system down?&lt;/li&gt;
&lt;li&gt;How do you know if predictions are quietly degrading weeks after deployment?&lt;/li&gt;
&lt;li&gt;Who reviews the results in the morning — and in what format?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are MLOps problems. And they're the gap between a notebook demo and a system someone can actually trust.&lt;/p&gt;

&lt;p&gt;RefundClassifier is my attempt to close that gap.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the System Does
&lt;/h2&gt;

&lt;p&gt;In plain terms: every night at 2 AM, the system picks up all the return images uploaded during the business day, runs them through an ML model, and writes out a results file that warehouse staff can review in the morning.&lt;/p&gt;

&lt;p&gt;The five categories it classifies are: &lt;strong&gt;Casual Shoes, Handbags, Shirts, Tops, and Watches&lt;/strong&gt; — trained on 2,500 product images with &lt;strong&gt;96.53% accuracy&lt;/strong&gt; on the test set.&lt;/p&gt;

&lt;p&gt;But the interesting parts aren't the model. They're the infrastructure around it.&lt;/p&gt;




&lt;h2&gt;
  
  
  How It's Built
&lt;/h2&gt;

&lt;p&gt;The architecture has three main layers:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The Model Service (FastAPI)&lt;/strong&gt;&lt;br&gt;
A lightweight REST API that loads the EfficientNet-B0 model from an MLflow registry and serves &lt;code&gt;/predict&lt;/code&gt; endpoints. It's stateless — it doesn't know or care about batches. It just classifies what it's given.&lt;/p&gt;

&lt;p&gt;Separating the model into its own service was a deliberate choice. It means I can update, restart, or swap the model without touching the batch processing logic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The Batch Orchestrator (Python)&lt;/strong&gt;&lt;br&gt;
This is the core of the system. It runs on a cron schedule, scans the input directory for unprocessed images, calls the Model Service in batches of 10, writes results to a CSV, and pushes metrics to Prometheus.&lt;/p&gt;

&lt;p&gt;The most important feature here: &lt;strong&gt;checkpoint recovery&lt;/strong&gt;. If the job crashes at image 287 of 400, it doesn't restart from zero. It reads the checkpoint, skips what's already done, and continues. In a production warehouse context, reprocessing already-classified items creates data integrity issues. This prevents that.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Monitoring (Prometheus + Grafana)&lt;/strong&gt;&lt;br&gt;
Every batch run pushes metrics — inference latency, batch success rate, class distribution — to a Prometheus Pushgateway. Grafana dashboards surface those metrics visually. If the model starts misclassifying at unusual rates, or a batch takes 3x longer than normal, it shows up.&lt;/p&gt;

&lt;p&gt;This was the part I underestimated the most. Monitoring isn't a "nice to have." It's how you find out something is wrong before a human has to tell you.&lt;/p&gt;




&lt;h2&gt;
  
  
  Model Versioning with MLflow
&lt;/h2&gt;

&lt;p&gt;The model is registered in MLflow with a &lt;strong&gt;production alias&lt;/strong&gt; — a pointer that says "this is the version the Model Service should load." When I retrain with new data, I register the new version and promote it to production. The service picks it up on restart, no code changes needed.&lt;/p&gt;

&lt;p&gt;This is the simplest version of a deployment pipeline, but it enforces a useful discipline: the model is never just a file on disk. It has a version, experiment metadata, accuracy metrics attached to it, and a clear promotion path.&lt;/p&gt;




&lt;h2&gt;
  
  
  The UI
&lt;/h2&gt;

&lt;p&gt;There's also a Streamlit interface for manual use — useful for ad-hoc classification or demos. Staff can upload a batch of images, trigger classification, and see the results in a table without touching the command line.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Actually Learned
&lt;/h2&gt;

&lt;p&gt;Building this taught me a few things that no ML course covered:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Batch processing is underrated.&lt;/strong&gt; Most tutorials show real-time inference. But most real business operations don't need sub-second latency — they need reliable, scheduled, auditable processing. Batch is often the right answer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The 10% that isn't model accuracy is 90% of the work.&lt;/strong&gt; Getting to 96% accuracy took two days. Getting checkpoint recovery, metric pushing, model registry integration, and error handling right took the rest of the project.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Observability is the difference between a deployed model and a trusted system.&lt;/strong&gt; A model running in the dark is not production. A model with dashboards, alerts, and traceable outputs is.&lt;/p&gt;




&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com" rel="noopener noreferrer"&gt;github.com/DanielPopoola/autorma&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Dataset: Fashion Product Images (Kaggle) — 2,500 images across 5 categories&lt;/li&gt;
&lt;li&gt;Stack: PyTorch · FastAPI · MLflow · Prometheus · Grafana · Streamlit · Docker&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;This was my final year CS project. I'm currently looking for roles in backend engineering and ML engineering — feel free to connect.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>automation</category>
      <category>devops</category>
      <category>machinelearning</category>
      <category>showdev</category>
    </item>
    <item>
      <title>Building a Payment Gateway That Doesn't Lie: How I Solved Distributed State Failures in Go</title>
      <dc:creator>Daniel Popoola</dc:creator>
      <pubDate>Fri, 20 Feb 2026 17:55:38 +0000</pubDate>
      <link>https://forem.com/lisan_al_gaib/i-built-a-production-grade-payment-gateway-in-go-heres-what-i-learned-about-distributed-systems-3882</link>
      <guid>https://forem.com/lisan_al_gaib/i-built-a-production-grade-payment-gateway-in-go-heres-what-i-learned-about-distributed-systems-3882</guid>
      <description>&lt;p&gt;Your server just charged a customer's card. The bank confirmed it — funds reserved, authorization ID returned. Then, a millisecond later, your server crashes.&lt;/p&gt;

&lt;p&gt;Your database never got the memo.&lt;/p&gt;

&lt;p&gt;Now your system thinks the payment failed. FicMart's order service re-routes the customer to a failure page, maybe even prompts them to retry. But the bank already has a hold on their money. The customer gets charged twice, or worse — their funds are locked in limbo with no order attached.&lt;/p&gt;

&lt;p&gt;This isn't a hypothetical. It's the fundamental challenge of payment processing in distributed systems, and it's deceptively easy to ignore until it happens in production. I built &lt;strong&gt;FicMart Payment Gateway&lt;/strong&gt; — a production-grade payment gateway in Go — specifically to confront this problem head-on. Here's how I thought through it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real Enemy: Partial Failures
&lt;/h2&gt;

&lt;p&gt;Most engineers think about failures in binary terms. Either a request succeeds or it fails. But distributed systems introduce a third, nastier category: &lt;strong&gt;partial failures&lt;/strong&gt; — where some things succeed and others don't, with no clean way to tell which is which.&lt;/p&gt;

&lt;p&gt;In payment processing, this is especially dangerous because two systems are involved: your gateway and the bank. When you ask the bank to capture $50, the sequence looks like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Gateway calls bank: "Capture $50 for Auth #123"&lt;/li&gt;
&lt;li&gt;Bank processes it: "Done. Capture ID: #456"&lt;/li&gt;
&lt;li&gt;Gateway prepares to save &lt;code&gt;CAPTURED&lt;/code&gt; to the database&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Gateway crashes&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Database still says &lt;code&gt;AUTHORIZED&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The money has moved. But your system doesn't know it. And because you have no record of Capture #456, you have no way to reconcile without manual intervention.&lt;/p&gt;

&lt;p&gt;This is the problem I set out to solve. The solution came down to three interlocking patterns.&lt;/p&gt;




&lt;h2&gt;
  
  
  Pattern 1: Capture Intent Before Acting
&lt;/h2&gt;

&lt;p&gt;The core insight is simple: &lt;strong&gt;your database needs to know what you're &lt;em&gt;about&lt;/em&gt; to do, not just what you've done.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Before the gateway makes any external bank call, it persists the payment in an intermediate state. For a capture, that means transitioning from &lt;code&gt;AUTHORIZED&lt;/code&gt; to &lt;code&gt;CAPTURING&lt;/code&gt; &lt;em&gt;before&lt;/em&gt; touching the bank. A naive state machine looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PENDING → AUTHORIZED → CAPTURED
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But this leaves a blind spot. If the gateway crashes between &lt;code&gt;AUTHORIZED&lt;/code&gt; and &lt;code&gt;CAPTURED&lt;/code&gt;, there's no record that a capture was ever attempted. Was the bank called? Did it succeed? You don't know.&lt;/p&gt;

&lt;p&gt;The intermediate state closes that gap:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PENDING → AUTHORIZED → CAPTURING → CAPTURED → REFUNDING → REFUNDED
                    ↓
                 VOIDING → VOIDED
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;CAPTURING&lt;/code&gt; is not just a status — it's a signal of intent. It says: &lt;em&gt;"A capture was started here. If you find me stuck in this state, you know exactly what to do."&lt;/em&gt; The transition into it happens inside the same database transaction that acquires the idempotency lock, so the intent is either fully committed or fully rolled back — no ambiguity.&lt;/p&gt;

&lt;p&gt;This is borrowed from database engineering: the Write-Ahead Log pattern, where you record what you're &lt;em&gt;about&lt;/em&gt; to do before doing it so recovery is always possible.&lt;/p&gt;

&lt;p&gt;For authorizations specifically, this gets more nuanced. PCI compliance means you can never store raw card details, so if a crash happens during authorization, there's no way to retry it — the card data is gone. Rather than pretending this is solvable automatically, &lt;code&gt;PENDING&lt;/code&gt; authorizations older than 10 minutes are marked &lt;code&gt;FAILED&lt;/code&gt; and flagged for manual reconciliation. Some failures can't be fully automated away, and being honest about that is better than silently losing money.&lt;/p&gt;

&lt;p&gt;The domain layer enforces all of this with zero database or HTTP dependencies. Business rules — you can't void a captured payment, you can't refund an unauthorized one — live in pure Go. The domain is the source of truth for what's &lt;em&gt;allowed&lt;/em&gt;, completely independent of what's &lt;em&gt;stored&lt;/em&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Pattern 2: Background Workers That Heal the System
&lt;/h2&gt;

&lt;p&gt;Intermediate states create the evidence. Background workers act on it.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;RetryWorker&lt;/strong&gt; polls the database on a configurable interval, looking for payments stuck in &lt;code&gt;CAPTURING&lt;/code&gt;, &lt;code&gt;VOIDING&lt;/code&gt;, or &lt;code&gt;REFUNDING&lt;/code&gt; past their retry window. For each one, it re-invokes the appropriate bank operation using the &lt;em&gt;original idempotency key&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;That last part is what makes this safe. Because the bank supports idempotency, sending the same key twice doesn't trigger a second charge — it returns the cached result from the first attempt. The worker doesn't need to know whether the original call succeeded or not. If the bank already processed it, we get the success response back and update the database. If it didn't, we process it now. Either way, the database eventually converges to reality.&lt;/p&gt;

&lt;p&gt;Before any retry decision is made, errors are classified:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Transient errors&lt;/strong&gt; (timeouts, 500s) — retry with exponential backoff and jitter to avoid hammering the bank&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Permanent errors&lt;/strong&gt; (card declined, insufficient funds, auth expired) — fail fast, no retry&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Business rule violations&lt;/strong&gt; (invalid state transitions) — reject immediately at the domain layer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This classification is what separates a robust retry system from one that makes things worse. Retrying a permanent error doesn't fix anything — a declined card won't become approved on the fifth attempt. Treating it as retryable wastes cycles and delays the customer from finding out their payment failed.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;ExpirationWorker&lt;/strong&gt; handles a different edge case: authorized payments approaching the bank's 7-day authorization window. Rather than trusting the local clock blindly, the worker checks the bank's state before marking anything expired — with a 48-hour grace period to account for distributed clock skew.&lt;/p&gt;




&lt;h2&gt;
  
  
  Pattern 3: Idempotency as the Safety Net
&lt;/h2&gt;

&lt;p&gt;Recovery workers only work if retrying is safe. That guarantee comes entirely from idempotency.&lt;/p&gt;

&lt;p&gt;Every external-facing operation requires an &lt;code&gt;Idempotency-Key&lt;/code&gt; header. But the enforcement here goes deeper than most implementations.&lt;/p&gt;

&lt;p&gt;Idempotency state is stored in PostgreSQL, not Redis — deliberately. This means it survives restarts and is subject to ACID guarantees. The &lt;code&gt;idempotency_keys&lt;/code&gt; table does two jobs simultaneously.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It's a response cache.&lt;/strong&gt; Once an operation completes, the result is stored against the key. Future requests with the same key get the cached response instantly, without touching the bank.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It's a distributed lock.&lt;/strong&gt; A &lt;code&gt;locked_at&lt;/code&gt; timestamp is set when an operation begins and cleared when it finishes. If two requests arrive with the same key at the same time, the second enters a polling loop — checking every 100ms — until the first completes, then receives the same response. No double-processing, no race conditions.&lt;/p&gt;

&lt;p&gt;There's also a subtler protection: a &lt;code&gt;request_hash&lt;/code&gt; (SHA-256 of the request body) stored alongside each key. If a client tries to reuse an idempotency key with &lt;em&gt;different&lt;/em&gt; parameters — a different amount, a different payment — the gateway rejects it with an &lt;code&gt;IDEMPOTENCY_MISMATCH&lt;/code&gt; error. This prevents a class of silent bugs where key reuse returns a stale result for a completely different operation.&lt;/p&gt;

&lt;p&gt;The three patterns form a chain: intermediate states give workers something to act on → workers retry using the original idempotency key → idempotency makes those retries safe. Remove any one of them and the others stop working.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd Do Differently at Scale
&lt;/h2&gt;

&lt;p&gt;Building this taught me as much about the limits of my approach as the strengths of it.&lt;/p&gt;

&lt;p&gt;The most important change in a high-traffic environment would be moving idempotency lookups to Redis. PostgreSQL works here, but for a gateway handling thousands of requests per second, sub-millisecond idempotency checks matter. I'd keep Postgres as the durable fallback but use Redis as the hot path.&lt;/p&gt;

&lt;p&gt;I'd also move to &lt;strong&gt;event sourcing&lt;/strong&gt; for payment state. Right now, the &lt;code&gt;payments&lt;/code&gt; table stores the current state — you can see that a payment is &lt;code&gt;CAPTURED&lt;/code&gt;, but you can't see the full timeline of how it got there. An append-only &lt;code&gt;payment_events&lt;/code&gt; table would make debugging orphaned authorizations significantly easier: you'd be able to reconstruct exactly where the gap between the bank's state and yours opened up.&lt;/p&gt;

&lt;p&gt;The retry worker would also benefit from &lt;code&gt;FOR UPDATE SKIP LOCKED&lt;/code&gt; on its database queries. Currently, multiple worker instances compete for the same stuck payments. Skip-locked semantics let workers divide the work without blocking each other — a meaningful concurrency improvement once the system is under real load.&lt;/p&gt;

&lt;p&gt;Finally, I'd add chaos testing: deliberately crashing the gateway at the exact millisecond between a bank response and the database commit. That's the failure mode this entire system is designed to handle, and the only way to be truly confident it works is to make it happen on purpose.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Really Taught Me
&lt;/h2&gt;

&lt;p&gt;Payment systems forced me to think about a dimension of engineering I hadn't fully internalized before: &lt;strong&gt;correctness under failure&lt;/strong&gt;, not just correctness under normal conditions.&lt;/p&gt;

&lt;p&gt;It's easy to build a service that works when everything goes right. The interesting engineering happens when you ask: &lt;em&gt;what is the worst possible moment for this process to crash, and what does the system look like afterward?&lt;/em&gt; That question shapes every decision in this gateway — the intermediate states, the write-ahead pattern, the idempotency locking, the recovery workers.&lt;/p&gt;

&lt;p&gt;The result is a system that doesn't just handle payments. It handles uncertainty. And in distributed systems, uncertainty is the only thing you can count on.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;The full source code is available on GitHub: &lt;a href="https://github.com/DanielPopoola/ficmart-payment-gateway" rel="noopener noreferrer"&gt;DanielPopoola/ficmart-payment-gateway&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>distributedsystems</category>
      <category>go</category>
      <category>softwareengineering</category>
      <category>postgres</category>
    </item>
    <item>
      <title>Building a Health-Check Microservice with FastAPI</title>
      <dc:creator>Daniel Popoola</dc:creator>
      <pubDate>Fri, 20 Jun 2025 10:44:57 +0000</pubDate>
      <link>https://forem.com/lisan_al_gaib/building-a-health-check-microservice-with-fastapi-26jo</link>
      <guid>https://forem.com/lisan_al_gaib/building-a-health-check-microservice-with-fastapi-26jo</guid>
      <description>&lt;p&gt;In modern application development, health checks play a crucial role in ensuring reliability, observability, and smooth orchestration—especially in containerized environments like Docker or Kubernetes. In this post, I’ll walk you through how I built a production-ready health-check microservice using &lt;strong&gt;FastAPI&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This project features structured logging, clean separation of concerns, and asynchronous service checks for both a database and Redis—all built in a modular and extensible way.&lt;/p&gt;

&lt;p&gt;GitHub Repo: [&lt;a href="https://github.com/DanielPopoola/fastapi-microservice-health-check" rel="noopener noreferrer"&gt;https://github.com/DanielPopoola/fastapi-microservice-health-check&lt;/a&gt;]&lt;/p&gt;




&lt;h2&gt;
  
  
  🚀 What This Project Covers
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Creating a &lt;code&gt;/health/&lt;/code&gt; endpoint with real service checks (DB, Redis)&lt;/li&gt;
&lt;li&gt;Supporting &lt;code&gt;/live&lt;/code&gt; and &lt;code&gt;/ready&lt;/code&gt; endpoints for Kubernetes probes&lt;/li&gt;
&lt;li&gt;Using async &lt;code&gt;asyncio.gather()&lt;/code&gt; for fast, parallel checks&lt;/li&gt;
&lt;li&gt;Configurable settings with Pydantic&lt;/li&gt;
&lt;li&gt;Structured logging with custom log formatting using loguru.&lt;/li&gt;
&lt;li&gt;Middleware for request timing and error handling&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  📁 Project Structure
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;project/
├── main.py             # App factory and configuration
├── config.py           # App settings via Pydantic
├── routers/
│   ├── health.py       # Health check endpoints
│   └── echo.py         # Echo endpoint (for demo)
├── utils/
│   └── logging.py      # Custom logger setup
└── ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  🔍 Under the Hood: &lt;code&gt;main.py&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;main.py&lt;/code&gt; acts as the orchestrator. Here's what it handles:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. App Lifecycle Management
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@asynccontextmanager&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lifespan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;FastAPI&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Application starting up&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;yield&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Application shutting down&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This cleanly logs startup and shutdown events, essential for container lifecycle awareness.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. App Factory Pattern
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;create_app()&lt;/code&gt; function encapsulates app setup:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Loads settings with &lt;code&gt;get_settings()&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Sets up structured logging&lt;/li&gt;
&lt;li&gt;Registers CORS middleware&lt;/li&gt;
&lt;li&gt;Adds global and HTTP exception handlers&lt;/li&gt;
&lt;li&gt;Includes routers for modularity&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Middleware
&lt;/h3&gt;

&lt;p&gt;A custom middleware logs request data and execution time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@app.middleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;log_requests&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;call_next&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;start_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;call_next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;X-Response-Time&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start_time&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4. Exception Handling
&lt;/h3&gt;

&lt;p&gt;Two global handlers catch errors and format them consistently:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One for &lt;code&gt;HTTPException&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;One for unexpected &lt;code&gt;Exception&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  ⚕️ Health Check Logic (&lt;code&gt;routers/health.py&lt;/code&gt;)
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;routers/health.py&lt;/code&gt; file houses the core of this service:&lt;/p&gt;

&lt;h3&gt;
  
  
  ✅ &lt;code&gt;/health/&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Performs parallel health checks using &lt;code&gt;asyncio.gather()&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;perform_health_checks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Settings&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ServiceCheck&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;checks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="n"&gt;tasks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;database_url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;database&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;check_database&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;database_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;health_check_timeout&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;  
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis_url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;redis&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;check_redis&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;health_check_timeout&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;return_exceptions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="bp"&gt;...&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;checks&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The result is a combined status response showing the health of each component.&lt;/p&gt;

&lt;h3&gt;
  
  
  🔁 &lt;code&gt;/live&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;A simple liveness check returning HTTP 200 to signal the app is alive.&lt;/p&gt;

&lt;h3&gt;
  
  
  📦 &lt;code&gt;/ready&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Waits for both Redis and DB to pass checks before returning 200. Useful for Kubernetes readiness probes.&lt;/p&gt;




&lt;h2&gt;
  
  
  📡 Root Endpoint and Echo
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;/&lt;/code&gt; returns app metadata like name, version, and timestamp&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;/echo&lt;/code&gt; is a simple test endpoint to verify connectivity&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🛠️ How to Run It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uvicorn app.main:app &lt;span class="nt"&gt;--reload&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or using the embedded &lt;code&gt;__main__&lt;/code&gt; block:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; main
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  🌟 What’s Next?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Add more service checks (e.g., external APIs, caches)&lt;/li&gt;
&lt;li&gt;Integrate with Docker’s &lt;code&gt;HEALTHCHECK&lt;/code&gt; instruction&lt;/li&gt;
&lt;li&gt;Configure Kubernetes readiness/liveness probes&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🧠 Final Thoughts
&lt;/h2&gt;

&lt;p&gt;Building robust health checks is one of the simplest yet most impactful ways to improve system reliability. With FastAPI’s speed and async support, this project offers a solid base for both simple and enterprise-grade applications.&lt;/p&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/DanielPopoola/fastapi-microservice-health-check" rel="noopener noreferrer"&gt;DanielPopoola/fastapi-microservice-health-check&lt;/a&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>fastapi</category>
      <category>beginners</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
