Forem: Daniel Popoola

Why Redis Cannot Share the Truth with Postgres - The architecture mistake that will oversell your tickets

Daniel Popoola — Mon, 13 Apr 2026 22:16:04 +0000

There is a moment, somewhere in the design of almost every backend system that mixes Redis and Postgres, where an engineer makes a decision that feels obviously correct and is actually subtly wrong.

The decision looks like this: Redis is fast, Postgres is durable. Use Redis to track inventory — it can handle the load. Persist the important stuff to Postgres.

It feels right because both halves are true. Redis is fast. Postgres is durable. The mistake is not in the premises. The mistake is in the conclusion — that these two systems can jointly own authoritative state.

They cannot. Not because of a tooling limitation you can engineer around. Because of a fundamental property of distributed systems that no amount of clever code eliminates.

This article is about that property, why it matters, and what the correct mental model looks like. It is grounded in a real system I built: FairQueue, a virtual queue and inventory allocation engine for high-demand live events in the Nigerian market — the kind of system that has to survive 50,000 people trying to buy 5,000 tickets at exactly the same second.

The Problem Space

Picture Detty December in Lagos. A Burna Boy concert. 5,000 tickets. The sale opens at noon.

By 12:00:00.003, your server is receiving more concurrent requests than it has ever seen. Every one of those requests wants the same thing: a ticket. Most of them will be disappointed. Your job is to make sure exactly 5,000 of them succeed — no more, no less — and that payment for each of those 5,000 is correctly recorded.

Overselling is not a minor bug. It means you charged someone for a ticket that does not exist. Silent inventory loss means someone got a ticket and you have no payment record. Both outcomes end careers and companies.

The thundering herd problem is well understood. The less-discussed problem is what happens to your data model when you try to handle it.

The Intuitive Architecture (And Why It Breaks)

The most natural response to high read/write volume on a shared counter is: put it in Redis. Redis is single-threaded. Operations are atomic. A DECR command cannot race with another DECR the way a Postgres UPDATE can under concurrent load without explicit locking. This reasoning is sound as far as it goes.

So the intuitive architecture emerges:

Redis holds inventory:{event_id} — the live ticket count
Postgres holds orders, claims, payments — the durable record
The flow: check Redis, decrement Redis, write to Postgres

Here is what that looks like in code:

// Check inventory
count, _ := redis.Get(ctx, "inventory:event-123").Int64()
if count <= 0 {
    return ErrSoldOut
}

// Decrement
redis.Decr(ctx, "inventory:event-123")

// Persist
db.Exec(ctx, "INSERT INTO claims (...) VALUES (...)")

This code has a bug. The bug is not in any single line. The bug is in the model — in the assumption that these three operations form a coherent unit.

They do not. They are three separate operations across two separate systems. No transaction spans them. Between any two of those lines, the process can crash, the network can partition, the Redis instance can restart. Each of those events produces a different kind of corruption.

Let us be precise about what each failure looks like.

The Four Failure Windows

Window 1: Between check and decrement

You read the inventory: 1 ticket left. Before you decrement, another request reads the same count. Both see 1. Both decrement. Both insert into Postgres. You have sold the same ticket twice.

This is a classic time-of-check/time-of-use (TOCTOU) race. It is solvable — Redis Lua scripts can make the check-and-decrement atomic. But solving this window does not close the others.

Window 2: Between Redis decrement and Postgres insert

You atomically decrement Redis to 0. Before you insert into Postgres, the process crashes — OOM kill, deployment, hardware fault, power failure. It does not matter why.

Redis says 0 tickets remain. Postgres has no claim record. The ticket has vanished. A real person paid — or was about to pay — and there is no recoverable record of their claim.

Window 3: Between Postgres insert and Redis decrement

You reverse the order — Postgres first, Redis second. The Postgres insert commits. Before you decrement Redis, the process crashes.

Now Redis shows 1 ticket remaining. Postgres has a committed claim. The next request that checks Redis will be told a ticket is available when none is. You may oversell.

Window 4: Redis restart

Your Redis instance restarts. The inventory key evaporates. All the careful decrements you performed are gone. Redis now reports the key as missing — which your code interprets as "full inventory available" — and suddenly every ticket is available again, even the ones already claimed and paid for.

Why You Cannot Fix This With More Code

The instinct at this point is to reach for compensating mechanisms. Retry logic. Distributed transactions. Two-phase commit. Sagas.

These approaches are real and useful in the right contexts. They do not fix the fundamental problem here, because the fundamental problem is not a missing feature. It is a property of the environment.

Martin Kleppmann puts it clearly in Designing Data-Intensive Applications: the dual-write problem is not solved by making writes faster or retries smarter. It is solved by choosing one system to be the source of truth and treating all other systems as derived state.

The moment you split authoritative state across Redis and Postgres — the moment both systems are required to agree for your data to be correct — you have created a consistency problem that lives in the gap between them. That gap cannot be closed. It can only be made smaller (with enough engineering complexity) or eliminated (by removing the split).

There is no atomic operation that spans two storage systems. That is not a Redis limitation or a Postgres limitation. It is a consequence of the fact that they are separate processes, on separate machines, with separate failure modes.

Every approach that tries to compensate for this — writing to both, reconciling differences, detecting divergence — is acknowledging the problem and managing it, not solving it. Management has a cost: operational complexity, latency, edge cases, and the ever-present risk that your compensation logic has its own bugs.

The simpler answer is to not create the split.

The Correct Mental Model: One Truth, One Cache

The model that actually works is this:

Postgres is the single source of truth. Redis is a performance layer. Redis holds nothing that cannot be reconstructed from Postgres.

This sounds like a constraint. It is actually a simplification. When Redis holds only reconstructible state, every failure mode has a clean answer: reconstruct from Postgres.

The ordering rule that follows from this model is strict:

Postgres is always written first. Redis is always written second. Never the reverse.

This rule is asymmetric by design. Violating it in one direction (Redis first, Postgres second) creates the possibility of a Redis state that Postgres cannot recover — an authoritative count with no corresponding record. That is the failure mode that oversells tickets.

Violating it in the other direction (Postgres first, Redis second) means a process crash between the two writes leaves Redis showing more inventory than actually exists. This is inflation — Redis is too generous. It is wrong, but it is recoverable. The next reconciliation pass reads the authoritative Postgres count and corrects Redis. No customer was incorrectly turned away. No ticket was oversold.

Choosing between these two failure modes is not splitting hairs. Temporary inflation that heals automatically is categorically different from silent overselling that requires manual intervention. One is a known, bounded failure. The other is a correctness violation.

What This Looks Like in FairQueue

FairQueue's inventory flow is built entirely around this model. Here is the actual path a claim request takes:

Claim request arrives
       │
       ▼
Redis SET NX lock acquired?  ← Layer 1: prevent concurrent claims for same customer
  No  → return ErrAlreadyClaimed
  Yes → continue
       │
       ▼
Redis Lua: DECRBY inventory if > 0  ← Atomic check-and-decrement
  -2 (sold out)   → return ErrEventSoldOut
  -1 (cache miss) → fall back to Postgres count, then retry
  ≥ 0 (success)   → continue
       │
       ▼
Postgres INSERT claim  ← Source of truth write
  unique violation → rollback Redis decrement, return ErrAlreadyClaimed
  success          → claim created ✓
       │
       ▼
Release lock

Several things in this flow are worth examining closely.

The Lua script is not the correctness guarantee. It is a performance optimisation. It prevents most concurrent claims from reaching Postgres at all, which reduces contention. But if Redis is unavailable, if the Lua script has a bug, if the lock fails — the Postgres unique constraint on (customer_id, event_id) is still there. That constraint is the inviolable correctness guarantee. Two rows cannot be inserted for the same customer and the same event. The database enforces this atomically, regardless of what happened in Redis.

This is the two-layer concurrency shield: Redis is the cheap doorman that turns away most concurrent attempts before they reach the database. Postgres is the last line of defence that holds even if the doorman is asleep.

The rollback on Postgres failure is explicit. If the Postgres insert fails after the Redis decrement succeeds, the code immediately increments Redis back. This is a best-effort compensation — if the increment also fails, the reconciliation worker will correct the divergence on its next tick. The failure is bounded and self-healing.

The cache miss path falls back to Postgres. When Redis does not have the inventory key — because it restarted, because it was never set, because the key expired — the code reads the authoritative count from Postgres and retries the decrement. Redis is not required for correctness. It is required for performance.

The Reconciliation Worker: Embracing Eventual Consistency

No matter how careful your write ordering is, Redis and Postgres will diverge. Process crashes, network blips, partial failures — these are not edge cases in production systems. They are normal operating conditions.

FairQueue has a reconciliation worker that runs every 30 seconds. Its job is mechanical: for every active event, derive the authoritative inventory count from Postgres (total_inventory - COUNT(active claims)), compare it to the Redis count, and force-sync if they differ.

func (w *ReconciliationWorker) reconcileEvent(ctx context.Context, event *domain.Event) error {
    activeClaims, _ := w.claims.CountActive(ctx, event.ID)
    pgCount := int64(event.TotalInventory) - activeClaims

    redisCount, _ := w.inventory.GetCount(ctx, event.ID)

    if redisCount == pgCount {
        return nil
    }

    w.logger.Warn("inventory divergence detected, healing",
        "event_id", event.ID,
        "postgres_count", pgCount,
        "redis_count", redisCount,
    )

    return w.inventory.ForceSync(ctx, event.ID, pgCount)
}

This worker does not make the system eventually consistent in the casual, hand-wavy sense. It makes the system intentionally eventually consistent with a bounded heal window. The maximum time Redis can be wrong is 30 seconds, and the direction of that wrongness (inflation, not deflation) is controlled.

The worker also handles Redis restarts entirely. When Redis comes back empty, the next reconciliation tick finds every event with a missing or zero inventory key and rebuilds them from Postgres. No manual intervention. No data loss. The system heals itself.

The Broader Principle

The dual-write problem is one instance of a more general principle: every distributed system design decision is actually a choice between failure modes, not a choice between correctness and incorrectness.

There is no architecture that eliminates failure. There are only architectures that choose which failures are acceptable, how long they last, and whether they are recoverable.

The engineers who get this wrong are not making careless mistakes. They are often making locally reasonable decisions — Redis is fast, Postgres is slow, put the hot path in Redis — without tracking the global consequence: that splitting authoritative state across systems creates a consistency gap, and that gap will be exercised in production.

The question to ask when designing a system like this is not "what happens when everything works?" It is "what happens when the process dies between these two lines of code?" And then: "is that failure mode acceptable?"

For FairQueue, the acceptable failure mode is: Redis briefly shows more inventory than exists, a reconciliation worker corrects it within 30 seconds, and no customer is permanently locked out. The unacceptable failure mode is: a ticket is sold that does not exist, or a payment is charged with no record.

Choosing the right failure mode and designing around it deliberately is what separates systems that survive production from systems that produce incident reports.

What FairQueue Ended Up With

For reference, the final architecture that came out of this reasoning:

Concern	System	Rationale
Inventory count	Redis (cache)	Performance — absorbs concurrent reads
Inventory truth	Postgres (derived)	`total - COUNT(active claims)`
Claim record	Postgres	Source of truth, unique constraint
Concurrency shield	Redis SET NX + Postgres unique index	Two layers; neither alone is sufficient
Queue position	Redis ZSET	Reconstructible from Postgres on restart
Payment record	Postgres	Outbox pattern; written before gateway call
Divergence healing	Reconciliation worker	Runs every 30s; force-syncs from Postgres

Redis handles roughly 50,000 concurrent queue joins at O(log N) per operation without touching Postgres. Postgres handles claim inserts with a unique constraint that makes overselling physically impossible. The reconciliation worker makes the system self-healing under any single-component failure.

The system never requires Redis and Postgres to agree atomically, because it never splits authoritative state between them. Redis is always derived. Postgres is always truth. The failure modes are chosen, bounded, and recoverable.

Closing

If you are building a system that mixes Redis and Postgres — and most production backends do — the question worth sitting with is: which system owns the truth?

Not "which system is faster" or "which system is more durable." Those are properties of the systems. The question is about your data model: when Postgres and Redis disagree, which one wins?

If the answer is not immediately obvious, you may have accidentally split your source of truth. That split will find you eventually. It tends to find you at the worst possible time — when load is highest, when the stakes are real, when the Detty December concert just went on sale.

Choose one system to own the truth. Let the other be fast. Design your failure modes deliberately. The system will be simpler, more debuggable, and more survivable for it.

FairQueue is open source. The full implementation — including the Lua scripts, reconciliation worker, and integration tests — is available on GitHub.

ML in Warehouse Operations - How I Built a Production ML System to Automate Fashion Return Classification

Daniel Popoola — Mon, 16 Mar 2026 06:50:31 +0000

From a warehouse problem I read about to a working MLOps pipeline

There's a stat that stuck with me when I started this project: online fashion retailers see return rates of up to 30%. That's nearly 1 in 3 items coming back.

Behind that number is a real operational headache. Every returned item — a pair of casual shoes, a handbag, a watch — has to be physically inspected, categorized, and processed. Is it a shirt or a top? Does it go back on the shelf or get refurbished? That decision, made by a human staring at an item after a long shift, happens hundreds of times a day.

I wanted to solve that with machine learning. Not just train a model and call it a day — but build something that could actually run in the background of a warehouse operation: automated, reliable, and observable.

That project is RefundClassifier.

The Problem with "Just Training a Model"

When I started thinking about this, my first instinct was the same as any ML student's: find a dataset, train a classifier, hit 90%+ accuracy, done.

But accuracy on a test set doesn't keep a warehouse running. The real questions are harder:

What happens when the batch job crashes halfway through 400 images at 2 AM?
How do you update the model without taking the whole system down?
How do you know if predictions are quietly degrading weeks after deployment?
Who reviews the results in the morning — and in what format?

These are MLOps problems. And they're the gap between a notebook demo and a system someone can actually trust.

RefundClassifier is my attempt to close that gap.

What the System Does

In plain terms: every night at 2 AM, the system picks up all the return images uploaded during the business day, runs them through an ML model, and writes out a results file that warehouse staff can review in the morning.

The five categories it classifies are: Casual Shoes, Handbags, Shirts, Tops, and Watches — trained on 2,500 product images with 96.53% accuracy on the test set.

But the interesting parts aren't the model. They're the infrastructure around it.

How It's Built

The architecture has three main layers:

1. The Model Service (FastAPI)
A lightweight REST API that loads the EfficientNet-B0 model from an MLflow registry and serves /predict endpoints. It's stateless — it doesn't know or care about batches. It just classifies what it's given.

Separating the model into its own service was a deliberate choice. It means I can update, restart, or swap the model without touching the batch processing logic.

2. The Batch Orchestrator (Python)
This is the core of the system. It runs on a cron schedule, scans the input directory for unprocessed images, calls the Model Service in batches of 10, writes results to a CSV, and pushes metrics to Prometheus.

The most important feature here: checkpoint recovery. If the job crashes at image 287 of 400, it doesn't restart from zero. It reads the checkpoint, skips what's already done, and continues. In a production warehouse context, reprocessing already-classified items creates data integrity issues. This prevents that.

3. Monitoring (Prometheus + Grafana)
Every batch run pushes metrics — inference latency, batch success rate, class distribution — to a Prometheus Pushgateway. Grafana dashboards surface those metrics visually. If the model starts misclassifying at unusual rates, or a batch takes 3x longer than normal, it shows up.

This was the part I underestimated the most. Monitoring isn't a "nice to have." It's how you find out something is wrong before a human has to tell you.

Model Versioning with MLflow

The model is registered in MLflow with a production alias — a pointer that says "this is the version the Model Service should load." When I retrain with new data, I register the new version and promote it to production. The service picks it up on restart, no code changes needed.

This is the simplest version of a deployment pipeline, but it enforces a useful discipline: the model is never just a file on disk. It has a version, experiment metadata, accuracy metrics attached to it, and a clear promotion path.

The UI

There's also a Streamlit interface for manual use — useful for ad-hoc classification or demos. Staff can upload a batch of images, trigger classification, and see the results in a table without touching the command line.

What I Actually Learned

Building this taught me a few things that no ML course covered:

Batch processing is underrated. Most tutorials show real-time inference. But most real business operations don't need sub-second latency — they need reliable, scheduled, auditable processing. Batch is often the right answer.

The 10% that isn't model accuracy is 90% of the work. Getting to 96% accuracy took two days. Getting checkpoint recovery, metric pushing, model registry integration, and error handling right took the rest of the project.

Observability is the difference between a deployed model and a trusted system. A model running in the dark is not production. A model with dashboards, alerts, and traceable outputs is.

Building a Payment Gateway That Doesn't Lie: How I Solved Distributed State Failures in Go

Daniel Popoola — Fri, 20 Feb 2026 17:55:38 +0000

Your server just charged a customer's card. The bank confirmed it — funds reserved, authorization ID returned. Then, a millisecond later, your server crashes.

Your database never got the memo.

Now your system thinks the payment failed. FicMart's order service re-routes the customer to a failure page, maybe even prompts them to retry. But the bank already has a hold on their money. The customer gets charged twice, or worse — their funds are locked in limbo with no order attached.

This isn't a hypothetical. It's the fundamental challenge of payment processing in distributed systems, and it's deceptively easy to ignore until it happens in production. I built FicMart Payment Gateway — a production-grade payment gateway in Go — specifically to confront this problem head-on. Here's how I thought through it.

The Real Enemy: Partial Failures

Most engineers think about failures in binary terms. Either a request succeeds or it fails. But distributed systems introduce a third, nastier category: partial failures — where some things succeed and others don't, with no clean way to tell which is which.

In payment processing, this is especially dangerous because two systems are involved: your gateway and the bank. When you ask the bank to capture $50, the sequence looks like this:

Gateway calls bank: "Capture $50 for Auth #123"
Bank processes it: "Done. Capture ID: #456"
Gateway prepares to save CAPTURED to the database
Gateway crashes
Database still says AUTHORIZED

The money has moved. But your system doesn't know it. And because you have no record of Capture #456, you have no way to reconcile without manual intervention.

This is the problem I set out to solve. The solution came down to three interlocking patterns.

Pattern 1: Capture Intent Before Acting

The core insight is simple: your database needs to know what you're about to do, not just what you've done.

Before the gateway makes any external bank call, it persists the payment in an intermediate state. For a capture, that means transitioning from AUTHORIZED to CAPTURING before touching the bank. A naive state machine looks like this:

PENDING → AUTHORIZED → CAPTURED

But this leaves a blind spot. If the gateway crashes between AUTHORIZED and CAPTURED, there's no record that a capture was ever attempted. Was the bank called? Did it succeed? You don't know.

The intermediate state closes that gap:

PENDING → AUTHORIZED → CAPTURING → CAPTURED → REFUNDING → REFUNDED
                    ↓
                 VOIDING → VOIDED

CAPTURING is not just a status — it's a signal of intent. It says: "A capture was started here. If you find me stuck in this state, you know exactly what to do." The transition into it happens inside the same database transaction that acquires the idempotency lock, so the intent is either fully committed or fully rolled back — no ambiguity.

This is borrowed from database engineering: the Write-Ahead Log pattern, where you record what you're about to do before doing it so recovery is always possible.

For authorizations specifically, this gets more nuanced. PCI compliance means you can never store raw card details, so if a crash happens during authorization, there's no way to retry it — the card data is gone. Rather than pretending this is solvable automatically, PENDING authorizations older than 10 minutes are marked FAILED and flagged for manual reconciliation. Some failures can't be fully automated away, and being honest about that is better than silently losing money.

The domain layer enforces all of this with zero database or HTTP dependencies. Business rules — you can't void a captured payment, you can't refund an unauthorized one — live in pure Go. The domain is the source of truth for what's allowed, completely independent of what's stored.

Pattern 2: Background Workers That Heal the System

Intermediate states create the evidence. Background workers act on it.

The RetryWorker polls the database on a configurable interval, looking for payments stuck in CAPTURING, VOIDING, or REFUNDING past their retry window. For each one, it re-invokes the appropriate bank operation using the original idempotency key.

That last part is what makes this safe. Because the bank supports idempotency, sending the same key twice doesn't trigger a second charge — it returns the cached result from the first attempt. The worker doesn't need to know whether the original call succeeded or not. If the bank already processed it, we get the success response back and update the database. If it didn't, we process it now. Either way, the database eventually converges to reality.

Before any retry decision is made, errors are classified:

Transient errors (timeouts, 500s) — retry with exponential backoff and jitter to avoid hammering the bank
Permanent errors (card declined, insufficient funds, auth expired) — fail fast, no retry
Business rule violations (invalid state transitions) — reject immediately at the domain layer

This classification is what separates a robust retry system from one that makes things worse. Retrying a permanent error doesn't fix anything — a declined card won't become approved on the fifth attempt. Treating it as retryable wastes cycles and delays the customer from finding out their payment failed.

The ExpirationWorker handles a different edge case: authorized payments approaching the bank's 7-day authorization window. Rather than trusting the local clock blindly, the worker checks the bank's state before marking anything expired — with a 48-hour grace period to account for distributed clock skew.

Pattern 3: Idempotency as the Safety Net

Recovery workers only work if retrying is safe. That guarantee comes entirely from idempotency.

Every external-facing operation requires an Idempotency-Key header. But the enforcement here goes deeper than most implementations.

Idempotency state is stored in PostgreSQL, not Redis — deliberately. This means it survives restarts and is subject to ACID guarantees. The idempotency_keys table does two jobs simultaneously.

It's a response cache. Once an operation completes, the result is stored against the key. Future requests with the same key get the cached response instantly, without touching the bank.

It's a distributed lock. A locked_at timestamp is set when an operation begins and cleared when it finishes. If two requests arrive with the same key at the same time, the second enters a polling loop — checking every 100ms — until the first completes, then receives the same response. No double-processing, no race conditions.

There's also a subtler protection: a request_hash (SHA-256 of the request body) stored alongside each key. If a client tries to reuse an idempotency key with different parameters — a different amount, a different payment — the gateway rejects it with an IDEMPOTENCY_MISMATCH error. This prevents a class of silent bugs where key reuse returns a stale result for a completely different operation.

The three patterns form a chain: intermediate states give workers something to act on → workers retry using the original idempotency key → idempotency makes those retries safe. Remove any one of them and the others stop working.

What I'd Do Differently at Scale

Building this taught me as much about the limits of my approach as the strengths of it.

The most important change in a high-traffic environment would be moving idempotency lookups to Redis. PostgreSQL works here, but for a gateway handling thousands of requests per second, sub-millisecond idempotency checks matter. I'd keep Postgres as the durable fallback but use Redis as the hot path.

I'd also move to event sourcing for payment state. Right now, the payments table stores the current state — you can see that a payment is CAPTURED, but you can't see the full timeline of how it got there. An append-only payment_events table would make debugging orphaned authorizations significantly easier: you'd be able to reconstruct exactly where the gap between the bank's state and yours opened up.

The retry worker would also benefit from FOR UPDATE SKIP LOCKED on its database queries. Currently, multiple worker instances compete for the same stuck payments. Skip-locked semantics let workers divide the work without blocking each other — a meaningful concurrency improvement once the system is under real load.

Finally, I'd add chaos testing: deliberately crashing the gateway at the exact millisecond between a bank response and the database commit. That's the failure mode this entire system is designed to handle, and the only way to be truly confident it works is to make it happen on purpose.

What This Really Taught Me

Payment systems forced me to think about a dimension of engineering I hadn't fully internalized before: correctness under failure, not just correctness under normal conditions.

It's easy to build a service that works when everything goes right. The interesting engineering happens when you ask: what is the worst possible moment for this process to crash, and what does the system look like afterward? That question shapes every decision in this gateway — the intermediate states, the write-ahead pattern, the idempotency locking, the recovery workers.

The result is a system that doesn't just handle payments. It handles uncertainty. And in distributed systems, uncertainty is the only thing you can count on.

The full source code is available on GitHub: DanielPopoola/ficmart-payment-gateway

Building a Health-Check Microservice with FastAPI

Daniel Popoola — Fri, 20 Jun 2025 10:44:57 +0000

In modern application development, health checks play a crucial role in ensuring reliability, observability, and smooth orchestration—especially in containerized environments like Docker or Kubernetes. In this post, I’ll walk you through how I built a production-ready health-check microservice using FastAPI.

This project features structured logging, clean separation of concerns, and asynchronous service checks for both a database and Redis—all built in a modular and extensible way.

GitHub Repo: [https://github.com/DanielPopoola/fastapi-microservice-health-check]

🚀 What This Project Covers

Creating a /health/ endpoint with real service checks (DB, Redis)
Supporting /live and /ready endpoints for Kubernetes probes
Using async asyncio.gather() for fast, parallel checks
Configurable settings with Pydantic
Structured logging with custom log formatting using loguru.
Middleware for request timing and error handling

📁 Project Structure

project/
├── main.py             # App factory and configuration
├── config.py           # App settings via Pydantic
├── routers/
│   ├── health.py       # Health check endpoints
│   └── echo.py         # Echo endpoint (for demo)
├── utils/
│   └── logging.py      # Custom logger setup
└── ...

🔍 Under the Hood: `main.py`

main.py acts as the orchestrator. Here's what it handles:

1. App Lifecycle Management

@asynccontextmanager
async def lifespan(app: FastAPI):
    logger.info("Application starting up")
    yield
    logger.info("Application shutting down")

This cleanly logs startup and shutdown events, essential for container lifecycle awareness.

2. App Factory Pattern

The create_app() function encapsulates app setup:

Loads settings with get_settings()
Sets up structured logging
Registers CORS middleware
Adds global and HTTP exception handlers
Includes routers for modularity

3. Middleware

A custom middleware logs request data and execution time:

@app.middleware("http")
async def log_requests(request, call_next):
    start_time = time.time()
    response = await call_next(request)
    response.headers["X-Response-Time"] = f"{(time.time() - start_time) * 1000:.2f}ms"
    return response

4. Exception Handling

Two global handlers catch errors and format them consistently:

One for HTTPException
One for unexpected Exception

⚕️ Health Check Logic (`routers/health.py`)

The routers/health.py file houses the core of this service:

✅ `/health/`

Performs parallel health checks using asyncio.gather():

async def perform_health_checks(settings: Settings) -> Dict[str, ServiceCheck]:
    checks = {}
    tasks = []
    if settings.database_url:
        tasks.append(("database", check_database(settings.database_url, settings.health_check_timeout)))  
    if settings.redis_url:
        tasks.append(("redis", check_redis(settings.redis_url, settings.health_check_timeout)))
    if tasks:
        results = await asyncio.gather(*[task[1] for task in tasks], return_exceptions=True)
        ...
    return checks

The result is a combined status response showing the health of each component.

🔁 `/live`

A simple liveness check returning HTTP 200 to signal the app is alive.

📦 `/ready`

Waits for both Redis and DB to pass checks before returning 200. Useful for Kubernetes readiness probes.

📡 Root Endpoint and Echo

/ returns app metadata like name, version, and timestamp
/echo is a simple test endpoint to verify connectivity

🛠️ How to Run It

uvicorn app.main:app --reload

Or using the embedded __main__ block:

python -m main

🌟 What’s Next?

Add more service checks (e.g., external APIs, caches)
Integrate with Docker’s HEALTHCHECK instruction
Configure Kubernetes readiness/liveness probes

🧠 Final Thoughts

Building robust health checks is one of the simplest yet most impactful ways to improve system reliability. With FastAPI’s speed and async support, this project offers a solid base for both simple and enterprise-grade applications.

GitHub: DanielPopoola/fastapi-microservice-health-check