<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Macaulay Praise</title>
    <description>The latest articles on Forem by Macaulay Praise (@wolfraider).</description>
    <link>https://forem.com/wolfraider</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3829266%2F3b75624f-53a7-4734-9bfb-5e06c51be579.jpg</url>
      <title>Forem: Macaulay Praise</title>
      <link>https://forem.com/wolfraider</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/wolfraider"/>
    <language>en</language>
    <item>
      <title>I Built a Production-Grade Async Job Queue from Scratch — Here's Everything That Actually Happened</title>
      <dc:creator>Macaulay Praise</dc:creator>
      <pubDate>Mon, 25 May 2026 20:58:19 +0000</pubDate>
      <link>https://forem.com/wolfraider/i-built-a-production-grade-async-job-queue-from-scratch-heres-everything-that-actually-happened-2oac</link>
      <guid>https://forem.com/wolfraider/i-built-a-production-grade-async-job-queue-from-scratch-heres-everything-that-actually-happened-2oac</guid>
      <description>&lt;p&gt;A real account of building an Async Job Queue with Backpressure &amp;amp; Priority Scheduling using Python, FastAPI, and Redis Streams — covering the Reaper service, 47 tests, 85% coverage, and every bug along the way.&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;No Celery. No shortcuts. No pretending the first approach worked.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;I decided to build a production-grade project from scratch to bypass the abstraction layer of tools like Celery and truly master backend internals—the kind of work you can defend in any technical interview.&lt;/p&gt;

&lt;p&gt;Instead of reaching for a framework that abstracts the hard parts, I built an async job queue from scratch — backpressure, priority scheduling, crash recovery, zombie detection, the full picture. This is the honest account of what I built, in what order, what broke, what I fixed, and what the final numbers looked like.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stack:&lt;/strong&gt; Python 3.12, FastAPI, Redis Streams, PostgreSQL, SQLAlchemy, Alembic, Prometheus, Grafana, Locust, Docker&lt;br&gt;
&lt;strong&gt;Final state:&lt;/strong&gt; 47 tests passing, 85% coverage, services layer at ~92%&lt;/p&gt;


&lt;h2&gt;
  
  
  Why a Job Queue? Why Not Just Use Celery?
&lt;/h2&gt;

&lt;p&gt;Celery is the right tool when you need to ship business features fast and the queue is infrastructure, not the product. Building from scratch is the right choice when you need to understand exactly why a system fails — and how to keep it alive when it does.&lt;/p&gt;

&lt;p&gt;When an interviewer asks &lt;em&gt;"how does a visibility timeout work?"&lt;/em&gt; or &lt;em&gt;"how do you prevent a zombie job from blocking your metrics?"&lt;/em&gt;, knowing how to configure Celery doesn't answer that. Understanding the mechanism does.&lt;/p&gt;

&lt;p&gt;A job queue is the backbone of anything that works outside the HTTP request-response cycle: sending emails, generating reports, processing images, running inference. The interesting engineering isn't the queue — it's everything around it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How do you stop a flood of jobs from overwhelming your workers?&lt;/li&gt;
&lt;li&gt;How do you ensure critical jobs run before low-priority ones without starving everything else?&lt;/li&gt;
&lt;li&gt;What happens when a worker crashes mid-job?&lt;/li&gt;
&lt;li&gt;How do you detect a job that's permanently broken vs. one that just needs a retry?&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  The Part Most Tutorials Skip: The Reaper
&lt;/h2&gt;

&lt;p&gt;Before walking through the build, I want to call out the component that separates a job queue from a &lt;em&gt;production-grade&lt;/em&gt; job queue: &lt;strong&gt;the Reaper&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Most tutorials build a producer and a consumer and call it done. But the hardest failure mode isn't a job that errors — it's a worker that &lt;em&gt;disappears&lt;/em&gt;. A worker that picks up a job, starts executing, and then crashes. The message is claimed. No ACK is coming. The job is stuck in limbo.&lt;/p&gt;

&lt;p&gt;The Reaper is the background service that fixes this. It runs every 10 seconds and does two things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Visibility timeout recovery&lt;/strong&gt; — queries &lt;code&gt;XPENDING&lt;/code&gt; per Redis Stream for messages claimed but not acknowledged beyond the timeout threshold. Re-enqueues them as retries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zombie job detection&lt;/strong&gt; — queries PostgreSQL for jobs in &lt;code&gt;RUNNING&lt;/code&gt; status whose &lt;code&gt;heartbeat_at&lt;/code&gt; timestamp is older than 60 seconds. Marks them &lt;code&gt;FAILED&lt;/code&gt; and re-enqueues.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Without the Reaper, a crashed worker creates jobs that show &lt;code&gt;RUNNING&lt;/code&gt; forever — poisoning your metrics, silently losing work, and giving you no signal that anything went wrong. With it, the system self-heals.&lt;/p&gt;

&lt;p&gt;Every other component in this build is interesting. The Reaper is what makes the system trustworthy.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Build Order (And Why It Matters)
&lt;/h2&gt;

&lt;p&gt;I enforced a strict eight-stage sequence. The rule: &lt;strong&gt;if Stage N is broken, do not start Stage N+1.&lt;/strong&gt; A broken Redis client makes every queue test confusing. A broken DB session makes every status transition unreliable.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Stage 1 — Environment &amp;amp; tooling
Stage 2 — Docker Compose infrastructure
Stage 3 — Configuration layer
Stage 4 — Database models &amp;amp; migrations
Stage 5 — Core clients (Redis, DB session)
Stage 6 — Business logic (services)
Stage 7 — FastAPI layer (routers, schemas)
Stage 8 — Background workers
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every stage had a verification step before moving on. Here's what actually happened at each one.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stage 1 — Environment &amp;amp; Tooling
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Poetry over pip.&lt;/strong&gt; Poetry gives you a lockfile (exact versions of every dependency), separate dev and production groups, and a consistent virtual environment tied to the project. First thing I added was the dev dependencies — pytest, ruff, mypy — before any application code existed.&lt;/p&gt;

&lt;p&gt;One non-obvious setting: &lt;code&gt;asyncio_mode = "auto"&lt;/code&gt; and &lt;code&gt;addopts = "--no-cov"&lt;/code&gt; in &lt;code&gt;pyproject.toml&lt;/code&gt;. The &lt;code&gt;--no-cov&lt;/code&gt; suppresses noisy coverage warnings before there's any application code to cover. Easy win.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;.env&lt;/code&gt; discipline from the start.&lt;/strong&gt; &lt;code&gt;.env.example&lt;/code&gt; committed. &lt;code&gt;.env&lt;/code&gt; in &lt;code&gt;.gitignore&lt;/code&gt; immediately. If you accidentally commit a &lt;code&gt;.env&lt;/code&gt; with real credentials, you must rotate everything in it. Don't find out the hard way.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stage 2 — Docker Compose Infrastructure
&lt;/h2&gt;

&lt;p&gt;No Kafka. Redis Streams provides the same PENDING/ACK semantics Kafka gives you, without the operational overhead of running Zookeeper alongside it.&lt;/p&gt;

&lt;p&gt;The most important thing I did here: used &lt;code&gt;condition: service_healthy&lt;/code&gt; in &lt;code&gt;depends_on&lt;/code&gt; instead of just listing service names. &lt;code&gt;depends_on&lt;/code&gt; without a health condition only waits for the container to &lt;em&gt;start&lt;/em&gt;, not for the service inside it to be &lt;em&gt;ready&lt;/em&gt;. Redis can take a second to accept connections. Without this, your app crashes on startup trying to connect to Redis before it's listening.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;redis&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;condition&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;service_healthy&lt;/span&gt;
  &lt;span class="na"&gt;postgres&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;condition&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;service_healthy&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Verification gate:&lt;/strong&gt; &lt;code&gt;make check-infra&lt;/code&gt; running &lt;code&gt;redis-cli ping&lt;/code&gt; and &lt;code&gt;pg_isready&lt;/code&gt; before touching any application code.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stage 3 — Configuration Layer
&lt;/h2&gt;

&lt;p&gt;One file: &lt;code&gt;app/config.py&lt;/code&gt;. Its entire job is to read environment variables and expose them as a typed Python object. Nothing else in the codebase calls &lt;code&gt;os.environ&lt;/code&gt; directly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;functools&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;lru_cache&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pydantic_settings&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseSettings&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Settings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseSettings&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;redis_url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;redis://localhost:6379/0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;database_url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;postgresql+asyncpg://...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;high_watermark&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10000&lt;/span&gt;
    &lt;span class="n"&gt;low_watermark&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2000&lt;/span&gt;
    &lt;span class="n"&gt;weight_critical&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;
    &lt;span class="n"&gt;weight_high&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;
    &lt;span class="n"&gt;weight_normal&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;
    &lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;
    &lt;span class="n"&gt;job_timeout_seconds&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;

&lt;span class="nd"&gt;@lru_cache&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_settings&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Settings&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Settings&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;@lru_cache&lt;/code&gt; ensures settings are read exactly once per process. Without it, environment variables get read on every function call — wasteful and occasionally surprising.&lt;/p&gt;

&lt;p&gt;First tests I wrote: verify defaults load, weights sum to 100, high watermark is greater than low watermark. Three tests. All pass. Move on.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stage 4 — Database Models &amp;amp; Migrations
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use Alembic from day one.&lt;/strong&gt; I learned this the hard way. SQLAlchemy's &lt;code&gt;Base.metadata.create_all()&lt;/code&gt; is convenient for getting started, but it creates untracked schema you can't evolve safely in production. Alembic tracks every change as a versioned migration file.&lt;/p&gt;

&lt;p&gt;The bug I hit: I ran &lt;code&gt;alembic revision --autogenerate&lt;/code&gt; multiple times across sessions. Each run created a new migration file with no dependency on the previous one, breaking Alembic's chain silently. The fix was to delete everything in &lt;code&gt;migrations/versions/&lt;/code&gt; and generate one single clean migration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The &lt;code&gt;jobs&lt;/code&gt; table columns that matter most:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Job&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Base&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;__tablename__&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;jobs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Mapped&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;UUID&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;mapped_column&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;primary_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;uuid4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Mapped&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;          &lt;span class="c1"&gt;# PENDING → RUNNING → COMPLETED/FAILED
&lt;/span&gt;    &lt;span class="n"&gt;priority&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Mapped&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;        &lt;span class="c1"&gt;# critical, high, normal
&lt;/span&gt;    &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Mapped&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;        &lt;span class="c1"&gt;# JSON
&lt;/span&gt;    &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Mapped&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# JSON, nullable
&lt;/span&gt;    &lt;span class="n"&gt;retry_count&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Mapped&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;mapped_column&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Mapped&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;mapped_column&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;worker_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Mapped&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;heartbeat_at&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Mapped&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# ← key for zombie detection
&lt;/span&gt;    &lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Mapped&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Mapped&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;mapped_column&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;func&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="n"&gt;updated_at&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Mapped&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;mapped_column&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;func&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;onupdate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;func&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;heartbeat_at&lt;/code&gt; is the column that makes the entire zombie detection system work. Without it, you have no way to distinguish a job that's genuinely running from one whose worker died.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stage 5 — Core Clients
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;redis[hiredis]&lt;/code&gt;&lt;/strong&gt;, not just &lt;code&gt;redis&lt;/code&gt;. The &lt;code&gt;hiredis&lt;/code&gt; extra is a C extension that makes Redis response parsing ~10x faster. One extra word in the install command, significant throughput difference at load.&lt;/p&gt;

&lt;p&gt;The bug that cost me time: I wrote the health endpoint as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;health&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  &lt;span class="c1"&gt;# WRONG
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;FastAPI saw &lt;code&gt;request&lt;/code&gt; without a type annotation and treated it as a &lt;strong&gt;query parameter&lt;/strong&gt;, not the HTTP request object. The fix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;health&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  &lt;span class="c1"&gt;# RIGHT
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;First-timer mistake. It'll happen to you too. Now you know.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Redis client lifecycle:&lt;/strong&gt; Create it inside the FastAPI lifespan function, not at module import time. Store it on &lt;code&gt;app.state.redis&lt;/code&gt;. Never create it globally.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@asynccontextmanager&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lifespan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;FastAPI&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;create_redis_client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis_url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;yield&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;aclose&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Creating it at import time causes "Event loop is closed" errors that are confusing to debug.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stage 6 — Business Logic
&lt;/h2&gt;

&lt;p&gt;This is the interesting part. All services are plain Python async functions — no FastAPI imports, no HTTP, no request/response objects. Pure logic that's easy to unit test.&lt;/p&gt;

&lt;h3&gt;
  
  
  Backpressure: Two Watermarks, Not One
&lt;/h3&gt;

&lt;p&gt;My first instinct was a single threshold: accept below X, reject above X. This causes &lt;strong&gt;oscillation&lt;/strong&gt; — rapid toggling between accepting and rejecting as queue depth bounces around the threshold under load.&lt;/p&gt;

&lt;p&gt;The fix is a band:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;BACKPRESSURE_STATE_KEY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;backpressure:active&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;enqueue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;job_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;priority&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;depth&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xlen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;queue:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;priority&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;is_active&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exists&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BACKPRESSURE_STATE_KEY&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;depth&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;high_watermark&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BACKPRESSURE_STATE_KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;BackpressureError&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;is_active&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;depth&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;low_watermark&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;BackpressureError&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;is_active&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;depth&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;low_watermark&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;delete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BACKPRESSURE_STATE_KEY&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xadd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;queue:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;priority&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;job_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;job_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payload&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;High watermark fires the flag. The flag stays set until depth drops below the low watermark. The band gives the queue time to drain before opening again.&lt;/p&gt;

&lt;h3&gt;
  
  
  Weighted Fair Scheduling: No Starvation
&lt;/h3&gt;

&lt;p&gt;Multiple priority queues are useless without a scheduler that prevents starvation. A naive scheduler draining critical first means low-priority jobs never run when critical is busy.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;pick_queue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;weights&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;get_weights&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# e.g. {"critical": 60, "high": 30, "normal": 10}
&lt;/span&gt;
    &lt;span class="n"&gt;roll&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;cumulative&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;priority&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;weights&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;cumulative&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;roll&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;cumulative&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;queue:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;priority&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The weights are stored in Redis at runtime, not hardcoded. During an incident, you can set critical to 100% via an API call without redeploying.&lt;/p&gt;

&lt;p&gt;I verified this with a test — 10,000 simulated picks, assert distribution matches configured weights within ±5%:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_scheduler_distribution&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;picks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;critical&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;normal&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;weights&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;critical&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;normal&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10_000&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;queue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;pick_queue_sync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;weights&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;picks&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;priority&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;expected_pct&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;weights&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;actual_pct&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;picks&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;priority&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;
        &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual_pct&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;expected_pct&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you can't prove the algorithm produces the right distribution, you haven't finished writing it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Redis Streams: Not Just a List
&lt;/h3&gt;

&lt;p&gt;My first instinct was &lt;code&gt;LPUSH&lt;/code&gt;/&lt;code&gt;RPOP&lt;/code&gt; — simple and familiar. Fatal flaw: if a worker pops a job and crashes before finishing, &lt;strong&gt;the job is gone&lt;/strong&gt;. There's no recovery.&lt;/p&gt;

&lt;p&gt;Redis Streams with consumer groups solve this via &lt;strong&gt;PENDING entries&lt;/strong&gt;. A message delivered to a consumer but not yet acknowledged stays in the pending list. No ACK = not done.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Dequeue
&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xreadgroup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;groupname&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;workers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;consumername&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;worker_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;streams&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;stream_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;  &lt;span class="c1"&gt;# "&amp;gt;" = only undelivered messages
&lt;/span&gt;    &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2000&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# After success
&lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stream_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;workers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;message_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The per-priority visibility timeouts I set:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Priority&lt;/th&gt;
&lt;th&gt;Visibility Timeout&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;critical&lt;/td&gt;
&lt;td&gt;30 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;high&lt;/td&gt;
&lt;td&gt;60 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;normal&lt;/td&gt;
&lt;td&gt;120 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Critical jobs get recovered faster. Normal jobs have more time before the reaper reclaims them.&lt;/p&gt;

&lt;h3&gt;
  
  
  Exponential Backoff with Full Jitter
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_backoff_seconds&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retry_count&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;base&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_delay&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;60.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="n"&gt;retry_count&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;max_delay&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;random&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;* random.random()&lt;/code&gt; is &lt;strong&gt;full jitter&lt;/strong&gt; — prevents the thundering herd problem where all retrying workers wake up simultaneously and hammer the same resources. Without jitter, every worker that fails at the same time retries at exactly the same moment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bug I hit during implementation:&lt;/strong&gt; The &lt;code&gt;except Exception as e&lt;/code&gt; retry path referenced &lt;code&gt;error_msg&lt;/code&gt;, a variable only defined inside the &lt;code&gt;except asyncio.TimeoutError&lt;/code&gt; block above it. Python raised &lt;code&gt;NameError&lt;/code&gt; at runtime. The fix was using &lt;code&gt;str(e)&lt;/code&gt; directly in the general exception handler.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stage 7 — FastAPI Layer
&lt;/h2&gt;

&lt;p&gt;By the time I reached this stage, all business logic was written and tested. Route handlers are thin:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@router.post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/jobs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;202&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;submit_job&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;JobCreate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DBSession&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;RedisDep&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SettingsDep&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;JobResponse&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;job&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;create_job&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;priority&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;enqueue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;priority&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;JobResponse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;model_validate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;BackpressureError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;backpressure_rejections_total&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;inc&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;HTTPException&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;503&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Retry-After&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="n"&gt;detail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Queue at capacity. Retry after 5 seconds.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If your route handler is longer than 20 lines, business logic has leaked into it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The pytest event loop nightmare:&lt;/strong&gt; My first conftest used &lt;code&gt;asyncio_default_fixture_loop_scope = "session"&lt;/code&gt; but &lt;code&gt;asyncio_default_test_loop_scope&lt;/code&gt; defaulted to &lt;code&gt;"function"&lt;/code&gt;. This caused "Future attached to a different loop" errors — SQLAlchemy's connection pool was created on the session loop but tests ran on function loops. The fix was setting &lt;em&gt;both&lt;/em&gt; to &lt;code&gt;"session"&lt;/code&gt; and restructuring conftest with session-scoped engine and Redis client.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stage 8 — Background Workers
&lt;/h2&gt;

&lt;p&gt;Two workers, each as a separate service in Docker Compose:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;job_worker.py&lt;/code&gt;&lt;/strong&gt; — the main worker loop:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Pick a queue via weighted scheduler&lt;/li&gt;
&lt;li&gt;XREADGROUP to claim a message&lt;/li&gt;
&lt;li&gt;Mark job &lt;code&gt;RUNNING&lt;/code&gt; in PostgreSQL&lt;/li&gt;
&lt;li&gt;Start a heartbeat task (writes timestamp every 5 seconds)&lt;/li&gt;
&lt;li&gt;Dispatch to handler with &lt;code&gt;asyncio.wait_for(timeout=30)&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;On success: XACK + mark &lt;code&gt;COMPLETED&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;On failure: increment retry, re-enqueue with backoff, or DLQ if exhausted&lt;/li&gt;
&lt;li&gt;Cancel heartbeat task in &lt;code&gt;finally&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;reaper.py&lt;/code&gt;&lt;/strong&gt; — runs every 10 seconds:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;XPENDING&lt;/code&gt; query per stream for messages older than visibility timeout&lt;/li&gt;
&lt;li&gt;Re-enqueue stale pending messages as retries&lt;/li&gt;
&lt;li&gt;Query PostgreSQL for &lt;code&gt;RUNNING&lt;/code&gt; jobs with &lt;code&gt;heartbeat_at &amp;lt; now() - 60 seconds&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Mark zombies as &lt;code&gt;FAILED&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# docker-compose.yml&lt;/span&gt;
&lt;span class="na"&gt;job_worker&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;.&lt;/span&gt;
  &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;python -m app.workers.job_worker&lt;/span&gt;
  &lt;span class="na"&gt;restart&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;unless-stopped&lt;/span&gt;
  &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;condition&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;service_healthy&lt;/span&gt;

&lt;span class="na"&gt;reaper&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;.&lt;/span&gt;
  &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;python -m app.workers.reaper&lt;/span&gt;
  &lt;span class="na"&gt;restart&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;unless-stopped&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;restart: unless-stopped&lt;/code&gt; means a crashed worker comes back up automatically and resumes — same behavior as production.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Dockerfile Problem (And the Right Fix)
&lt;/h3&gt;

&lt;p&gt;My multi-stage Dockerfile put the venv at &lt;code&gt;/app/.venv&lt;/code&gt;. Then I added a bind mount in docker-compose:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;.:/app&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The bind mount &lt;strong&gt;overlaid the entire &lt;code&gt;/app&lt;/code&gt; directory&lt;/strong&gt;, hiding the venv the Dockerfile built. The container had no packages.&lt;/p&gt;

&lt;p&gt;My first fix was a named Docker volume to protect the venv. Workable, but fragile — every time I added a new package, I had to run &lt;code&gt;docker compose down -v&lt;/code&gt; to wipe and repopulate the volume.&lt;/p&gt;

&lt;p&gt;The correct fix: &lt;strong&gt;move the venv outside &lt;code&gt;/app&lt;/code&gt;&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="c"&gt;# Stage 1: build&lt;/span&gt;
&lt;span class="k"&gt;ENV&lt;/span&gt;&lt;span class="s"&gt; VIRTUAL_ENV=/opt/venv&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; venv &lt;span class="nv"&gt;$VIRTUAL_ENV&lt;/span&gt;
&lt;span class="k"&gt;ENV&lt;/span&gt;&lt;span class="s"&gt; PATH="$VIRTUAL_ENV/bin:$PATH"&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; ...

&lt;span class="c"&gt;# Stage 2: runtime&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; --from=builder /opt/venv /opt/venv&lt;/span&gt;
&lt;span class="k"&gt;ENV&lt;/span&gt;&lt;span class="s"&gt; PATH="/opt/venv/bin:$PATH"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;/opt/venv&lt;/code&gt; is never touched by the bind mount. No volume management needed. The bind mount only covers &lt;code&gt;/app&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Engineering Pillars I Hardened After the Initial Build
&lt;/h2&gt;

&lt;p&gt;After completing all 8 stages and doing a thorough review of the system against production requirements, I found gaps across three areas: resilience, observability, and testing. Here's how I addressed each one.&lt;/p&gt;




&lt;h3&gt;
  
  
  Pillar 1 — Resilience
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Two-watermark backpressure band (not a single threshold).&lt;/strong&gt; My initial implementation used one cutoff point. The problem: as queue depth oscillates around that number under real traffic, the system rapidly toggles between accepting and rejecting requests. The fix is a band — the high watermark fires a Redis flag, and the flag stays set until depth drops below the separate low watermark. That gap prevents oscillation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Exponential backoff with full jitter.&lt;/strong&gt; I had retry logic but used a fixed delay. The issue with fixed delays is synchronization — when ten workers all fail at the same moment, they all retry at the same moment, and you hammer your dependencies in waves.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_backoff_seconds&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retry_count&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;base&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_delay&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;60.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="n"&gt;retry_count&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;max_delay&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;random&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;* random.random()&lt;/code&gt; is &lt;strong&gt;full jitter&lt;/strong&gt;. It spreads retries across a window, breaking the thundering herd.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bug hit during implementation:&lt;/strong&gt; The &lt;code&gt;except Exception as e&lt;/code&gt; retry path referenced &lt;code&gt;error_msg&lt;/code&gt;, a variable only defined inside the &lt;code&gt;except asyncio.TimeoutError&lt;/code&gt; block above it. Python raised &lt;code&gt;NameError&lt;/code&gt; at runtime. Fixed by using &lt;code&gt;str(e)&lt;/code&gt; directly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Per-priority visibility timeouts.&lt;/strong&gt; The original reaper used a single global timeout for all streams. Critical jobs need faster recovery. Normal jobs can wait longer before the reaper reclaims them.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Priority&lt;/th&gt;
&lt;th&gt;Visibility Timeout&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;critical&lt;/td&gt;
&lt;td&gt;30 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;high&lt;/td&gt;
&lt;td&gt;60 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;normal&lt;/td&gt;
&lt;td&gt;120 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Dead Letter Queue with replay.&lt;/strong&gt; Permanently failed jobs (retry count exhausted) land in &lt;code&gt;queue:dlq&lt;/code&gt; with their full failure metadata. A &lt;code&gt;POST /queues/dlq/{id}/replay&lt;/code&gt; endpoint moves them back to their original priority stream for human-initiated retry. The DLQ depth is exposed in &lt;code&gt;/queues/metrics&lt;/code&gt; so you know when it needs attention.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Runtime weight adjustment.&lt;/strong&gt; Scheduling weights are stored in Redis, not config. &lt;code&gt;PATCH /queues/weights&lt;/code&gt; validates that weights sum to 100 and stores them. &lt;code&gt;DELETE /queues/weights&lt;/code&gt; reverts to config defaults. During an incident, you can redirect 100% of worker capacity to critical jobs via a single API call without redeploying.&lt;/p&gt;




&lt;h3&gt;
  
  
  Pillar 2 — Observability
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Structured logging with &lt;code&gt;structlog&lt;/code&gt;.&lt;/strong&gt; Python's built-in &lt;code&gt;logging&lt;/code&gt; produces unstructured text. &lt;code&gt;structlog&lt;/code&gt; with JSON output in production produces log lines that are filterable and machine-parseable. Every log line includes &lt;code&gt;request_id&lt;/code&gt;, &lt;code&gt;job_id&lt;/code&gt;, priority, and the relevant domain context.&lt;/p&gt;

&lt;p&gt;Bug hit: &lt;code&gt;structlog.stdlib.add_logger_name&lt;/code&gt; is designed for &lt;code&gt;logging.Logger&lt;/code&gt; objects. &lt;code&gt;PrintLoggerFactory&lt;/code&gt; (structlog's default) produces &lt;code&gt;PrintLogger&lt;/code&gt; objects with no &lt;code&gt;.name&lt;/code&gt; attribute — the processor raises &lt;code&gt;AttributeError&lt;/code&gt; at runtime. Fix: remove that processor from the chain.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Four custom Prometheus metrics — by name, not just by count:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# app/core/metrics.py
&lt;/span&gt;&lt;span class="n"&gt;queue_depth&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Gauge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;queue_depth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Current queue depth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;priority&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;jobs_processed_total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;jobs_processed_total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Jobs processed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;backpressure_rejections_total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;backpressure_rejections_total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Backpressure 503s fired&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;zombie_jobs_reaped_total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zombie_jobs_reaped_total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Zombie jobs recovered by reaper&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;queue_depth&lt;/code&gt; updates on a 15-second background task. &lt;code&gt;backpressure_rejections_total&lt;/code&gt; increments in the router's exception handler. &lt;code&gt;jobs_processed_total&lt;/code&gt; increments in the worker's success and DLQ-failure paths. &lt;code&gt;zombie_jobs_reaped_total&lt;/code&gt; increments in the reaper after each recovery.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Grafana dashboard provisioned automatically.&lt;/strong&gt; Rather than requiring manual dashboard setup, I added &lt;code&gt;grafana/provisioning/&lt;/code&gt; with datasource and dashboard JSON files mounted into the Grafana container. &lt;code&gt;make dev&lt;/code&gt; gives you a working dashboard at &lt;code&gt;localhost:3000&lt;/code&gt;, no clicking required.&lt;/p&gt;




&lt;h3&gt;
  
  
  Pillar 3 — Testing
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Locust load tests with three realistic user classes:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;LegitimateUser&lt;/code&gt; (weight 3): submits jobs, polls results, checks metrics&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;BurstUser&lt;/code&gt; (weight 1): submits at 10-50ms intervals to trigger backpressure intentionally — both 202 and 503 are counted as success because 503 &lt;em&gt;is the correct behavior&lt;/em&gt; under overload&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;MixedWorkload&lt;/code&gt; (weight 2): submits across all three priorities in a 6:3:1 ratio&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Results at 50 users, dev server: &lt;strong&gt;56 req/s sustained, P50 220ms, P95 640ms&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pre-commit hooks.&lt;/strong&gt; &lt;code&gt;ruff&lt;/code&gt;, &lt;code&gt;ruff-format&lt;/code&gt;, &lt;code&gt;mypy&lt;/code&gt;, and standard file checks (&lt;code&gt;trailing-whitespace&lt;/code&gt;, &lt;code&gt;check-yaml&lt;/code&gt;, &lt;code&gt;debug-statements&lt;/code&gt;) run before every commit. Code that doesn't pass these never reaches the repo.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub Actions CI pipeline.&lt;/strong&gt; Every push triggers: ruff → mypy → alembic upgrade head → pytest with Redis 7 and Postgres 15 as sidecar containers. The migration step is included because schema drift between model and database is a silent failure mode that only shows up in integration tests.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Tests That Actually Proved the System Works
&lt;/h2&gt;

&lt;p&gt;I'm including three specific tests because they demonstrate how to test distributed system behavior without &lt;code&gt;asyncio.sleep&lt;/code&gt; — which makes tests flaky and slow. More importantly, they test the failure modes that most implementations never cover: what happens when workers crash mid-execution.&lt;/p&gt;

&lt;p&gt;The 85% overall coverage and ~92% services layer coverage isn't just a number — it specifically covers worker crashes during execution (&lt;code&gt;test_visibility_timeout_reaper&lt;/code&gt;), stale-heartbeat zombie recovery (&lt;code&gt;test_zombie_detection&lt;/code&gt;), and exhausted-retry DLQ routing. These are the edge cases that fail silently in production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Visibility timeout reaper&lt;/strong&gt; — directly patch the timeout to zero instead of waiting:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_visibility_timeout_reaper&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mocker&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xadd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;queue:normal&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;job_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;orphan-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payload&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="c1"&gt;# Claim it but never ACK (simulate crashed worker)
&lt;/span&gt;    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xreadgroup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;workers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dead-worker&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;queue:normal&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Patch timeout to 0ms — no sleeping required
&lt;/span&gt;    &lt;span class="n"&gt;mocker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;patch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;app.services.reaper_service.VISIBILITY_TIMEOUTS_MS&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                      &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;critical&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;normal&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="n"&gt;recovered&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;recover_pending_messages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;recovered&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Zombie detection&lt;/strong&gt; — set stale heartbeat directly in the DB:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_zombie_detection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db_session&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;job&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;create_job&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db_session&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;send_email&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;normal&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;mark_running&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db_session&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;worker_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dead-worker&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Directly set heartbeat to 2 minutes ago
&lt;/span&gt;    &lt;span class="n"&gt;stale_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;utcnow&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nf"&gt;timedelta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;seconds&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;120&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;db_session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Job&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;heartbeat_at&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;stale_time&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;db_session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;commit&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;reaped&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;reap_zombie_jobs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db_session&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;reaped&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;db_session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;refresh&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;FAILED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;heartbeat timeout&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;error&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No sleeps. Direct state manipulation. Tests run in milliseconds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Priority distribution&lt;/strong&gt; — the test I showed interviewers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_priority_distribution&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;picks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;critical&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;normal&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10_000&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;queue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;pick_queue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;priority&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;picks&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;priority&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;picks&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;critical&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;picks&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;picks&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;normal&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Final Numbers
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Tests&lt;/td&gt;
&lt;td&gt;47 passing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Coverage (overall)&lt;/td&gt;
&lt;td&gt;85%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Coverage (services layer)&lt;/td&gt;
&lt;td&gt;~92%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sustained throughput&lt;/td&gt;
&lt;td&gt;56 req/s at 50 users&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P50 latency&lt;/td&gt;
&lt;td&gt;220ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P95 latency&lt;/td&gt;
&lt;td&gt;640ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prometheus metrics&lt;/td&gt;
&lt;td&gt;4 custom&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API endpoints&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The 56 req/s and 640ms P95 are on a &lt;strong&gt;dev server&lt;/strong&gt; (&lt;code&gt;uvicorn&lt;/code&gt; single worker). The DESIGN.md section on 10x scale calls for Gunicorn with multiple workers as the first step — a fair and honest constraint to document.&lt;/p&gt;




&lt;h2&gt;
  
  
  Lessons for the Next Build
&lt;/h2&gt;

&lt;p&gt;Knowing where your design breaks is more impressive than pretending it's perfect. These aren't regrets — they're the honest constraints of what a single-node dev server build looks like, and what comes next.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Redis Cluster&lt;/strong&gt; instead of single-node Redis — Redis is currently a single point of failure. At scale, the queue backend needs to survive node loss without losing PENDING state.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gunicorn with multiple Uvicorn workers&lt;/strong&gt; — the dev server constraint (single-worker &lt;code&gt;uvicorn&lt;/code&gt;) is what makes the 56 req/s the ceiling, not the floor. Multiple workers multiply that linearly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sorted set for delayed re-enqueue&lt;/strong&gt; — backoff delay currently uses &lt;code&gt;asyncio.sleep&lt;/code&gt; inside the worker. That blocks the worker's event loop for the duration of the delay. The production pattern is a sorted set where the score is the &lt;code&gt;execute_at&lt;/code&gt; timestamp, with a separate scheduler polling for ready entries. The worker enqueues and moves on immediately.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prometheus alerting rules&lt;/strong&gt; — Grafana dashboards are reactive; you see a problem after it's already happening. Alerting rules on &lt;code&gt;queue_depth&lt;/code&gt;, &lt;code&gt;zombie_jobs_reaped_total&lt;/code&gt;, and DLQ depth make the system proactive.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-job-type timeouts&lt;/strong&gt; — there's one global &lt;code&gt;job_timeout_seconds = 30&lt;/code&gt;. A &lt;code&gt;send_email&lt;/code&gt; job and a &lt;code&gt;generate_report&lt;/code&gt; job have very different expected runtimes. The next version reads timeout configuration per job type from Redis, same pattern as the scheduling weights.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Honest Takeaways
&lt;/h2&gt;

&lt;p&gt;Building this surfaced things that configuring a framework never would:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;depends_on&lt;/code&gt; doesn't mean ready&lt;/strong&gt; — use &lt;code&gt;condition: service_healthy&lt;/code&gt; or your app crashes on startup reaching a service that hasn't finished initializing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bind mounts hide Dockerfile artifacts&lt;/strong&gt; — move your venv outside the mounted directory (&lt;code&gt;/opt/venv&lt;/code&gt; instead of &lt;code&gt;/app/.venv&lt;/code&gt;) or every &lt;code&gt;docker compose up&lt;/code&gt; wipes it&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pytest event loops don't mix&lt;/strong&gt; — set both &lt;code&gt;asyncio_default_fixture_loop_scope&lt;/code&gt; AND &lt;code&gt;asyncio_default_test_loop_scope&lt;/code&gt; to &lt;code&gt;"session"&lt;/code&gt; or you get "Future attached to a different loop" errors that are genuinely hard to diagnose&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test your algorithms quantitatively&lt;/strong&gt; — a 10,000-sample distribution test on the scheduler catches bias that type checks and unit assertions miss entirely&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Reaper is what makes it production-grade&lt;/strong&gt; — without visibility timeout recovery and zombie detection, crashed workers silently kill jobs and give you no signal that work was lost&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The whole stack — FastAPI, Redis Streams, PostgreSQL, Prometheus, Grafana, Docker Compose — runs on a laptop. No cloud account required to build and demo it.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;The GitHub repo, DESIGN.md, and CONTRIBUTING.md are &lt;a href="https://github.com/macaulaypraise/async-job-queue-with-backpressure.git" rel="noopener noreferrer"&gt;https://github.com/macaulaypraise/async-job-queue-with-backpressure.git&lt;/a&gt;. If you have questions about the reaper logic, the backpressure band, or the worker integration tests — drop them in the comments.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>backend</category>
      <category>redis</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Rate Limiting Wasn't Enough — So I Built an API Gateway with Behavioral Abuse Detection</title>
      <dc:creator>Macaulay Praise</dc:creator>
      <pubDate>Thu, 09 Apr 2026 16:21:33 +0000</pubDate>
      <link>https://forem.com/wolfraider/rate-limiting-wasnt-enough-so-i-built-an-api-gateway-with-behavioral-abuse-detection-24j4</link>
      <guid>https://forem.com/wolfraider/rate-limiting-wasnt-enough-so-i-built-an-api-gateway-with-behavioral-abuse-detection-24j4</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Real rate limiting, Bloom filters, credential stuffing detection, and the bugs that almost broke everything. Live demo included.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/macaulaypraise/api-gateway-with-abuse-detection" rel="noopener noreferrer"&gt;macaulaypraise/api-gateway-with-abuse-detection&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Live demo:&lt;/strong&gt; &lt;a href="https://api-gateway-with-abuse-detection.onrender.com/docs" rel="noopener noreferrer"&gt;api-gateway-with-abuse-detection.onrender.com/docs&lt;/a&gt;&lt;/p&gt;



&lt;p&gt;As someone transitioning into backend engineering, I wanted to build something that went beyond tutorials. I didn't want a CRUD app. I wanted something that would teach me how real systems defend themselves — something I could point to in an interview and say: &lt;em&gt;"I built this from scratch and I know exactly why every line exists."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That project became an &lt;strong&gt;API Gateway with Abuse Detection&lt;/strong&gt; — a FastAPI service that sits in front of upstream backends and actively detects credential stuffing, scraping bots, and known-bad actors. Here's a technical breakdown of how it works, the decisions behind it, and the real bugs that nearly cost me my sanity.&lt;/p&gt;


&lt;h2&gt;
  
  
  What the System Does
&lt;/h2&gt;

&lt;p&gt;Every request passes through a &lt;strong&gt;six-step middleware chain&lt;/strong&gt; in this exact order:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. RequestID      → UUID trace ID attached to every request
2. Auth           → JWT validation, client_id + role extracted
3. BloomFilter    → O(1) bad IP + bad user-agent check
4. RateLimit      → sliding window per authenticated client
5. AbuseDetector  → graduated response (throttle/block)
6. ShadowMode     → log would-be blocks before enforcement
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each middleware depends on the one before it. If the Bloom filter flags you, the rate limiter never runs. Fail fast, fail cheap.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Core Components (And Why Each One Exists)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Sliding Window Rate Limiter
&lt;/h3&gt;

&lt;p&gt;Fixed-window rate limiting has a well-known flaw: a client can send &lt;code&gt;N&lt;/code&gt; requests at the end of window 1 and &lt;code&gt;N&lt;/code&gt; more at the start of window 2 — that's &lt;code&gt;2N&lt;/code&gt; requests in 2 seconds while technically never violating the per-window rule.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;sliding window&lt;/strong&gt; eliminates this. Every request gets timestamped and stored in a Redis sorted set. On each new request:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Delete all entries older than the window&lt;/li&gt;
&lt;li&gt;Count what remains&lt;/li&gt;
&lt;li&gt;Allow or deny&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The key word is &lt;em&gt;atomic&lt;/em&gt;. If steps 1–3 aren't wrapped in a Lua script, a concurrent request can slip between the remove and the count, creating a race condition that lets clients exceed their limit.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight lua"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Executed atomically on the Redis server&lt;/span&gt;
&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;KEYS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;tonumber&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ARGV&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;window&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;tonumber&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ARGV&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;tonumber&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ARGV&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'ZREMRANGEBYSCORE'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;window&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'ZCARD'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt;
    &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'ZADD'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;  &lt;span class="c1"&gt;-- allowed&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;  &lt;span class="c1"&gt;-- blocked&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Production verification:&lt;/strong&gt; 150 parallel requests against the live Render deployment confirmed the enforcer is exact:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;100 × 200 OK  ← exactly the rate limit
 50 × 429     ← every request over the limit rejected
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Prometheus confirmed &lt;code&gt;rate_limit_rejections_total{client_id="demo"} 200.0&lt;/code&gt; after two parallel test runs. The &lt;code&gt;client_id&lt;/code&gt; label proves the &lt;strong&gt;JWT identity is tracked, not the IP address&lt;/strong&gt; — a crucial distinction for shared NATs and corporate networks.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Two-Dimensional Auth Failure Tracking
&lt;/h3&gt;

&lt;p&gt;Credential stuffing is tracked on two axes simultaneously:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;By IP&lt;/strong&gt;: &lt;code&gt;failed_auth:{ip}&lt;/code&gt; — one IP failing across many accounts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;By username&lt;/strong&gt;: &lt;code&gt;failed_auth:{username}&lt;/code&gt; — many IPs targeting the same account&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are separate Redis keys with independent TTLs, configurable via environment variables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;AUTH_FAILURE_IP_THRESHOLD&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;10       &lt;span class="c"&gt;# failures before IP soft-block&lt;/span&gt;
&lt;span class="nv"&gt;AUTH_FAILURE_USER_THRESHOLD&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;20     &lt;span class="c"&gt;# failures before username soft-block&lt;/span&gt;
&lt;span class="nv"&gt;AUTH_FAILURE_WINDOW_SECONDS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;300    &lt;span class="c"&gt;# counter TTL&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Keeping these counters independent means you can block a specific IP without penalizing every other IP targeting that same user, and flag a username as under attack without affecting unrelated clients.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Scraping Detection via Request Timing Entropy
&lt;/h3&gt;

&lt;p&gt;Humans generate requests with high temporal variance. Bots generate requests with suspiciously regular inter-request timing.&lt;/p&gt;

&lt;p&gt;For each client, I maintain a sliding window of the last N timestamps in a Redis sorted set and compute the &lt;strong&gt;standard deviation of the inter-arrival gaps&lt;/strong&gt;. A standard deviation below &lt;code&gt;SCRAPING_ENTROPY_THRESHOLD&lt;/code&gt; (default &lt;code&gt;0.5&lt;/code&gt;) triggers a bot flag.&lt;/p&gt;

&lt;p&gt;The elegant part: this doesn't care about request volume. A sophisticated bot that rate-limits itself to human speeds will still be caught if it's too &lt;em&gt;regular&lt;/em&gt;. This pairs with user-agent fingerprinting (the second Bloom filter) to create a multi-signal detection approach.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Dual Bloom Filters
&lt;/h3&gt;

&lt;p&gt;Two in-memory Bloom filters, both synced from Redis every 60 seconds by a background worker:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;known_bad_ips&lt;/code&gt; — screens every incoming IP at O(1) with no Redis round-trip&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;abusive_agents&lt;/code&gt; — user-agent fingerprinting for known scraper signatures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;BLOOM_FILTER_CAPACITY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1000000  &lt;span class="c"&gt;# expected entries&lt;/span&gt;
&lt;span class="nv"&gt;BLOOM_FILTER_ERROR_RATE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.001  &lt;span class="c"&gt;# 0.1% false positive rate&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At a 0.1% false positive rate across 1 million IPs, the filter requires roughly 1.1 MB of memory. The worst case is a legitimate IP being flagged — which shadow mode surfaces before enforcement is ever enabled.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Critical implementation detail&lt;/strong&gt;: the filter must live on &lt;code&gt;app.state.bloom&lt;/code&gt; and be shared across all requests. Per-request instantiation gives you a fresh empty filter on every call — zero enforcement, zero errors, 100% invisible failure. More on this in the bugs section.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Graduated Response System
&lt;/h3&gt;

&lt;p&gt;Three states instead of a binary allow/block:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;State&lt;/th&gt;
&lt;th&gt;Behavior&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ALLOWED&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Request passes through normally&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;THROTTLED&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Response delayed via &lt;code&gt;asyncio.sleep&lt;/code&gt;, served with &lt;code&gt;Retry-After&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;SOFT_BLOCK&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Immediate 429 — Redis TTL, temporary, self-expiring&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This matters because going straight to hard block means a legitimate client that briefly triggered a rule is permanently punished. The graduated approach lets real users recover automatically while truly malicious clients face escalating consequences.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Shadow Mode — The Safety Net
&lt;/h3&gt;

&lt;p&gt;Shadow mode is how you deploy new detection rules without blocking real users. When a request would trigger a rule, shadow mode &lt;strong&gt;logs the event to Redis with a 24-hour TTL instead of blocking&lt;/strong&gt;. The request passes through normally.&lt;/p&gt;

&lt;p&gt;What makes this interesting is the implementation: shadow mode is a &lt;strong&gt;runtime toggle&lt;/strong&gt;, not a deploy-time config. It's controlled via a Redis key:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Enable — observe but don't block&lt;/span&gt;
curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="nv"&gt;$BASE&lt;/span&gt;/admin/shadow-mode?enabled&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer &lt;/span&gt;&lt;span class="nv"&gt;$ADMIN_TOKEN&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

&lt;span class="c"&gt;# Disable — start enforcing&lt;/span&gt;
curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="nv"&gt;$BASE&lt;/span&gt;/admin/shadow-mode?enabled&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;false&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer &lt;/span&gt;&lt;span class="nv"&gt;$ADMIN_TOKEN&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The middleware reads &lt;code&gt;config:shadow_mode_enabled&lt;/code&gt; from Redis on every request, falling back to the &lt;code&gt;SHADOW_MODE_ENABLED&lt;/code&gt; environment variable if the key is absent. Toggle takes effect on the next request — no redeployment, no restart.&lt;/p&gt;




&lt;h2&gt;
  
  
  Database-Backed RBAC
&lt;/h2&gt;

&lt;p&gt;The admin role system started as a simple &lt;code&gt;ADMIN_USERNAMES&lt;/code&gt; environment variable. That approach has an obvious flaw: any user who registers with that exact username bypasses all admin checks.&lt;/p&gt;

&lt;p&gt;The replacement: a &lt;code&gt;UserRole&lt;/code&gt; enum (&lt;code&gt;USER&lt;/code&gt;, &lt;code&gt;ADMIN&lt;/code&gt;) stored in the &lt;code&gt;users&lt;/code&gt; table, embedded in the JWT at login time.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# JWT payload at login
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sub&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;username&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;require_admin&lt;/code&gt; dependency reads the JWT &lt;code&gt;role&lt;/code&gt; claim directly — no database query per request. To promote a user:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="k"&gt;role&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'admin'&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;username&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'target'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The user logs in again, receives a JWT with &lt;code&gt;"role": "admin"&lt;/code&gt;, and admin endpoints immediately become accessible. Their previous token expires in 30 minutes. No server restart required.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Bugs That Actually Hurt
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Bug 1: The Async Password Verification Trap
&lt;/h3&gt;

&lt;p&gt;This one was subtle and genuinely dangerous. I had refactored &lt;code&gt;verify_password&lt;/code&gt; to be an &lt;code&gt;async&lt;/code&gt; function wrapping bcrypt's blocking &lt;code&gt;checkpw&lt;/code&gt; in &lt;code&gt;asyncio.to_thread()&lt;/code&gt; — which was correct. But I forgot to &lt;code&gt;await&lt;/code&gt; it at the call site:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# 🚨 WRONG — coroutine object is always truthy
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;verify_password&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;plain&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hashed&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# This branch ALWAYS executes
&lt;/span&gt;    &lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="c1"&gt;# ✅ CORRECT
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;verify_password&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;plain&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hashed&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A coroutine object that's never awaited evaluates as truthy. Every password check passed, regardless of input. All authentication was silently bypassed. The auth endpoint returned a valid JWT for any password entered against any account.&lt;/p&gt;

&lt;p&gt;There were no exceptions, no warnings, no test failures if your tests weren't checking wrong-password rejection specifically. The fix is trivial once you find it — finding it is the hard part.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bug 2: Bloom Filter Instantiated Per-Request
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;block-ip&lt;/code&gt; admin route was creating a new &lt;code&gt;BloomFilterService()&lt;/code&gt; inside the route handler, adding the IP to that instance, and returning. Meanwhile, the middleware's shared in-memory filter (on &lt;code&gt;app.state.bloom&lt;/code&gt;) was never updated — until the 60-second background sync ran.&lt;/p&gt;

&lt;p&gt;The result: a hard-blocked IP could make 60 more requests before the block took effect. The fix was making admin routes update &lt;code&gt;request.app.state.bloom&lt;/code&gt; directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# 🚨 WRONG — local instance, never seen by middleware
&lt;/span&gt;&lt;span class="n"&gt;bloom&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BloomFilterService&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;bloom&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ip&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# ✅ CORRECT — updates the shared middleware instance immediately
&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bloom&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ip&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Bug 3: Static Admin Username Bypassed by Registration
&lt;/h3&gt;

&lt;p&gt;The original &lt;code&gt;ADMIN_USERNAMES&lt;/code&gt; config approach had a security hole: if the env var was set to &lt;code&gt;"admin"&lt;/code&gt;, anyone could register with username &lt;code&gt;admin&lt;/code&gt; and gain admin access. Replaced entirely with the database-backed &lt;code&gt;UserRole&lt;/code&gt; enum. The setting and its associated property were deleted from &lt;code&gt;config.py&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bug 4: Duplicate Alembic Migration Head
&lt;/h3&gt;

&lt;p&gt;Running &lt;code&gt;make makemigration&lt;/code&gt; twice without migrating in between creates two heads in the Alembic migration graph. The fix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;alembic merge heads &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"merge heads"&lt;/span&gt;
alembic stamp &lt;span class="nb"&gt;head
&lt;/span&gt;alembic upgrade &lt;span class="nb"&gt;head&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not a show-stopper, but something that will confuse you the first time you hit it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bug 5: Sequential curl Doesn't Test Rate Limiting
&lt;/h3&gt;

&lt;p&gt;This one isn't a code bug — it's a test methodology bug that looks exactly like a code bug.&lt;/p&gt;

&lt;p&gt;A rate limit of 100 requests per 60-second window means requests must arrive within the same 60-second window to count against each other. Over a network connection (Render free tier adds ~500ms per request), 300 sequential calls take roughly 5 minutes. At any point only ~60 requests sit inside the window — well under the limit. The limiter appears broken when it's working correctly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# This will NOT trigger rate limiting against a remote host&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;i &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;seq &lt;/span&gt;1 300&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do &lt;/span&gt;curl &lt;span class="nv"&gt;$BASE&lt;/span&gt;/gateway/proxy&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;done&lt;/span&gt;

&lt;span class="c"&gt;# This will — all requests fire within the same window&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;i &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;seq &lt;/span&gt;1 150&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; /dev/null &lt;span class="nt"&gt;-w&lt;/span&gt; &lt;span class="s2"&gt;"%{http_code}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;$BASE&lt;/span&gt;/gateway/proxy &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer &lt;/span&gt;&lt;span class="nv"&gt;$TOKEN&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &amp;amp;
&lt;span class="k"&gt;done&lt;/span&gt; | &lt;span class="nb"&gt;sort&lt;/span&gt; | &lt;span class="nb"&gt;uniq&lt;/span&gt; &lt;span class="nt"&gt;-c&lt;/span&gt;
&lt;span class="c"&gt;# Output: 100 × 200, 50 × 429&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Always use parallel requests when testing rate limiting against any remote deployment.&lt;/p&gt;




&lt;h2&gt;
  
  
  Performance Numbers
&lt;/h2&gt;

&lt;p&gt;From a 60-second Locust load test, 20 concurrent users (legitimate users, credential stuffers, and scrapers running simultaneously):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Throughput&lt;/td&gt;
&lt;td&gt;59 req/s sustained&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Legitimate user failure rate&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Credential stuffing detection&lt;/td&gt;
&lt;td&gt;Blocked within 10 attempts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P50 gateway latency&lt;/td&gt;
&lt;td&gt;10ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P99 gateway latency&lt;/td&gt;
&lt;td&gt;440ms (includes throttle delay)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Shadow events logged in 60s&lt;/td&gt;
&lt;td&gt;740&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The P99 spike is intentional — throttled clients hit &lt;code&gt;asyncio.sleep&lt;/code&gt;, which is where the latency comes from. Legitimate users sit at the P50 line throughout.&lt;/p&gt;




&lt;h2&gt;
  
  
  Test Coverage
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;67 tests, 93% coverage.&lt;/strong&gt; The most important tests to get right:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;test_sliding_window_blocks_boundary_spike&lt;/code&gt;&lt;/strong&gt; — send N requests at end of window 1, N at start of window 2, assert total allowed is N not 2N&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;test_concurrent_duplicate_requests&lt;/code&gt;&lt;/strong&gt; — &lt;code&gt;asyncio.gather&lt;/code&gt; firing same endpoint 5 times simultaneously, assert no race condition in the counter&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;test_shadow_mode_does_not_block&lt;/code&gt;&lt;/strong&gt; — enable shadow mode, send a would-be-blocked request, assert 200 returned and shadow log has an entry&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;test_credential_stuffing_detected&lt;/code&gt;&lt;/strong&gt; — fail auth 10 times from same IP, assert 11th is blocked&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;test_require_admin_valid_admin&lt;/code&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;code&gt;test_non_admin_cannot_access_admin_routes&lt;/code&gt;&lt;/strong&gt; — RBAC enforcement&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Integration tests run against real Redis and PostgreSQL via a separate &lt;code&gt;docker-compose.test.yml&lt;/code&gt;. Test isolation uses &lt;code&gt;TRUNCATE TABLE ... RESTART IDENTITY CASCADE&lt;/code&gt; per test, not &lt;code&gt;drop_all/create_all&lt;/code&gt; — same isolation, far lower overhead.&lt;/p&gt;




&lt;h2&gt;
  
  
  Production Stack
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Technology&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Web framework&lt;/td&gt;
&lt;td&gt;FastAPI + Uvicorn&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rate limit state&lt;/td&gt;
&lt;td&gt;Redis 7 (sorted sets + Lua scripts)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IP/agent filtering&lt;/td&gt;
&lt;td&gt;Bloom filter (pybloom-live)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Auth&lt;/td&gt;
&lt;td&gt;JWT (python-jose) + bcrypt (asyncio.to_thread)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Database&lt;/td&gt;
&lt;td&gt;PostgreSQL 15 + SQLAlchemy async&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Migrations&lt;/td&gt;
&lt;td&gt;Alembic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Metrics&lt;/td&gt;
&lt;td&gt;Prometheus&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Logging&lt;/td&gt;
&lt;td&gt;structlog (JSON output with request_id on every line)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Testing&lt;/td&gt;
&lt;td&gt;pytest + pytest-asyncio + Locust&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CI&lt;/td&gt;
&lt;td&gt;GitHub Actions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hosting&lt;/td&gt;
&lt;td&gt;Render (app) + Upstash (Redis) + Supabase (PostgreSQL)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Interview Talking Points Worth Owning
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;"Why Lua scripts in Redis?"&lt;/strong&gt; — &lt;code&gt;MULTI/EXEC&lt;/code&gt; is optimistic; other clients can interleave between commands. Lua runs atomically on the Redis server. The read-increment-expire cycle cannot be observed in an intermediate state under concurrent load.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"How do you handle a Redis outage?"&lt;/strong&gt; — Fail open vs. fail closed is a business decision. A bank fails closed — block everything if rate limit state is unavailable. A media site fails open — serve traffic and accept the abuse risk. Expose it as a config flag.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"What about shared IPs and NATs?"&lt;/strong&gt; — IP alone is a weak identifier. The system layers it with JWT &lt;code&gt;client_id&lt;/code&gt;. IP rate limiting catches unauthenticated abuse; user-level limiting catches authenticated abuse. Both are needed, neither is sufficient alone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"How does the Bloom filter help performance?"&lt;/strong&gt; — Without it, every request does a Redis &lt;code&gt;SISMEMBER&lt;/code&gt; call — a network round-trip. The Bloom filter checks the same list from process memory in microseconds. At 0.1% false positive rate, 1 in 1000 legitimate IPs might be flagged — which shadow mode surfaces before enforcement is enabled.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"What would you change at 10x scale?"&lt;/strong&gt; — Move to Redis Cluster to eliminate the single point of failure. Load detection rules from Redis at runtime instead of config at deploy time. Add ML anomaly detection as a second signal layer. Per-datacenter rate limiting with global sync.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;p&gt;The most valuable lesson wasn't any individual component — it was &lt;strong&gt;build order&lt;/strong&gt;. The pattern that worked: environment → infrastructure → config → database models → core clients → services → API layer → workers. Never jumping a stage. A broken Redis client makes every rate limiter test confusing. A broken DB session makes every auth test unreliable.&lt;/p&gt;

&lt;p&gt;The second lesson: cross-check against your spec after you think you're done. The graduated response system, user-agent fingerprinting, and several Prometheus metrics were all missing from my "complete" implementation until I ran a systematic audit.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;The live demo is running at &lt;a href="https://api-gateway-with-abuse-detection.onrender.com/docs" rel="noopener noreferrer"&gt;api-gateway-with-abuse-detection.onrender.com/docs&lt;/a&gt;. Register a user, grab a JWT, hit the gateway endpoint 110 times in parallel, and watch the 429s start. Shadow stats accumulate at &lt;code&gt;/admin/shadow-stats&lt;/code&gt; if you have an admin token.&lt;/p&gt;

&lt;p&gt;Source, DESIGN.md, and load test scenarios: &lt;a href="https://github.com/macaulaypraise/api-gateway-with-abuse-detection" rel="noopener noreferrer"&gt;github.com/macaulaypraise/api-gateway-with-abuse-detection&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Tags: &lt;code&gt;python&lt;/code&gt; &lt;code&gt;fastapi&lt;/code&gt; &lt;code&gt;redis&lt;/code&gt; &lt;code&gt;security&lt;/code&gt; &lt;code&gt;webdev&lt;/code&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>api</category>
      <category>backend</category>
      <category>security</category>
      <category>showdev</category>
    </item>
    <item>
      <title>The Dual-Write Problem: Why Your Payment API Is One Crash Away From Silent Data Loss</title>
      <dc:creator>Macaulay Praise</dc:creator>
      <pubDate>Tue, 17 Mar 2026 11:37:44 +0000</pubDate>
      <link>https://forem.com/wolfraider/the-dual-write-problem-why-your-payment-api-is-one-crash-away-from-silent-data-loss-mk7</link>
      <guid>https://forem.com/wolfraider/the-dual-write-problem-why-your-payment-api-is-one-crash-away-from-silent-data-loss-mk7</guid>
      <description>&lt;p&gt;You commit a payment to your database. Then you publish an event to Kafka so downstream services can settle it. Both succeed — until one day the process crashes in the 3 milliseconds between those two operations.&lt;/p&gt;

&lt;p&gt;The database says the payment happened. Kafka never heard about it. The settlement worker never ran. The customer was charged and nothing moved.&lt;/p&gt;

&lt;p&gt;That's the dual-write problem. This post explains why it's unsolvable with the obvious approaches, and how the Outbox pattern fixes it properly — using an implementation I built and load-tested to 1,000 concurrent users with zero duplicate charges.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why the Obvious Solutions Don't Work
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;"Just publish to Kafka first, then write to the DB."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Same problem, reversed. The event fires but the payment row never gets written. Your downstream consumers process a payment that your database has no record of.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"Use a transaction that wraps both."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You can't. A database transaction and a Kafka publish are two entirely separate systems. PostgreSQL has no knowledge of Kafka. There is no &lt;code&gt;COMMIT&lt;/code&gt; that covers both. The moment you step outside your DB transaction to call &lt;code&gt;producer.send()&lt;/code&gt;, you're in crash territory.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"Use Two-Phase Commit (2PC)."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Kafka doesn't support it. And even in systems where both sides support 2PC, you're introducing a coordinator as a single point of failure with significantly higher latency. This is why 2PC has largely been abandoned in modern distributed systems in favour of patterns like the Outbox.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Crash Window Nobody Talks About
&lt;/h2&gt;

&lt;p&gt;Here's the exact sequence that fails silently:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt; &lt;span class="k"&gt;BEGIN&lt;/span&gt; &lt;span class="n"&gt;transaction&lt;/span&gt;
&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt; &lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;payments&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'PENDING'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="err"&gt;←&lt;/span&gt; &lt;span class="n"&gt;DB&lt;/span&gt; &lt;span class="k"&gt;write&lt;/span&gt;
&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt; &lt;span class="k"&gt;COMMIT&lt;/span&gt;                                       &lt;span class="err"&gt;←&lt;/span&gt; &lt;span class="n"&gt;success&lt;/span&gt;
&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;                                              &lt;span class="err"&gt;←&lt;/span&gt; &lt;span class="err"&gt;💥&lt;/span&gt; &lt;span class="n"&gt;process&lt;/span&gt; &lt;span class="n"&gt;crashes&lt;/span&gt; &lt;span class="n"&gt;here&lt;/span&gt;
&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt; &lt;span class="n"&gt;producer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'payment.initiated'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...)&lt;/span&gt;      &lt;span class="err"&gt;←&lt;/span&gt; &lt;span class="n"&gt;never&lt;/span&gt; &lt;span class="n"&gt;reached&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Step 4 is real. Network blips, OOM kills, deploys — any of these can fire between steps 3 and 5. The window is tiny, but at scale it closes eventually.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Outbox Pattern
&lt;/h2&gt;

&lt;p&gt;The fix is to stop writing to two systems. Write to one.&lt;/p&gt;

&lt;p&gt;Instead of publishing directly to Kafka, you write the event as a row in an &lt;code&gt;outbox_events&lt;/code&gt; table — &lt;strong&gt;inside the same database transaction as the payment row&lt;/strong&gt;. A separate background poller reads from that table and publishes to Kafka.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt; &lt;span class="k"&gt;BEGIN&lt;/span&gt; &lt;span class="n"&gt;transaction&lt;/span&gt;
&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt; &lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;payments&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'PENDING'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt; &lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;outbox_events&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'payment.initiated'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;published_at&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt; &lt;span class="k"&gt;COMMIT&lt;/span&gt;                                    &lt;span class="err"&gt;←&lt;/span&gt; &lt;span class="k"&gt;both&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt; &lt;span class="n"&gt;land&lt;/span&gt; &lt;span class="n"&gt;atomically&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now the Kafka publish is handled by the poller:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;OUTBOX&lt;/span&gt; &lt;span class="n"&gt;POLLER&lt;/span&gt;  &lt;span class="err"&gt;→&lt;/span&gt;  &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;outbox_events&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;published_at&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
               &lt;span class="err"&gt;→&lt;/span&gt;  &lt;span class="n"&gt;producer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
               &lt;span class="err"&gt;→&lt;/span&gt;  &lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="n"&gt;outbox_events&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;published_at&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the poller crashes after publishing but before marking the row, it simply replays on restart — Kafka receives a duplicate, which you handle with a deterministic event ID (more on this below). The payment row is never orphaned because the event was committed to the database first.&lt;/p&gt;

&lt;p&gt;The full flow in my implementation looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CLIENT  →  POST /payments  +  Idempotency-Key: &amp;lt;uuid&amp;gt;
                │
                ▼
        ┌─ Redis cache check ──── HIT → return stored response (no DB touch)
        ├─ Distributed lock ───── prevents concurrent duplicate requests
        ├─ DB transaction ──────── Payment row + OutboxEvent row (atomic)
        └─ Cache response, release lock → 202 Accepted

OUTBOX POLLER  →  polls outbox_events WHERE published_at IS NULL  →  Kafka

KAFKA  →  SETTLEMENT WORKER
           ├─ PENDING → PROCESSING → SETTLED / FAILED
           ├─ Exponential backoff, max 5 retries
           └─ Dead Letter Queue on exhaustion
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Handling the At-Least-Once Delivery Problem
&lt;/h2&gt;

&lt;p&gt;The outbox poller delivers at-least-once to Kafka — meaning duplicate events are possible on replay. The settlement worker handles this with deterministic UUID5 event IDs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;event_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uuid5&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NAMESPACE_URL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;partition&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;offset&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The same &lt;code&gt;topic:partition:offset&lt;/code&gt; always produces the same UUID. On replay, the deduplication check is a no-op — it sees the event ID already in &lt;code&gt;processed_events&lt;/code&gt; and skips it. No double processing, no complex coordination.&lt;/p&gt;




&lt;h2&gt;
  
  
  Does It Actually Work?
&lt;/h2&gt;

&lt;p&gt;I ran two load test scenarios with Locust against a single Docker container:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Concurrent Users&lt;/th&gt;
&lt;th&gt;Total Requests&lt;/th&gt;
&lt;th&gt;Duplicate Charges&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Normal load&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;1,378&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stress test&lt;/td&gt;
&lt;td&gt;1,000&lt;/td&gt;
&lt;td&gt;12,746&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Correctness held at 0% duplicate charges through both. The 0.4% error rate at 1,000 users was connection pool exhaustion — not an idempotency failure. Every retry with the same idempotency key returned the identical &lt;code&gt;payment_id&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the Outbox Pattern Trades Off
&lt;/h2&gt;

&lt;p&gt;Nothing is free. The outbox poller introduces a small delay — typically 1–5 seconds — between a payment being committed and its event reaching Kafka. For most use cases this is acceptable. For real-time fraud scoring that needs to act on the event immediately, it isn't, and you'd need a different approach.&lt;/p&gt;

&lt;p&gt;The poller also needs to be a reliable background process. If it stops running silently, your outbox table grows and events stall. Monitoring queue depth is not optional.&lt;/p&gt;




&lt;h2&gt;
  
  
  The One-Sentence Summary
&lt;/h2&gt;

&lt;p&gt;The Outbox pattern solves the dual-write problem by making the event a database record first and delegating the Kafka publish to a separate, restartable poller — so you never write to two systems atomically, you write to one.&lt;/p&gt;




&lt;p&gt;Full source code, DESIGN.md, and load test results: &lt;strong&gt;&lt;a href="https://github.com/macaulaypraise/idempotent-payment-processing-system.git" rel="noopener noreferrer"&gt;https://github.com/macaulaypraise/idempotent-payment-processing-system.git&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Stack: Python 3.12 · FastAPI · PostgreSQL 15 · Redis 7 · Kafka · SQLAlchemy (async) · Docker Compose&lt;/p&gt;

</description>
      <category>distributedsystems</category>
      <category>python</category>
      <category>backend</category>
      <category>kafka</category>
    </item>
  </channel>
</rss>
