Forem: Macaulay Praise

I Built a Production-Grade Async Job Queue from Scratch — Here's Everything That Actually Happened

Macaulay Praise — Mon, 25 May 2026 20:58:19 +0000

A real account of building an Async Job Queue with Backpressure & Priority Scheduling using Python, FastAPI, and Redis Streams — covering the Reaper service, 47 tests, 85% coverage, and every bug along the way.

No Celery. No shortcuts. No pretending the first approach worked.

I decided to build a production-grade project from scratch to bypass the abstraction layer of tools like Celery and truly master backend internals—the kind of work you can defend in any technical interview.

Instead of reaching for a framework that abstracts the hard parts, I built an async job queue from scratch — backpressure, priority scheduling, crash recovery, zombie detection, the full picture. This is the honest account of what I built, in what order, what broke, what I fixed, and what the final numbers looked like.

Stack: Python 3.12, FastAPI, Redis Streams, PostgreSQL, SQLAlchemy, Alembic, Prometheus, Grafana, Locust, Docker
Final state: 47 tests passing, 85% coverage, services layer at ~92%

Why a Job Queue? Why Not Just Use Celery?

Celery is the right tool when you need to ship business features fast and the queue is infrastructure, not the product. Building from scratch is the right choice when you need to understand exactly why a system fails — and how to keep it alive when it does.

When an interviewer asks "how does a visibility timeout work?" or "how do you prevent a zombie job from blocking your metrics?", knowing how to configure Celery doesn't answer that. Understanding the mechanism does.

A job queue is the backbone of anything that works outside the HTTP request-response cycle: sending emails, generating reports, processing images, running inference. The interesting engineering isn't the queue — it's everything around it:

How do you stop a flood of jobs from overwhelming your workers?
How do you ensure critical jobs run before low-priority ones without starving everything else?
What happens when a worker crashes mid-job?
How do you detect a job that's permanently broken vs. one that just needs a retry?

The Part Most Tutorials Skip: The Reaper

Before walking through the build, I want to call out the component that separates a job queue from a production-grade job queue: the Reaper.

Most tutorials build a producer and a consumer and call it done. But the hardest failure mode isn't a job that errors — it's a worker that disappears. A worker that picks up a job, starts executing, and then crashes. The message is claimed. No ACK is coming. The job is stuck in limbo.

The Reaper is the background service that fixes this. It runs every 10 seconds and does two things:

Visibility timeout recovery — queries XPENDING per Redis Stream for messages claimed but not acknowledged beyond the timeout threshold. Re-enqueues them as retries.
Zombie job detection — queries PostgreSQL for jobs in RUNNING status whose heartbeat_at timestamp is older than 60 seconds. Marks them FAILED and re-enqueues.

Without the Reaper, a crashed worker creates jobs that show RUNNING forever — poisoning your metrics, silently losing work, and giving you no signal that anything went wrong. With it, the system self-heals.

Every other component in this build is interesting. The Reaper is what makes the system trustworthy.

The Build Order (And Why It Matters)

I enforced a strict eight-stage sequence. The rule: if Stage N is broken, do not start Stage N+1. A broken Redis client makes every queue test confusing. A broken DB session makes every status transition unreliable.

Stage 1 — Environment & tooling
Stage 2 — Docker Compose infrastructure
Stage 3 — Configuration layer
Stage 4 — Database models & migrations
Stage 5 — Core clients (Redis, DB session)
Stage 6 — Business logic (services)
Stage 7 — FastAPI layer (routers, schemas)
Stage 8 — Background workers

Every stage had a verification step before moving on. Here's what actually happened at each one.

Stage 1 — Environment & Tooling

Poetry over pip. Poetry gives you a lockfile (exact versions of every dependency), separate dev and production groups, and a consistent virtual environment tied to the project. First thing I added was the dev dependencies — pytest, ruff, mypy — before any application code existed.

One non-obvious setting: asyncio_mode = "auto" and addopts = "--no-cov" in pyproject.toml. The --no-cov suppresses noisy coverage warnings before there's any application code to cover. Easy win.

.env discipline from the start. .env.example committed. .env in .gitignore immediately. If you accidentally commit a .env with real credentials, you must rotate everything in it. Don't find out the hard way.

Stage 2 — Docker Compose Infrastructure

No Kafka. Redis Streams provides the same PENDING/ACK semantics Kafka gives you, without the operational overhead of running Zookeeper alongside it.

The most important thing I did here: used condition: service_healthy in depends_on instead of just listing service names. depends_on without a health condition only waits for the container to start, not for the service inside it to be ready. Redis can take a second to accept connections. Without this, your app crashes on startup trying to connect to Redis before it's listening.

depends_on:
  redis:
    condition: service_healthy
  postgres:
    condition: service_healthy

Verification gate: make check-infra running redis-cli ping and pg_isready before touching any application code.

Stage 3 — Configuration Layer

One file: app/config.py. Its entire job is to read environment variables and expose them as a typed Python object. Nothing else in the codebase calls os.environ directly.

from functools import lru_cache
from pydantic_settings import BaseSettings

class Settings(BaseSettings):
    redis_url: str = "redis://localhost:6379/0"
    database_url: str = "postgresql+asyncpg://..."
    high_watermark: int = 10000
    low_watermark: int = 2000
    weight_critical: int = 60
    weight_high: int = 30
    weight_normal: int = 10
    max_retries: int = 5
    job_timeout_seconds: int = 30

@lru_cache
def get_settings() -> Settings:
    return Settings()

@lru_cache ensures settings are read exactly once per process. Without it, environment variables get read on every function call — wasteful and occasionally surprising.

First tests I wrote: verify defaults load, weights sum to 100, high watermark is greater than low watermark. Three tests. All pass. Move on.

Stage 4 — Database Models & Migrations

Use Alembic from day one. I learned this the hard way. SQLAlchemy's Base.metadata.create_all() is convenient for getting started, but it creates untracked schema you can't evolve safely in production. Alembic tracks every change as a versioned migration file.

The bug I hit: I ran alembic revision --autogenerate multiple times across sessions. Each run created a new migration file with no dependency on the previous one, breaking Alembic's chain silently. The fix was to delete everything in migrations/versions/ and generate one single clean migration.

The jobs table columns that matter most:

class Job(Base):
    __tablename__ = "jobs"

    id: Mapped[UUID] = mapped_column(primary_key=True, default=uuid4)
    status: Mapped[str]          # PENDING → RUNNING → COMPLETED/FAILED
    priority: Mapped[str]        # critical, high, normal
    payload: Mapped[dict]        # JSON
    result: Mapped[dict | None]  # JSON, nullable
    retry_count: Mapped[int] = mapped_column(default=0)
    max_retries: Mapped[int] = mapped_column(default=5)
    worker_id: Mapped[str | None]
    heartbeat_at: Mapped[datetime | None]  # ← key for zombie detection
    error: Mapped[str | None]
    created_at: Mapped[datetime] = mapped_column(default=func.now())
    updated_at: Mapped[datetime] = mapped_column(default=func.now(), onupdate=func.now())

heartbeat_at is the column that makes the entire zombie detection system work. Without it, you have no way to distinguish a job that's genuinely running from one whose worker died.

Stage 5 — Core Clients

redis[hiredis], not just redis. The hiredis extra is a C extension that makes Redis response parsing ~10x faster. One extra word in the install command, significant throughput difference at load.

The bug that cost me time: I wrote the health endpoint as:

async def health(request):  # WRONG

FastAPI saw request without a type annotation and treated it as a query parameter, not the HTTP request object. The fix:

async def health(request: Request):  # RIGHT

First-timer mistake. It'll happen to you too. Now you know.

The Redis client lifecycle: Create it inside the FastAPI lifespan function, not at module import time. Store it on app.state.redis. Never create it globally.

@asynccontextmanager
async def lifespan(app: FastAPI):
    app.state.redis = await create_redis_client(settings.redis_url)
    yield
    await app.state.redis.aclose()

Creating it at import time causes "Event loop is closed" errors that are confusing to debug.

Stage 6 — Business Logic

This is the interesting part. All services are plain Python async functions — no FastAPI imports, no HTTP, no request/response objects. Pure logic that's easy to unit test.

Backpressure: Two Watermarks, Not One

My first instinct was a single threshold: accept below X, reject above X. This causes oscillation — rapid toggling between accepting and rejecting as queue depth bounces around the threshold under load.

The fix is a band:

BACKPRESSURE_STATE_KEY = "backpressure:active"

async def enqueue(redis, job_id, payload, priority):
    depth = await redis.xlen(f"queue:{priority}")
    is_active = await redis.exists(BACKPRESSURE_STATE_KEY)

    if depth >= settings.high_watermark:
        await redis.set(BACKPRESSURE_STATE_KEY, "1")
        raise BackpressureError()

    if is_active and depth >= settings.low_watermark:
        raise BackpressureError()

    if is_active and depth < settings.low_watermark:
        await redis.delete(BACKPRESSURE_STATE_KEY)

    await redis.xadd(f"queue:{priority}", {"job_id": job_id, "payload": payload})

High watermark fires the flag. The flag stays set until depth drops below the low watermark. The band gives the queue time to drain before opening again.

Weighted Fair Scheduling: No Starvation

Multiple priority queues are useless without a scheduler that prevents starvation. A naive scheduler draining critical first means low-priority jobs never run when critical is busy.

async def pick_queue(redis) -> str:
    weights = await get_weights(redis)
    # e.g. {"critical": 60, "high": 30, "normal": 10}

    roll = random.randint(1, 100)
    cumulative = 0
    for priority, weight in weights.items():
        cumulative += weight
        if roll <= cumulative:
            return f"queue:{priority}"

The weights are stored in Redis at runtime, not hardcoded. During an incident, you can set critical to 100% via an API call without redeploying.

I verified this with a test — 10,000 simulated picks, assert distribution matches configured weights within ±5%:

def test_scheduler_distribution():
    picks = {"critical": 0, "high": 0, "normal": 0}
    weights = {"critical": 60, "high": 30, "normal": 10}

    for _ in range(10_000):
        queue = pick_queue_sync(weights)
        picks[queue] += 1

    for priority, expected_pct in weights.items():
        actual_pct = picks[priority] / 100
        assert abs(actual_pct - expected_pct) < 5

If you can't prove the algorithm produces the right distribution, you haven't finished writing it.

Redis Streams: Not Just a List

My first instinct was LPUSH/RPOP — simple and familiar. Fatal flaw: if a worker pops a job and crashes before finishing, the job is gone. There's no recovery.

Redis Streams with consumer groups solve this via PENDING entries. A message delivered to a consumer but not yet acknowledged stays in the pending list. No ACK = not done.

# Dequeue
messages = await redis.xreadgroup(
    groupname="workers",
    consumername=worker_id,
    streams={stream_name: ">"},  # ">" = only undelivered messages
    count=1,
    block=2000
)

# After success
await redis.xack(stream_name, "workers", message_id)

The per-priority visibility timeouts I set:

Priority	Visibility Timeout
critical	30 seconds
high	60 seconds
normal	120 seconds

Critical jobs get recovered faster. Normal jobs have more time before the reaper reclaims them.

Exponential Backoff with Full Jitter

def _backoff_seconds(retry_count: int, base: float = 1.0, max_delay: float = 60.0) -> float:
    return min(base * (2 ** retry_count), max_delay) * random.random()

The * random.random() is full jitter — prevents the thundering herd problem where all retrying workers wake up simultaneously and hammer the same resources. Without jitter, every worker that fails at the same time retries at exactly the same moment.

Bug I hit during implementation: The except Exception as e retry path referenced error_msg, a variable only defined inside the except asyncio.TimeoutError block above it. Python raised NameError at runtime. The fix was using str(e) directly in the general exception handler.

Stage 7 — FastAPI Layer

By the time I reached this stage, all business logic was written and tested. Route handlers are thin:

@router.post("/jobs", status_code=202)
async def submit_job(
    body: JobCreate,
    db: DBSession,
    redis: RedisDep,
    settings: SettingsDep,
) -> JobResponse:
    try:
        job = await create_job(db, body.payload, body.priority)
        await enqueue(redis, job.id, body.payload, body.priority, settings)
        return JobResponse.model_validate(job)
    except BackpressureError:
        backpressure_rejections_total.inc()
        raise HTTPException(
            status_code=503,
            headers={"Retry-After": "5"},
            detail="Queue at capacity. Retry after 5 seconds."
        )

If your route handler is longer than 20 lines, business logic has leaked into it.

The pytest event loop nightmare: My first conftest used asyncio_default_fixture_loop_scope = "session" but asyncio_default_test_loop_scope defaulted to "function". This caused "Future attached to a different loop" errors — SQLAlchemy's connection pool was created on the session loop but tests ran on function loops. The fix was setting both to "session" and restructuring conftest with session-scoped engine and Redis client.

Stage 8 — Background Workers

Two workers, each as a separate service in Docker Compose:

job_worker.py — the main worker loop:

Pick a queue via weighted scheduler
XREADGROUP to claim a message
Mark job RUNNING in PostgreSQL
Start a heartbeat task (writes timestamp every 5 seconds)
Dispatch to handler with asyncio.wait_for(timeout=30)
On success: XACK + mark COMPLETED
On failure: increment retry, re-enqueue with backoff, or DLQ if exhausted
Cancel heartbeat task in finally

reaper.py — runs every 10 seconds:

XPENDING query per stream for messages older than visibility timeout
Re-enqueue stale pending messages as retries
Query PostgreSQL for RUNNING jobs with heartbeat_at < now() - 60 seconds
Mark zombies as FAILED

# docker-compose.yml
job_worker:
  build: .
  command: python -m app.workers.job_worker
  restart: unless-stopped
  depends_on:
    app:
      condition: service_healthy

reaper:
  build: .
  command: python -m app.workers.reaper
  restart: unless-stopped

restart: unless-stopped means a crashed worker comes back up automatically and resumes — same behavior as production.

The Dockerfile Problem (And the Right Fix)

My multi-stage Dockerfile put the venv at /app/.venv. Then I added a bind mount in docker-compose:

volumes:
  - .:/app

The bind mount overlaid the entire /app directory, hiding the venv the Dockerfile built. The container had no packages.

My first fix was a named Docker volume to protect the venv. Workable, but fragile — every time I added a new package, I had to run docker compose down -v to wipe and repopulate the volume.

The correct fix: move the venv outside /app:

# Stage 1: build
ENV VIRTUAL_ENV=/opt/venv
RUN python -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
RUN pip install ...

# Stage 2: runtime
COPY --from=builder /opt/venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

/opt/venv is never touched by the bind mount. No volume management needed. The bind mount only covers /app.

The Engineering Pillars I Hardened After the Initial Build

After completing all 8 stages and doing a thorough review of the system against production requirements, I found gaps across three areas: resilience, observability, and testing. Here's how I addressed each one.

Pillar 1 — Resilience

Two-watermark backpressure band (not a single threshold). My initial implementation used one cutoff point. The problem: as queue depth oscillates around that number under real traffic, the system rapidly toggles between accepting and rejecting requests. The fix is a band — the high watermark fires a Redis flag, and the flag stays set until depth drops below the separate low watermark. That gap prevents oscillation.

Exponential backoff with full jitter. I had retry logic but used a fixed delay. The issue with fixed delays is synchronization — when ten workers all fail at the same moment, they all retry at the same moment, and you hammer your dependencies in waves.

def _backoff_seconds(retry_count: int, base: float = 1.0, max_delay: float = 60.0) -> float:
    return min(base * (2 ** retry_count), max_delay) * random.random()

The * random.random() is full jitter. It spreads retries across a window, breaking the thundering herd.

Bug hit during implementation: The except Exception as e retry path referenced error_msg, a variable only defined inside the except asyncio.TimeoutError block above it. Python raised NameError at runtime. Fixed by using str(e) directly.

Per-priority visibility timeouts. The original reaper used a single global timeout for all streams. Critical jobs need faster recovery. Normal jobs can wait longer before the reaper reclaims them.

Priority	Visibility Timeout
critical	30 seconds
high	60 seconds
normal	120 seconds

Dead Letter Queue with replay. Permanently failed jobs (retry count exhausted) land in queue:dlq with their full failure metadata. A POST /queues/dlq/{id}/replay endpoint moves them back to their original priority stream for human-initiated retry. The DLQ depth is exposed in /queues/metrics so you know when it needs attention.

Runtime weight adjustment. Scheduling weights are stored in Redis, not config. PATCH /queues/weights validates that weights sum to 100 and stores them. DELETE /queues/weights reverts to config defaults. During an incident, you can redirect 100% of worker capacity to critical jobs via a single API call without redeploying.

Pillar 2 — Observability

Structured logging with structlog. Python's built-in logging produces unstructured text. structlog with JSON output in production produces log lines that are filterable and machine-parseable. Every log line includes request_id, job_id, priority, and the relevant domain context.

Bug hit: structlog.stdlib.add_logger_name is designed for logging.Logger objects. PrintLoggerFactory (structlog's default) produces PrintLogger objects with no .name attribute — the processor raises AttributeError at runtime. Fix: remove that processor from the chain.

Four custom Prometheus metrics — by name, not just by count:

# app/core/metrics.py
queue_depth = Gauge("queue_depth", "Current queue depth", ["priority"])
jobs_processed_total = Counter("jobs_processed_total", "Jobs processed", ["status"])
backpressure_rejections_total = Counter("backpressure_rejections_total", "Backpressure 503s fired")
zombie_jobs_reaped_total = Counter("zombie_jobs_reaped_total", "Zombie jobs recovered by reaper")

queue_depth updates on a 15-second background task. backpressure_rejections_total increments in the router's exception handler. jobs_processed_total increments in the worker's success and DLQ-failure paths. zombie_jobs_reaped_total increments in the reaper after each recovery.

Grafana dashboard provisioned automatically. Rather than requiring manual dashboard setup, I added grafana/provisioning/ with datasource and dashboard JSON files mounted into the Grafana container. make dev gives you a working dashboard at localhost:3000, no clicking required.

Pillar 3 — Testing

Locust load tests with three realistic user classes:

LegitimateUser (weight 3): submits jobs, polls results, checks metrics
BurstUser (weight 1): submits at 10-50ms intervals to trigger backpressure intentionally — both 202 and 503 are counted as success because 503 is the correct behavior under overload
MixedWorkload (weight 2): submits across all three priorities in a 6:3:1 ratio

Results at 50 users, dev server: 56 req/s sustained, P50 220ms, P95 640ms.

Pre-commit hooks. ruff, ruff-format, mypy, and standard file checks (trailing-whitespace, check-yaml, debug-statements) run before every commit. Code that doesn't pass these never reaches the repo.

GitHub Actions CI pipeline. Every push triggers: ruff → mypy → alembic upgrade head → pytest with Redis 7 and Postgres 15 as sidecar containers. The migration step is included because schema drift between model and database is a silent failure mode that only shows up in integration tests.

The Tests That Actually Proved the System Works

I'm including three specific tests because they demonstrate how to test distributed system behavior without asyncio.sleep — which makes tests flaky and slow. More importantly, they test the failure modes that most implementations never cover: what happens when workers crash mid-execution.

The 85% overall coverage and ~92% services layer coverage isn't just a number — it specifically covers worker crashes during execution (test_visibility_timeout_reaper), stale-heartbeat zombie recovery (test_zombie_detection), and exhausted-retry DLQ routing. These are the edge cases that fail silently in production.

Visibility timeout reaper — directly patch the timeout to zero instead of waiting:

async def test_visibility_timeout_reaper(redis, mocker):
    await redis.xadd("queue:normal", {"job_id": "orphan-1", "payload": "{}"})
    # Claim it but never ACK (simulate crashed worker)
    await redis.xreadgroup("workers", "dead-worker", {"queue:normal": ">"}, count=1)

    # Patch timeout to 0ms — no sleeping required
    mocker.patch.dict("app.services.reaper_service.VISIBILITY_TIMEOUTS_MS",
                      {"critical": 0, "high": 0, "normal": 0})

    recovered = await recover_pending_messages(redis)
    assert recovered >= 1

Zombie detection — set stale heartbeat directly in the DB:

async def test_zombie_detection(db_session):
    job = await create_job(db_session, {"type": "send_email"}, "normal")
    await mark_running(db_session, job.id, worker_id="dead-worker")

    # Directly set heartbeat to 2 minutes ago
    stale_time = datetime.utcnow() - timedelta(seconds=120)
    await db_session.execute(
        update(Job).where(Job.id == job.id).values(heartbeat_at=stale_time)
    )
    await db_session.commit()

    reaped = await reap_zombie_jobs(db_session)
    assert reaped >= 1

    await db_session.refresh(job)
    assert job.status == "FAILED"
    assert "heartbeat timeout" in job.error

No sleeps. Direct state manipulation. Tests run in milliseconds.

Priority distribution — the test I showed interviewers:

async def test_priority_distribution(redis):
    picks = {"critical": 0, "high": 0, "normal": 0}
    for _ in range(10_000):
        queue = await pick_queue(redis)
        priority = queue.split(":")[1]
        picks[priority] += 1

    assert abs(picks["critical"] / 100 - 60) < 5
    assert abs(picks["high"] / 100 - 30) < 5
    assert abs(picks["normal"] / 100 - 10) < 5

Final Numbers

Metric	Value
Tests	47 passing
Coverage (overall)	85%
Coverage (services layer)	~92%
Sustained throughput	56 req/s at 50 users
P50 latency	220ms
P95 latency	640ms
Prometheus metrics	4 custom
API endpoints	9

The 56 req/s and 640ms P95 are on a dev server (uvicorn single worker). The DESIGN.md section on 10x scale calls for Gunicorn with multiple workers as the first step — a fair and honest constraint to document.

Lessons for the Next Build

Knowing where your design breaks is more impressive than pretending it's perfect. These aren't regrets — they're the honest constraints of what a single-node dev server build looks like, and what comes next.

Redis Cluster instead of single-node Redis — Redis is currently a single point of failure. At scale, the queue backend needs to survive node loss without losing PENDING state.
Gunicorn with multiple Uvicorn workers — the dev server constraint (single-worker uvicorn) is what makes the 56 req/s the ceiling, not the floor. Multiple workers multiply that linearly.
Sorted set for delayed re-enqueue — backoff delay currently uses asyncio.sleep inside the worker. That blocks the worker's event loop for the duration of the delay. The production pattern is a sorted set where the score is the execute_at timestamp, with a separate scheduler polling for ready entries. The worker enqueues and moves on immediately.
Prometheus alerting rules — Grafana dashboards are reactive; you see a problem after it's already happening. Alerting rules on queue_depth, zombie_jobs_reaped_total, and DLQ depth make the system proactive.
Per-job-type timeouts — there's one global job_timeout_seconds = 30. A send_email job and a generate_report job have very different expected runtimes. The next version reads timeout configuration per job type from Redis, same pattern as the scheduling weights.

The Honest Takeaways

Building this surfaced things that configuring a framework never would:

depends_on doesn't mean ready — use condition: service_healthy or your app crashes on startup reaching a service that hasn't finished initializing
Bind mounts hide Dockerfile artifacts — move your venv outside the mounted directory (/opt/venv instead of /app/.venv) or every docker compose up wipes it
Pytest event loops don't mix — set both asyncio_default_fixture_loop_scope AND asyncio_default_test_loop_scope to "session" or you get "Future attached to a different loop" errors that are genuinely hard to diagnose
Test your algorithms quantitatively — a 10,000-sample distribution test on the scheduler catches bias that type checks and unit assertions miss entirely
The Reaper is what makes it production-grade — without visibility timeout recovery and zombie detection, crashed workers silently kill jobs and give you no signal that work was lost

The whole stack — FastAPI, Redis Streams, PostgreSQL, Prometheus, Grafana, Docker Compose — runs on a laptop. No cloud account required to build and demo it.

The GitHub repo, DESIGN.md, and CONTRIBUTING.md are https://github.com/macaulaypraise/async-job-queue-with-backpressure.git. If you have questions about the reaper logic, the backpressure band, or the worker integration tests — drop them in the comments.

Rate Limiting Wasn't Enough — So I Built an API Gateway with Behavioral Abuse Detection

Macaulay Praise — Thu, 09 Apr 2026 16:21:33 +0000

Real rate limiting, Bloom filters, credential stuffing detection, and the bugs that almost broke everything. Live demo included.

GitHub: macaulaypraise/api-gateway-with-abuse-detection
Live demo: api-gateway-with-abuse-detection.onrender.com/docs

As someone transitioning into backend engineering, I wanted to build something that went beyond tutorials. I didn't want a CRUD app. I wanted something that would teach me how real systems defend themselves — something I could point to in an interview and say: "I built this from scratch and I know exactly why every line exists."

That project became an API Gateway with Abuse Detection — a FastAPI service that sits in front of upstream backends and actively detects credential stuffing, scraping bots, and known-bad actors. Here's a technical breakdown of how it works, the decisions behind it, and the real bugs that nearly cost me my sanity.

What the System Does

Every request passes through a six-step middleware chain in this exact order:

1. RequestID      → UUID trace ID attached to every request
2. Auth           → JWT validation, client_id + role extracted
3. BloomFilter    → O(1) bad IP + bad user-agent check
4. RateLimit      → sliding window per authenticated client
5. AbuseDetector  → graduated response (throttle/block)
6. ShadowMode     → log would-be blocks before enforcement

Each middleware depends on the one before it. If the Bloom filter flags you, the rate limiter never runs. Fail fast, fail cheap.

The Core Components (And Why Each One Exists)

1. Sliding Window Rate Limiter

Fixed-window rate limiting has a well-known flaw: a client can send N requests at the end of window 1 and N more at the start of window 2 — that's 2N requests in 2 seconds while technically never violating the per-window rule.

The sliding window eliminates this. Every request gets timestamped and stored in a Redis sorted set. On each new request:

Delete all entries older than the window
Count what remains
Allow or deny

The key word is atomic. If steps 1–3 aren't wrapped in a Lua script, a concurrent request can slip between the remove and the count, creating a race condition that lets clients exceed their limit.

-- Executed atomically on the Redis server
local key = KEYS[1]
local now = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local limit = tonumber(ARGV[3])

redis.call('ZREMRANGEBYSCORE', key, 0, now - window)
local count = redis.call('ZCARD', key)

if count < limit then
    redis.call('ZADD', key, now, now)
    return 1  -- allowed
end
return 0  -- blocked

Production verification: 150 parallel requests against the live Render deployment confirmed the enforcer is exact:

100 × 200 OK  ← exactly the rate limit
 50 × 429     ← every request over the limit rejected

Prometheus confirmed rate_limit_rejections_total{client_id="demo"} 200.0 after two parallel test runs. The client_id label proves the JWT identity is tracked, not the IP address — a crucial distinction for shared NATs and corporate networks.

2. Two-Dimensional Auth Failure Tracking

Credential stuffing is tracked on two axes simultaneously:

By IP: failed_auth:{ip} — one IP failing across many accounts
By username: failed_auth:{username} — many IPs targeting the same account

These are separate Redis keys with independent TTLs, configurable via environment variables:

AUTH_FAILURE_IP_THRESHOLD=10       # failures before IP soft-block
AUTH_FAILURE_USER_THRESHOLD=20     # failures before username soft-block
AUTH_FAILURE_WINDOW_SECONDS=300    # counter TTL

Keeping these counters independent means you can block a specific IP without penalizing every other IP targeting that same user, and flag a username as under attack without affecting unrelated clients.

3. Scraping Detection via Request Timing Entropy

Humans generate requests with high temporal variance. Bots generate requests with suspiciously regular inter-request timing.

For each client, I maintain a sliding window of the last N timestamps in a Redis sorted set and compute the standard deviation of the inter-arrival gaps. A standard deviation below SCRAPING_ENTROPY_THRESHOLD (default 0.5) triggers a bot flag.

The elegant part: this doesn't care about request volume. A sophisticated bot that rate-limits itself to human speeds will still be caught if it's too regular. This pairs with user-agent fingerprinting (the second Bloom filter) to create a multi-signal detection approach.

4. Dual Bloom Filters

Two in-memory Bloom filters, both synced from Redis every 60 seconds by a background worker:

known_bad_ips — screens every incoming IP at O(1) with no Redis round-trip
abusive_agents — user-agent fingerprinting for known scraper signatures

Configuration:

BLOOM_FILTER_CAPACITY=1000000  # expected entries
BLOOM_FILTER_ERROR_RATE=0.001  # 0.1% false positive rate

At a 0.1% false positive rate across 1 million IPs, the filter requires roughly 1.1 MB of memory. The worst case is a legitimate IP being flagged — which shadow mode surfaces before enforcement is ever enabled.

Critical implementation detail: the filter must live on app.state.bloom and be shared across all requests. Per-request instantiation gives you a fresh empty filter on every call — zero enforcement, zero errors, 100% invisible failure. More on this in the bugs section.

5. Graduated Response System

Three states instead of a binary allow/block:

State	Behavior
`ALLOWED`	Request passes through normally
`THROTTLED`	Response delayed via `asyncio.sleep`, served with `Retry-After`
`SOFT_BLOCK`	Immediate 429 — Redis TTL, temporary, self-expiring

This matters because going straight to hard block means a legitimate client that briefly triggered a rule is permanently punished. The graduated approach lets real users recover automatically while truly malicious clients face escalating consequences.

6. Shadow Mode — The Safety Net

Shadow mode is how you deploy new detection rules without blocking real users. When a request would trigger a rule, shadow mode logs the event to Redis with a 24-hour TTL instead of blocking. The request passes through normally.

What makes this interesting is the implementation: shadow mode is a runtime toggle, not a deploy-time config. It's controlled via a Redis key:

# Enable — observe but don't block
curl -X POST $BASE/admin/shadow-mode?enabled=true \
  -H "Authorization: Bearer $ADMIN_TOKEN"

# Disable — start enforcing
curl -X POST $BASE/admin/shadow-mode?enabled=false \
  -H "Authorization: Bearer $ADMIN_TOKEN"

The middleware reads config:shadow_mode_enabled from Redis on every request, falling back to the SHADOW_MODE_ENABLED environment variable if the key is absent. Toggle takes effect on the next request — no redeployment, no restart.

Database-Backed RBAC

The admin role system started as a simple ADMIN_USERNAMES environment variable. That approach has an obvious flaw: any user who registers with that exact username bypasses all admin checks.

The replacement: a UserRole enum (USER, ADMIN) stored in the users table, embedded in the JWT at login time.

# JWT payload at login
{"sub": username, "role": user.role}

The require_admin dependency reads the JWT role claim directly — no database query per request. To promote a user:

UPDATE users SET role = 'admin' WHERE username = 'target';

The user logs in again, receives a JWT with "role": "admin", and admin endpoints immediately become accessible. Their previous token expires in 30 minutes. No server restart required.

The Bugs That Actually Hurt

Bug 1: The Async Password Verification Trap

This one was subtle and genuinely dangerous. I had refactored verify_password to be an async function wrapping bcrypt's blocking checkpw in asyncio.to_thread() — which was correct. But I forgot to await it at the call site:

# 🚨 WRONG — coroutine object is always truthy
if verify_password(plain, hashed):
    # This branch ALWAYS executes
    ...

# ✅ CORRECT
if await verify_password(plain, hashed):
    ...

A coroutine object that's never awaited evaluates as truthy. Every password check passed, regardless of input. All authentication was silently bypassed. The auth endpoint returned a valid JWT for any password entered against any account.

There were no exceptions, no warnings, no test failures if your tests weren't checking wrong-password rejection specifically. The fix is trivial once you find it — finding it is the hard part.

Bug 2: Bloom Filter Instantiated Per-Request

The block-ip admin route was creating a new BloomFilterService() inside the route handler, adding the IP to that instance, and returning. Meanwhile, the middleware's shared in-memory filter (on app.state.bloom) was never updated — until the 60-second background sync ran.

The result: a hard-blocked IP could make 60 more requests before the block took effect. The fix was making admin routes update request.app.state.bloom directly:

# 🚨 WRONG — local instance, never seen by middleware
bloom = BloomFilterService()
bloom.add(ip)

# ✅ CORRECT — updates the shared middleware instance immediately
request.app.state.bloom.add(ip)

Bug 3: Static Admin Username Bypassed by Registration

The original ADMIN_USERNAMES config approach had a security hole: if the env var was set to "admin", anyone could register with username admin and gain admin access. Replaced entirely with the database-backed UserRole enum. The setting and its associated property were deleted from config.py.

Bug 4: Duplicate Alembic Migration Head

Running make makemigration twice without migrating in between creates two heads in the Alembic migration graph. The fix:

alembic merge heads -m "merge heads"
alembic stamp head
alembic upgrade head

Not a show-stopper, but something that will confuse you the first time you hit it.

Bug 5: Sequential curl Doesn't Test Rate Limiting

This one isn't a code bug — it's a test methodology bug that looks exactly like a code bug.

A rate limit of 100 requests per 60-second window means requests must arrive within the same 60-second window to count against each other. Over a network connection (Render free tier adds ~500ms per request), 300 sequential calls take roughly 5 minutes. At any point only ~60 requests sit inside the window — well under the limit. The limiter appears broken when it's working correctly.

# This will NOT trigger rate limiting against a remote host
for i in $(seq 1 300); do curl $BASE/gateway/proxy; done

# This will — all requests fire within the same window
for i in $(seq 1 150); do
  curl -s -o /dev/null -w "%{http_code}\n" \
    $BASE/gateway/proxy \
    -H "Authorization: Bearer $TOKEN" &
done | sort | uniq -c
# Output: 100 × 200, 50 × 429

Always use parallel requests when testing rate limiting against any remote deployment.

Performance Numbers

From a 60-second Locust load test, 20 concurrent users (legitimate users, credential stuffers, and scrapers running simultaneously):

Metric	Result
Throughput	59 req/s sustained
Legitimate user failure rate	0%
Credential stuffing detection	Blocked within 10 attempts
P50 gateway latency	10ms
P99 gateway latency	440ms (includes throttle delay)
Shadow events logged in 60s	740

The P99 spike is intentional — throttled clients hit asyncio.sleep, which is where the latency comes from. Legitimate users sit at the P50 line throughout.

Test Coverage

67 tests, 93% coverage. The most important tests to get right:

test_sliding_window_blocks_boundary_spike — send N requests at end of window 1, N at start of window 2, assert total allowed is N not 2N
test_concurrent_duplicate_requests — asyncio.gather firing same endpoint 5 times simultaneously, assert no race condition in the counter
test_shadow_mode_does_not_block — enable shadow mode, send a would-be-blocked request, assert 200 returned and shadow log has an entry
test_credential_stuffing_detected — fail auth 10 times from same IP, assert 11th is blocked
test_require_admin_valid_admin and test_non_admin_cannot_access_admin_routes — RBAC enforcement

Integration tests run against real Redis and PostgreSQL via a separate docker-compose.test.yml. Test isolation uses TRUNCATE TABLE ... RESTART IDENTITY CASCADE per test, not drop_all/create_all — same isolation, far lower overhead.

Production Stack

Component	Technology
Web framework	FastAPI + Uvicorn
Rate limit state	Redis 7 (sorted sets + Lua scripts)
IP/agent filtering	Bloom filter (pybloom-live)
Auth	JWT (python-jose) + bcrypt (asyncio.to_thread)
Database	PostgreSQL 15 + SQLAlchemy async
Migrations	Alembic
Metrics	Prometheus
Logging	structlog (JSON output with request_id on every line)
Testing	pytest + pytest-asyncio + Locust
CI	GitHub Actions
Hosting	Render (app) + Upstash (Redis) + Supabase (PostgreSQL)

Interview Talking Points Worth Owning

"Why Lua scripts in Redis?" — MULTI/EXEC is optimistic; other clients can interleave between commands. Lua runs atomically on the Redis server. The read-increment-expire cycle cannot be observed in an intermediate state under concurrent load.

"How do you handle a Redis outage?" — Fail open vs. fail closed is a business decision. A bank fails closed — block everything if rate limit state is unavailable. A media site fails open — serve traffic and accept the abuse risk. Expose it as a config flag.

"What about shared IPs and NATs?" — IP alone is a weak identifier. The system layers it with JWT client_id. IP rate limiting catches unauthenticated abuse; user-level limiting catches authenticated abuse. Both are needed, neither is sufficient alone.

"How does the Bloom filter help performance?" — Without it, every request does a Redis SISMEMBER call — a network round-trip. The Bloom filter checks the same list from process memory in microseconds. At 0.1% false positive rate, 1 in 1000 legitimate IPs might be flagged — which shadow mode surfaces before enforcement is enabled.

"What would you change at 10x scale?" — Move to Redis Cluster to eliminate the single point of failure. Load detection rules from Redis at runtime instead of config at deploy time. Add ML anomaly detection as a second signal layer. Per-datacenter rate limiting with global sync.

What I'd Do Differently

The most valuable lesson wasn't any individual component — it was build order. The pattern that worked: environment → infrastructure → config → database models → core clients → services → API layer → workers. Never jumping a stage. A broken Redis client makes every rate limiter test confusing. A broken DB session makes every auth test unreliable.

The second lesson: cross-check against your spec after you think you're done. The graduated response system, user-agent fingerprinting, and several Prometheus metrics were all missing from my "complete" implementation until I ran a systematic audit.

Try It

The live demo is running at api-gateway-with-abuse-detection.onrender.com/docs. Register a user, grab a JWT, hit the gateway endpoint 110 times in parallel, and watch the 429s start. Shadow stats accumulate at /admin/shadow-stats if you have an admin token.

Source, DESIGN.md, and load test scenarios: github.com/macaulaypraise/api-gateway-with-abuse-detection

Tags: python fastapi redis security webdev

The Dual-Write Problem: Why Your Payment API Is One Crash Away From Silent Data Loss

Macaulay Praise — Tue, 17 Mar 2026 11:37:44 +0000

You commit a payment to your database. Then you publish an event to Kafka so downstream services can settle it. Both succeed — until one day the process crashes in the 3 milliseconds between those two operations.

The database says the payment happened. Kafka never heard about it. The settlement worker never ran. The customer was charged and nothing moved.

That's the dual-write problem. This post explains why it's unsolvable with the obvious approaches, and how the Outbox pattern fixes it properly — using an implementation I built and load-tested to 1,000 concurrent users with zero duplicate charges.

Why the Obvious Solutions Don't Work

"Just publish to Kafka first, then write to the DB."

Same problem, reversed. The event fires but the payment row never gets written. Your downstream consumers process a payment that your database has no record of.

"Use a transaction that wraps both."

You can't. A database transaction and a Kafka publish are two entirely separate systems. PostgreSQL has no knowledge of Kafka. There is no COMMIT that covers both. The moment you step outside your DB transaction to call producer.send(), you're in crash territory.

"Use Two-Phase Commit (2PC)."

Kafka doesn't support it. And even in systems where both sides support 2PC, you're introducing a coordinator as a single point of failure with significantly higher latency. This is why 2PC has largely been abandoned in modern distributed systems in favour of patterns like the Outbox.

The Crash Window Nobody Talks About

Here's the exact sequence that fails silently:

1. BEGIN transaction
2. INSERT INTO payments (status = 'PENDING')   ← DB write
3. COMMIT                                       ← success
4.                                              ← 💥 process crashes here
5. producer.send('payment.initiated', ...)      ← never reached

Step 4 is real. Network blips, OOM kills, deploys — any of these can fire between steps 3 and 5. The window is tiny, but at scale it closes eventually.

The Outbox Pattern

The fix is to stop writing to two systems. Write to one.

Instead of publishing directly to Kafka, you write the event as a row in an outbox_events table — inside the same database transaction as the payment row. A separate background poller reads from that table and publishes to Kafka.

1. BEGIN transaction
2. INSERT INTO payments (status = 'PENDING')
3. INSERT INTO outbox_events (event = 'payment.initiated', published_at = NULL)
4. COMMIT                                    ← both rows land atomically

Now the Kafka publish is handled by the poller:

OUTBOX POLLER  →  SELECT * FROM outbox_events WHERE published_at IS NULL
               →  producer.send(event)
               →  UPDATE outbox_events SET published_at = NOW()

If the poller crashes after publishing but before marking the row, it simply replays on restart — Kafka receives a duplicate, which you handle with a deterministic event ID (more on this below). The payment row is never orphaned because the event was committed to the database first.

The full flow in my implementation looks like this:

CLIENT  →  POST /payments  +  Idempotency-Key: <uuid>
                │
                ▼
        ┌─ Redis cache check ──── HIT → return stored response (no DB touch)
        ├─ Distributed lock ───── prevents concurrent duplicate requests
        ├─ DB transaction ──────── Payment row + OutboxEvent row (atomic)
        └─ Cache response, release lock → 202 Accepted

OUTBOX POLLER  →  polls outbox_events WHERE published_at IS NULL  →  Kafka

KAFKA  →  SETTLEMENT WORKER
           ├─ PENDING → PROCESSING → SETTLED / FAILED
           ├─ Exponential backoff, max 5 retries
           └─ Dead Letter Queue on exhaustion

Handling the At-Least-Once Delivery Problem

The outbox poller delivers at-least-once to Kafka — meaning duplicate events are possible on replay. The settlement worker handles this with deterministic UUID5 event IDs:

event_id = uuid.uuid5(
    uuid.NAMESPACE_URL,
    f"{topic}:{partition}:{offset}"
)

The same topic:partition:offset always produces the same UUID. On replay, the deduplication check is a no-op — it sees the event ID already in processed_events and skips it. No double processing, no complex coordination.

Does It Actually Work?

I ran two load test scenarios with Locust against a single Docker container:

Scenario	Concurrent Users	Total Requests	Duplicate Charges
Normal load	50	1,378	0
Stress test	1,000	12,746	0

Correctness held at 0% duplicate charges through both. The 0.4% error rate at 1,000 users was connection pool exhaustion — not an idempotency failure. Every retry with the same idempotency key returned the identical payment_id.

What the Outbox Pattern Trades Off

Nothing is free. The outbox poller introduces a small delay — typically 1–5 seconds — between a payment being committed and its event reaching Kafka. For most use cases this is acceptable. For real-time fraud scoring that needs to act on the event immediately, it isn't, and you'd need a different approach.

The poller also needs to be a reliable background process. If it stops running silently, your outbox table grows and events stall. Monitoring queue depth is not optional.

The One-Sentence Summary

The Outbox pattern solves the dual-write problem by making the event a database record first and delegating the Kafka publish to a separate, restartable poller — so you never write to two systems atomically, you write to one.

Full source code, DESIGN.md, and load test results: https://github.com/macaulaypraise/idempotent-payment-processing-system.git

Stack: Python 3.12 · FastAPI · PostgreSQL 15 · Redis 7 · Kafka · SQLAlchemy (async) · Docker Compose