Forem: Temitope

Building a Fault-Tolerant Job Queue: Node.js Producers, Elixir/OTP Consumers

Temitope — Mon, 11 May 2026 10:39:22 +0000

The pitch for OTP is always the same: "let it crash," nine nines of uptime, Erlang running phone switches. That's all true and also completely useless when you're staring at a blank mix new project wondering how to actually structure the thing.

This tutorial skips the theory tour. We build something real: a distributed job processing system where a Node.js API enqueues work into Redis, and an Elixir/OTP application consumes it — with a supervision tree that keeps the whole thing running when individual workers die, when Redis blips, and when a job payload is malformed.

By the end you'll have:

A Node.js producer API with Redis streams (not just lists — we want consumer groups)
An Elixir Application with a proper OTP supervision tree
A QueueConsumer GenServer that polls Redis and dispatches work
A WorkerSupervisor (DynamicSupervisor) that spawns and monitors per-job workers
A JobWorker GenServer that processes a job, retries on failure, and dead-letters after max attempts
A Telemetry integration so you can see what's actually happening

No Oban, no Exq. We're building the layer below so you understand what those libraries are doing.

Architecture Overview

┌─────────────────────────────────────────────────────────┐
│  Node.js Producer API                                   │
│  POST /jobs  →  Redis XADD  →  Stream: "jobs:work"     │
└─────────────────────────────────────────────────────────┘
                          │ Redis Streams
                          ▼
┌─────────────────────────────────────────────────────────┐
│  Elixir OTP Application                                 │
│                                                         │
│  Application (supervisor)                               │
│  ├── RedisPool (Redix connections)                      │
│  ├── QueueConsumer (GenServer — polls + dispatches)     │
│  └── WorkerSupervisor (DynamicSupervisor)               │
│       ├── JobWorker<job_id_1> (GenServer)               │
│       ├── JobWorker<job_id_2> (GenServer)               │
│       └── JobWorker<job_id_n> (GenServer)               │
└─────────────────────────────────────────────────────────┘

Redis Streams give us persistent, consumer-group-aware queuing. A job isn't acknowledged until the worker finishes it — crash the worker mid-job and Redis redelivers it on restart.

Part 1: The Node.js Producer

Project Setup

mkdir job-producer && cd job-producer
npm init -y
npm install express ioredis ulid zod

Redis Stream Producer

We use XADD to append jobs to a Redis stream. Unlike LPUSH/RPUSH, streams give us:

Persistent, ordered log of all jobs (not consumed on read)
Consumer groups (multiple consumers, each gets different jobs)
Built-in pending entry list (PEL) — unacknowledged jobs are trackable

// src/redis.js
const Redis = require('ioredis');

const redis = new Redis({
  host: process.env.REDIS_HOST || 'localhost',
  port: parseInt(process.env.REDIS_PORT || '6379'),
  maxRetriesPerRequest: 3,
  retryStrategy: (times) => Math.min(times * 50, 2000),
  lazyConnect: false,
});

redis.on('error',  (err) => console.error('[redis] error:', err.message));
redis.on('connect', ()   => console.log('[redis] connected'));

module.exports = redis;

// src/jobs.js
const { ulid }  = require('ulid');
const redis     = require('./redis');
const { z }     = require('zod');

const STREAM_KEY    = 'jobs:work';
const MAX_LEN       = 10_000;  // cap stream length, trim old entries

// Job schema — validate before enqueuing
const JobSchema = z.object({
  type:    z.enum(['email', 'report', 'webhook', 'thumbnail']),
  payload: z.record(z.unknown()),
  priority: z.number().int().min(1).max(10).default(5),
});

async function enqueueJob(rawInput) {
  const parsed = JobSchema.parse(rawInput);  // throws ZodError if invalid

  const job = {
    id:         ulid(),           // sortable, unique job ID
    type:       parsed.type,
    payload:    JSON.stringify(parsed.payload),
    priority:   String(parsed.priority),
    enqueued_at: new Date().toISOString(),
    attempts:   '0',
  };

  // XADD stream MAXLEN ~ 10000 * id field value field value ...
  // '*' tells Redis to auto-generate the stream entry ID
  const entryId = await redis.xadd(
    STREAM_KEY,
    'MAXLEN', '~', String(MAX_LEN),
    '*',                          // auto-ID
    ...Object.entries(job).flat() // field-value pairs
  );

  console.log(`[jobs] enqueued ${job.type} job=${job.id} entry=${entryId}`);
  return { jobId: job.id, streamEntryId: entryId };
}

async function getJobStats() {
  const [length, groups] = await Promise.all([
    redis.xlen(STREAM_KEY),
    redis.xinfo('GROUPS', STREAM_KEY).catch(() => []),
  ]);

  return { stream: STREAM_KEY, length, consumerGroups: groups };
}

module.exports = { enqueueJob, getJobStats };

The API

// src/index.js
const express        = require('express');
const { enqueueJob, getJobStats } = require('./jobs');

const app  = express();
app.use(express.json());

// POST /jobs  — enqueue a new job
app.post('/jobs', async (req, res) => {
  try {
    const result = await enqueueJob(req.body);
    res.status(202).json({ status: 'accepted', ...result });
  } catch (err) {
    if (err.name === 'ZodError') {
      return res.status(422).json({ error: 'Invalid job', issues: err.issues });
    }
    console.error('[api] enqueue error:', err);
    res.status(500).json({ error: 'Internal error' });
  }
});

// GET /jobs/stats  — queue depth and consumer group info
app.get('/jobs/stats', async (req, res) => {
  const stats = await getJobStats();
  res.json(stats);
});

const PORT = process.env.PORT || 3000;
app.listen(PORT, () => console.log(`[api] listening on :${PORT}`));

Test it:

node src/index.js &

curl -s -X POST http://localhost:3000/jobs \
  -H 'Content-Type: application/json' \
  -d '{"type":"email","payload":{"to":"user@example.com","template":"welcome"}}' \
  | jq .

# => { "status": "accepted", "jobId": "01HXYZ...", "streamEntryId": "1699...-0" }

Part 2: The Elixir/OTP Consumer

Project Setup

mix new job_consumer --sup   # --sup scaffolds an Application module
cd job_consumer

The --sup flag is important — it generates a JobConsumer.Application module with a supervision tree stub. We'll fill that in.

# mix.exs
defp deps do
  [
    {:redix, "~> 1.4"},
    {:poolboy, "~> 1.5"},
    {:jason, "~> 1.4"},
    {:telemetry, "~> 1.2"},
    {:telemetry_metrics, "~> 0.6"},
  ]
end

mix deps.get

Configuration

# config/config.exs
import Config

config :job_consumer,
  redis_host:       System.get_env("REDIS_HOST", "localhost"),
  redis_port:       String.to_integer(System.get_env("REDIS_PORT", "6379")),
  stream_key:       System.get_env("STREAM_KEY", "jobs:work"),
  consumer_group:   System.get_env("CONSUMER_GROUP", "elixir-workers"),
  consumer_name:    System.get_env("CONSUMER_NAME", "consumer-#{:inet.gethostname() |> elem(1)}"),
  max_concurrency:  String.to_integer(System.get_env("MAX_CONCURRENCY", "10")),
  poll_interval_ms: String.to_integer(System.get_env("POLL_INTERVAL_MS", "100")),
  max_attempts:     String.to_integer(System.get_env("MAX_ATTEMPTS", "3"))

The Supervision Tree

This is the heart of the OTP design. Get this right and everything else is pluggable.

# lib/job_consumer/application.ex
defmodule JobConsumer.Application do
  use Application
  require Logger

  @impl true
  def start(_type, _args) do
    config = Application.get_all_env(:job_consumer)

    children = [
      # 1. Redis connection pool — must start before anything that uses Redis
      {JobConsumer.RedisPool, config},

      # 2. Queue consumer — polls Redis, dispatches to WorkerSupervisor
      #    Depends on RedisPool being up; supervision order matters
      {JobConsumer.QueueConsumer, config},

      # 3. Dynamic supervisor — spawns/monitors per-job worker processes
      {JobConsumer.WorkerSupervisor, config},

      # 4. Telemetry — attach handlers after workers are up
      JobConsumer.Telemetry,
    ]

    # :one_for_one — if one child crashes, only restart that child
    # This is correct here: a crashed QueueConsumer shouldn't kill the WorkerSupervisor
    # and vice versa. The RedisPool restart policy handles reconnection.
    opts = [
      strategy:  :one_for_one,
      name:      JobConsumer.Supervisor,
      max_restarts: 10,
      max_seconds:  60,
    ]

    Logger.info("[app] starting supervision tree")
    Supervisor.start_link(children, opts)
  end
end

A note on strategy choice: :one_for_one is right here because our children are loosely coupled — the QueueConsumer and WorkerSupervisor don't share state. If we had children where a crash in one makes the others' state invalid, we'd use :one_for_all (restart everyone) or :rest_for_one (restart the crashed child and all children started after it).

The Redis Connection Pool

We use Poolboy to maintain a pool of Redix connections. One connection handles one command at a time; pooling gives us concurrency.

# lib/job_consumer/redis_pool.ex
defmodule JobConsumer.RedisPool do
  @pool_name :redix_pool

  def child_spec(config) do
    pool_opts = [
      name:          {:local, @pool_name},
      worker_module: JobConsumer.RedisWorker,
      size:          10,    # idle connections
      max_overflow:  5,     # burst connections
    ]

    redis_opts = [
      host: config[:redis_host],
      port: config[:redis_port],
    ]

    :poolboy.child_spec(@pool_name, pool_opts, redis_opts)
  end

  # Execute a Redis command, borrowing a connection from the pool
  def command(cmd) do
    :poolboy.transaction(@pool_name, fn worker ->
      Redix.command(worker, cmd)
    end)
  end

  # Pipeline multiple commands in one round trip
  def pipeline(cmds) do
    :poolboy.transaction(@pool_name, fn worker ->
      Redix.pipeline(worker, cmds)
    end)
  end
end

# lib/job_consumer/redis_worker.ex
defmodule JobConsumer.RedisWorker do
  use GenServer

  def start_link(redis_opts) do
    GenServer.start_link(__MODULE__, redis_opts)
  end

  @impl true
  def init(opts) do
    host = Keyword.get(opts, :host, "localhost")
    port = Keyword.get(opts, :port, 6379)

    case Redix.start_link(host: host, port: port) do
      {:ok, conn}     -> {:ok, conn}
      {:error, reason} -> {:stop, reason}
    end
  end

  # Delegate all GenServer calls to the Redix connection
  @impl true
  def handle_call(request, from, conn) do
    GenServer.reply(from, Redix.command(conn, request))
    {:noreply, conn}
  end
end

Ensuring the Consumer Group Exists

Redis consumer groups must be created before XREADGROUP can be called. We do this lazily in the QueueConsumer init:

# lib/job_consumer/stream.ex
defmodule JobConsumer.Stream do
  require Logger

  @doc """
  Ensure the consumer group exists on the stream.
  XGROUP CREATE with MKSTREAM creates the stream if it doesn't exist yet.
  '$' means 'start from new messages only' (use '0' to reprocess all).
  """
  def ensure_consumer_group!(stream_key, group_name) do
    case JobConsumer.RedisPool.command(
      ["XGROUP", "CREATE", stream_key, group_name, "$", "MKSTREAM"]
    ) do
      {:ok, "OK"} ->
        Logger.info("[stream] created consumer group '#{group_name}' on '#{stream_key}'")

      {:error, %Redix.Error{message: "BUSYGROUP" <> _}} ->
        # Group already exists — this is fine, not an error
        :ok

      {:error, reason} ->
        raise "Failed to create consumer group: #{inspect(reason)}"
    end
  end

  @doc """
  Read up to `count` new messages from the stream via consumer group.
  '>' means 'give me messages not yet delivered to any consumer'.
  """
  def read_new(stream_key, group_name, consumer_name, count \\ 10) do
    JobConsumer.RedisPool.command([
      "XREADGROUP",
      "GROUP",    group_name,
      consumer_name,
      "COUNT",    Integer.to_string(count),
      "BLOCK",    "0",   # block until messages available (ms); 0 = indefinite
      "STREAMS",  stream_key,
      ">"         # deliver only undelivered messages
    ])
  end

  @doc """
  Re-claim messages that have been pending (delivered but not acknowledged)
  for longer than `min_idle_ms`. Used for crash recovery.
  """
  def reclaim_stale(stream_key, group_name, consumer_name, min_idle_ms \\ 30_000) do
    JobConsumer.RedisPool.command([
      "XAUTOCLAIM",
      stream_key,
      group_name,
      consumer_name,
      Integer.to_string(min_idle_ms),
      "0-0",      # start from beginning of PEL
      "COUNT", "100"
    ])
  end

  @doc """
  Acknowledge a message — removes it from the Pending Entry List.
  Call this only after successful processing.
  """
  def ack(stream_key, group_name, entry_id) do
    JobConsumer.RedisPool.command(["XACK", stream_key, group_name, entry_id])
  end
end

The QueueConsumer GenServer

This is the poller. It wakes up, reads a batch of jobs from Redis, spawns a JobWorker for each via the WorkerSupervisor, and loops.

# lib/job_consumer/queue_consumer.ex
defmodule JobConsumer.QueueConsumer do
  use GenServer
  require Logger

  alias JobConsumer.{Stream, WorkerSupervisor}

  @reclaim_interval_ms 30_000  # check for stale pending entries every 30s

  # ── Public API ─────────────────────────────────────────────────────────────

  def start_link(config) do
    GenServer.start_link(__MODULE__, config, name: __MODULE__)
  end

  def status do
    GenServer.call(__MODULE__, :status)
  end

  # ── GenServer Callbacks ────────────────────────────────────────────────────

  @impl true
  def init(config) do
    stream_key     = config[:stream_key]
    group_name     = config[:consumer_group]
    consumer_name  = config[:consumer_name]
    poll_interval  = config[:poll_interval_ms]
    max_concurrent = config[:max_concurrency]

    # Ensure the consumer group exists before we start polling
    Stream.ensure_consumer_group!(stream_key, group_name)

    state = %{
      stream_key:     stream_key,
      group_name:     group_name,
      consumer_name:  consumer_name,
      poll_interval:  poll_interval,
      max_concurrent: max_concurrent,
      dispatched:     0,
      errors:         0,
    }

    # Schedule first poll immediately, then reclaim loop
    send(self(), :poll)
    Process.send_after(self(), :reclaim_stale, @reclaim_interval_ms)

    Logger.info("[consumer] started — group=#{group_name} consumer=#{consumer_name}")
    {:ok, state}
  end

  @impl true
  def handle_info(:poll, state) do
    # Backpressure: don't read more jobs than we can handle concurrently
    active_workers = WorkerSupervisor.active_count()

    new_state =
      if active_workers >= state.max_concurrent do
        Logger.debug("[consumer] at capacity (#{active_workers}/#{state.max_concurrent}), skipping poll")
        state
      else
        read_and_dispatch(state)
      end

    # Schedule next poll
    Process.send_after(self(), :poll, state.poll_interval)
    {:noreply, new_state}
  end

  @impl true
  def handle_info(:reclaim_stale, state) do
    case Stream.reclaim_stale(state.stream_key, state.group_name, state.consumer_name) do
      {:ok, [_next_id, entries, _]} when entries != [] ->
        Logger.warning("[consumer] reclaimed #{length(entries)} stale entries")
        dispatch_entries(entries, state)

      {:ok, _} ->
        :ok

      {:error, reason} ->
        Logger.error("[consumer] reclaim failed: #{inspect(reason)}")
    end

    Process.send_after(self(), :reclaim_stale, @reclaim_interval_ms)
    {:noreply, state}
  end

  @impl true
  def handle_call(:status, _from, state) do
    {:reply, Map.take(state, [:dispatched, :errors, :max_concurrent]), state}
  end

  # ── Private ────────────────────────────────────────────────────────────────

  defp read_and_dispatch(state) do
    case Stream.read_new(
      state.stream_key,
      state.group_name,
      state.consumer_name,
      state.max_concurrent
    ) do
      {:ok, [[_stream_key, entries]]} ->
        dispatch_entries(entries, state)

      {:ok, nil} ->
        # Timeout with no messages — normal
        state

      {:error, reason} ->
        Logger.error("[consumer] read error: #{inspect(reason)}")
        %{state | errors: state.errors + 1}
    end
  end

  defp dispatch_entries(entries, state) do
    Enum.reduce(entries, state, fn {entry_id, fields}, acc ->
      job = parse_job(entry_id, fields)

      case WorkerSupervisor.start_worker(job) do
        {:ok, _pid} ->
          Logger.debug("[consumer] dispatched job=#{job.id} entry=#{entry_id}")
          %{acc | dispatched: acc.dispatched + 1}

        {:error, reason} ->
          Logger.error("[consumer] dispatch failed job=#{job.id}: #{inspect(reason)}")
          %{acc | errors: acc.errors + 1}
      end
    end)
  end

  defp parse_job(entry_id, fields) do
    field_map = Enum.chunk_every(fields, 2)
                |> Enum.into(%{}, fn [k, v] -> {k, v} end)

    %{
      stream_entry_id: entry_id,
      id:              field_map["id"],
      type:            field_map["type"],
      payload:         Jason.decode!(field_map["payload"]),
      priority:        String.to_integer(field_map["priority"] || "5"),
      attempts:        String.to_integer(field_map["attempts"] || "0"),
      enqueued_at:     field_map["enqueued_at"],
    }
  end
end

The backpressure check (active_workers >= state.max_concurrent) is critical. Without it, a burst of 10,000 jobs would spawn 10,000 GenServer processes simultaneously. With it, we cap concurrency and let Redis hold the overflow.

The WorkerSupervisor

A DynamicSupervisor that spawns JobWorker processes on demand and supervises them independently.

# lib/job_consumer/worker_supervisor.ex
defmodule JobConsumer.WorkerSupervisor do
  use DynamicSupervisor
  require Logger

  def start_link(config) do
    DynamicSupervisor.start_link(__MODULE__, config, name: __MODULE__)
  end

  @impl true
  def init(_config) do
    # :one_for_one is the only strategy DynamicSupervisor supports
    # max_restarts/max_seconds: if a worker crashes more than 3 times in 5s,
    # the supervisor itself crashes and gets restarted by Application supervisor
    DynamicSupervisor.init(
      strategy:     :one_for_one,
      max_restarts: 3,
      max_seconds:  5
    )
  end

  @doc "Spawn a supervised JobWorker for the given job map"
  def start_worker(job) do
    spec = {JobConsumer.JobWorker, job}
    DynamicSupervisor.start_child(__MODULE__, spec)
  end

  @doc "Count currently active (living) worker processes"
  def active_count do
    DynamicSupervisor.count_children(__MODULE__).active
  end

  @doc "List all active worker PIDs"
  def list_workers do
    DynamicSupervisor.which_children(__MODULE__)
    |> Enum.map(fn {_id, pid, _type, _modules} -> pid end)
  end
end

The JobWorker GenServer

This is where the actual work happens. Each job gets its own process — isolated heap, isolated failure domain, independent retry logic.

# lib/job_consumer/job_worker.ex
defmodule JobConsumer.JobWorker do
  use GenServer, restart: :temporary  # don't auto-restart crashed workers
  require Logger

  alias JobConsumer.{Stream, DeadLetter}

  @base_retry_delay_ms 1_000
  @max_attempts        Application.compile_env(:job_consumer, :max_attempts, 3)

  # ── Public API ─────────────────────────────────────────────────────────────

  def start_link(job) do
    GenServer.start_link(__MODULE__, job)
  end

  # ── GenServer Callbacks ────────────────────────────────────────────────────

  @impl true
  def init(job) do
    # Process the job immediately after init — don't block the supervisor
    send(self(), :process)
    {:ok, job}
  end

  @impl true
  def handle_info(:process, job) do
    start_time = System.monotonic_time()

    :telemetry.execute(
      [:job_consumer, :job, :start],
      %{system_time: System.system_time()},
      %{job_type: job.type, job_id: job.id}
    )

    result =
      try do
        {:ok, execute_job(job)}
      rescue
        e -> {:error, Exception.format(:error, e, __STACKTRACE__)}
      catch
        :exit, reason -> {:error, "exit: #{inspect(reason)}"}
      end

    duration = System.monotonic_time() - start_time

    case result do
      {:ok, _output} ->
        handle_success(job, duration)

      {:error, reason} ->
        handle_failure(job, reason, duration)
    end

    # Worker is done — stop normally. The supervisor does not restart :temporary workers.
    {:stop, :normal, job}
  end

  # ── Job Dispatch ───────────────────────────────────────────────────────────

  defp execute_job(%{type: "email"} = job) do
    # Simulate: in production, call your mailer here
    %{to: to, template: template} = atomize(job.payload)
    Logger.info("[worker] sending email to=#{to} template=#{template}")
    Process.sleep(100)  # simulate I/O
    %{sent_to: to, template: template}
  end

  defp execute_job(%{type: "report"} = job) do
    %{report_id: id} = atomize(job.payload)
    Logger.info("[worker] generating report id=#{id}")
    Process.sleep(500)
    %{report_id: id, rows: :rand.uniform(10_000)}
  end

  defp execute_job(%{type: "webhook"} = job) do
    %{url: url} = atomize(job.payload)
    Logger.info("[worker] dispatching webhook to=#{url}")
    # In production: HTTP call here; raise on non-2xx for retry
    Process.sleep(200)
    %{url: url, status: 200}
  end

  defp execute_job(%{type: type}) do
    raise "Unknown job type: #{type}"
  end

  # ── Success / Failure Handling ─────────────────────────────────────────────

  defp handle_success(job, duration_native) do
    duration_ms = System.convert_time_unit(duration_native, :native, :millisecond)

    # Acknowledge the message — removes it from Redis PEL
    case Stream.ack(config(:stream_key), config(:consumer_group), job.stream_entry_id) do
      {:ok, 1} ->
        Logger.info("[worker] ✓ job=#{job.id} type=#{job.type} duration=#{duration_ms}ms")

      {:ok, 0} ->
        Logger.warning("[worker] ack returned 0 for job=#{job.id} — already acked?")

      {:error, reason} ->
        Logger.error("[worker] ack failed job=#{job.id}: #{inspect(reason)}")
    end

    :telemetry.execute(
      [:job_consumer, :job, :success],
      %{duration: duration_native},
      %{job_type: job.type, job_id: job.id}
    )
  end

  defp handle_failure(job, reason, duration_native) do
    attempts = job.attempts + 1
    duration_ms = System.convert_time_unit(duration_native, :native, :millisecond)

    Logger.error("[worker] ✗ job=#{job.id} type=#{job.type} attempt=#{attempts} reason=#{inspect(reason)}")

    :telemetry.execute(
      [:job_consumer, :job, :failure],
      %{duration: duration_native},
      %{job_type: job.type, job_id: job.id, attempt: attempts, reason: reason}
    )

    if attempts >= @max_attempts do
      # Max attempts reached — move to dead letter stream, then ack to clear PEL
      Logger.error("[worker] dead-lettering job=#{job.id} after #{attempts} attempts")
      DeadLetter.push(job, reason)
      Stream.ack(config(:stream_key), config(:consumer_group), job.stream_entry_id)
    else
      # Exponential backoff retry: re-enqueue with incremented attempts
      # We ack the current entry and re-add to the stream with updated attempts count
      delay_ms = @base_retry_delay_ms * :math.pow(2, attempts) |> round()
      Logger.info("[worker] retrying job=#{job.id} in #{delay_ms}ms (attempt #{attempts}/#{@max_attempts})")

      Stream.ack(config(:stream_key), config(:consumer_group), job.stream_entry_id)

      # Re-enqueue after delay — spawn a detached process so we don't block
      job_to_retry = %{job | attempts: attempts}
      Task.start(fn ->
        Process.sleep(delay_ms)
        re_enqueue(job_to_retry)
      end)
    end
  end

  defp re_enqueue(job) do
    fields = [
      "id",          job.id,
      "type",        job.type,
      "payload",     Jason.encode!(job.payload),
      "priority",    Integer.to_string(job.priority),
      "enqueued_at", job.enqueued_at,
      "attempts",    Integer.to_string(job.attempts),
    ]

    JobConsumer.RedisPool.command(
      ["XADD", config(:stream_key), "MAXLEN", "~", "10000", "*" | fields]
    )
  end

  defp atomize(map) do
    Map.new(map, fn {k, v} -> {String.to_existing_atom(k), v} end)
  end

  defp config(key), do: Application.fetch_env!(:job_consumer, key)
end

The restart: :temporary option on use GenServer is essential. It tells the WorkerSupervisor not to automatically restart a worker that exits — whether normally or abnormally. We want full control over retry logic inside the worker itself. Auto-restart would bypass our backoff and dead-letter logic.

Dead Letter Queue

Jobs that exhaust retries go here for inspection, not into the void:

# lib/job_consumer/dead_letter.ex
defmodule JobConsumer.DeadLetter do
  require Logger

  @stream_key "jobs:dead"

  def push(job, reason) do
    fields = [
      "original_id",  job.id,
      "type",         job.type,
      "payload",      Jason.encode!(job.payload),
      "attempts",     Integer.to_string(job.attempts),
      "failed_at",    DateTime.utc_now() |> DateTime.to_iso8601(),
      "reason",       inspect(reason),
    ]

    case JobConsumer.RedisPool.command(["XADD", @stream_key, "*" | fields]) do
      {:ok, entry_id} ->
        Logger.info("[dead_letter] stored job=#{job.id} at entry=#{entry_id}")
        {:ok, entry_id}

      {:error, reason} ->
        Logger.error("[dead_letter] failed to store job=#{job.id}: #{inspect(reason)}")
        {:error, reason}
    end
  end

  def list(count \\ 100) do
    case JobConsumer.RedisPool.command(["XRANGE", @stream_key, "-", "+", "COUNT", Integer.to_string(count)]) do
      {:ok, entries} -> {:ok, Enum.map(entries, &parse_entry/1)}
      error -> error
    end
  end

  defp parse_entry({entry_id, fields}) do
    field_map = Enum.chunk_every(fields, 2)
                |> Enum.into(%{}, fn [k, v] -> {k, v} end)
    Map.put(field_map, "entry_id", entry_id)
  end
end

Telemetry

Wire up metrics so you actually know what's happening:

# lib/job_consumer/telemetry.ex
defmodule JobConsumer.Telemetry do
  use GenServer
  require Logger

  def start_link(_opts) do
    GenServer.start_link(__MODULE__, [], name: __MODULE__)
  end

  @impl true
  def init(_) do
    events = [
      [:job_consumer, :job, :start],
      [:job_consumer, :job, :success],
      [:job_consumer, :job, :failure],
    ]

    :telemetry.attach_many(
      "job-consumer-logger",
      events,
      &__MODULE__.handle_event/4,
      nil
    )

    {:ok, %{processed: 0, failed: 0}}
  end

  def handle_event([:job_consumer, :job, :start], _measurements, meta, _config) do
    Logger.debug("[telemetry] job started type=#{meta.job_type} id=#{meta.job_id}")
  end

  def handle_event([:job_consumer, :job, :success], measurements, meta, _config) do
    duration_ms = System.convert_time_unit(measurements.duration, :native, :millisecond)
    Logger.info("[telemetry] job success type=#{meta.job_type} id=#{meta.job_id} duration=#{duration_ms}ms")
    # In production: emit to StatsD, Prometheus, Datadog, etc.
  end

  def handle_event([:job_consumer, :job, :failure], _measurements, meta, _config) do
    Logger.warning("[telemetry] job failure type=#{meta.job_type} id=#{meta.job_id} attempt=#{meta.attempt}")
  end
end

Watching the Supervision Tree in Action

Start the Elixir application:

mix run --no-halt

In another terminal, enqueue a batch of jobs:

for i in $(seq 1 20); do
  curl -s -X POST http://localhost:3000/jobs \
    -H 'Content-Type: application/json' \
    -d "{\"type\":\"email\",\"payload\":{\"to\":\"user${i}@example.com\",\"template\":\"welcome\"}}" \
    > /dev/null
done
echo "Enqueued 20 jobs"

Observe the Elixir logs — you'll see workers spawning, processing, and acknowledging:

[consumer] dispatched job=01HX... entry=1699...-0
[worker] sending email to=user1@example.com template=welcome
[worker] ✓ job=01HX... type=email duration=103ms
[consumer] dispatched job=01HY... entry=1699...-1
...

Now simulate a crash. In iex:

# Kill the QueueConsumer process directly
Process.whereis(JobConsumer.QueueConsumer) |> Process.exit(:kill)

# The Application supervisor restarts it automatically within milliseconds
# Watch the logs:
# [consumer] started — group=elixir-workers consumer=consumer-hostname

The supervision tree just restarted the consumer. Any jobs that were mid-flight but not yet acknowledged are still in the Redis PEL — the reclaim_stale loop will pick them up on the next cycle.

The Failure Matrix

Failure	What happens	Recovery
JobWorker crashes mid-job	Job stays in Redis PEL (not acked)	`XAUTOCLAIM` reclaims after 30s
JobWorker raises exception	`try/rescue` catches it, retry logic runs	Exponential backoff, then dead letter
QueueConsumer crashes	App supervisor restarts it	Polls resume; PEL intact in Redis
WorkerSupervisor crashes	App supervisor restarts it	All workers lost; PEL covers in-flight jobs
Redis connection drops	Redix auto-reconnects; pool returns errors	Consumer logs errors, retries next poll
Job exceeds max_attempts	Moved to `jobs:dead` stream, PEL cleared	Manual inspection + replay
Burst of jobs	Backpressure check caps concurrency	Overflow sits in Redis stream safely

Every cell in that table has code behind it in what we built. None of it relies on hope.

Where to Take It Next

Priorities: Add a separate stream per priority level (jobs:high, jobs:normal, jobs:low). Poll high-priority first; fall through to lower streams only when high is empty.
Observability: Replace the Telemetry logger with a Prometheus exporter. Track queue depth (Redis XLEN), processing rate, p99 duration per job type.
Horizontal scaling: Run multiple Elixir nodes. Each gets a unique consumer_name. Redis consumer groups handle deduplication automatically — no coordinator needed.
Rate limiting: Add a RateLimiter GenServer that tracks jobs-per-second per job type and blocks the QueueConsumer dispatch when limits are hit.
Job cancellation: XDEL a stream entry by ID before it's claimed. Workers should check a cancellation flag at the start of execute_job.

The OTP supervision tree you have now is the skeleton that all of this hangs on. Add a new capability → add a supervised child. Something breaks → the tree heals it. That's the promise, and it's not magic — it's just processes all the way down.

Ractors vs Fibers: Ruby Concurrency Without the Hand-Waving

Temitope — Mon, 11 May 2026 10:34:00 +0000

Ruby has had a concurrency story for years. For most of that time, the story was "threads exist but the GIL means they don't give you parallelism for CPU-bound work, and Fibers are for cooperative scheduling if you want to manage it yourself." Ruby 3 changed the first half of that sentence.

Ractors (introduced in Ruby 3.0, still experimental as of 3.3) give you genuine parallelism — multiple Ractors run on multiple OS threads without the GIL constraining them. Fibers (overhauled in Ruby 3.1 with Fiber::Scheduler) give you async I/O concurrency without threads.

These are different tools for different problems. This article puts them side by side with real code so you can reason about which one belongs in your next design.

The Core Distinction, Precisely

Before the code: get the mental model right.

Fibers are cooperative. A Fiber runs until it explicitly yields. The scheduler decides what runs next. There is still only one OS thread (by default). You get concurrency — multiple things making progress — but not parallelism — multiple things running simultaneously on different CPUs. The payoff is I/O-bound work: while one Fiber waits for a socket, another runs.

Ractors are parallel. Each Ractor runs on its own OS thread, free of the GIL. The payoff is CPU-bound work: genuine multi-core computation. The cost is strict isolation — Ractors cannot share mutable objects, and the rules Ruby enforces to guarantee this are strict enough to break a lot of idioms you're used to.

Fibers:   [F1]--yield--[F2]--yield--[F1]  (one thread, interleaved)
Ractors:  [R1]                            (thread 1, core 1)
          [R2]                            (thread 2, core 2, simultaneously)

Neither replaces threads. Both are alternatives with sharper tradeoffs.

Part 1: Fibers

The Basics — Fibers as Resumable Closures

A Fiber is a chunk of code with a suspension point. Fiber.yield suspends it; fiber.resume picks it back up. State is preserved across yields.

fib = Fiber.new do
  a, b = 0, 1
  loop do
    Fiber.yield a
    a, b = b, a + b
  end
end

10.times { print "#{fib.resume} " }
# => 0 1 1 2 3 5 8 13 21 34

This is an infinite Fibonacci generator that produces one value at a time. No array allocated, no recursion limit. The Fiber's stack frame persists between resumes — a and b are just sitting there between calls.

Fibers as Enumerators

Ruby's Enumerator is built on Fibers under the hood. Understanding this makes lazy enumerators click:

# Enumerator::new takes a yielder — same pattern as Fiber.yield
counter = Enumerator.new do |yielder|
  n = 0
  loop { yielder << n; n += 1 }
end

counter.take(5)       # => [0, 1, 2, 3, 4]
counter.lazy
       .select(&:odd?)
       .first(3)      # => [1, 3, 5]  — only computes what's needed

# You can compose lazy pipelines over Fibers manually too
def integers_from(n)
  Fiber.new do
    loop { Fiber.yield n; n += 1 }
  end
end

def fiber_map(fiber, &block)
  Fiber.new do
    loop { Fiber.yield block.call(fiber.resume) }
  end
end

def fiber_select(fiber, &predicate)
  Fiber.new do
    loop do
      val = fiber.resume
      Fiber.yield val if predicate.call(val)
    end
  end
end

source  = integers_from(1)
squares = fiber_map(source) { |n| n * n }
evens   = fiber_select(squares, &:even?)

5.times { print "#{evens.resume} " }
# => 4 16 36 64 100

Each step pulls exactly one value. No intermediate arrays. This is why lazy pipelines are memory-efficient at scale.

Fiber::Scheduler — The Real Power

Ruby 3.1's Fiber::Scheduler lets you plug in a custom scheduler that intercepts blocking I/O operations and runs other Fibers while one is waiting. The standard library doesn't ship a scheduler — you implement the interface, or use a gem like async or evt.

Here's a minimal scheduler that makes the interface concrete:

# A stripped-down scheduler — production would use libev or io_uring
class MinimalScheduler
  def initialize
    @readable  = {}  # io => fiber
    @writable  = {}
    @timers    = []  # [[fire_at, fiber], ...]
    @ready     = []  # fibers ready to run immediately
  end

  # Called by Fiber.schedule { } — registers a fiber to run
  def fiber(&block)
    fiber = Fiber.new(&block)
    @ready << fiber
    fiber
  end

  # Intercepted by Kernel#sleep inside a Fiber
  def kernel_sleep(duration)
    @timers << [Process.clock_gettime(Process::CLOCK_MONOTONIC) + duration,
                Fiber.current]
    Fiber.yield  # suspend the current fiber
  end

  # Intercepted by IO#wait_readable inside a Fiber
  def io_wait(io, events, timeout)
    if events & IO::READABLE > 0
      @readable[io] = Fiber.current
    elsif events & IO::WRITABLE > 0
      @writable[io] = Fiber.current
    end
    Fiber.yield
  end

  # The event loop — called by Ruby when the main Fiber has nothing to do
  def run
    until @readable.empty? && @writable.empty? && @timers.empty? && @ready.empty?
      # Run anything immediately ready
      @ready.dup.each { |f| @ready.delete(f); f.resume if f.alive? }

      now = Process.clock_gettime(Process::CLOCK_MONOTONIC)

      # Fire elapsed timers
      @timers.select! do |fire_at, fiber|
        if fire_at <= now
          fiber.resume if fiber.alive?
          false
        else
          true
        end
      end

      # Select on I/O with a short timeout
      timeout = @timers.map(&:first).min&.-(now) || 0.1
      timeout = [timeout, 0].max

      readable, writable = IO.select(
        @readable.keys, @writable.keys, [], timeout
      ) || [[], []]

      readable.each { |io| @readable.delete(io)&.resume }
      writable.each { |io| @writable.delete(io)&.resume }
    end
  end
end

Now wire it up:

scheduler = MinimalScheduler.new
Fiber.set_scheduler(scheduler)

# These three Fibers run concurrently on ONE thread
Fiber.schedule do
  puts "[A] start"
  sleep 0.1   # intercepted → this fiber suspends, others run
  puts "[A] after sleep"
end

Fiber.schedule do
  puts "[B] start"
  sleep 0.05
  puts "[B] after sleep"
end

Fiber.schedule do
  puts "[C] instant"
end

# Output:
# [A] start
# [B] start
# [C] instant
# [B] after sleep   ← B's timer fires first (0.05s)
# [A] after sleep   ← A's timer fires second (0.1s)

One thread. Three Fibers making progress "simultaneously." The scheduler is the traffic cop.

Fibers for Concurrent HTTP Requests (Real Example)

Using the async gem, which ships a production-grade scheduler:

# gem 'async'
require 'async'
require 'async/http/internet'

urls = %w[
  https://api.github.com/users/rails
  https://api.github.com/users/matz
  https://api.github.com/users/tenderlove
]

results = {}

Async do |task|
  internet = Async::HTTP::Internet.new

  tasks = urls.map do |url|
    task.async do
      response = internet.get(url)
      body     = response.read
      results[url] = JSON.parse(body)["public_repos"]
    end
  end

  tasks.each(&:wait)
  internet.close
end

results.each { |url, repos| puts "#{url.split('/').last}: #{repos} repos" }

All three HTTP requests fire concurrently. The Fiber scheduler switches between them as each one waits on I/O. Total wall-clock time ≈ the slowest single request, not the sum of all three.

When Fibers win:

Many concurrent I/O operations (HTTP, DB, Redis, file)
You want async without threads (no locking, no race conditions on shared state)
Streaming pipelines, generators, lazy evaluation

When Fibers don't help:

CPU-bound computation — one thread means one core
Work that blocks without yielding (C extensions that don't release the GIL)

Part 2: Ractors

The Basics — Isolated Parallel Workers

A Ractor is an actor-model execution unit. It has its own heap, runs on its own thread, and communicates via message passing. The GIL does not apply between Ractors.

# Simple parallel computation
r = Ractor.new do
  sum = 0
  1_000_000.times { |i| sum += i }
  sum
end

puts r.take  # => 499999500000
# Runs on a separate thread while main continues

Multiple Ractors running in parallel:

workers = 4.times.map do |i|
  Ractor.new(i) do |id|
    # Each Ractor computes its slice
    start = id * 250_000
    finish = start + 250_000
    sum = 0
    (start...finish).each { |n| sum += n }
    sum
  end
end

total = workers.sum(&:take)
puts total  # => 499999500000  (same answer, ~4x faster on 4 cores)

The Isolation Rules — Where Most Code Breaks

Ractors cannot share mutable objects. This is enforced at runtime, not compile time. The rules:

# ✅ Immutable objects can be shared freely
CONSTANT = "hello".freeze
r = Ractor.new { puts CONSTANT }  # fine — frozen string is shareable

# ✅ Value types are always shareable
r = Ractor.new { puts 42 + Math::PI }  # integers, floats, symbols: fine

# ❌ Mutable objects cannot be passed by reference
shared_hash = { count: 0 }
r = Ractor.new(shared_hash) do |h|
  h[:count] += 1  # Ractor::IsolationError!
end

# ✅ Pass a copy instead — the sender loses the reference
data = { count: 0 }
r = Ractor.new(data) do |h|
  h[:count] += 1
  h
end
# After this line, `data` is moved — the main Ractor can no longer access it
result = r.take
puts result[:count]  # => 1

The move semantics are the hardest part. When you pass a mutable object to a Ractor, you're transferring ownership:

payload = [1, 2, 3, 4, 5]

r = Ractor.new(payload) do |arr|
  arr.map { |n| n ** 2 }
end

result = r.take

# payload is now inaccessible from the main Ractor
begin
  puts payload  # => Ractor::MovedError: moved object has been used
rescue Ractor::MovedError => e
  puts "As expected: #{e.message}"
end

This forces you to think about ownership, which is uncomfortable if you're used to Ruby's free-for-all sharing. It's also what makes Ractor-based code genuinely thread-safe without locks.

Message Passing Patterns

Ractors communicate via send/receive (push-based) or yield/take (pull-based):

# Push model: supervisor sends work to workers
workers = 3.times.map do
  Ractor.new do
    loop do
      job = Ractor.receive  # blocks until a message arrives
      break if job == :stop
      result = job * job
      Ractor.yield result   # makes result available to anyone who calls .take
    end
  end
end

# Distribute work
jobs = [10, 20, 30, 40, 50, 60, 70, 80, 90]
jobs.each_with_index do |job, i|
  workers[i % workers.size].send(job)
end

workers.each { |w| w.send(:stop) }

# Collect results (order not guaranteed)
results = workers.flat_map do |w|
  # Drain each worker's yielded values
  collected = []
  loop { collected << w.take }
rescue Ractor::ClosedError
  collected
end

puts results.sort.inspect
# => [100, 400, 900, 1600, 2500, 3600, 4900, 6400, 8100]

A Worker Pool with a Shared Queue

The most common Ractor pattern: a pool of workers pulling from a central pipeline.

# Ractor.make_shareable deep-freezes an object so all Ractors can read it
PIPELINE = Ractor.new do
  loop { Ractor.yield Ractor.receive }
end

N_WORKERS = 4

workers = N_WORKERS.times.map do
  Ractor.new(PIPELINE) do |pipeline|
    loop do
      job = pipeline.take     # pull next job from the pipeline
      break if job == :done

      # CPU-intensive work — genuinely parallel across workers
      result = {
        input:  job,
        output: job.chars.sort.join,  # simulate work
        worker: Ractor.current.inspect,
      }
      Ractor.yield result
    end
  end
end

# Feed jobs
words = %w[banana apple cherry date elderberry fig grape]
words.each { |w| PIPELINE.send(w) }
N_WORKERS.times { PIPELINE.send(:done) }

# Collect in arrival order
results = []
N_WORKERS.times do |_|
  loop do
    _ractor, result = Ractor.select(*workers)
    results << result
  rescue Ractor::ClosedError
    break
  end
end

results.each { |r| puts "#{r[:input]} → #{r[:output]} (#{r[:worker]})" }

Ractor.select is the key — it blocks until any of the given Ractors has a value ready, then returns whichever fires first. This is how you avoid blocking on a slow worker when faster ones have results ready.

Ractors and Classes — Where Things Get Thorny

Most Ruby classes are not Ractor-safe because they hold mutable class-level state. This is the thing that will bite you most in practice:

# ❌ Class with mutable state — breaks in Ractors
class Counter
  @@count = 0                      # class variable: mutable, shared
  def self.increment = @@count += 1
  def self.value     = @@count
end

r = Ractor.new { Counter.increment }  # Ractor::IsolationError

# ✅ Approach 1: Pass data explicitly, return results
r = Ractor.new(0) do |count|
  count + 1
end
puts r.take  # => 1

# ✅ Approach 2: Freeze the class's shareable parts
class Config
  DEFAULTS = {
    timeout: 30,
    retries: 3,
  }.freeze                         # frozen Hash is shareable

  def self.defaults = DEFAULTS
end

r = Ractor.new { Config.defaults[:timeout] }
puts r.take  # => 30  — works because DEFAULTS is frozen

# ✅ Approach 3: Ractor-local state via instance variables on the Ractor
# Each Ractor has its own object space — instance vars on objects created
# inside the Ractor are private to it by definition
r = Ractor.new do
  @local_cache = {}                # this lives only in this Ractor's heap
  @local_cache[:key] = "value"
  @local_cache[:key]
end
puts r.take  # => "value"

Practical Ractor: Parallel File Processing

require 'json'

# Process a directory of JSON files in parallel
files = Dir.glob("data/*.json")

# Freeze paths — strings are mutable but freeze makes them shareable
frozen_files = files.map(&:freeze)

workers = frozen_files.map do |path|
  Ractor.new(path) do |file_path|
    raw  = File.read(file_path)
    data = JSON.parse(raw)

    # Simulate per-file processing
    {
      file:   file_path,
      count:  data.length,
      sample: data.first,
    }
  end
end

# Collect as workers finish (whichever finishes first)
until workers.empty?
  done, result = Ractor.select(*workers)
  workers.delete(done)
  puts "#{result[:file]}: #{result[:count]} records"
end

Without Ractors, Dir.glob + processing would be sequential. With 4 Ractors on a 4-core machine, you're parsing 4 files simultaneously.

Head-to-Head Benchmarks

Let's measure both approaches on their native problem types.

I/O-Bound: 50 Concurrent HTTP Requests

require 'benchmark'
require 'net/http'
require 'async'
require 'async/http/internet'

URLS = Array.new(50) { "https://httpbin.org/delay/0.1" }

# Approach 1: Sequential (baseline)
sequential_time = Benchmark.realtime do
  URLS.each do |url|
    uri = URI(url)
    Net::HTTP.get(uri)
  end
end

# Approach 2: Threads
threads_time = Benchmark.realtime do
  threads = URLS.map do |url|
    Thread.new { Net::HTTP.get(URI(url)) }
  end
  threads.each(&:join)
end

# Approach 3: Fibers with Async scheduler
fibers_time = Benchmark.realtime do
  Async do |task|
    internet = Async::HTTP::Internet.new
    tasks = URLS.map { |url| task.async { internet.get(url).read } }
    tasks.each(&:wait)
    internet.close
  end
end

puts "Sequential: #{sequential_time.round(2)}s"
puts "Threads:    #{threads_time.round(2)}s"
puts "Fibers:     #{fibers_time.round(2)}s"

# Typical output (50 requests × 100ms each):
# Sequential: 5.21s
# Threads:    0.38s   ← thread overhead per request
# Fibers:     0.12s   ← near-theoretical minimum (one connection pool)

Fibers beat threads for high-concurrency I/O because they have lower overhead per "concurrent unit" and can share a single connection pool efficiently.

CPU-Bound: Parallel Prime Sieve

require 'benchmark'

def primes_in_range(start, finish)
  sieve = Array.new(finish + 1, true)
  sieve[0] = sieve[1] = false
  (2..Math.sqrt(finish)).each do |i|
    if sieve[i]
      (i*i..finish).step(i) { |j| sieve[j] = false }
    end
  end
  (start..finish).select { |n| sieve[n] }
end

RANGES = [
  [2,       500_000],
  [500_001, 1_000_000],
  [1_000_001, 1_500_000],
  [1_500_001, 2_000_000],
]

# Approach 1: Sequential
sequential_time = Benchmark.realtime do
  results = RANGES.map { |s, e| primes_in_range(s, e) }
  puts "Sequential primes found: #{results.sum(&:length)}"
end

# Approach 2: Threads (GIL-constrained — won't parallelize Ruby bytecode)
threads_time = Benchmark.realtime do
  threads = RANGES.map do |s, e|
    Thread.new { primes_in_range(s, e) }
  end
  results = threads.map(&:value)
  puts "Threads primes found: #{results.sum(&:length)}"
end

# Approach 3: Ractors (genuine parallelism)
ractors_time = Benchmark.realtime do
  workers = RANGES.map do |s, e|
    Ractor.new(s, e) { |start, finish| primes_in_range(start, finish) }
  end
  results = workers.map(&:take)
  puts "Ractors primes found: #{results.sum(&:length)}"
end

puts "\nSequential: #{sequential_time.round(3)}s"
puts "Threads:    #{threads_time.round(3)}s  (should be ~same as sequential)"
puts "Ractors:    #{ractors_time.round(3)}s  (should be ~1/4 on 4 cores)"

# Typical output on a 4-core machine:
# Sequential primes found: 148933
# Threads primes found:    148933
# Ractors primes found:    148933
#
# Sequential: 1.847s
# Threads:    1.831s   ← GIL: no speedup
# Ractors:    0.512s   ← ~3.6x speedup (real parallelism)

This is the number that matters. Threads give you zero speedup on CPU-bound Ruby work because the GIL serializes bytecode execution. Ractors give you near-linear scaling with cores.

The Comparison Table

	Fibers	Ractors
Parallelism	No (cooperative, 1 thread)	Yes (preemptive, N threads)
Use case	I/O-bound concurrency	CPU-bound parallelism
Shared state	Freely shared (same heap)	Forbidden (isolated heaps)
Communication	Direct — same memory	Message passing only
Error handling	Exceptions propagate normally	Ractor errors are isolated
Maturity	Stable	Experimental (Ruby 3.x)
Ecosystem	`async`, `evt` gems	Limited; most gems not safe
Overhead	Very low (kb per fiber)	Higher (full OS thread)
Debugging	Standard tools work	Limited tooling
When it breaks	Blocking C extensions	Any mutable shared state

Combining Both: Parallel Workers, Each with Async I/O

The really interesting architecture is Ractors for parallelism at the top level, with each Ractor using a Fiber scheduler internally for its I/O work. This gives you both:

# Each Ractor handles a batch of URLs concurrently via Fibers
# Multiple Ractors run in parallel on multiple cores
require 'async'
require 'async/http/internet'

URL_BATCHES = [
  %w[https://api.github.com/users/rails https://api.github.com/users/matz],
  %w[https://api.github.com/users/tenderlove https://api.github.com/users/dhh],
].map { |batch| batch.freeze }.freeze

workers = URL_BATCHES.map do |batch|
  Ractor.new(batch) do |urls|
    # Inside this Ractor: Async scheduler for concurrent I/O
    results = {}

    Async do |task|
      internet = Async::HTTP::Internet.new

      fetches = urls.map do |url|
        task.async do
          response = internet.get(url)
          body     = JSON.parse(response.read)
          [url, body["public_repos"]]
        end
      end

      fetches.each do |f|
        url, repos = f.wait
        results[url] = repos
      end

      internet.close
    end

    results
  end
end

all_results = workers.map(&:take).reduce(:merge)
all_results.each do |url, repos|
  puts "#{url.split('/').last}: #{repos} repos"
end

Each Ractor runs on its own core. Within each Ractor, Fibers handle concurrent HTTP. You get horizontal scaling (Ractors) and I/O multiplexing (Fibers) simultaneously.

Note the freeze calls — both the batch arrays and the URL strings must be frozen to be passed into Ractors. In practice you'll build helpers for this:

module RactorSafe
  def self.freeze_deep(obj)
    case obj
    when Hash
      obj.transform_values { |v| freeze_deep(v) }.freeze
    when Array
      obj.map { |v| freeze_deep(v) }.freeze
    when String
      obj.frozen? ? obj : obj.dup.freeze
    else
      obj.frozen? ? obj : obj.freeze
    end
  end
end

payload = RactorSafe.freeze_deep({ urls: ["http://example.com"], timeout: 30 })
r = Ractor.new(payload) { |p| p[:timeout] }
puts r.take  # => 30

What's Still Broken with Ractors (Honest Assessment)

Ractors are marked experimental for a reason. As of Ruby 3.3:

# Most stdlib classes are not Ractor-safe
require 'date'
r = Ractor.new { Date.today }  # Ractor::IsolationError in some versions

# Logger is not Ractor-safe
require 'logger'
logger = Logger.new($stdout)
r = Ractor.new(logger) { |l| l.info("hello") }  # IsolationError

# ERB is not Ractor-safe
# OpenSSL contexts are not Ractor-safe
# Most ActiveRecord is emphatically not Ractor-safe

The practical upshot: Ractors work well for pure computation with simple data types. They break badly when you reach for anything that has mutable class-level state — which is most of the Ruby ecosystem.

The path forward is gems marking themselves as Ractor-safe and Ruby's standard library getting audited. It's happening, slowly. Ractors today are best suited to isolated computation pipelines where you control the code, not general-purpose Rails-style applications.

Decision Guide

Use Fibers when:

You're doing many concurrent network calls, DB queries, or file operations
You want async without the complexity of thread synchronization
You're building a streaming pipeline or lazy generator
Your codebase uses gems that aren't Ractor-safe (which is most gems)

Use Ractors when:

You have genuinely CPU-bound work (parsing, compression, cryptography, simulation)
You control the full call stack and can audit it for Ractor-safety
You need to saturate all available cores, not just improve I/O throughput
You're building a new system, not retrofitting an existing Rails app

Use Threads when:

You need Ractor-like parallelism but can't freeze your data (legacy code)
You're wrapping a C extension that releases the GIL (SQLite, some crypto libs)
The ergonomics of Ractors aren't worth it for your team yet

The honest answer for most production Ruby applications in 2024: Fibers via async for I/O, threads for the rare parallelism case, and Ractors for isolated computation services where you own the entire stack. Ractors are the future — they're just not universally the present yet.

Node.js Performance at the Limit: Profiling, Fixing, and Proving It with Real Numbers

Temitope — Mon, 11 May 2026 10:29:54 +0000

Most Node.js performance content teaches you to avoid eval, use streams instead of buffers, and "don't block the event loop." That's fine advice — and it won't help you when your p99 latency is 2.3 seconds and your CTO is in your Slack DMs.

This is a different kind of article. We start with a realistic API that has real problems, profile it properly, fix each bottleneck with actual code, and measure the delta at every step. No platitudes. Numbers or it didn't happen.

The Benchmark Harness First

Before touching a single line of application code, establish your measurement baseline. Every optimization you make needs a before and after. Without this, you're just guessing with extra steps.

We'll use autocannon for HTTP benchmarking and Node's built-in --prof flag plus Chrome DevTools for CPU profiling.

npm install -g autocannon clinic

The baseline test we'll run throughout:

# 10 seconds, 50 concurrent connections, pipe results to JSON
autocannon -c 50 -d 10 -j http://localhost:3000/api/reports > baseline.json

A helper script to diff two runs:

// scripts/compare.js
const before = require('./baseline.json');
const after  = require('./optimized.json');

const metrics = ['requests', 'latency', 'throughput'];

for (const m of metrics) {
  const b = before[m];
  const a = after[m];
  const deltaAvg = (((a.average - b.average) / b.average) * 100).toFixed(1);
  console.log(`${m}.average: ${b.average} → ${a.average} (${deltaAvg}%)`);
}

Run this after every change. Keep all your JSON files. You'll need receipts.

The Patient: A Realistic Slow API

Here's the kind of endpoint that exists in every codebase that's survived long enough. It generates a report — fetches some data, processes it, formats it, returns JSON.

// src/routes/reports.js  (the before — intentionally broken)
const express = require('express');
const db      = require('../db');       // Postgres via pg
const crypto  = require('crypto');
const router  = express.Router();

router.get('/api/reports', async (req, res) => {
  const { org_id, start, end } = req.query;

  // Fetch orders
  const orders = await db.query(
    `SELECT * FROM orders WHERE org_id = $1
     AND created_at BETWEEN $2 AND $3`,
    [org_id, start, end]
  );

  // For each order, fetch its line items separately
  const enriched = [];
  for (const order of orders.rows) {
    const items = await db.query(
      `SELECT * FROM line_items WHERE order_id = $1`,
      [order.id]
    );

    const total = items.rows.reduce((sum, i) => sum + i.price * i.qty, 0);

    // Compute a "fingerprint" for cache busting downstream
    const fingerprint = crypto
      .createHash('sha256')
      .update(JSON.stringify(order))
      .digest('hex');

    enriched.push({ ...order, items: items.rows, total, fingerprint });
  }

  // Sort by total descending
  enriched.sort((a, b) => b.total - a.total);

  res.json({ data: enriched, count: enriched.count });
});

If you've been around long enough, you felt something in your chest reading that. Let's quantify the pain.

Baseline numbers (50 concurrent, 10s, 200 orders in the result set):

Requests/sec:  47.3
Latency avg:   1,041ms
Latency p99:   2,380ms
Throughput:    1.1 MB/s

Four things are wrong. We'll fix them in order of impact.

Problem 1: The N+1 Query

The for...of loop that fires a db.query per order is the worst offender. With 200 orders, that's 201 round trips to Postgres. Each one waits for the previous to complete because await inside a for loop is sequential.

Proof First

node --require ./src/db-logger.js src/index.js &
autocannon -c 1 -d 3 http://localhost:3000/api/reports 2>/dev/null

// src/db-logger.js  — count queries per request
let count = 0;
const { Pool } = require('pg');
const originalQuery = Pool.prototype.query;
Pool.prototype.query = function(...args) {
  count++;
  process.stdout.write(`\rQueries this process: ${count}`);
  return originalQuery.apply(this, args);
};

Output confirms: 201 queries per request. At 5ms average round-trip, that's 1,005ms of pure waiting before any processing begins.

The Fix: JOIN Everything

// One query, zero loops
const result = await db.query(
  `SELECT
     o.*,
     json_agg(
       json_build_object(
         'id',    li.id,
         'price', li.price,
         'qty',   li.qty,
         'sku',   li.sku
       )
     ) AS items,
     SUM(li.price * li.qty) AS total
   FROM orders o
   JOIN line_items li ON li.order_id = o.id
   WHERE o.org_id = $1
     AND o.created_at BETWEEN $2 AND $3
   GROUP BY o.id
   ORDER BY total DESC`,
  [org_id, start, end]
);

json_agg builds the nested items array directly in Postgres. The SUM computes the total in SQL, skipping the JS reduce entirely. One round trip.

After fix 1:

Requests/sec:  312.4   (+560%)
Latency avg:   158ms   (-85%)
Latency p99:   401ms   (-83%)

That's your N+1. Find it, kill it, collect your 5x improvement.

Problem 2: CPU Blocking — The Fingerprint Loop

With the database bottleneck gone, the CPU profile becomes readable. Let's generate one:

node --prof src/index.js &
autocannon -c 50 -d 10 http://localhost:3000/api/reports
kill %1
node --prof-process isolate-*.log > profile.txt

Look for the hot functions at the top of profile.txt:

 [JavaScript]:
   ticks  total  nonlib   name
   2,847   31.2%   34.1%  crypto.Hash.update
   1,203   13.2%   14.4%  JSON.stringify
     891    9.8%   10.7%  Array.prototype.sort

crypto.Hash.update eating 31% of CPU time for a "fingerprint" that's used for... cache busting? This needs scrutiny.

The Analysis

// The original — called 200 times per request
const fingerprint = crypto
  .createHash('sha256')
  .update(JSON.stringify(order))  // stringify a full order object, 200x
  .digest('hex');

Two problems:

JSON.stringify on a full order object, 200 times per request, under 50 concurrent connections = 10,000 stringifies/second, each allocating a new string.
SHA-256 is cryptographically secure. We don't need that for a cache-busting fingerprint. We need fast and unique, not secure.

If the fingerprint is truly needed, use a cheaper hash and stop serializing the whole object:

// Option A: Hash only the fields that actually affect cache validity
const fingerprint = crypto
  .createHash('md4')              // 3x faster than sha256 for this use
  .update(`${order.id}:${order.updated_at.getTime()}`)
  .digest('hex');

// Option B: If you only need uniqueness, not a hash
// updated_at is already a change signal — use it directly
const fingerprint = `${order.id}-${order.updated_at.getTime().toString(36)}`;

Option B is what you almost certainly actually want. It's a string concat, not a hash. It's unique per order-version. It takes microseconds.

After fix 2:

Requests/sec:  489.1   (+57% on top of fix 1)
Latency avg:   101ms   (-36%)
Latency p99:   229ms   (-43%)
CPU idle:      ~62%    (was ~21%)

Problem 3: Memory Pressure and GC Pauses

Run the Clinic.js heap profiler to see allocation patterns:

clinic heapprofiler -- node src/index.js &
autocannon -c 50 -d 10 http://localhost:3000/api/reports

Clinic will generate an HTML flamegraph. The allocation spike you'll see is from building the enriched array: 200 objects, each with a spread copy of the order, plus items array, plus computed fields. Under 50 concurrent connections, that's up to 10,000 object allocations per second, many of them large.

The V8 GC handles this, but not for free. You'll see GC pauses in the p99 latency as the minor GC sweeps short-lived allocations from the new-space.

The Fix: Return the Postgres Result Directly

The JOIN query already gives us the shape we need. Stop copying:

// Before: building enriched[] with spreads and mutations
const enriched = [];
for (const row of result.rows) {
  enriched.push({ ...row, items: row.items, total: row.total, fingerprint });
}

// After: the DB result IS the response — transform in place minimally
const data = result.rows.map(row => ({
  id:          row.id,
  org_id:      row.org_id,
  created_at:  row.created_at,
  items:       row.items,         // already json_agg'd by Postgres
  total:       parseFloat(row.total),
  fingerprint: `${row.id}-${new Date(row.created_at).getTime().toString(36)}`,
}));

Explicit field selection instead of spreading also avoids accidentally sending internal fields (internal_notes, cost_price, etc.) to the client — a common security issue hiding inside performance code.

After fix 3:

Requests/sec:  541.8   (+11%)
Latency avg:   91ms    (-10%)
Latency p99:   198ms   (-14%)
GC pause max:  4ms     (was 23ms)

The absolute numbers are a modest improvement, but GC max pause dropping from 23ms to 4ms matters — that's what was spiking your p99.

Problem 4: The Event Loop — Blocking JSON Serialization

res.json() calls JSON.stringify() synchronously on the main thread. For small responses this doesn't matter. For a response that's 200 orders × 10 line items each, you're stringifying a 400KB+ object on the event loop, blocking all other requests during that serialization.

Let's prove it with a flame chart:

clinic flame -- node src/index.js &
autocannon -c 50 -d 10 http://localhost:3000/api/reports

You'll see JSON.stringify as a wide horizontal band — it's synchronous time on the main thread. For a 50-concurrent test, this means requests queuing behind each other's serialization.

Fix A: Streaming JSON with `fast-json-stringify`

npm install fast-json-stringify

const fastJson = require('fast-json-stringify');

// Define the shape of your response once — compile it
const stringify = fastJson({
  type: 'object',
  properties: {
    count: { type: 'integer' },
    data: {
      type: 'array',
      items: {
        type: 'object',
        properties: {
          id:          { type: 'integer' },
          org_id:      { type: 'integer' },
          created_at:  { type: 'string' },
          total:       { type: 'number' },
          fingerprint: { type: 'string' },
          items: {
            type: 'array',
            items: {
              type: 'object',
              properties: {
                id:    { type: 'integer' },
                price: { type: 'number' },
                qty:   { type: 'integer' },
                sku:   { type: 'string' },
              }
            }
          }
        }
      }
    }
  }
});

// In the route handler
const payload = stringify({ data, count: data.length });
res.setHeader('Content-Type', 'application/json');
res.end(payload);

fast-json-stringify generates a schema-specific serializer at startup — no runtime type-checking, no property iteration. For a known schema it's typically 2–5x faster than JSON.stringify.

Fix B: For Very Large Responses — `JSONStream`

If your response can be megabytes, don't serialize it all before sending. Stream it:

npm install JSONStream

const JSONStream = require('JSONStream');

router.get('/api/reports', async (req, res) => {
  // ... query ...

  res.setHeader('Content-Type', 'application/json');

  const stream = JSONStream.stringify('{"data":[', ',', ']}');
  stream.pipe(res);

  for (const row of result.rows) {
    stream.write(transformRow(row));
  }

  stream.end();
});

This writes the response incrementally — the client starts receiving bytes before you've processed the last row. Critical for very large datasets.

After fix 4 (fast-json-stringify):

Requests/sec:  618.3   (+14%)
Latency avg:   80ms    (-12%)
Latency p99:   171ms   (-14%)

Problem 5: Connection Pool Starvation

Under sustained 50-connection load, you'll hit a subtler problem: connection pool exhaustion. The default pg Pool size is 10. With 50 concurrent requests each needing a connection, 40 of them are waiting in queue.

// The invisible default that's killing your concurrency
const pool = new Pool({
  // max: 10  ← this is the default you never set
});

Tuning the Pool

const { Pool } = require('pg');

const pool = new Pool({
  host:     process.env.PGHOST,
  database: process.env.PGDATABASE,
  user:     process.env.PGUSER,
  password: process.env.PGPASSWORD,
  port:     5432,

  // Tune these to your Postgres max_connections and node count
  max:             25,    // per Node process; multiply by process count
  idleTimeoutMillis: 30_000,
  connectionTimeoutMillis: 2_000,

  // Log pool events in development — essential for diagnosing starvation
  ...(process.env.NODE_ENV === 'development' && {
    log: (...args) => console.log('[pool]', ...args),
  }),
});

// Monitor pool health — expose this to your metrics system
pool.on('connect',  () => metrics.gauge('pg.pool.size', pool.totalCount));
pool.on('acquire',  () => metrics.gauge('pg.pool.waiting', pool.waitingCount));
pool.on('remove',   () => metrics.gauge('pg.pool.idle', pool.idleCount));

The right max value:

max per process = floor(postgres_max_connections / node_process_count) - headroom

If Postgres is configured for 100 connections and you run 4 Node processes:
floor(100 / 4) - 5 = 20 — leave 5 for admin connections, migrations, etc.

After fix 5:

Requests/sec:  791.2   (+28%)
Latency avg:   62ms    (-23%)
Latency p99:   134ms   (-22%)

Connection pool sizing is pure configuration — no code to write, enormous impact.

The Complete Optimized Handler

Here's the final version — everything applied:

// src/routes/reports.js  (the after)
const express    = require('express');
const db         = require('../db');
const fastJson   = require('fast-json-stringify');
const router     = express.Router();

const stringify = fastJson({
  type: 'object',
  properties: {
    count: { type: 'integer' },
    data: {
      type: 'array',
      items: {
        type: 'object',
        properties: {
          id:          { type: 'integer' },
          org_id:      { type: 'integer' },
          created_at:  { type: 'string'  },
          total:       { type: 'number'  },
          fingerprint: { type: 'string'  },
          items: {
            type: 'array',
            items: {
              type: 'object',
              properties: {
                id:    { type: 'integer' },
                price: { type: 'number'  },
                qty:   { type: 'integer' },
                sku:   { type: 'string'  },
              }
            }
          }
        }
      }
    }
  }
});

router.get('/api/reports', async (req, res) => {
  const { org_id, start, end } = req.query;

  if (!org_id || !start || !end) {
    return res.status(400).json({ error: 'org_id, start, end are required' });
  }

  const result = await db.query(
    `SELECT
       o.id, o.org_id, o.created_at,
       json_agg(
         json_build_object('id', li.id, 'price', li.price, 'qty', li.qty, 'sku', li.sku)
         ORDER BY li.id
       ) AS items,
       SUM(li.price * li.qty) AS total
     FROM orders o
     JOIN line_items li ON li.order_id = o.id
     WHERE o.org_id = $1
       AND o.created_at BETWEEN $2 AND $3
     GROUP BY o.id
     ORDER BY total DESC`,
    [org_id, start, end]
  );

  const data = result.rows.map(row => ({
    id:          row.id,
    org_id:      row.org_id,
    created_at:  row.created_at.toISOString(),
    items:       row.items,
    total:       parseFloat(row.total),
    fingerprint: `${row.id}-${row.created_at.getTime().toString(36)}`,
  }));

  const payload = stringify({ data, count: data.length });
  res.setHeader('Content-Type', 'application/json');
  res.end(payload);
});

module.exports = router;

Full Benchmark Summary

Every fix measured, no cherry-picking:

Fix	Req/sec	Avg latency	p99 latency	Delta req/sec
Baseline	47	1,041ms	2,380ms	—
1. Eliminate N+1	312	158ms	401ms	+562%
2. Cheaper fingerprint	489	101ms	229ms	+57%
3. Reduce allocations	542	91ms	198ms	+11%
4. fast-json-stringify	618	80ms	171ms	+14%
5. Pool tuning	791	62ms	134ms	+28%
Total	791	62ms	134ms	+1,574%

The N+1 was worth 5x on its own. Everything else stacked another 2.5x on top. That's the real distribution of performance work — one structural problem and a handful of incremental improvements.

What to Do When the Low-Hanging Fruit Is Gone

After these fixes, you've addressed the common offenders. Further gains require different tools:

Worker threads for CPU-heavy work. If you have actual computation (image processing, cryptography on large data, PDF generation), offload it:

const { Worker, isMainThread, parentPort, workerData } = require('worker_threads');

// main thread
function runInWorker(data) {
  return new Promise((resolve, reject) => {
    const w = new Worker(__filename, { workerData: data });
    w.on('message', resolve);
    w.on('error',   reject);
  });
}

// worker thread
if (!isMainThread) {
  const result = expensiveComputation(workerData);
  parentPort.postMessage(result);
}

Caching at the right layer. If the same org_id + date range is queried repeatedly, cache at the route level — but measure hit rate before adding cache complexity. A cache that misses 80% of the time adds latency, not removes it.

const cache = new Map();          // Replace with Redis in production

router.get('/api/reports', async (req, res) => {
  const key = `${req.query.org_id}:${req.query.start}:${req.query.end}`;
  const cached = cache.get(key);

  if (cached) {
    res.setHeader('X-Cache', 'HIT');
    res.setHeader('Content-Type', 'application/json');
    return res.end(cached);
  }

  // ... query and process ...

  const payload = stringify({ data, count: data.length });
  cache.set(key, payload);
  setTimeout(() => cache.delete(key), 30_000);  // 30s TTL

  res.setHeader('X-Cache', 'MISS');
  res.setHeader('Content-Type', 'application/json');
  res.end(payload);
});

Horizontal scaling. Node is single-threaded per process. Use the cluster module or PM2 to run one process per CPU core. This is orthogonal to the optimizations above — do both.

// cluster.js
const cluster = require('cluster');
const os      = require('os');

if (cluster.isPrimary) {
  const cpus = os.cpus().length;
  console.log(`Forking ${cpus} workers`);
  for (let i = 0; i < cpus; i++) cluster.fork();
  cluster.on('exit', (worker) => {
    console.log(`Worker ${worker.process.pid} died, restarting`);
    cluster.fork();
  });
} else {
  require('./src/index.js');
}

The Discipline

Performance work without measurement is superstition. The discipline is:

Baseline before you touch anything. No exceptions.
Change one thing at a time. If you fix three things together, you don't know which one mattered.
Profile before you optimize. The bottleneck is almost never where you think it is until you look.
Keep all your benchmark JSON files. You'll need to explain the improvement to someone who wasn't there.
Test under realistic concurrency. A benchmark with c 1 will not find pool exhaustion or GC pressure.

The 16x improvement above came from five targeted fixes to one endpoint. That's not unusual. Most production APIs have an N+1 they've lived with for years, a hash function nobody remembers adding, and a connection pool set to its default. Profile, fix, measure, repeat.

Row-Level Multitenancy in Rails: Building a Bulletproof Tenant Isolation Layer from Scratch

Temitope — Mon, 11 May 2026 10:26:45 +0000

If you've reached for acts_as_tenant or apartment without really understanding what's happening underneath, this tutorial is the correction. We're building a row-level multitenant system from first principles — the kind you can actually reason about when something goes wrong at 2am.

By the end, you'll have:

Middleware that resolves the current tenant from a subdomain or JWT claim
A Current-based tenant context that threads safely through the request lifecycle
An ApplicationRecord mixin that enforces tenant scoping at the model layer
Postgres row-level security policies as a hard backstop
A TenantSafe concern for background jobs that re-establishes context correctly
A test helper that won't let you accidentally write cross-tenant specs

No gems. Just Rails, Postgres, and honest code.

The Mental Model

Row-level multitenancy means every tenant's data lives in the same tables, separated by a tenant_id foreign key. The application layer is responsible for filtering. The database can optionally enforce this too (and should, as defense-in-depth).

The failure modes are well-known:

A developer forgets to scope a query → data leak
A background job runs without tenant context → unscoped query touches all tenants
An association crosses tenant boundaries silently

We'll build guardrails against all three.

Step 1: The Tenant Model and Migration

First, the tenant itself. Keep it simple — tenants own a subdomain and that's the primary resolution mechanism.

# db/migrate/20240901000001_create_tenants.rb
class CreateTenants < ActiveRecord::Migration[7.2]
  def change
    create_table :tenants do |t|
      t.string  :name,      null: false
      t.string  :subdomain, null: false
      t.string  :status,    null: false, default: "active"
      t.jsonb   :settings,  null: false, default: {}
      t.timestamps
    end

    add_index :tenants, :subdomain, unique: true
    add_index :tenants, :status
  end
end

# db/migrate/20240901000002_create_accounts.rb
# A representative tenant-scoped resource
class CreateAccounts < ActiveRecord::Migration[7.2]
  def change
    create_table :accounts do |t|
      t.references :tenant, null: false, foreign_key: true, index: true
      t.string :name,  null: false
      t.string :email, null: false
      t.timestamps
    end

    # Composite index: tenant lookups are always tenant-first
    add_index :accounts, [:tenant_id, :email], unique: true
  end
end

The tenant_id column goes on every tenant-scoped table. No exceptions. Make this a convention enforced in code review, or better — a custom RuboCop rule that checks migrations.

Step 2: Thread-Local Tenant Context via `Current`

Rails 5.2 shipped ActiveSupport::CurrentAttributes. It gives you a request-scoped (or thread-scoped) object that's automatically reset between requests. This is the right place for tenant context.

# app/models/current.rb
class Current < ActiveSupport::CurrentAttributes
  attribute :tenant

  # Convenience predicate used throughout the app
  def tenant?
    tenant.present?
  end

  # Hard assertion — use in contexts where a missing tenant is a bug
  def tenant!
    tenant || raise(TenantNotSetError, "Current.tenant is not set")
  end
end

# app/errors/tenant_not_set_error.rb
class TenantNotSetError < StandardError; end
class TenantNotFoundError < StandardError; end

CurrentAttributes resets between requests automatically because Rails calls reset on it at the end of each request via the executor. You get thread safety for free. Don't roll your own thread-local here — this is one case where the Rails abstraction is genuinely better.

Step 3: Middleware for Tenant Resolution

Subdomain-based resolution is the most common pattern. The middleware runs before your controllers and sets Current.tenant for the entire request.

# app/middleware/tenant_resolver.rb
class TenantResolver
  EXCLUDED_SUBDOMAINS = %w[www api admin].freeze

  def initialize(app)
    @app = app
  end

  def call(env)
    request  = ActionDispatch::Request.new(env)
    subdomain = extract_subdomain(request)

    if subdomain && EXCLUDED_SUBDOMAINS.exclude?(subdomain)
      tenant = Tenant.active.find_by(subdomain: subdomain)

      unless tenant
        return [
          302,
          { "Location" => root_url(request), "Content-Type" => "text/html" },
          ["Tenant not found"]
        ]
      end

      Current.tenant = tenant
    end

    @app.call(env)
  ensure
    # CurrentAttributes resets itself, but be explicit for clarity
    Current.reset
  end

  private

  def extract_subdomain(request)
    # Handles localhost (single part) and production (multi-part) hosts
    parts = request.host.split(".")
    return nil if parts.length <= (Rails.env.production? ? 2 : 1)
    parts.first.downcase.presence
  end

  def root_url(request)
    "#{request.protocol}#{request.host_with_port}/"
  end
end

# config/application.rb
config.middleware.insert_after ActionDispatch::Session::CookieStore, TenantResolver

If you're using API mode with JWT instead of subdomains, the shape is the same — just parse the tenant claim from the Authorization header and resolve accordingly:

# Inside TenantResolver#call, for API mode
def resolve_from_jwt(request)
  token = request.headers["Authorization"]&.delete_prefix("Bearer ")
  return unless token

  payload = JwtService.decode(token)  # your existing JWT layer
  Tenant.active.find_by(id: payload["tenant_id"])
rescue JWT::DecodeError
  nil
end

Step 4: Enforcing Tenant Scope at the Model Layer

This is the core of the system. Every model that belongs to a tenant should be impossible to query without a scope.

# app/models/concerns/tenant_scoped.rb
module TenantScoped
  extend ActiveSupport::Concern

  included do
    belongs_to :tenant

    # Default scope: every query automatically filters by current tenant
    default_scope { where(tenant: Current.tenant) if Current.tenant? }

    # Validate that the record's tenant matches the current tenant
    validates :tenant_id, presence: true
    validate  :tenant_matches_current_context, on: :create

    before_create :assign_current_tenant
  end

  class_methods do
    # Escape hatch for internal/admin queries — use sparingly and explicitly
    def unscoped_for_tenant(tenant)
      unscoped.where(tenant: tenant)
    end

    # For cross-tenant admin operations only
    def all_tenants
      raise TenantNotSetError, "Use unscoped explicitly" unless Current.tenant.nil?
      unscoped
    end
  end

  private

  def assign_current_tenant
    self.tenant ||= Current.tenant
  end

  def tenant_matches_current_context
    return unless Current.tenant?
    return if tenant_id == Current.tenant.id

    errors.add(:tenant_id, "does not match current tenant context")
  end
end

Include it in your models:

# app/models/account.rb
class Account < ApplicationRecord
  include TenantScoped

  validates :name,  presence: true
  validates :email, presence: true, format: { with: URI::MailTo::EMAIL_REGEXP }
end

Now Account.all automatically returns only the current tenant's accounts. Account.create!(name: "Acme") automatically assigns tenant_id. You can't accidentally cross tenant boundaries in normal operation.

A Note on `default_scope`

default_scope is controversial in the Rails community, and the criticism is fair: it makes behaviour non-obvious, and it can cause surprising joins. Here's the rule: use it only for tenant scoping, never for anything else (no order, no where deleted_at IS NULL). Tenant scoping is the one case where you actually want it to be invisible — the whole point is that forgetting the scope is the bug.

Step 5: Postgres Row-Level Security as a Hard Backstop

The application layer is the first line of defense. RLS is the second. Even if a bug slips through your default scope, Postgres won't return data that doesn't belong to the current tenant.

-- db/migrate/20240901000010_enable_rls_on_accounts.rb
class EnableRlsOnAccounts < ActiveRecord::Migration[7.2]
  def up
    # Create an application-level DB user that isn't a superuser
    execute <<~SQL
      ALTER TABLE accounts ENABLE ROW LEVEL SECURITY;
      ALTER TABLE accounts FORCE ROW LEVEL SECURITY;

      -- Policy: SELECT/INSERT/UPDATE/DELETE only see rows for the current tenant
      CREATE POLICY tenant_isolation ON accounts
        USING (tenant_id = current_setting('app.current_tenant_id')::bigint)
        WITH CHECK (tenant_id = current_setting('app.current_tenant_id')::bigint);
    SQL
  end

  def down
    execute <<~SQL
      DROP POLICY IF EXISTS tenant_isolation ON accounts;
      ALTER TABLE accounts DISABLE ROW LEVEL SECURITY;
    SQL
  end
end

Now wire the Postgres session variable from Rails. The cleanest place is an around_action in ApplicationController, after the middleware has already set Current.tenant:

# app/controllers/application_controller.rb
class ApplicationController < ActionController::Base
  around_action :set_rls_tenant_context

  private

  def set_rls_tenant_context
    return yield unless Current.tenant?

    # Set the session-level variable Postgres uses in the RLS policy
    ActiveRecord::Base.connection.execute(
      ActiveRecord::Base.sanitize_sql(
        ["SELECT set_config('app.current_tenant_id', ?, false)",
         Current.tenant.id.to_s]
      )
    )
    yield
  rescue ActiveRecord::StatementInvalid => e
    # Don't let RLS config failure silently succeed
    Rails.logger.error("RLS context error: #{e.message}")
    raise
  end

The false argument to set_config means the setting is transaction-scoped, not session-scoped. This is correct — you want each request/transaction to set it explicitly, not inherit it from a pooled connection's previous request.

Connection pool caveat: Because you're using a connection pool (via PgBouncer or Rails' own pool), you must set the Postgres variable on every transaction, not assume it persists. The around_action above handles this correctly because it runs on every request.

Step 6: Background Jobs Without Foot Guns

Background jobs are where multitenant systems most commonly break. A job enqueued in a tenant context runs later in a worker process that has no HTTP request — so Current.tenant is nil, and your default scopes stop working.

The pattern: always serialize the tenant_id with the job, and restore Current.tenant before the job body runs.

# app/jobs/concerns/tenant_aware.rb
module TenantAware
  extend ActiveSupport::Concern

  included do
    before_enqueue  :capture_tenant_context
    before_perform  :restore_tenant_context
    after_perform   :clear_tenant_context
    around_perform  :with_rls_context
  end

  private

  def capture_tenant_context
    # Store tenant_id in the job's arguments at enqueue time
    raise TenantNotSetError, "#{self.class} enqueued outside tenant context" unless Current.tenant?

    # We use a job-level instance variable; Sidekiq serializes via #serialize
    @tenant_id_for_job = Current.tenant.id
  end

  def restore_tenant_context
    tenant_id = arguments.last.is_a?(Hash) ? arguments.last[:_tenant_id] : nil
    tenant_id ||= @tenant_id_for_job

    raise TenantNotSetError, "No tenant_id found in job arguments" unless tenant_id

    Current.tenant = Tenant.find(tenant_id)
  end

  def clear_tenant_context
    Current.reset
  end

  def with_rls_context
    if Current.tenant?
      ActiveRecord::Base.connection.execute(
        ActiveRecord::Base.sanitize_sql(
          ["SELECT set_config('app.current_tenant_id', ?, false)",
           Current.tenant.id.to_s]
        )
      )
    end
    yield
  end
end

For Sidekiq specifically, the cleanest approach is a middleware that injects and restores tenant context automatically, so you don't have to remember to include the concern on every job:

# config/initializers/sidekiq.rb

# Client middleware: inject tenant_id when the job is pushed to Redis
class SidekiqTenantClientMiddleware
  def call(_worker_class, job, _queue, _redis_pool)
    job["tenant_id"] = Current.tenant&.id
    yield
  end
end

# Server middleware: restore Current.tenant before the worker runs
class SidekiqTenantServerMiddleware
  def call(worker, job, queue)
    tenant_id = job["tenant_id"]

    if tenant_id
      Current.tenant = Tenant.find(tenant_id)

      ActiveRecord::Base.connection.execute(
        ActiveRecord::Base.sanitize_sql(
          ["SELECT set_config('app.current_tenant_id', ?, false)", tenant_id.to_s]
        )
      )
    end

    yield
  ensure
    Current.reset
  end
end

Sidekiq.configure_client do |config|
  config.client_middleware do |chain|
    chain.add SidekiqTenantClientMiddleware
  end
end

Sidekiq.configure_server do |config|
  config.client_middleware do |chain|
    chain.add SidekiqTenantClientMiddleware
  end

  config.server_middleware do |chain|
    chain.add SidekiqTenantServerMiddleware
  end
end

Now every job automatically carries its tenant context. Workers restore it without any per-job boilerplate.

Step 7: Test Helpers That Enforce the Rules

If your test suite doesn't force tenant context, you'll write tests that pass but miss real bugs. Here's a helper module that makes tenant context explicit and prevents accidental unscoped queries in specs.

# spec/support/tenant_helpers.rb
module TenantHelpers
  # Sets Current.tenant for the duration of a block
  def with_tenant(tenant, &block)
    previous = Current.tenant
    Current.tenant = tenant
    block.call
  ensure
    Current.tenant = previous
  end

  # Creates a tenant and sets it as current for the example
  def acting_as_tenant(tenant = nil)
    tenant ||= create(:tenant)
    Current.tenant = tenant
    tenant
  end

  # Assert that a block does NOT execute any unscoped cross-tenant queries
  def expect_tenant_safe(&block)
    queries = []
    subscriber = ActiveSupport::Notifications.subscribe("sql.active_record") do |_, _, _, _, payload|
      queries << payload[:sql] if payload[:sql].match?(/FROM "accounts"|FROM "orders"/)
    end

    block.call

    unscoped = queries.reject { |q| q.include?("tenant_id") }
    expect(unscoped).to be_empty,
      "Found #{unscoped.count} unscoped queries:\n#{unscoped.join("\n")}"
  ensure
    ActiveSupport::Notifications.unsubscribe(subscriber)
  end
end

RSpec.configure do |config|
  config.include TenantHelpers

  # Auto-reset Current between examples
  config.around(:each) do |example|
    Current.reset
    example.run
    Current.reset
  end
end

Usage in specs:

# spec/models/account_spec.rb
RSpec.describe Account, type: :model do
  let(:tenant_a) { create(:tenant, subdomain: "alpha") }
  let(:tenant_b) { create(:tenant, subdomain: "beta") }

  describe "tenant isolation" do
    before do
      with_tenant(tenant_a) { create_list(:account, 3) }
      with_tenant(tenant_b) { create_list(:account, 2) }
    end

    it "only returns the current tenant's accounts" do
      with_tenant(tenant_a) do
        expect(Account.count).to eq(3)
      end
    end

    it "does not leak tenant_b records into tenant_a context" do
      with_tenant(tenant_a) do
        tenant_b_account = Account.unscoped.where(tenant: tenant_b).first
        expect { Account.find(tenant_b_account.id) }.to raise_error(ActiveRecord::RecordNotFound)
      end
    end

    it "raises when tenant context is absent" do
      Current.reset
      expect { Account.all.load }.to raise_error(TenantNotSetError)
    end
  end

  describe "cross-tenant query safety" do
    it "scopes all queries to the current tenant" do
      acting_as_tenant(tenant_a) do
        expect_tenant_safe { Account.where(name: "anything").to_a }
      end
    end
  end
end

Step 8: The Admin Escape Hatch

You'll need internal tooling — Sidekiq Web, an admin dashboard, a Rails console task — that operates without a tenant context. Do this explicitly, never by accident.

# app/models/concerns/tenant_scoped.rb (add to class_methods block)
def for_tenant(tenant)
  unscoped.where(tenant: tenant)
end

def across_all_tenants
  # Explicitly documents intent; cannot be called by accident
  raise ArgumentError, "You must acknowledge cross-tenant access" unless block_given?
  unscoped { yield }
end

# Usage in a Rake task or admin controller
namespace :tenants do
  desc "Backfill a field across all tenants"
  task backfill_account_status: :environment do
    Tenant.find_each do |tenant|
      Account.for_tenant(tenant).find_each do |account|
        account.update_columns(status: "active") if account.status.nil?
      end
      puts "Done: #{tenant.subdomain}"
    end
  end
end

Note the pattern: even in cross-tenant admin tasks, we still loop through tenants explicitly rather than doing a single Account.update_all. This keeps queries tenant-anchored, which is better for Postgres query plans on large datasets anyway.

Common Pitfalls and How to Catch Them

Joins That Escape the Scope

default_scope applies to the base model but not to eagerly-loaded associations by default. Test this explicitly:

# This is safe — tenant scoped on Account
Account.includes(:orders).where(name: "Acme")

# But what about Order? Does it have TenantScoped too?
# If yes, the included records are separately scoped. Good.
# If no, you're loading all orders for the matching accounts across all tenants. Bad.

Every model in a has_many or belongs_to relationship should include TenantScoped if it's tenant data. Don't let associations be the escape valve.

Cached Queries in Mailers

ActionMailer runs outside a request context in production (delivered via background job). Make sure your mailers go through the same Sidekiq middleware path and have Current.tenant set before rendering:

class AccountMailer < ApplicationMailer
  def welcome(account_id)
    # Don't pass the object — pass the ID and reload in the mailer
    # This forces the query to run inside the correct tenant context
    @account = Account.find(account_id)
    mail(to: @account.email, subject: "Welcome")
  end
end

Fixtures and Seeds With `tenant_id`

If you use db/seeds.rb or fixtures, they run outside tenant context. Either set Current.tenant explicitly in the seed file, or use Account.unscoped.create! with explicit tenant_id:

# db/seeds.rb
alpha = Tenant.create!(name: "Alpha Corp", subdomain: "alpha")
beta  = Tenant.create!(name: "Beta LLC",   subdomain: "beta")

Current.tenant = alpha
Account.create!(name: "Alice", email: "alice@alpha.com")

Current.tenant = beta
Account.create!(name: "Bob", email: "bob@beta.com")

Current.reset

What You Now Have

Let's audit the threat model:

Threat	Mitigation
Developer forgets to scope a query	`default_scope` on every model via `TenantScoped`
Scope is bypassed via `unscoped`	RLS policy in Postgres catches it
Background job runs without context	Sidekiq middleware serializes and restores `tenant_id`
Cross-tenant association load	Every associated model includes `TenantScoped`
Test suite masks real bugs	`expect_tenant_safe` helper + `Current.reset` around each spec
Admin task accidentally touches all tenants	Explicit `for_tenant` / `across_all_tenants` API

This isn't magic — it's conventions, enforced at multiple layers. The value is that no single layer needs to be perfect. If the application scope slips, RLS catches it. If RLS is misconfigured for a table, the model validation catches it on write. Defense-in-depth, each layer honest about what it does.

Where to Go Next

Schema-per-tenant: If your tenants need true schema isolation (compliance reasons, not just data volume), look at apartment or roll your own search_path switcher. The tradeoffs are well documented; schema-per-tenant is significantly more operationally complex.
Tenant provisioning: Automating CREATE POLICY for new tables as your schema evolves. A custom migration generator that adds the RLS boilerplate helps.
Query performance at scale: At 10k+ tenants, your tenant_id index cardinality affects query plans. Consider BRIN indexes for time-series tenant data and composite indexes that put tenant_id first on any multi-column lookup.
Audit logging: Every mutation should record tenant_id, user_id, and the previous value. PaperTrail's controller_info hook is the cleanest place to attach tenant context to the audit trail.

The system above will hold through 99% of what production throws at it. The other 1% is where you earn your scars — and now at least you'll know exactly which layer to look at first.

Error Handling Patterns for Python AI Pipelines: What to Catch, What to Retry, and What to Alert On

Temitope — Mon, 11 May 2026 10:13:02 +0000

AI pipelines fail differently from regular software.

A web API fails predictably — the database is down, the network is unreachable, the input is invalid. You catch the exception, return an error response, and move on. The failure modes are finite and well-understood.

An AI pipeline fails in ways that standard exception handling wasn't designed for. The model returns a response that is structurally valid but semantically wrong. The output JSON is malformed because the model decided to add a comment inside it. The pipeline succeeds on 98% of inputs and silently produces garbage on the other 2%. The same input produces different outputs on different runs, making failures intermittent and hard to reproduce.

Worse, some failures aren't exceptions at all. They're successful API calls that returned something unusable.

This article builds a systematic approach to error handling in Python AI pipelines — categorizing failure modes, deciding what to catch versus retry versus alert on, and implementing the patterns that make pipelines debuggable in production.

The Four Categories of AI Pipeline Failures

Before writing any error handling code, it helps to categorize what can go wrong. AI pipeline failures fall into four distinct categories, each requiring a different response.

1. Infrastructure Failures

These are failures you'd see in any distributed system — network timeouts, rate limiting, service unavailability. They're transient by nature and almost always safe to retry.

# Infrastructure failures — retry these
openai.APITimeoutError
openai.APIConnectionError
openai.RateLimitError
anthropic.APIConnectionError
httpx.TimeoutException

2. Input Failures

These failures happen because the input to the pipeline is invalid — too long, wrong format, contains content that triggers a safety filter. Retrying won't help because the same input will produce the same failure.

# Input failures — don't retry, fix the input
openai.BadRequestError      # Often: context too long
anthropic.BadRequestError
ContentFilterError          # Safety system triggered
TokenLimitExceededError     # Input exceeds model context window

3. Output Failures

These are the most subtle. The API call succeeds, but the output isn't what you expected — malformed JSON, missing required fields, truncated response, wrong format. Standard exception handling misses these entirely because no exception was raised.

# Output failures — no exception raised, but response is unusable
# finish_reason == "length"    → response was cut off
# finish_reason == "content_filter" → output was filtered
# JSON parsing fails on response content
# Required fields missing from structured output
# Response is in wrong language

4. Logic Failures

These are failures in how the pipeline processes the model's output — transformation errors, validation failures, downstream system errors triggered by bad model output. They live outside the LLM call itself.

# Logic failures — pipeline code failed processing model output
json.JSONDecodeError
pydantic.ValidationError
KeyError  # Expected field missing from model output
ValueError  # Model output failed business rule validation

Setting Up the Error Handling Infrastructure

errors.py

from enum import Enum
from dataclasses import dataclass, field
from typing import Optional, Any


class FailureCategory(Enum):
    INFRASTRUCTURE = "infrastructure"
    INPUT = "input"
    OUTPUT = "output"
    LOGIC = "logic"


class RetryStrategy(Enum):
    RETRY = "retry"           # Safe to retry immediately or with backoff
    NO_RETRY = "no_retry"     # Retrying won't help — fix the input
    ALERT = "alert"           # Needs human attention
    FALLBACK = "fallback"     # Use a fallback response


@dataclass
class PipelineError:
    """
    Structured representation of an AI pipeline failure.
    Carries enough context to make retry and alerting decisions.
    """
    category: FailureCategory
    retry_strategy: RetryStrategy
    message: str
    original_error: Optional[Exception] = None
    context: dict = field(default_factory=dict)
    recoverable: bool = True

    def should_retry(self) -> bool:
        return self.retry_strategy == RetryStrategy.RETRY

    def should_alert(self) -> bool:
        return self.retry_strategy == RetryStrategy.ALERT

    def should_fallback(self) -> bool:
        return self.retry_strategy == RetryStrategy.FALLBACK

Step 1: Catching Infrastructure Failures

Infrastructure failures are the easiest to handle — they're transient, well-documented, and safe to retry with exponential backoff.

retry.py

import asyncio
import time
import structlog
from functools import wraps
from typing import Callable, TypeVar, Awaitable
from opentelemetry import trace

from errors import PipelineError, FailureCategory, RetryStrategy

logger = structlog.get_logger()
tracer = trace.get_tracer("pipeline-retry")

T = TypeVar("T")


def with_retry(
    max_attempts: int = 3,
    base_delay: float = 1.0,
    max_delay: float = 30.0,
    retryable_errors: tuple = (),
):
    """
    Decorator that retries async functions with exponential backoff.
    Only retries on explicitly listed error types.
    """
    def decorator(func: Callable[..., Awaitable[T]]) -> Callable[..., Awaitable[T]]:
        @wraps(func)
        async def wrapper(*args, **kwargs) -> T:
            current_span = trace.get_current_span()
            last_error = None

            for attempt in range(max_attempts):
                try:
                    if attempt > 0:
                        delay = min(base_delay * (2 ** (attempt - 1)), max_delay)
                        logger.info(
                            "retrying_after_delay",
                            attempt=attempt,
                            delay_seconds=delay,
                            function=func.__name__,
                        )
                        current_span.set_attribute("retry.attempt", attempt)
                        await asyncio.sleep(delay)

                    return await func(*args, **kwargs)

                except retryable_errors as e:
                    last_error = e
                    logger.warning(
                        "retryable_error",
                        attempt=attempt + 1,
                        max_attempts=max_attempts,
                        error_type=type(e).__name__,
                        function=func.__name__,
                    )
                    current_span.set_attribute("retry.count", attempt + 1)
                    continue

                except Exception:
                    # Non-retryable errors propagate immediately
                    raise

            # All retries exhausted
            logger.error(
                "all_retries_exhausted",
                max_attempts=max_attempts,
                function=func.__name__,
                error_type=type(last_error).__name__,
            )
            raise last_error

        return wrapper
    return decorator

Step 2: Handling Output Failures

Output failures require a different approach because they don't raise exceptions — you have to check the response explicitly.

output_validator.py

import json
from typing import Optional, Type, TypeVar
from pydantic import BaseModel, ValidationError
import structlog
from opentelemetry import trace

from errors import PipelineError, FailureCategory, RetryStrategy

logger = structlog.get_logger()
T = TypeVar("T", bound=BaseModel)


def check_finish_reason(finish_reason: str, model: str) -> Optional[PipelineError]:
    """
    Evaluate the model's finish reason and return a PipelineError if
    the response should not be used.
    """
    if finish_reason == "stop":
        return None  # Normal completion — no error

    if finish_reason == "length":
        return PipelineError(
            category=FailureCategory.OUTPUT,
            retry_strategy=RetryStrategy.FALLBACK,
            message="Response truncated by token limit",
            context={"finish_reason": finish_reason, "model": model},
            recoverable=False,
        )

    if finish_reason == "content_filter":
        return PipelineError(
            category=FailureCategory.INPUT,
            retry_strategy=RetryStrategy.ALERT,
            message="Response blocked by content filter",
            context={"finish_reason": finish_reason, "model": model},
            recoverable=False,
        )

    # tool_calls and function_call are valid non-stop reasons
    if finish_reason in ("tool_calls", "function_call"):
        return None

    # Unknown finish reason — log but don't fail
    logger.warning("unknown_finish_reason", finish_reason=finish_reason, model=model)
    return None


def parse_structured_output(
    content: str,
    output_model: Type[T],
    operation: str,
) -> tuple[Optional[T], Optional[PipelineError]]:
    """
    Parse and validate structured JSON output from an LLM.
    Returns (result, None) on success or (None, PipelineError) on failure.
    """
    span = trace.get_current_span()

    # Step 1: Extract JSON if wrapped in markdown code blocks
    # Models frequently wrap JSON in ```
{% endraw %}
json ...
{% raw %}
 ``` even when told not to
    content = content.strip()
    if content.startswith("```

"):
        lines = content.split("\n")
        # Remove first and last lines (the code fence markers)
        content = "\n".join(lines[1:-1] if lines[-1] == "

```" else lines[1:])

    # Step 2: Parse JSON
    try:
        raw_data = json.loads(content)
    except json.JSONDecodeError as e:
        span.set_attribute("output.parse_error", "json_decode")
        logger.warning(
            "json_parse_failed",
            operation=operation,
            error=str(e),
            content_preview=content[:200],
        )
        return None, PipelineError(
            category=FailureCategory.OUTPUT,
            retry_strategy=RetryStrategy.RETRY,
            message=f"Model returned invalid JSON: {str(e)}",
            original_error=e,
            context={"operation": operation},
            recoverable=True,
        )

    # Step 3: Validate against expected schema
    try:
        result = output_model.model_validate(raw_data)
        span.set_attribute("output.validation", "passed")
        return result, None

    except ValidationError as e:
        span.set_attribute("output.parse_error", "schema_validation")
        logger.warning(
            "schema_validation_failed",
            operation=operation,
            errors=e.errors(),
        )
        return None, PipelineError(
            category=FailureCategory.OUTPUT,
            retry_strategy=RetryStrategy.RETRY,
            message=f"Model output failed schema validation: {str(e)}",
            original_error=e,
            context={"operation": operation, "validation_errors": e.errors()},
            recoverable=True,
        )

Step 3: A Complete Pipeline With Error Handling

Now let's put it together in a realistic pipeline that classifies support tickets and returns structured output.

pipeline.py

import os
from openai import AsyncOpenAI, RateLimitError, APITimeoutError, APIConnectionError, BadRequestError
from pydantic import BaseModel
from typing import Optional
import structlog
from opentelemetry import trace

from retry import with_retry
from output_validator import check_finish_reason, parse_structured_output
from errors import PipelineError, FailureCategory, RetryStrategy
from llm_tracer import llm_span, record_llm_response, record_llm_error

logger = structlog.get_logger()
tracer = trace.get_tracer("support-pipeline")
client = AsyncOpenAI(api_key=os.environ["OPENAI_API_KEY"])


class TicketClassification(BaseModel):
    category: str
    urgency: str  # "low" | "medium" | "high" | "critical"
    sentiment: str  # "positive" | "neutral" | "negative" | "angry"
    requires_human: bool
    suggested_response: Optional[str] = None


class ClassificationResult(BaseModel):
    success: bool
    classification: Optional[TicketClassification] = None
    error: Optional[str] = None
    fallback_used: bool = False


RETRYABLE_OPENAI_ERRORS = (RateLimitError, APITimeoutError, APIConnectionError)


@with_retry(
    max_attempts=3,
    base_delay=1.0,
    retryable_errors=RETRYABLE_OPENAI_ERRORS,
)
async def _call_openai(messages: list, model: str) -> dict:
    """
    Raw OpenAI call with retry decoration.
    Separated from business logic so retries are clean.
    """
    response = await client.chat.completions.create(
        model=model,
        temperature=0.0,
        max_tokens=500,
        response_format={"type": "json_object"},
        messages=messages,
    )
    return response


async def classify_ticket(
    ticket_text: str,
    ticket_id: str,
) -> ClassificationResult:
    """
    Classify a support ticket with full error handling.
    Returns a ClassificationResult regardless of what goes wrong.
    """
    model = "gpt-4o-mini"
    log = logger.bind(ticket_id=ticket_id, model=model)

    with llm_span(
        model=model,
        operation="classify",
        feature="support_triage",
    ) as span:
        span.set_attribute("ticket.id", ticket_id)

        # Step 1: Check input length before calling the API
        # Avoids paying for a call that will fail with a context length error
        estimated_tokens = len(ticket_text.split()) * 1.3
        if estimated_tokens > 3000:
            log.warning("ticket_too_long", estimated_tokens=estimated_tokens)
            return ClassificationResult(
                success=False,
                error="Ticket too long for classification",
                fallback_used=True,
            )

        messages = [
            {
                "role": "system",
                "content": """Classify the support ticket. Return JSON with these exact fields:
{
  "category": "billing|technical|account|general",
  "urgency": "low|medium|high|critical",
  "sentiment": "positive|neutral|negative|angry",
  "requires_human": true|false,
  "suggested_response": "optional brief response suggestion or null"
}""",
            },
            {
                "role": "user",
                "content": ticket_text,
            },
        ]

        # Step 2: Call the API (with retry on infrastructure failures)
        try:
            response = await _call_openai(messages, model)

        except RETRYABLE_OPENAI_ERRORS as e:
            # All retries exhausted
            record_llm_error(span, e, error_type="infrastructure_exhausted")
            log.error("classification_failed_infrastructure", exc_info=True)
            return ClassificationResult(
                success=False,
                error=f"Service temporarily unavailable: {type(e).__name__}",
                fallback_used=True,
            )

        except BadRequestError as e:
            # Input error — don't retry
            record_llm_error(span, e, error_type="bad_request")
            log.warning("classification_bad_request", error=str(e))
            return ClassificationResult(
                success=False,
                error="Invalid request",
                fallback_used=False,
            )

        # Step 3: Check finish reason before trusting the output
        choice = response.choices[0]
        usage = response.usage

        record_llm_response(
            span=span,
            model=model,
            prompt_tokens=usage.prompt_tokens,
            completion_tokens=usage.completion_tokens,
            finish_reason=choice.finish_reason,
        )

        finish_error = check_finish_reason(choice.finish_reason, model)
        if finish_error:
            log.warning(
                "bad_finish_reason",
                finish_reason=choice.finish_reason,
                category=finish_error.category.value,
            )
            return ClassificationResult(
                success=False,
                error=finish_error.message,
                fallback_used=finish_error.should_fallback(),
            )

        # Step 4: Parse and validate the structured output
        classification, parse_error = parse_structured_output(
            content=choice.message.content,
            output_model=TicketClassification,
            operation="classify_ticket",
        )

        if parse_error:
            log.warning(
                "output_parse_failed",
                category=parse_error.category.value,
                message=parse_error.message,
            )
            # Output parsing failed — this is recoverable on retry
            # but we've already used our retries on the API call
            # so return a graceful failure
            return ClassificationResult(
                success=False,
                error=parse_error.message,
                fallback_used=True,
            )

        log.info(
            "ticket_classified",
            category=classification.category,
            urgency=classification.urgency,
            requires_human=classification.requires_human,
        )

        span.set_attributes({
            "ticket.category": classification.category,
            "ticket.urgency": classification.urgency,
            "ticket.requires_human": classification.requires_human,
        })

        return ClassificationResult(
            success=True,
            classification=classification,
        )

Step 4: What to Alert On

Not every failure needs a human. Here's how to think about alerting thresholds:

Alert immediately:

content_filter finish reason — a prompt in your system may be triggering safety systems
infrastructure_exhausted errors exceeding 1% of requests — your retry budget is being consumed
Parse failures on structured output exceeding 5% — the model is producing malformed output consistently

Alert on trend, not on individual events:

Rising length finish reasons — prompts are growing and approaching token limits
Increasing retry counts — a dependency is degrading
Validation failure rate increasing — the model's output format may have drifted

Log but don't alert:

Single rate limit errors (retries handle these)
Individual JSON parse failures (occasional, expected)
Unknown finish reasons (rare edge cases)

monitoring.py

from opentelemetry import metrics

meter = metrics.get_meter("pipeline-metrics")

# Counters for tracking failure rates
pipeline_errors = meter.create_counter(
    "pipeline.errors",
    description="Count of pipeline errors by category and type",
)

pipeline_retries = meter.create_counter(
    "pipeline.retries",
    description="Count of retry attempts",
)

pipeline_fallbacks = meter.create_counter(
    "pipeline.fallbacks",
    description="Count of fallback responses served",
)


def record_pipeline_error(
    error: PipelineError,
    feature: str,
    model: str,
) -> None:
    pipeline_errors.add(1, {
        "error.category": error.category.value,
        "error.retry_strategy": error.retry_strategy.value,
        "feature": feature,
        "model": model,
    })

    if error.should_fallback():
        pipeline_fallbacks.add(1, {"feature": feature, "model": model})

The Decision Tree

When an AI pipeline fails, this is how to decide what to do:

Exception raised?
├── Yes → Is it a known infrastructure error? (timeout, rate limit, connection)
│         ├── Yes → Retry with exponential backoff
│         └── No → Is it an input error? (too long, bad request)
│                  ├── Yes → Return error immediately, don't retry
│                  └── No → Log, alert, return graceful failure
│
└── No → Check finish_reason
          ├── "stop" → Parse and validate output
          │            ├── Valid → Return result
          │            └── Invalid → Retry if budget remains, else fallback
          ├── "length" → Response truncated → Fallback
          ├── "content_filter" → Alert, return graceful failure
          └── Other → Log, continue

Summary

AI pipeline error handling requires explicit categorization before writing any catch blocks. The four categories — infrastructure, input, output, and logic — each have different retry strategies and alerting thresholds.

The patterns that matter most in production:

Retry infrastructure failures with exponential backoff, but only infrastructure failures
Check finish reason before trusting model output — a 200 response doesn't mean usable output
Validate structured output against a schema before using it downstream
Return graceful failures rather than propagating exceptions to users
Alert on rates, not individuals — single failures are noise, trends are signal

An AI pipeline that handles errors well degrades gracefully. Users get a fallback instead of a 500. Engineers get structured telemetry instead of a stack trace. And the system stays observable when something goes wrong at 2am.

Find me on GitHub or LinkedIn.

Monitoring LLM API Calls in Python: Latency, Token Usage, and Cost Tracking With OpenTelemetry

Temitope — Mon, 11 May 2026 10:09:00 +0000

LLM API calls are unlike any other external dependency in your Python application.

A database query takes milliseconds. A Redis call takes microseconds. An LLM call takes anywhere from half a second to thirty seconds, consumes a variable number of tokens on every invocation, costs real money on every request, and can fail in ways that have nothing to do with network connectivity — token limits, content filters, model refusals, context window exhaustion.

Standard application monitoring was not built for this. Your existing latency dashboards will show LLM calls as outliers. Your error rate alerts will fire on model refusals that aren't actually errors. Your cost monitoring won't exist at all unless you build it.

This article builds it. We'll instrument LLM API calls in Python with OpenTelemetry — capturing latency, token consumption, estimated cost, and finish reasons as structured telemetry that you can query, dashboard, and alert on.

The Monitoring Gap in LLM Applications

When you add an LLM to a Python application, you typically get visibility into two things: whether the call succeeded, and how long it took. Everything else — how many tokens it consumed, what the model decided to do, how much it cost, whether it hit a limit — is invisible unless you instrument it explicitly.

This creates real operational problems:

A feature that works in testing starts timing out in production because prompts grew longer than expected and token counts climbed
Costs spike unexpectedly because one endpoint is generating unusually long completions
Users report bad responses but you can't tell whether the model refused, truncated, or hallucinated because finish_reason is never captured
You can't tell which of your ten LLM-powered features is responsible for 80% of your API spend

Structured telemetry on LLM calls fixes all of these. Let's build it.

Prerequisites

Python 3.10+
An OpenAI or Anthropic API key
A running OpenTelemetry Collector or observability backend

Installing Dependencies

pip install opentelemetry-sdk
pip install opentelemetry-api
pip install opentelemetry-exporter-otlp-proto-grpc
pip install openai
pip install anthropic
pip install fastapi uvicorn

Project Structure

llm-monitoring/
├── tracing.py          # OpenTelemetry setup
├── llm_tracer.py       # LLM instrumentation layer
├── cost_estimator.py   # Token cost calculation
├── main.py             # FastAPI application
└── services.py         # LLM-powered features

Step 1: OpenTelemetry Setup

tracing.py

import os
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource, SERVICE_NAME


def init_tracer(service_name: str) -> trace.Tracer:
    resource = Resource.create({
        SERVICE_NAME: service_name,
        "service.version": "1.0.0",
    })

    exporter = OTLPSpanExporter(
        endpoint=os.environ.get("OTEL_EXPORTER_OTLP_ENDPOINT", "localhost:4317"),
        insecure=True,
    )

    provider = TracerProvider(resource=resource)
    provider.add_span_processor(BatchSpanProcessor(exporter))
    trace.set_tracer_provider(provider)

    return trace.get_tracer(service_name)

Step 2: Cost Estimation

Before building the instrumentation layer, we need a way to estimate costs. LLM providers charge per token, with different rates for input and output tokens.

cost_estimator.py

from dataclasses import dataclass
from typing import Optional


@dataclass
class ModelPricing:
    input_cost_per_token: float   # USD per token
    output_cost_per_token: float  # USD per token


# Pricing as of early 2026 — verify against provider pricing pages
# before building cost dashboards on these numbers
MODEL_PRICING: dict[str, ModelPricing] = {
    # OpenAI
    "gpt-4o": ModelPricing(
        input_cost_per_token=0.000005,
        output_cost_per_token=0.000015,
    ),
    "gpt-4o-mini": ModelPricing(
        input_cost_per_token=0.00000015,
        output_cost_per_token=0.0000006,
    ),
    "gpt-3.5-turbo": ModelPricing(
        input_cost_per_token=0.0000005,
        output_cost_per_token=0.0000015,
    ),
    # Anthropic
    "claude-sonnet-4-6": ModelPricing(
        input_cost_per_token=0.000003,
        output_cost_per_token=0.000015,
    ),
    "claude-haiku-4-5": ModelPricing(
        input_cost_per_token=0.00000025,
        output_cost_per_token=0.00000125,
    ),
}


def estimate_cost(
    model: str,
    prompt_tokens: int,
    completion_tokens: int,
) -> Optional[float]:
    """
    Estimate the cost of an LLM call in USD.
    Returns None if the model is not in the pricing table.
    """
    pricing = MODEL_PRICING.get(model)
    if not pricing:
        return None

    input_cost = prompt_tokens * pricing.input_cost_per_token
    output_cost = completion_tokens * pricing.output_cost_per_token
    return round(input_cost + output_cost, 8)

Step 3: The LLM Instrumentation Layer

This is the core of the setup — a context manager that wraps any LLM call and captures the telemetry we care about.

llm_tracer.py

import time
from contextlib import contextmanager
from typing import Optional, Generator
from opentelemetry import trace
from opentelemetry.trace import Span, Status, StatusCode

from cost_estimator import estimate_cost

tracer = trace.get_tracer("llm-instrumentation")


@contextmanager
def llm_span(
    model: str,
    operation: str,
    feature: str,
    prompt_tokens: Optional[int] = None,
    temperature: float = 0.0,
    max_tokens: Optional[int] = None,
) -> Generator[Span, None, None]:
    """
    Context manager that creates a span for an LLM API call.

    Args:
        model: The model identifier (e.g. "gpt-4o", "claude-sonnet-4-6")
        operation: What this call is doing (e.g. "summarize", "classify", "generate")
        feature: Which product feature triggered this call (e.g. "order_summary", "search")
        prompt_tokens: Estimated prompt token count (if known before the call)
        temperature: Sampling temperature
        max_tokens: Maximum tokens requested
    """
    with tracer.start_as_current_span(f"llm.{operation}") as span:
        # Request attributes — known before the call
        span.set_attributes({
            "llm.model": model,
            "llm.operation": operation,
            "llm.feature": feature,
            "llm.temperature": temperature,
            "llm.request_time": time.time(),
        })

        if prompt_tokens is not None:
            span.set_attribute("llm.prompt_tokens", prompt_tokens)

        if max_tokens is not None:
            span.set_attribute("llm.max_tokens", max_tokens)

        start_time = time.perf_counter()

        try:
            yield span
        finally:
            latency_ms = (time.perf_counter() - start_time) * 1000
            span.set_attribute("llm.latency_ms", round(latency_ms, 2))


def record_llm_response(
    span: Span,
    model: str,
    prompt_tokens: int,
    completion_tokens: int,
    finish_reason: str,
    cached: bool = False,
) -> None:
    """
    Record response attributes after an LLM call completes.
    Call this inside the llm_span context manager after the API call returns.
    """
    total_tokens = prompt_tokens + completion_tokens
    cost = estimate_cost(model, prompt_tokens, completion_tokens)

    span.set_attributes({
        "llm.prompt_tokens": prompt_tokens,
        "llm.completion_tokens": completion_tokens,
        "llm.total_tokens": total_tokens,
        "llm.finish_reason": finish_reason,
        "llm.cached": cached,
    })

    if cost is not None:
        span.set_attribute("llm.estimated_cost_usd", cost)

    # Set span status based on finish reason
    # Not all non-"stop" finish reasons are errors — but they need visibility
    if finish_reason == "length":
        # Response was cut off — may indicate prompt is too long
        # or max_tokens is set too low
        span.set_status(Status(StatusCode.ERROR, "Response truncated by token limit"))
        span.set_attribute("llm.truncated", True)

    elif finish_reason == "content_filter":
        # Content policy triggered — usually a prompt design issue
        span.set_status(Status(StatusCode.ERROR, "Response blocked by content filter"))

    elif finish_reason == "stop":
        span.set_status(Status(StatusCode.OK))

    else:
        # tool_calls, function_call, or unknown — not an error
        span.set_status(Status(StatusCode.OK))


def record_llm_error(span: Span, error: Exception, error_type: str) -> None:
    """
    Record an LLM API error on the span.
    Use error_type to distinguish between different failure modes.
    """
    span.record_exception(error)
    span.set_attributes({
        "llm.error": True,
        "llm.error_type": error_type,
    })
    span.set_status(Status(StatusCode.ERROR, str(error)))

The finish_reason handling is worth examining. When an LLM response is truncated because of a token limit, most monitoring systems record it as a successful call — the HTTP request returned 200. But from a product perspective, the response is incomplete and the user may get a broken experience. Treating finish_reason == "length" as an error in the span means you can alert on it separately from network failures or API errors.

Step 4: Instrumenting Real LLM Calls

Now let's use the instrumentation layer with actual API calls.

services.py

import os
from openai import AsyncOpenAI, RateLimitError, APITimeoutError
from anthropic import AsyncAnthropic, APIStatusError
import structlog

from llm_tracer import llm_span, record_llm_response, record_llm_error

logger = structlog.get_logger()
openai_client = AsyncOpenAI(api_key=os.environ["OPENAI_API_KEY"])
anthropic_client = AsyncAnthropic(api_key=os.environ["ANTHROPIC_API_KEY"])


async def summarize_order(order_text: str, user_id: str) -> str:
    """Summarize an order for the customer dashboard."""
    model = "gpt-4o-mini"

    with llm_span(
        model=model,
        operation="summarize",
        feature="order_dashboard",
        temperature=0.0,
        max_tokens=200,
    ) as span:
        try:
            response = await openai_client.chat.completions.create(
                model=model,
                temperature=0.0,
                max_tokens=200,
                messages=[
                    {
                        "role": "system",
                        "content": "Summarize the following order in 2-3 sentences for a customer.",
                    },
                    {
                        "role": "user",
                        "content": order_text,
                    },
                ],
            )

            choice = response.choices[0]
            usage = response.usage

            record_llm_response(
                span=span,
                model=model,
                prompt_tokens=usage.prompt_tokens,
                completion_tokens=usage.completion_tokens,
                finish_reason=choice.finish_reason,
            )

            logger.info(
                "order_summarized",
                user_id=user_id,
                model=model,
                prompt_tokens=usage.prompt_tokens,
                completion_tokens=usage.completion_tokens,
                finish_reason=choice.finish_reason,
            )

            return choice.message.content

        except RateLimitError as e:
            record_llm_error(span, e, error_type="rate_limit")
            logger.warning("llm_rate_limited", model=model, feature="order_dashboard")
            raise

        except APITimeoutError as e:
            record_llm_error(span, e, error_type="timeout")
            logger.error("llm_timeout", model=model, feature="order_dashboard")
            raise

        except Exception as e:
            record_llm_error(span, e, error_type="unknown")
            logger.error("llm_error", model=model, exc_info=True)
            raise


async def classify_support_ticket(ticket_text: str) -> dict:
    """Classify a support ticket by category and urgency."""
    model = "claude-haiku-4-5"

    with llm_span(
        model=model,
        operation="classify",
        feature="support_triage",
        temperature=0.0,
        max_tokens=100,
    ) as span:
        try:
            response = await anthropic_client.messages.create(
                model=model,
                max_tokens=100,
                messages=[
                    {
                        "role": "user",
                        "content": f"""Classify this support ticket. 
Respond with JSON only: {{"category": "...", "urgency": "low|medium|high"}}

Ticket: {ticket_text}""",
                    }
                ],
            )

            usage = response.usage
            finish_reason = response.stop_reason  # Anthropic uses stop_reason

            record_llm_response(
                span=span,
                model=model,
                prompt_tokens=usage.input_tokens,
                completion_tokens=usage.output_tokens,
                finish_reason=finish_reason or "stop",
            )

            import json
            result = json.loads(response.content[0].text)

            # Add classification result to span for filtering
            span.set_attributes({
                "ticket.category": result.get("category", "unknown"),
                "ticket.urgency": result.get("urgency", "unknown"),
            })

            return result

        except APIStatusError as e:
            record_llm_error(span, e, error_type=f"api_status_{e.status_code}")
            raise

        except Exception as e:
            record_llm_error(span, e, error_type="unknown")
            raise

Step 5: Wiring Into FastAPI

main.py

import os
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

from tracing import init_tracer
from services import summarize_order, classify_support_ticket

init_tracer("llm-powered-api")

app = FastAPI()
FastAPIInstrumentor.instrument_app(app)


class OrderSummaryRequest(BaseModel):
    order_text: str
    user_id: str


class SupportTicketRequest(BaseModel):
    ticket_text: str


@app.post("/orders/summarize")
async def summarize(request: OrderSummaryRequest):
    try:
        summary = await summarize_order(request.order_text, request.user_id)
        return {"summary": summary}
    except Exception:
        raise HTTPException(status_code=503, detail="Summary service unavailable")


@app.post("/support/classify")
async def classify(request: SupportTicketRequest):
    try:
        classification = await classify_support_ticket(request.ticket_text)
        return classification
    except Exception:
        raise HTTPException(status_code=503, detail="Classification service unavailable")

What the Telemetry Looks Like

A successful call to /orders/summarize produces a span with these attributes:

{
  "name": "llm.summarize",
  "status": "OK",
  "attributes": {
    "llm.model": "gpt-4o-mini",
    "llm.operation": "summarize",
    "llm.feature": "order_dashboard",
    "llm.temperature": 0.0,
    "llm.max_tokens": 200,
    "llm.prompt_tokens": 87,
    "llm.completion_tokens": 52,
    "llm.total_tokens": 139,
    "llm.finish_reason": "stop",
    "llm.estimated_cost_usd": 0.0000913,
    "llm.latency_ms": 1243.5,
    "llm.cached": false
  }
}

A truncated response — where the model hit the token limit — looks like:

{
  "name": "llm.summarize",
  "status": "ERROR",
  "status_message": "Response truncated by token limit",
  "attributes": {
    "llm.model": "gpt-4o-mini",
    "llm.finish_reason": "length",
    "llm.truncated": true,
    "llm.prompt_tokens": 312,
    "llm.completion_tokens": 200,
    "llm.total_tokens": 512,
    "llm.estimated_cost_usd": 0.0001672,
    "llm.latency_ms": 3821.2
  }
}

Dashboards and Alerts That Actually Matter

With this telemetry in place, here are the queries that become useful:

Cost by feature: Group spans by llm.feature and sum llm.estimated_cost_usd. This tells you which features are driving your LLM spend. In most applications, one or two features account for the majority of cost.

Truncation rate by model: Filter spans where llm.truncated = true and group by llm.model. A rising truncation rate on a specific model usually means prompts are growing — often because you've added more context or the input data has changed.

Latency percentiles by operation: P50 and P99 latency grouped by llm.operation. LLM latency distributions are wide — P50 might be 800ms while P99 is 12 seconds. Alerting on P99 rather than average catches the tail latency issues that users actually experience.

Error rate by error type: Group spans by llm.error_type. Rate limit errors, timeouts, and content filter triggers have completely different remediation paths. Grouping them together hides what's actually wrong.

Recommended alerts:

Alert	Condition	Threshold
High latency	P99 `llm.latency_ms`	> 10,000ms
Truncation spike	`llm.truncated = true` rate	> 5% of calls
Rate limiting	`llm.error_type = rate_limit` count	> 10 per minute
Cost spike	Sum `llm.estimated_cost_usd` per hour	> 2x baseline
Content filter	`llm.error_type = content_filter` count	> 3 per hour

Handling Retries Without Double-Counting

If your application retries failed LLM calls, you need to track retry counts to avoid double-counting costs and misattributing errors.

async def summarize_with_retry(order_text: str, user_id: str, max_retries: int = 2) -> str:
    model = "gpt-4o-mini"
    last_error = None

    for attempt in range(max_retries + 1):
        with llm_span(
            model=model,
            operation="summarize",
            feature="order_dashboard",
        ) as span:
            span.set_attribute("llm.attempt", attempt)
            span.set_attribute("llm.is_retry", attempt > 0)

            try:
                response = await openai_client.chat.completions.create(
                    model=model,
                    max_tokens=200,
                    messages=[
                        {"role": "system", "content": "Summarize this order."},
                        {"role": "user", "content": order_text},
                    ],
                )

                usage = response.usage
                record_llm_response(
                    span=span,
                    model=model,
                    prompt_tokens=usage.prompt_tokens,
                    completion_tokens=usage.completion_tokens,
                    finish_reason=response.choices[0].finish_reason,
                )

                return response.choices[0].message.content

            except RateLimitError as e:
                record_llm_error(span, e, error_type="rate_limit")
                last_error = e
                if attempt < max_retries:
                    import asyncio
                    await asyncio.sleep(2 ** attempt)
                continue

    raise last_error

With llm.attempt and llm.is_retry on every span, you can filter your cost dashboard to exclude retry attempts — or specifically query retried calls to understand which operations are flaky.

Summary

LLM API calls require a different approach to monitoring than standard HTTP dependencies. The key attributes to capture are:

Latency — LLM calls are slow and variable; P99 matters more than average
Token counts — input and output separately, since they have different costs
Finish reason — stop, length, content_filter, and tool_calls each indicate different conditions
Estimated cost — per-call and aggregated by feature
Error type — rate limits, timeouts, and content filters need different responses

The instrumentation layer in this article wraps both OpenAI and Anthropic calls with a consistent span structure. As you add more models or providers, the pattern stays the same — llm_span as the context manager, record_llm_response after the call, record_llm_error in the exception handler.

Without this telemetry, LLM-powered features are a black box. With it, you can answer the questions that actually matter in production: what is this costing, why is it slow, and what is the model actually doing.

Find me on GitHub or LinkedIn.

Structured Logging in Python With Structlog: Correlating Logs, Traces, and Errors in Production

Temitope — Mon, 11 May 2026 09:58:22 +0000

Most Python applications log like this:

logger.info(f"Processing order {order_id} for user {user_id}")
logger.error(f"Payment failed for order {order_id}: {str(e)}")

It works fine in development. In production, it falls apart.

When you're debugging a latency spike at 2am, grep-ing through gigabytes of unstructured log strings is painful. When you want to correlate a log line with the trace that generated it, you can't — the trace ID isn't in the log. When you want to count how many payment failures happened for a specific product in the last hour, you can't — the data is buried in a string.

Structured logging fixes all of this. Instead of log strings, you emit log events — dictionaries of key-value pairs that your log aggregator can index, query, and alert on.

In this article, we'll build a production-grade structured logging setup using Structlog, connect it to OpenTelemetry traces, and establish the patterns that make logs actually useful when something goes wrong.

Why Structlog Over Python's Built-in Logging

Python's standard logging module can emit JSON, but it requires significant configuration and produces output that's awkward to work with. Structlog takes a different approach — it makes structured, context-rich logging the default rather than something you have to force out of the standard library.

The core difference is how context is handled. With standard logging, you manually include context in every log call:

# Standard logging — context repeated everywhere
logger.info(f"Order created: order_id={order_id}, user_id={user_id}, product_id={product_id}")
logger.info(f"Payment processed: order_id={order_id}, user_id={user_id}, amount={amount}")
logger.info(f"Email sent: order_id={order_id}, user_id={user_id}, template=confirmation")

With Structlog, you bind context once and it automatically appears on every subsequent log event in that request:

# Structlog — bind once, appears everywhere
log = structlog.get_logger().bind(order_id=order_id, user_id=user_id)
log.info("order_created")
log.info("payment_processed", amount=amount)
log.info("email_sent", template="confirmation")

This isn't just cleaner — it means you can never accidentally forget to include the order ID on a critical log line.

Installing Dependencies

pip install structlog
pip install opentelemetry-sdk
pip install opentelemetry-api
pip install python-json-logger  # For JSON output in production

Project Structure

structured-logging/
├── logging_config.py   # Structlog setup
├── middleware.py        # Request context injection
├── main.py             # FastAPI application
└── services.py         # Business logic with logging

Step 1: Configuring Structlog

The configuration is where most of the important decisions happen. We want different behavior in development (readable, colored output) and production (JSON, machine-parseable).

logging_config.py

import logging
import sys
import structlog
from typing import Any


def add_trace_context(
    logger: Any, method: str, event_dict: dict
) -> dict:
    """
    Inject the current OpenTelemetry trace and span ID into every log event.
    This is the bridge between logs and traces.
    """
    from opentelemetry import trace

    current_span = trace.get_current_span()
    span_context = current_span.get_span_context()

    if span_context.is_valid:
        # Format as hex strings matching the W3C traceparent format
        event_dict["trace_id"] = format(span_context.trace_id, "032x")
        event_dict["span_id"] = format(span_context.span_id, "016x")
        event_dict["trace_sampled"] = span_context.trace_flags.sampled

    return event_dict


def add_log_level(
    logger: Any, method: str, event_dict: dict
) -> dict:
    """Add log level as a string field."""
    event_dict["level"] = method.upper()
    return event_dict


def configure_logging(environment: str = "production") -> None:
    """
    Configure Structlog for the given environment.
    Call this once at application startup.
    """

    # Shared processors that run in both development and production
    shared_processors = [
        structlog.contextvars.merge_contextvars,       # Merge request-scoped context
        structlog.processors.add_log_level,            # Add level field
        structlog.processors.TimeStamper(fmt="iso"),   # ISO 8601 timestamps
        add_trace_context,                             # Inject trace/span IDs
        structlog.processors.StackInfoRenderer(),      # Include stack info if present
        structlog.processors.format_exc_info,          # Format exceptions inline
    ]

    if environment == "development":
        # Human-readable output for local development
        structlog.configure(
            processors=shared_processors + [
                structlog.dev.ConsoleRenderer(colors=True),
            ],
            wrapper_class=structlog.make_filtering_bound_logger(logging.DEBUG),
            context_class=dict,
            logger_factory=structlog.PrintLoggerFactory(),
            cache_logger_on_first_use=True,
        )
    else:
        # JSON output for production — parseable by log aggregators
        structlog.configure(
            processors=shared_processors + [
                structlog.processors.dict_tracebacks,      # Structured tracebacks
                structlog.processors.JSONRenderer(),        # JSON output
            ],
            wrapper_class=structlog.make_filtering_bound_logger(logging.INFO),
            context_class=dict,
            logger_factory=structlog.PrintLoggerFactory(),
            cache_logger_on_first_use=True,
        )

    # Also configure standard library logging to route through Structlog
    # This captures logs from third-party libraries (SQLAlchemy, httpx, etc.)
    logging.basicConfig(
        format="%(message)s",
        stream=sys.stdout,
        level=logging.INFO,
    )

The add_trace_context processor is the most important piece here. Every log event — whether it comes from your code or a third-party library — automatically gets the current trace ID and span ID appended. This means you can take any log line from production and immediately jump to the full trace in your observability backend, or take a trace and find every log line that was emitted during that request.

Step 2: Request-Scoped Context With Middleware

One of Structlog's most powerful features is context variables — thread-local (or async-local) storage that holds log context for the duration of a request. Set it once in middleware, and it appears on every log event that request generates.

middleware.py

import uuid
import time
import structlog
from starlette.middleware.base import BaseHTTPMiddleware
from starlette.requests import Request
from starlette.responses import Response

logger = structlog.get_logger()


class LoggingMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request: Request, call_next) -> Response:
        # Clear any context from previous requests
        # This is critical — without this, context leaks between requests
        structlog.contextvars.clear_contextvars()

        # Generate a unique request ID for correlating all logs in this request
        request_id = str(uuid.uuid4())

        # Bind request-scoped context that will appear on all log events
        structlog.contextvars.bind_contextvars(
            request_id=request_id,
            method=request.method,
            path=request.url.path,
            client_ip=request.client.host if request.client else None,
        )

        start_time = time.perf_counter()

        # Log the incoming request
        logger.info("request_started")

        try:
            response = await call_next(request)
            duration_ms = (time.perf_counter() - start_time) * 1000

            # Log the completed request with response details
            logger.info(
                "request_completed",
                status_code=response.status_code,
                duration_ms=round(duration_ms, 2),
            )

            # Add request ID to response headers for client-side correlation
            response.headers["X-Request-ID"] = request_id
            return response

        except Exception as e:
            duration_ms = (time.perf_counter() - start_time) * 1000
            logger.error(
                "request_failed",
                duration_ms=round(duration_ms, 2),
                exc_info=True,
            )
            raise

The clear_contextvars() call at the start of every request is easy to forget and critical to include. Without it, context from one request can bleed into the next request handled by the same worker — you'd see logs from request B carrying the request ID of request A.

Step 3: Business Logic Logging

With the infrastructure in place, logging in your business logic is clean and context-rich by default.

services.py

import structlog
from opentelemetry import trace

logger = structlog.get_logger()
tracer = trace.get_tracer("order-service")


class OrderService:
    async def create_order(self, user_id: str, product_id: str, quantity: int) -> dict:
        # Bind order-specific context for all logs in this operation
        log = logger.bind(
            user_id=user_id,
            product_id=product_id,
            quantity=quantity,
        )

        log.info("order_creation_started")

        with tracer.start_as_current_span("validate_order") as span:
            if quantity <= 0:
                # Structured error — quantity is a queryable field
                log.warning(
                    "order_validation_failed",
                    reason="invalid_quantity",
                    quantity=quantity,
                )
                span.set_attribute("validation.error", "invalid_quantity")
                raise ValueError("Quantity must be positive")

            log.debug("order_validation_passed")

        with tracer.start_as_current_span("process_payment") as span:
            try:
                payment_result = await self._process_payment(user_id, product_id, quantity)
                log.info(
                    "payment_processed",
                    payment_id=payment_result["payment_id"],
                    amount=payment_result["amount"],
                    currency=payment_result["currency"],
                )
                span.set_attribute("payment.id", payment_result["payment_id"])
                span.set_attribute("payment.amount", payment_result["amount"])

            except PaymentDeclinedError as e:
                log.warning(
                    "payment_declined",
                    decline_reason=e.reason,
                    decline_code=e.code,
                )
                span.record_exception(e)
                raise

            except ExternalServiceError as e:
                log.error(
                    "payment_service_unavailable",
                    service="payment-gateway",
                    exc_info=True,
                )
                span.record_exception(e)
                raise

        order_id = f"order_{user_id}_{product_id}"
        log.info("order_created", order_id=order_id)

        return {"order_id": order_id, "status": "created"}

    async def _process_payment(self, user_id: str, product_id: str, quantity: int) -> dict:
        # Simulate payment processing
        return {
            "payment_id": f"pay_{user_id}",
            "amount": quantity * 29.99,
            "currency": "USD",
        }

Notice that different exceptions get different log levels with different context:

PaymentDeclinedError is a warning — it's expected behavior, not a system failure
ExternalServiceError is an error with exc_info=True — the payment gateway is down, which needs immediate attention

This distinction matters in production. If everything is logged as error, alerts become noise and get ignored. Reserving error for genuine system failures means every error alert deserves immediate attention.

Step 4: Wiring It Into FastAPI

main.py

import os
from contextlib import asynccontextmanager
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import structlog

from logging_config import configure_logging
from middleware import LoggingMiddleware
from services import OrderService

logger = structlog.get_logger()


@asynccontextmanager
async def lifespan(app: FastAPI):
    # Configure logging at startup
    environment = os.environ.get("ENVIRONMENT", "production")
    configure_logging(environment=environment)
    logger.info("application_started", environment=environment)
    yield
    logger.info("application_stopping")


app = FastAPI(lifespan=lifespan)
app.add_middleware(LoggingMiddleware)

order_service = OrderService()


class OrderRequest(BaseModel):
    user_id: str
    product_id: str
    quantity: int


@app.post("/orders")
async def create_order(order: OrderRequest):
    try:
        result = await order_service.create_order(
            user_id=order.user_id,
            product_id=order.product_id,
            quantity=order.quantity,
        )
        return result
    except ValueError as e:
        raise HTTPException(status_code=400, detail=str(e))
    except Exception:
        raise HTTPException(status_code=500, detail="Internal server error")

What Production Logs Look Like

A successful order request produces a sequence of JSON log events, each carrying the full context:

{"event": "request_started", "level": "INFO", "timestamp": "2026-05-01T10:23:41Z", "request_id": "a3f1c2d4", "method": "POST", "path": "/orders", "client_ip": "192.168.1.1", "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736", "span_id": "00f067aa0ba902b7"}

{"event": "order_creation_started", "level": "INFO", "timestamp": "2026-05-01T10:23:41Z", "request_id": "a3f1c2d4", "user_id": "u123", "product_id": "p456", "quantity": 2, "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736", "span_id": "00f067aa0ba902b7"}

{"event": "payment_processed", "level": "INFO", "timestamp": "2026-05-01T10:23:41Z", "request_id": "a3f1c2d4", "user_id": "u123", "product_id": "p456", "quantity": 2, "payment_id": "pay_u123", "amount": 59.98, "currency": "USD", "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736", "span_id": "ab12cd34ef567890"}

{"event": "order_created", "level": "INFO", "timestamp": "2026-05-01T10:23:41Z", "request_id": "a3f1c2d4", "user_id": "u123", "order_id": "order_u123_p456", "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736", "span_id": "ab12cd34ef567890"}

{"event": "request_completed", "level": "INFO", "timestamp": "2026-05-01T10:23:41Z", "request_id": "a3f1c2d4", "status_code": 200, "duration_ms": 342.5, "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736", "span_id": "00f067aa0ba902b7"}

Every log event carries the same trace_id. In your log aggregator, you can:

Filter by trace_id to see every log event from a single request
Filter by user_id to see every order a specific user has placed
Filter by event=payment_declined and group by decline_reason to understand why payments are failing
Alert when event=payment_service_unavailable appears more than 3 times in 5 minutes

None of this is possible with unstructured log strings.

Correlating Logs and Traces in Practice

The trace_id field is the bridge between your logs and your traces. Here's how to use it:

From a log to a trace: A user reports their order failed. You find their user_id in your log aggregator, filter by event=payment_declined, get the trace_id from the log event, and paste it into your tracing backend. You now see the full execution timeline — every span, every operation, every timing.

From a trace to logs: You notice a trace with unusually high latency in your tracing backend. You copy the trace_id and filter your log aggregator. You see that a payment_processed log event was emitted 800ms into the request — but the request_completed event came 2.3 seconds later. Something happened after payment processing that doesn't appear in the trace. The logs fill the gap.

This bidirectional navigation — from logs to traces and back — is what makes a production observability stack genuinely useful rather than just theoretically complete.

Common Mistakes

Logging sensitive data. It's easy to accidentally log payment details, passwords, or personally identifiable information as structured fields. Before adding a field to a log event, ask whether it belongs in your log aggregator. User IDs are fine. Card numbers are not.

Forgetting to clear context between requests. If you're using bind_contextvars without clear_contextvars at the start of each request, context leaks between requests on the same worker. Always clear first.

Using string formatting instead of keyword arguments. log.info(f"Payment of {amount} processed") defeats the purpose of structured logging. Use log.info("payment_processed", amount=amount) — keep the event name short and stable, put the data in fields.

Logging at the wrong level. Reserve error for conditions that require immediate human attention. Use warning for expected failure cases (payment declined, validation failed). Use info for significant business events. Use debug for everything else. If everything is error, nothing is.

Summary

Structured logging with Structlog gives you three things that unstructured logging cannot:

Queryable log data — filter, aggregate, and alert on specific fields rather than parsing strings
Automatic context propagation — bind context once per request and it appears on every log event
Log-trace correlation — every log event carries the trace ID, making bidirectional navigation between logs and traces possible

The setup is straightforward: configure Structlog once, add the trace context processor, use middleware to bind request-scoped context, and log events with keyword arguments instead of formatted strings. From there, your log aggregator does the heavy lifting.

Unstructured logs tell you something happened. Structured logs tell you what happened, to whom, when, why, and where in the trace you can find the full picture.

Find me on GitHub or LinkedIn.

Tracing Async Python: How to Instrument FastAPI and Celery in the Same Trace

Temitope — Mon, 11 May 2026 09:50:54 +0000

Most observability guides stop at the HTTP layer.

You add OpenTelemetry to your FastAPI app, traces start showing up in your backend, and everything looks great — until you realize that the moment a request hands off work to a Celery worker, the trace disappears. The worker runs in a completely separate process. The span context doesn't travel with the task. What should be one unified trace becomes two unrelated fragments.

This is one of the most common observability gaps in Python backend systems, and it's surprisingly underserved in the documentation. In this article, we'll close it.

We'll instrument a FastAPI application and a Celery worker so that every task a request spawns appears as a child span in the same trace — giving you end-to-end visibility from HTTP request to background job completion.

What We're Building

A simple order processing system:

POST /orders
    │
    ├── FastAPI handler (validates, saves order)
    │
    └── Celery task (sends confirmation email, updates inventory)

Without proper trace propagation, you'd see:

Trace A: POST /orders (FastAPI) — 45ms
Trace B: process_order (Celery) — 1.2s   ← orphaned, no parent

With it, you'd see:

Trace A: POST /orders — 1.25s total
  ├── validate_order — 8ms
  ├── save_order — 35ms
  └── process_order (Celery worker) — 1.2s
        ├── send_confirmation_email — 800ms
        └── update_inventory — 400ms

That second view is what we're building towards.

Prerequisites

Python 3.10+
A running Redis instance (for Celery's broker)
Basic familiarity with FastAPI and Celery

Installing Dependencies

pip install fastapi uvicorn celery redis
pip install opentelemetry-sdk
pip install opentelemetry-api
pip install opentelemetry-instrumentation-fastapi
pip install opentelemetry-instrumentation-celery
pip install opentelemetry-exporter-otlp-proto-grpc

Project Structure

order-tracing/
├── tracing.py        # Shared tracer setup
├── main.py           # FastAPI application
├── worker.py         # Celery worker
└── tasks.py          # Celery tasks

Step 1: Setting Up the Shared Tracer

The most important design decision in this setup is that both FastAPI and Celery must share the same tracer configuration. They're separate processes, but they need to use the same trace exporter and the same propagator so span context can be serialized and deserialized correctly.

tracing.py

import os
from opentelemetry import trace, propagate
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource, SERVICE_NAME
from opentelemetry.propagators.composite import CompositePropagator
from opentelemetry.propagators.b3 import B3MultiFormat
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator


def init_tracer(service_name: str) -> trace.Tracer:
    resource = Resource.create({
        SERVICE_NAME: service_name,
    })

    exporter = OTLPSpanExporter(
        endpoint=os.environ.get("OTEL_EXPORTER_OTLP_ENDPOINT", "localhost:4317"),
        insecure=True,
    )

    provider = TracerProvider(resource=resource)
    provider.add_span_processor(BatchSpanProcessor(exporter))

    trace.set_tracer_provider(provider)

    # Set a composite propagator that supports both W3C TraceContext and B3
    # W3C is the modern standard; B3 is common in older infrastructure
    propagate.set_global_textmap(
        CompositePropagator([
            TraceContextTextMapPropagator(),
            B3MultiFormat(),
        ])
    )

    return trace.get_tracer(service_name)

The propagator setup here is worth pausing on. The propagator is responsible for serializing the trace context into a format that can be passed between processes — in our case, into Celery task headers — and deserializing it on the other side. Without a matching propagator on both ends, the context arrives as bytes that the worker doesn't know how to read.

Step 2: The FastAPI Application

main.py

import json
from fastapi import FastAPI, HTTPException
from opentelemetry import trace, propagate
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from pydantic import BaseModel

from tracing import init_tracer
from tasks import process_order_task

# Initialize tracer for the API service
tracer = init_tracer("order-api")

app = FastAPI()

# Auto-instrument FastAPI — this creates spans for every request automatically
FastAPIInstrumentor.instrument_app(app)


class OrderRequest(BaseModel):
    user_id: str
    product_id: str
    quantity: int


@app.post("/orders")
async def create_order(order: OrderRequest):
    # FastAPIInstrumentor has already started a span for this request
    # We can get the current span to add business context
    current_span = trace.get_current_span()
    current_span.set_attribute("order.user_id", order.user_id)
    current_span.set_attribute("order.product_id", order.product_id)
    current_span.set_attribute("order.quantity", order.quantity)

    with tracer.start_as_current_span("validate_order") as span:
        if order.quantity <= 0:
            span.set_attribute("validation.error", "invalid_quantity")
            raise HTTPException(status_code=400, detail="Quantity must be positive")
        span.set_attribute("validation.status", "passed")

    with tracer.start_as_current_span("save_order") as span:
        # Simulate saving to database
        order_id = f"order_{order.user_id}_{order.product_id}"
        span.set_attribute("order.id", order_id)
        span.set_attribute("db.operation", "insert")

    # This is the critical step: inject the current trace context into
    # the Celery task headers so the worker can continue the trace
    trace_context = {}
    propagate.inject(trace_context)

    # Pass the serialized trace context as part of the task
    process_order_task.apply_async(
        args=[order_id, order.dict()],
        headers={"trace_context": json.dumps(trace_context)},
    )

    return {"order_id": order_id, "status": "processing"}

The propagate.inject(trace_context) call is doing the heavy lifting here. It takes the current span context — the trace ID, span ID, and sampling flags — and serializes it into the trace_context dictionary using whatever propagator format we configured. That dictionary then travels with the Celery task as a header.

Step 3: The Celery Worker

worker.py

from celery import Celery
from opentelemetry.instrumentation.celery import CeleryInstrumentor

from tracing import init_tracer

# Initialize tracer for the worker service
# Note: same exporter endpoint, different service name
init_tracer("order-worker")

# Instrument Celery BEFORE creating the app instance
# Order matters here — instrumentation must happen before Celery initializes
CeleryInstrumentor().instrument()

celery_app = Celery(
    "order_worker",
    broker="redis://localhost:6379/0",
    backend="redis://localhost:6379/0",
)

celery_app.conf.update(
    task_serializer="json",
    result_serializer="json",
    accept_content=["json"],
    task_track_started=True,
)

tasks.py

import json
import time
from opentelemetry import trace, propagate
from opentelemetry.trace import Status, StatusCode

from worker import celery_app

tracer = trace.get_tracer("order-worker")


@celery_app.task(bind=True, name="process_order")
def process_order_task(self, order_id: str, order_data: dict):
    # Extract trace context from task headers
    # This is how we reconnect to the parent trace from the API
    raw_context = self.request.headers.get("trace_context", "{}")
    carrier = json.loads(raw_context)

    # Deserialize the trace context using the same propagator
    parent_context = propagate.extract(carrier)

    # Start a new span as a child of the API request span
    with tracer.start_as_current_span(
        "process_order",
        context=parent_context,
        kind=trace.SpanKind.CONSUMER,
    ) as span:
        span.set_attribute("order.id", order_id)
        span.set_attribute("messaging.system", "celery")
        span.set_attribute("messaging.operation", "process")

        try:
            _send_confirmation_email(order_id, order_data)
            _update_inventory(order_data)

            span.set_attribute("order.status", "completed")
            span.set_status(Status(StatusCode.OK))

        except Exception as e:
            span.record_exception(e)
            span.set_attribute("order.status", "failed")
            span.set_status(Status(StatusCode.ERROR, str(e)))
            # Re-raise so Celery can handle retry logic
            raise


def _send_confirmation_email(order_id: str, order_data: dict):
    with tracer.start_as_current_span("send_confirmation_email") as span:
        span.set_attribute("email.recipient", order_data.get("user_id"))
        span.set_attribute("email.template", "order_confirmation")

        # Simulate email sending latency
        time.sleep(0.8)

        span.set_attribute("email.status", "sent")


def _update_inventory(order_data: dict):
    with tracer.start_as_current_span("update_inventory") as span:
        span.set_attribute("product.id", order_data.get("product_id"))
        span.set_attribute("inventory.delta", -order_data.get("quantity", 0))
        span.set_attribute("db.operation", "update")

        # Simulate database write
        time.sleep(0.4)

        span.set_attribute("inventory.status", "updated")

The key line is context=parent_context in tracer.start_as_current_span. This tells OpenTelemetry to create the new span as a child of the span that was active in the FastAPI handler — even though that span is in a completely different process. Without this, OpenTelemetry would create a new root span, giving you the orphaned trace problem we started with.

Step 4: Running Everything

Start the OpenTelemetry Collector (or point directly at your backend):

docker run -p 4317:4317 otel/opentelemetry-collector-contrib

Start the Celery worker:

celery -A worker worker --loglevel=info

Start the FastAPI app:

uvicorn main:app --reload

Send a test request:

curl -X POST http://localhost:8000/orders \
  -H "Content-Type: application/json" \
  -d '{"user_id": "u123", "product_id": "p456", "quantity": 2}'

Open your observability backend and you should see a single trace with the full span hierarchy:

POST /orders — 1.25s
  ├── validate_order — 8ms
  ├── save_order — 35ms
  └── process_order (worker) — 1.2s
        ├── send_confirmation_email — 800ms
        └── update_inventory — 400ms

Common Issues and How to Fix Them

Spans appear but Celery tasks are orphaned

Cause: The propagator on the API side doesn't match the propagator on the worker side.

Fix: Make sure tracing.py is imported by both main.py and worker.py, and that propagate.set_global_textmap is called before any spans are created. The easiest way to guarantee this is to call init_tracer at module import time, not inside a function.

CeleryInstrumentor creates duplicate spans

Cause: CeleryInstrumentor().instrument() was called after the Celery app was initialized.

Fix: Instrument before creating the Celery instance. The instrumentation patches Celery's signal system, which only works if the patch is applied before Celery sets up its internal event hooks.

Trace context arrives empty in the worker

Cause: Celery's default serializer strips unknown header fields.

Fix: Make sure task_serializer is set to "json" and accept_content includes "json". Also verify you're reading from self.request.headers and not self.request.kwargs — the context travels in headers, not task arguments.

Worker spans show a different trace ID

Cause: The context parameter was passed to start_as_current_span but the extracted context was empty (e.g. the header key name was wrong).

Fix: Log the raw carrier dict before calling propagate.extract to verify the context is arriving correctly:

raw_context = self.request.headers.get("trace_context", "{}")
carrier = json.loads(raw_context)
print(f"Received trace context: {carrier}")  # Should not be empty
parent_context = propagate.extract(carrier)

Adding Retry Visibility

Celery's retry mechanism is another common observability blind spot. When a task retries, it creates a new execution — but without instrumentation, you can't tell from the trace how many times it tried.

@celery_app.task(bind=True, name="process_order", max_retries=3)
def process_order_task(self, order_id: str, order_data: dict):
    raw_context = self.request.headers.get("trace_context", "{}")
    carrier = json.loads(raw_context)
    parent_context = propagate.extract(carrier)

    with tracer.start_as_current_span(
        "process_order",
        context=parent_context,
        kind=trace.SpanKind.CONSUMER,
    ) as span:
        # Track retry count on every execution
        span.set_attribute("task.retry_count", self.request.retries)
        span.set_attribute("task.max_retries", self.max_retries)
        span.set_attribute("order.id", order_id)

        try:
            _send_confirmation_email(order_id, order_data)
            _update_inventory(order_data)

        except ExternalServiceError as e:
            span.set_attribute("task.will_retry", self.request.retries < self.max_retries)
            span.record_exception(e)

            # Exponential backoff
            raise self.retry(exc=e, countdown=2 ** self.request.retries)

        except Exception as e:
            span.record_exception(e)
            span.set_status(Status(StatusCode.ERROR, str(e)))
            raise

With task.retry_count on every span, you can filter your observability backend for tasks with retry_count > 0 and immediately see which operations are flaky — and how many retries it typically takes before they succeed or fail permanently.

What to Monitor

With this instrumentation in place, here are the metrics and queries that become useful:

End-to-end latency: Filter traces by POST /orders and look at the total duration. Spikes here could be API-side or worker-side — the span breakdown tells you which.

Worker queue depth: Combine trace latency data with Celery's built-in queue metrics. If API spans complete quickly but total trace duration is high, tasks are sitting in the queue longer than expected.

Task retry rate: Query for spans where task.retry_count > 0. A rising retry rate is often the first signal of a degrading downstream dependency.

Failed tasks by error type: Filter spans by status=ERROR on process_order spans and group by the exception type. This tells you whether failures are clustering around a specific integration (email, inventory) or spread across everything.

Summary

Connecting FastAPI and Celery in a single trace requires three things:

A shared tracer configuration imported by both processes
propagate.inject on the API side to serialize the trace context into task headers
propagate.extract on the worker side to deserialize it and pass it as the parent context

The rest is standard OpenTelemetry — spans, attributes, status codes. But without those three pieces in place, you're flying blind the moment work leaves your API process.

Async Python systems hand off work constantly — to queues, to background workers, to scheduled jobs. Trace propagation is what lets you follow that work all the way through, and debug the full picture when something goes wrong.

Find me on GitHub or LinkedIn.

The Death of the "Black Box": Why the Future of AI is Modular

Temitope — Thu, 07 May 2026 13:57:53 +0000

For the past few years, the narrative around AI has been dominated by the "Black Box"—massive, monolithic models that live in the cloud, gatekept by APIs, and disconnected from our actual workflows.

But as someone building at the intersection of history and engineering, I see a shift happening. We are moving away from the monolith and toward Modularity.

The Shift to Agentic Architecture
The real breakthrough isn’t just a "smarter" LLM; it’s the ability to wrap that intelligence in a framework that can act. Using** p-agent** as an orchestrator allows us to treat the LLM as just one component of a larger machine.

When you pair this with the Model Context Protocol (MCP), you solve the biggest hurdle in AI: Context. Instead of begging a model to "remember" your project structure, you give it a standardized pipe directly to your filesystem or database.

Why This Matters for Developers
Vendor Agility: If a better model comes out tomorrow (like the next iteration of Gemma), a modular stack lets you swap the "brain" without rebuilding your entire toolset.

Privacy by Design: By running orchestration locally, we move from "Trust us with your data" to "We never see your data".

Local-First Engineering: As we've seen with startups like Ex Machina Technologies, the goal is to build systems that work at the speed of local hardware, not the speed of an API queue.

Implementation: The Modular Loop

Here is the "Invisible Logic" of a modular agent. It’s not a single script; it’s a conversation between an orchestrator and its environment.

# The "Modular" approach: Separation of Brain and Body
from p_agent.core import Agent
from p_agent.providers import AnthropicProvider # or Gemma via Ollama

# 1. Define the Brain (Intelligence)
brain = AnthropicProvider(model="claude-3-5-sonnet")

# 2. Define the Body (Tools/Context via MCP)
assistant = Agent(
    name="Architect",
    provider=brain,
    tools=["filesystem-mcp", "postgres-mcp"] # Standardized interfaces
)

# 3. Execution
assistant.run("Audit my local database schema against the latest documentation.")

Final Thoughts: From Tools to Teammates
We are no longer just "using" AI; we are architecting digital colleagues. By leaning into open-source frameworks and standardized protocols, we ensure that the future of AI is transparent, private, and—most importantly—under our control.

This journey from a simple script to a production-ready agent isn't just a technical upgrade; it's a new philosophy of software.

Open-Source Intelligence: Building with Gemma and MCP

Temitope — Thu, 07 May 2026 13:27:40 +0000

Open-source models have reached a pivotal moment. With the release of Gemma, developers are no longer forced to choose between "smart" cloud-based models and "private" local ones. We finally have both.

In this guide, I’ll walk you through setting up a fully local agentic system using Gemma as the brain, p-agent as the orchestrator, and the Model Context Protocol (MCP) to bridge the gap between AI and your local data.

Why Gemma for Agents?
Gemma is built on the same technical foundations as Google’s Gemini models but tailored specifically for the open-source community. For agentic workflows, it offers:

Superior Logic: Its architecture excels at following complex, multi-step instructions—the "bread and butter" of agent planning.
Efficiency: The 9B and 27B variants run comfortably on consumer hardware while outperforming many models twice their size.
Total Privacy: This is the perfect setup for a "Knowledge Brain" where you want to query sensitive local files without your data ever leaving your machine.

The Stack: Local Intelligence in Action
To get this running, we'll use Ollama to serve Gemma locally. This allows p-agent to call it just like a cloud API, but with 100% data sovereignty.

1. Serve Gemma Locally
First, grab Ollama and pull the model:

ollama run gemma2

2. Configure p-agent to use Gemma
In your p-agent setup, we simply point the provider to your local Ollama instance. This gives your agent a private reasoning engine.

from p_agent.core import Agent
from p_agent.providers import OllamaProvider

# Set up the Gemma brain
gemma_provider = OllamaProvider(model="gemma2")

local_agent = Agent(
    name="GemmaAgent",
    instructions="You are a local researcher. Use MCP tools to analyze data.",
    provider=gemma_provider
)

Bridging the Gap with MCP

An agent is only as useful as the tools it can access. By registering a Filesystem MCP server, we allow Gemma to "reach out" and actually read your local documentation or code repositories.

from p_agent.mcp import MCPTool

# Connect Gemma to your local filesystem
fs_tool = MCPTool(server_url="http://localhost:8080")
local_agent.register_tool(fs_tool)

# Execute a complex local task
local_agent.run("Summarize the architectural decisions found in the /docs folder.")

Final Thoughts: What This Means for Us
The ability to run a model with Gemma’s reasoning capabilities on a local machine—orchestrated by a framework like p-agent—is a genuine game-changer. For developers and founders, it offers:

Cost Independence: No more per-token billing for your internal development tools.
Data Sovereignty: Your codebases and strategy documents stay exactly where they belong: on your hardware.
Rapid Prototyping: Build and test agentic loops all day without worrying about API rate limits or latency spikes.

Standardized protocols like MCP, paired with high-capability open models, are the foundation of the next generation of software.

From Prototype to Production: Deploying p-agent Workflows

Temitope — Thu, 07 May 2026 13:04:24 +0000

The journey from a main.py script to a production-ready AI service is often the hardest part of the "agentic" lifecycle. When building for the real world—especially in high-growth tech hubs like Lagos—developers need to balance power with cost-effectiveness and latency.

In this final installment, we’re looking at how to take the p-agent workflows we’ve built and wrap them in a scalable architecture that’s ready for users.

The Deployment Lifecycle
Moving to production means solving for three things: Persistence, Scalability, and Connectivity.

1. Session Persistence
In a local script, your agent's memory disappears when the process ends. In production, you need to maintain state across multiple user interactions. p-agent handles this by allowing you to inject session managers that store conversation history in a database rather than just RAM.

2. Scaling with MCP Microservices
In our previous tutorials, we ran MCP servers locally. For production, you can host your MCP servers as independent microservices.

The Benefit: Your "GitHub MCP" or "Database MCP" can run on a separate container, allowing your main p-agent orchestrator to remain lightweight.

The Standard: Because MCP is a protocol, your p-agent instance can connect to these remote tools over secure transports (like SSE or WebSockets).

3. Optimizing for Latency
Not every task requires GPT-4o. A professional architecture uses a "Routing Model" (like a smaller, faster LLM) to handle simple tool-calling tasks, reserving the "Reasoning Model" for complex problem-solving. This saves both time and API credits.

Implementation: The Production Wrapper
Here is how you might wrap a p-agent workflow into a FastAPI endpoint for deployment:

from fastapi import FastAPI
from p_agent.core import Agent
from p_agent.providers import OpenAIProvider

app = FastAPI()
provider = OpenAIProvider(model="gpt-4o-mini") # Cost-effective for routing

# Initialize a production-ready agent
deploy_agent = Agent(
    name="ProdAssistant",
    instructions="Process user requests efficiently using connected tools.",
    provider=provider
)

@app.post("/chat")
async def chat_endpoint(user_input: str, session_id: str):
    # Retrieve session context and run the agent
    response = deploy_agent.run(user_input, session_id=session_id)
    return {"reply": response.content}

Foundational Future: Building Locally, Scaling Globally
Frameworks like p-agent are lowering the barrier to entry for AI startups. Whether you are building the next big thing at a Lagos-based firm like Ex Machina Technologies or optimizing internal workflows for a global team, the focus remains the same: building modular, open-source-first systems.

By using p-agent and MCP, you aren't just building a feature; you are architecting a system that can evolve with the AI landscape.

What’s your biggest challenge when moving AI to production? Let’s troubleshoot in the comments!

Temitope Ajao, AI Engineering professional based in Lagos; founder of Ex Machina Technologies.

Building an LLM-as-a-Judge Evaluation Pipeline for Translation Quality

Temitope — Sun, 03 May 2026 10:45:02 +0000

Introduction

How do you know if your AI-generated translation is actually good?

Traditional metrics like BLEU scores measure word overlap — but they miss fluency, context, and cultural nuance entirely. A translation can score well on BLEU and still read like gibberish to a native speaker.

This is where LLM-as-a-Judge comes in — using a large language model to evaluate the quality of another model's output. In this tutorial, we'll build a practical evaluation pipeline that scores translation quality across multiple dimensions using Claude as the judge.

By the end, you'll have a working system you can plug into any translation workflow.
What We're Building
A Python-based evaluation pipeline that:

Accepts a source text + translated output.
Sends both to an LLM judge with a structured scoring prompt.
Returns scores for fluency, accuracy, and cultural appropriateness.
Logs results to a simple JSON file for tracking over time.

Prerequisites

Python 3.9+
An Anthropic API key (or OpenAI)
Basic familiarity with REST APIs and Python

pip install anthropic python-dotenv

Step 1: Set Up Your Project Structure

llm-judge-pipeline/
├── evaluator.py
├── prompts.py
├── logger.py
├── results/
│   └── evaluations.json
└── .env

Create your .env:

ANTHROPIC_API_KEY=your_api_key_here

Step 2: Design Your Evaluation Prompt
The quality of your judge depends almost entirely on your prompt. We want structured, consistent output — so we'll ask the model to respond in JSON.

Prompts.py

JUDGE_PROMPT = """
You are an expert translation evaluator with deep knowledge of linguistics and cultural context.

You will be given:
- SOURCE: The original text
- TRANSLATION: The translated output to evaluate

Evaluate the translation on three dimensions and return ONLY a JSON object:

{{
  "fluency": {{
    "score": <1-10>,
    "reason": "<one sentence explanation>"
  }},
  "accuracy": {{
    "score": <1-10>,
    "reason": "<one sentence explanation>"
  }},
  "cultural_appropriateness": {{
    "score": <1-10>,
    "reason": "<one sentence explanation>"
  }},
  "overall_score": <average of the three scores>,
  "recommendation": "<pass | review | reject>"
}}

SOURCE: {source}
TRANSLATION: {translation}
TARGET LANGUAGE: {target_language}
"""

Step 3: Build the Evaluator
evaluator.py

import os
import json
import anthropic
from dotenv import load_dotenv
from prompts import JUDGE_PROMPT

load_dotenv()

client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

def evaluate_translation(source: str, translation: str, target_language: str) -> dict:
    prompt = JUDGE_PROMPT.format(
        source=source,
        translation=translation,
        target_language=target_language
    )

    message = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=1024,
        messages=[
            {"role": "user", "content": prompt}
        ]
    )

    raw_response = message.content[0].text

    try:
        result = json.loads(raw_response)
    except json.JSONDecodeError:
        import re
        json_match = re.search(r'\{.*\}', raw_response, re.DOTALL)
        if json_match:
            result = json.loads(json_match.group())
        else:
            raise ValueError("Could not parse JSON from model response")

    return result


def batch_evaluate(pairs: list[dict]) -> list[dict]:
    """Evaluate multiple source/translation pairs"""
    results = []
    for pair in pairs:
        result = evaluate_translation(
            source=pair["source"],
            translation=pair["translation"],
            target_language=pair["target_language"]
        )
        result["source"] = pair["source"]
        result["translation"] = pair["translation"]
        results.append(result)
    return results

Step 4: Add a Logger
logger.py

import json
import os
from datetime import datetime

RESULTS_FILE = "results/evaluations.json"

def log_evaluation(evaluation: dict):
    os.makedirs("results", exist_ok=True)

    existing = []
    if os.path.exists(RESULTS_FILE):
        with open(RESULTS_FILE, "r") as f:
            existing = json.load(f)

    evaluation["timestamp"] = datetime.utcnow().isoformat()
    existing.append(evaluation)

    with open(RESULTS_FILE, "w") as f:
        json.dump(existing, f, indent=2)

    print(f"✅ Logged evaluation. Overall score: {evaluation['overall_score']}/10")

Step 5: Run It
main.py

from evaluator import evaluate_translation
from logger import log_evaluation

source = "Please ensure the patient takes the medication twice daily with food."
translation = "Jọwọ rii daju pe alaisan mu oogun naa lẹmeji lojoojumọ pẹlu ounjẹ."
target_language = "Yoruba"

result = evaluate_translation(source, translation, target_language)
log_evaluation(result)

print(json.dumps(result, indent=2))

Sample Output

{
  "fluency": {
    "score": 9,
    "reason": "The translation reads naturally and follows Yoruba grammatical structure."
  },
  "accuracy": {
    "score": 8,
    "reason": "Core meaning is preserved; minor phrasing differences don't affect intent."
  },
  "cultural_appropriateness": {
    "score": 9,
    "reason": "Terminology is appropriate for a Nigerian Yoruba-speaking audience."
  },
  "overall_score": 8.67,
  "recommendation": "pass"
}

Step 6: Scale It with Batch Processing

from evaluator import batch_evaluate
from logger import log_evaluation

pairs = [
    {
        "source": "Welcome to our platform.",
        "translation": "Kaabọ si pẹpẹ wa.",
        "target_language": "Yoruba"
    },
    {
        "source": "Your payment was successful.",
        "translation": "Isanwo rẹ ti ṣaṣeyọri.",
        "target_language": "Yoruba"
    }
]

results = batch_evaluate(pairs)
for result in results:
    log_evaluation(result)

When to Use This Pattern

LLM-as-a-Judge works best when:

You need nuanced evaluation beyond keyword matching
You're working with low-resource languages where reference datasets are scarce
You want explainable scores — not just a number, but a reason
You're building human-in-the-loop review systems

Limitations to Be Aware Of

Cost: Every evaluation is an API call — batch wisely
Judge bias: LLMs have their own language biases; calibrate against human evaluators
Consistency: Add temperature=0 for more deterministic scoring
Self-evaluation: Don't use the same model as judge and translator

What's Next

Add a dashboard to visualize score trends over time.
Integrate with GitHub Actions to auto-evaluate translations in CI/CD
Extend to multi-language pairs with language-specific rubrics
Add human feedback loops to fine-tune your judge prompt over time.

Conclusion
LLM-as-a-Judge is one of the most practical evaluation techniques available today — especially for language tasks where ground truth is hard to define. With just a well-crafted prompt and a structured output format, you can build an evaluation system that catches what traditional metrics miss.
The full code is available on GitHub

Forem: Temitope

Building a Fault-Tolerant Job Queue: Node.js Producers, Elixir/OTP Consumers

Architecture Overview

Part 1: The Node.js Producer

Project Setup

Redis Stream Producer

The API

Part 2: The Elixir/OTP Consumer

Project Setup

Configuration

The Supervision Tree

The Redis Connection Pool

Ensuring the Consumer Group Exists

The QueueConsumer GenServer

The WorkerSupervisor

The JobWorker GenServer

Dead Letter Queue

Telemetry

Watching the Supervision Tree in Action

The Failure Matrix

Where to Take It Next

Ractors vs Fibers: Ruby Concurrency Without the Hand-Waving

The Core Distinction, Precisely

Part 1: Fibers

The Basics — Fibers as Resumable Closures

Fibers as Enumerators

Fiber::Scheduler — The Real Power

Fibers for Concurrent HTTP Requests (Real Example)

Part 2: Ractors

The Basics — Isolated Parallel Workers

The Isolation Rules — Where Most Code Breaks

Message Passing Patterns

A Worker Pool with a Shared Queue

Ractors and Classes — Where Things Get Thorny

Practical Ractor: Parallel File Processing

Head-to-Head Benchmarks

I/O-Bound: 50 Concurrent HTTP Requests

CPU-Bound: Parallel Prime Sieve

The Comparison Table

Combining Both: Parallel Workers, Each with Async I/O

What's Still Broken with Ractors (Honest Assessment)

Decision Guide

Node.js Performance at the Limit: Profiling, Fixing, and Proving It with Real Numbers

The Benchmark Harness First

The Patient: A Realistic Slow API

Problem 1: The N+1 Query

Proof First

The Fix: JOIN Everything

Problem 2: CPU Blocking — The Fingerprint Loop

The Analysis

Problem 3: Memory Pressure and GC Pauses

The Fix: Return the Postgres Result Directly

Problem 4: The Event Loop — Blocking JSON Serialization

Fix A: Streaming JSON with fast-json-stringify

Fix B: For Very Large Responses — JSONStream

Problem 5: Connection Pool Starvation

Tuning the Pool

The Complete Optimized Handler

Full Benchmark Summary

What to Do When the Low-Hanging Fruit Is Gone

The Discipline

Row-Level Multitenancy in Rails: Building a Bulletproof Tenant Isolation Layer from Scratch

The Mental Model

Step 1: The Tenant Model and Migration

Step 2: Thread-Local Tenant Context via Current

Step 3: Middleware for Tenant Resolution

Step 4: Enforcing Tenant Scope at the Model Layer

A Note on default_scope

Step 5: Postgres Row-Level Security as a Hard Backstop

Step 6: Background Jobs Without Foot Guns

Step 7: Test Helpers That Enforce the Rules

Step 8: The Admin Escape Hatch

Common Pitfalls and How to Catch Them

Joins That Escape the Scope

Cached Queries in Mailers

Fixtures and Seeds With tenant_id

What You Now Have

Where to Go Next

Error Handling Patterns for Python AI Pipelines: What to Catch, What to Retry, and What to Alert On

The Four Categories of AI Pipeline Failures

Fix A: Streaming JSON with `fast-json-stringify`

Fix B: For Very Large Responses — `JSONStream`

Step 2: Thread-Local Tenant Context via `Current`

A Note on `default_scope`

Fixtures and Seeds With `tenant_id`