Forem: Ankit Jangwan

How I Design Reliable Backend Systems

Ankit Jangwan — Sat, 04 Apr 2026 00:00:00 +0000

I've been building backend systems for about five years now. Some handle 10,000+ messages per day. Others serve AI-powered responses in real-time. A few run on a single VPS for under $5/month.

Over that time, I've settled on a set of principles that keep my systems predictable and boring in production. Boring is the goal.

This post covers the patterns I return to on every project.

Start With the Failure Modes

Most backend design starts with the happy path. I start with a different question: what breaks first?

On my Message Scheduler project, the answer was clear early on. External APIs fail. Amazon SES returns transient errors. Telegram's Bot API has rate limits. If a scheduled message fails at delivery time, the user never gets it.

So the first architectural decision wasn't about the framework or database. It was about retry behaviour. I went with exponential backoff — 3 attempts over 15 minutes — before marking a message as failed. That single decision drove the rest of the stack: Celery for async workers, Redis as the broker, idempotency keys to prevent duplicate deliveries.

The principle: design for what goes wrong before designing for what goes right.

When I built HealthLab — a pathology lab management system — the critical failure mode was slot overbooking. Two patients booking the same time slot simultaneously could cause real operational problems. The fix was an atomic database update with a WHERE booked < capacity guard. If a second booking hits after the slot fills, it fails cleanly instead of creating a conflict.

func (r *TimeSlotRepository) IncrementBooked(ctx context.Context, slotID uint) error {
    return r.db.Model(&model.TimeSlot{}).
        Where("id = ? AND booked < capacity", slotID).
        Update("booked", gorm.Expr("booked + 1")).Error
}

No distributed locks. No complex coordination. Just a conditional update that makes the race condition impossible.

Keep the Request Cycle Fast

Anything that doesn't need to happen before the HTTP response should be moved out of the request cycle. This sounds obvious, but I've seen (and written) plenty of views that send emails, process files, and sync with external APIs synchronously.

In the Message Scheduler, creating a scheduled message is instant. The actual delivery happens later through Celery:

def create_order(request):
    order = Order.objects.create(...)
    send_order_confirmation.delay(order.id)  # Non-blocking
    return Response({"id": order.id}, status=201)

The user gets a 201 response in milliseconds. The email sends whenever Celery picks up the task.

For my portfolio's AI chatbot, I took this further with streaming responses. Nobody wants to stare at a spinner while an LLM generates a full answer. So the backend yields tokens via Server-Sent Events as they generate. Sub-second time-to-first-token.

I also run follow-up suggestion generation in parallel using a ThreadPoolExecutor. The suggestion chips appear the moment the main response finishes, with zero additional wait.

The pattern works across all my projects:

Accept the request
Validate inputs
Do the minimum work needed for the response
Offload everything else

Choose Boring Technology When You Can

Django + PostgreSQL + Celery + Redis. I use this stack on most Python projects because I know exactly how it behaves under load, how it fails, and how to debug it.

For the Message Scheduler, I considered FastAPI with a custom scheduler. FastAPI would have been slightly faster and lower on memory. But Django gave me the ORM, the admin panel, and battle-tested middleware. Development speed mattered more than shaving 50ms off response times.

Go was the right choice for HealthLab — single binary deployment, goroutines for handling concurrent bot conversations, and compile-time type safety. But that was a deliberate decision for a specific set of constraints, not a default.

My defaults:

Need	Default Choice	Why
Web API (Python)	Django + DRF	ORM, admin, middleware ecosystem
Task queue	Celery + Redis	Proven, debuggable, good monitoring
Database	PostgreSQL	JSON support, full-text search, partial indexes
Async delivery	Celery `apply_async(eta=...)`	Built-in ETA scheduling, no polling
Connection pooling	PgBouncer or `CONN_MAX_AGE`	Reuse connections across requests

I've written about the specific performance patterns I follow in more detail on my blog: How to Optimise Backend Performance.

Make Scheduling Precise, Not Approximate

The Message Scheduler handles timezone-aware scheduling across different user locations. The rule I follow: store everything in UTC, convert at display and delivery time.

I've seen bugs from storing local times in database columns. A user in IST schedules for 9 AM, another in PST schedules for 9 AM — if you store "09:00" without timezone context, one of them gets it wrong.

The scheduling architecture uses Celery's eta parameter instead of polling:

Same-day messages: scheduled immediately at creation time with apply_async(eta=send_at)
Future messages: a daily cron at midnight schedules that day's messages

This means one cron job per day instead of checking every minute. The tasks sit in Redis until their ETA, then fire at the exact scheduled time.

Natural language date parsing rounds this out. Type "next Friday" or "in 2 weeks" and it converts to UTC. I implemented this with chrono-node on the frontend, normalizing everything to UTC before persistence.

The full case study is on my site: Message Scheduler Case Study.

Observability Is Not Optional

I wrote a full post on this — How to Optimise Backend Performance — but the short version: you can't fix what you can't see.

The minimum I set up on every project:

Structured logging with enough context to reconstruct requests (IDs, durations, cache hit/miss)
Query count tracking per request (a sudden jump from 3 queries to 200 is an N+1)
Health endpoints that check actual dependencies (database connectivity, Redis, external APIs)
Percentile metrics (p75, p95, p99) over averages — averages hide the worst experiences

On the HealthLab project, the health endpoint checks DB connectivity. On the Message Scheduler, Celery Flower provides worker monitoring. On my portfolio, the Cloudflare Worker proxy handles error states gracefully.

Authentication Should Match the Client

Different clients have different trust levels. On HealthLab, I built a multi-auth middleware stack:

JWT for the admin dashboard (human users with sessions)
API keys for bot integrations (long-lived, server-to-server)
Webhook secret tokens for Telegram callbacks (verify origin authenticity)

// JWT for dashboard (admin users)
v1.Use(middleware.JWTAuth())

// API Key for bot integrations
botGroup := v1.Group("/bot")
botGroup.Use(middleware.APIKeyAuth())

// Secret token for Telegram webhooks
telegramGroup.Use(middleware.TelegramSecretToken())

Each auth strategy exists because it's the right fit for that client type. Forcing JWT on a webhook endpoint or API keys on a human-facing dashboard creates friction for no security benefit.

State Belongs Where It's Cheapest

For the portfolio's AI chatbot, chat history lives in the visitor's browser (localStorage with a 1-hour TTL). The last 5 exchanges get sent to the backend with each request so the LLM can handle follow-ups.

No session database. No Redis for state. The backend is stateless, which means it scales horizontally without coordination.

For HealthLab's Telegram bot, conversation state (the 5-step booking flow) lives in a Go sync.Map in memory. Booking conversations are short-lived — under 5 minutes. If the server restarts, the user starts the flow over. That's an acceptable tradeoff for a much simpler architecture.

The question I ask: how long does this state need to live, and what happens if it disappears?

Booking flow (5 minutes, restartable) → in-memory
Chat history (1 hour, nice-to-have) → client-side storage
Scheduled messages (days or months, critical) → PostgreSQL
Task queue (hours, retriable) → Redis

Deployment Should Be One Command

Every project I ship is containerized or packaged for single-command deployment.

Message Scheduler: docker-compose up → Django + Celery + Redis + PostgreSQL + Nginx
HealthLab: docker-compose up → Go API + PostgreSQL + React dashboard
Telegram Chat Manager: single PyInstaller executable — no Docker, no Python, no dependencies
Portfolio AI backend: FastAPI container on a VPS with systemd

The Telegram Chat Manager took this the furthest. The entire app — FastAPI server, embedded HTML template, Telethon client — packages into a ~20MB executable that runs on Windows, Linux, and Mac without Python installed.

What I'd Do Differently

Across all these projects, a few patterns came up too late:

Add structured logging from day one. Debugging async workers without it is painful.
Implement rate limiting early. It's harder to retrofit than to build in.
Write integration tests for external API interactions. Mocking SES and Telegram in unit tests is fine, but you also need tests that hit the real API in staging.
Use dead-letter queues. Failed messages can be silently dropped without them. Currently monitoring these through Celery Flower, but a proper DLQ would be better.

The Recurring Theme

Looking across my projects, the reliable systems share a few things:

They handle failures explicitly, not hopefully
They keep the request cycle minimal
They use boring, proven technology as the default
They store state where it's cheapest and most appropriate
They deploy in one command

The details change — Go vs Python, Celery vs goroutines, PostgreSQL vs in-memory maps — but the principles hold.

If you want to see the full implementation details, including architecture diagrams, code samples, and failure mode analysis, check out my case studies.

About me: I'm Ankit Jangwan, a Senior Software Engineer building backend systems, AI integrations, and developer tools. You can see my work at ankitjang.one.

How to Optimise Backend Performance: A Practical Playbook

Ankit Jangwan — Fri, 03 Apr 2026 00:00:00 +0000

TL;DR: Backend performance work is a loop: observe, profile, fix, verify. This post covers the full cycle. Setting up observability, identifying bottlenecks with percentile metrics, applying targeted fixes (N+1 queries, indexing, caching, async offloading), and verifying improvements against p75/p95/p99 latency targets.

Why Percentiles Matter More Than Averages

Average response time is misleading. An endpoint averaging 80 ms might seem fine until you realise 5% of your users are waiting 800 ms or more.

Percentile metrics give you the actual picture:

Metric	What It Tells You
p50 (median)	The typical user experience
p75	Where the experience starts degrading
p95	The worst experience for most users
p99	The tail, your worst-case under normal load

The goal I worked towards: p95 under 200 ms, p99 under 500 ms, and critical queries completing in under 50 ms.

When you optimise, you're compressing the gap between p50 and p99. A fast median with a slow tail means your system is unpredictable, and users notice unpredictability more than raw speed.

Step 1: Establish Observability

Before touching any code, you need visibility into what your system is actually doing. I've seen teams spend weeks optimising the wrong endpoint because they didn't have the data to tell them where the real problems were.

Reproducible Demo Setup

To make this more concrete, I created a small Django project that intentionally includes common performance issues like N+1 queries, missing indexes, and slow endpoints.

All the screenshots in this post (Datadog traces, latency graphs, Debug Toolbar panels) are taken from this repo, so you can reproduce the same scenarios locally and follow along:

👉 https://github.com/jangwanAnkit/django-perf-demo

The idea is simple: instead of reading abstract examples, you can run the app, hit the endpoints, and see exactly how these bottlenecks show up in profiling tools.

Application Performance Monitoring (APM)

APM tools trace requests end-to-end through your stack. They break down where time goes: application code, database queries, external API calls, template rendering, serialisation.

Tools: Datadog APM, New Relic, Elastic APM, Jaeger (open-source)

What to look for in APM data:

Flame graphs give you a visual breakdown of time spent in each function call
Trace waterfalls show sequential vs. parallel execution of sub-operations
Service maps lay out which services call which, and where dependencies bottleneck

Database Profiling

Most backend latency lives in the database layer. Profiling queries tells you exactly which ones are slow and why.

Tools: Datadog Database Monitoring, pganalyze (PostgreSQL), django-debug-toolbar (local development). For a hands-on walkthrough of using django-debug-toolbar and snakeviz for local profiling, see my Case Study: Profiling Django APIs with Debug Toolbar and snakeviz.

Key metrics to track:

Query execution time: how long the database spends running each query
Query frequency: a 5 ms query executed 200 times per request is worse than a single 100 ms query
Lock wait time: queries blocked waiting for row or table locks
Rows scanned vs. rows returned: a high ratio points to missing indexes

Structured Logging

Logs are your investigation trail. When APM shows a slow trace, logs tell you what happened during that request.

import structlog

logger = structlog.get_logger()

logger.info(
    "order_processed",
    order_id=order.id,
    duration_ms=elapsed,
    item_count=len(order.items),
    cache_hit=cache_hit,
)

Log with enough context to reconstruct the request path: IDs, durations, counts, cache hit/miss status.

Dashboards and Alerting

Combine these signals into dashboards. I use Datadog dashboards tracking:

p75 / p95 / p99 latency per endpoint over time
Error rate alongside latency (slow responses often precede errors)
Database query count per request, where a sudden jump signals a regression
Queue depth for async workers

Set up both threshold-based and rate-of-change alerts. Static thresholds catch known-bad states; rate-of-change alerts catch regressions as they happen.

Datadog automatically generates trace metrics (like p50/p75/p95/p99) from your application traces. You can visualise these using APM dashboards — see: https://docs.datadoghq.com/tracing/guide/apm_dashboard/

Reading Profiling Data

Setting up observability tools is step one. Getting useful information out of them is where most people get stuck. Below is how I read the output from the three profiling interfaces I use most: Python's cProfile, Django Debug Toolbar, and Datadog APM.

Python cProfile

cProfile is built into Python and requires no dependencies. It profiles function-level execution time.

Running a profile:

import cProfile
import pstats

# Profile a function call
cProfile.run('my_slow_function()', 'output.prof')

# Read the results
stats = pstats.Stats('output.prof')
stats.sort_stats('cumulative')
stats.print_stats(20)  # Top 20 functions

For profiling a Django view in isolation:

import cProfile
from django.test import RequestFactory

factory = RequestFactory()
request = factory.get('/api/orders/')

profiler = cProfile.Profile()
profiler.enable()
response = my_view(request)
profiler.disable()

profiler.print_stats(sort='cumulative')

Reading the output:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      1    0.000    0.000    1.842    1.842 views.py:45(order_list)
    200    0.003    0.000    1.650    0.008 models.py:12(get_customer)
    200    1.580    0.008    1.580    0.008 base.py:330(execute)
      1    0.001    0.001    0.180    0.180 serializers.py:88(to_representation)
      1    0.000    0.000    0.012    0.012 pagination.py:22(paginate)

Column	What It Means
`ncalls`	How many times this function was called
`tottime`	Time spent inside this function, excluding sub-calls
`percall` (first)	`tottime / ncalls`
`cumtime`	Total time spent in this function, including sub-calls
`percall` (second)	`cumtime / ncalls`

How to read this:

Start from the top (sorted by cumtime). In the example above, order_list takes 1.84 seconds total. Drilling down, get_customer is called 200 times and accounts for 1.65 seconds — that's 89% of the total. The actual time is spent in base.py:execute, which is Django's database query executor. This is a textbook N+1: 200 individual queries to fetch customer data.

What to focus on:

High ncalls on database functions: N+1 queries
High tottime on a single function: CPU-bound bottleneck (serialisation, computation)
High cumtime with low tottime: the function itself is fast but calls something slow

For a visual alternative to the text output, pipe cProfile data into snakeviz — it renders the same data as an interactive flame graph in the browser:

python -m cProfile -o output.prof my_script.py
snakeviz output.prof

snakeviz reads the profile top-down. Each box is a function, and boxes nested below others mean they were called by the function above. Wider boxes took more time. Click a box to zoom in, and sort the table below by ncalls or cumtime to find outliers.

Screenshot (snakeviz-flamegraph.png):

Django Debug Toolbar Profiling

Django Debug Toolbar gives you per-request profiling without writing any code. It has several panels, but for performance work the most useful are the SQL panel and the Profiling panel.

Enabling the profiler:

# settings.py (development only)
INSTALLED_APPS = [
    ...
    'debug_toolbar',
]

MIDDLEWARE = [
    'debug_toolbar.middleware.DebugToolbarMiddleware',
    ...
]

DEBUG_TOOLBAR_PANELS = [
    'debug_toolbar.panels.sql.SQLPanel',
    'debug_toolbar.panels.profiling.ProfilingPanel',
    'debug_toolbar.panels.timer.TimerPanel',
    'debug_toolbar.panels.cache.CachePanel',
]

INTERNAL_IPS = ['127.0.0.1']

The SQL Panel:

This is usually the first place to look. It shows:

Total number of queries and total time
Each individual query with its SQL, execution time, and stack trace
Duplicate queries highlighted (immediate N+1 indicator)
EXPLAIN output for each query (click to expand)

What to look for:

"Similar" or "Duplicated" badges — these are N+1 queries. The toolbar groups identical query patterns and shows how many times each pattern was executed
Total query count — a list API returning 50 items should not fire 150 queries. If it does, you're missing select_related or prefetch_related
Query time distribution — if one query takes 200 ms and the rest take 1 ms each, that single query is your target
The stack trace — click on any query to see exactly which line of Python code triggered it. This tells you whether the query came from the view, the serialiser, a model method, or a template

The Profiling Panel:

The profiling panel is disabled by default. Click its checkbox in the toolbar to activate it. On Python 3.12+, you need to run the dev server with --nothreading for it to work:

python manage.py runserver --nothreading

Once enabled, it shows a collapsible call tree for the current request, similar to cProfile output but rendered as an indented HTML table. Each row shows a function, its cumulative time, own time, and call count. You can expand and collapse levels to drill into the call hierarchy:

GET /api/orders/ — 1842 ms
├── OrderListView.get() — 1842 ms (cumtime)
│   ├── OrderQuerySet.all() — 12 ms
│   ├── OrderSerializer.to_representation() — 1650 ms
│   │   ├── CustomerField.to_representation() × 200 — 1580 ms
│   │   │   └── SQL: SELECT * FROM customers WHERE id = %s × 200
│   │   └── ItemSerializer.to_representation() × 200 — 60 ms
│   └── Paginator.paginate() — 180 ms

Reading this:

Nesting shows the call hierarchy. A slow parent with a fast own-time means the parent is slow because of its children
Call count (× 200) is the key signal. If a function repeats many times inside a loop, you're probably looking at an N+1 or a missing batch operation
Start from the deepest nodes with the highest cumulative time and work upward
You can adjust PROFILER_MAX_DEPTH (default: 10) and PROFILER_THRESHOLD_RATIO (default: 8) in DEBUG_TOOLBAR_CONFIG to control how deep the tree goes and which functions get included

Datadog APM Resource Pages

The Datadog service page aggregates latency, throughput, error rate, and resource-level performance for each endpoint. If you're unfamiliar with this view, the official docs are helpful: https://docs.datadoghq.com/tracing/services/service_page/

When you open a resource (endpoint) in Datadog APM, you see several tabs and visualisations. Here's what each one tells you.

The Resource Page Overview:

The top of the page shows aggregate metrics for the selected endpoint over your chosen time range:

Requests/sec — throughput
Latency — shown as p50, p75, p90, p95, p99 over time
Errors — error rate as a percentage
Total time — the proportion of your service's total processing time spent on this resource

Latency Distribution:

A histogram showing how response times are distributed. You want a tight cluster on the left. A long tail to the right means outlier requests are much slower than typical ones. Bimodal distributions (two humps) suggest two distinct code paths — for example, cache hits completing in 20 ms and cache misses in 400 ms.

This histogram is rendered using Datadog’s distribution visualisation, which is designed to highlight tail latency and skewed performance patterns: https://docs.datadoghq.com/dashboards/widgets/distribution/

Spans (Trace Waterfall):

When you click into an individual trace, you get the span waterfall. Each span represents a unit of work:

[django.request]──────────────────────── 1200 ms
  [django.middleware]──── 5 ms
  [django.view]──────────────────────── 1190 ms
    [postgresql.query]── 4 ms    SELECT * FROM orders WHERE ...
    [postgresql.query]── 3 ms    SELECT * FROM customers WHERE id = 1
    [postgresql.query]── 4 ms    SELECT * FROM customers WHERE id = 2
    [postgresql.query]── 3 ms    SELECT * FROM customers WHERE id = 3
    ... (197 more identical spans)
    [serialization]───── 15 ms

What each span tells you:

Span Attribute	What It Shows
Service	Which service produced this span (web app, database, cache, external API)
Operation	The type of work (e.g., `postgresql.query`, `redis.command`, `http.request`)
Duration	How long this span took
Resource	The specific query, URL, or cache key
Error flag	Whether this span resulted in an error
Child count	Number of child spans (sub-operations)

How to read the waterfall:

Spans stacked vertically with small gaps are executing sequentially. This is normal for database queries within a single-threaded request
A tall stack of identical spans (same operation, same resource pattern) is an N+1. In the example above, 200 postgresql.query spans with SELECT * FROM customers WHERE id = ? is the smoking gun
Spans with long durations but no children indicate time spent in application code (CPU-bound work, synchronous I/O)
A single very wide span early in the waterfall followed by fast spans suggests a slow initial query or connection setup

If you want to explore all available trace views (waterfall, span list, flame graph), refer to Datadog’s trace view documentation: https://docs.datadoghq.com/tracing/trace_explorer/trace_view/

Span List tab:

The span list groups spans by resource and service, sorted by span count. Instead of a timeline, you see a table with columns for resource name, number of spans, average duration, execution time, and percentage of total trace time. This is useful when you want to quickly answer "which database query ran the most times?" or "which service consumed the most time?" without scrolling through a long waterfall.

Sort by SPANS to find N+1 patterns (one query repeated hundreds of times), or by % EXEC TIME to find the single heaviest operation.

Flame Graph tab:

Datadog’s flame graph view is especially useful for identifying which service or function dominates execution time. Their guide explains how to read width, depth, and service breakdowns: https://www.datadoghq.com/knowledge-center/distributed-tracing/flame-graph/

The flame graph is the default trace visualisation in Datadog. It shows all spans from a trace laid out on a timeline, colour-coded by service.

The x-axis is time. Wider spans took longer
The y-axis is call depth. Each row is a child of the row above it
Colours represent services by default (you can switch to group by host or container)
Spans from different services are visually distinct, so you can tell at a glance whether time is spent in your application code, the database, or an external API call

Reading the flame graph:

Look for the widest spans at the deepest level. These are where actual time is spent
Hover over any span to see the service name, operation, resource, and duration
Click a span to open the detail panel below, which includes the full query text, error details, and related logs
Use the legend at the top to see what percentage of total execution time each service accounts for. If postgresql takes 80% of the trace, the database is your bottleneck
Toggle the Errors checkbox under "Filter Spans" to highlight error spans in the graph
Compare flame graphs before and after a fix. The previously wide span should be narrower or gone

Walkthrough: Finding a Bottleneck End-to-End

Here's how the full process looks on a real endpoint. I'll use a simplified version of a case I've worked through.

If you want to follow this exact scenario step-by-step, the demo repo linked earlier reproduces this N+1 pattern and the before/after behaviour.

The symptom:

Datadog dashboard shows /api/orders/ with p95 at 1200 ms, well above the 200 ms target. The endpoint handles 50,000 requests/day, making it a critical priority.

Step 1: Check the Datadog resource page

Open the resource page for /api/orders/. The latency distribution shows a long tail — p50 is 180 ms, but p95 jumps to 1200 ms. The tail requests correlate with customers who have many orders.

Step 2: Drill into a slow trace

You can use Datadog’s Trace Explorer to filter slow requests by service, endpoint, or duration: https://docs.datadoghq.com/tracing/trace_explorer/

Filter traces by duration > 1000 ms. Open one. The span waterfall shows:

[django.request] ────────────────────────── 1180 ms
  [django.view] ─────────────────────────── 1170 ms
    [postgresql.query] ─── 8 ms   SELECT * FROM orders WHERE user_id = 42 ...
    [postgresql.query] ─── 3 ms   SELECT * FROM customers WHERE id = 42
    [postgresql.query] ─── 4 ms   SELECT * FROM order_items WHERE order_id = 101
    [postgresql.query] ─── 3 ms   SELECT * FROM products WHERE id = 55
    [postgresql.query] ─── 4 ms   SELECT * FROM order_items WHERE order_id = 102
    [postgresql.query] ─── 3 ms   SELECT * FROM products WHERE id = 23
    ... (380 more query spans)

Total query count: 384. Total database time: ~980 ms. The rest is serialisation overhead.

Step 3: Identify the pattern

Two N+1 patterns:

For each order, a separate query fetches its items (SELECT * FROM order_items WHERE order_id = ?)
For each item, a separate query fetches the product (SELECT * FROM products WHERE id = ?)

Step 4: Reproduce locally with Debug Toolbar

Hit the same endpoint locally with django-debug-toolbar enabled. The SQL panel confirms: 384 queries, with "Duplicated" badges on the order_items and products queries.

Step 5: Apply the fix

# Before
orders = Order.objects.filter(user=request.user)

# After
orders = (
    Order.objects
    .filter(user=request.user)
    .select_related('customer')
    .prefetch_related(
        Prefetch(
            'items',
            queryset=OrderItem.objects.select_related('product')
        )
    )
)

Query count drops from 384 to 3: one for orders, one for items, one for products (via prefetch).

Step 6: Verify

Push to staging. Monitor the Datadog resource page for /api/orders/:

p50: 180 ms → 45 ms
p95: 1200 ms → 95 ms
p99: 2400 ms → 180 ms
Query count per request: 384 → 3

The latency distribution shifts from a long-tail shape to a tight cluster under 100 ms. Ship to production.

Step 2: Identify and Prioritise Bottlenecks

With observability in place, the next step is triage. Not every slow endpoint matters equally.

Prioritisation Framework

Rank endpoints by impact × frequency:

Endpoint	p95 Latency	Requests/day	Priority
`/api/orders/`	1200 ms	50,000	Critical
`/api/users/profile/`	400 ms	30,000	High
`/api/reports/monthly/`	3000 ms	200	Low
`/api/dashboard/`	600 ms	15,000	High

A 3-second report endpoint used 200 times a day is less urgent than a 1.2-second orders endpoint hit 50,000 times. Fix what affects the most users first.

Common Bottleneck Patterns

These are the patterns I've run into most often:

N+1 queries: a list endpoint fires one query per item instead of batching
Missing database indexes: full table scans on filtered or sorted columns
Over-fetching: loading entire rows when only a few columns are needed, especially with large text or JSON fields
Synchronous blocking: waiting on external APIs, email sending, or file processing in the request cycle
No caching: recomputing identical results on every request
Unoptimised serialisation: serialisers performing additional queries or heavy computation

As any system scales, these patterns get more noticeable. An N+1 that's invisible with 10 records becomes a real problem at 10,000.

Step 3: Apply Targeted Fixes

Start with small wins. They're often low-effort but make a disproportionate difference.

Fix N+1 Queries

N+1 is probably the most common performance bug in ORM-based backends. It happens when you load a list of objects and then access a related object on each one individually.

The problem:

# This fires 1 query for orders + N queries for customer (one per order)
orders = Order.objects.all()
for order in orders:
    print(order.customer.name)  # Each access = 1 query

The fix:

# select_related: single JOIN query for ForeignKey/OneToOne
orders = Order.objects.select_related('customer').all()

# prefetch_related: two queries for ManyToMany/reverse FK
orders = Order.objects.prefetch_related('items').all()

Detecting N+1 queries:

Datadog APM traces showing dozens of identical SELECT statements per request
django-debug-toolbar showing query count spikes on list views
Middleware that logs query count per request (useful in staging or with a debug flag):

from django.db import connection

class QueryCountMiddleware:
    def __init__(self, get_response):
        self.get_response = get_response

    def __call__(self, request):
        initial = len(connection.queries)
        response = self.get_response(request)
        total = len(connection.queries) - initial
        if total > 20:
            logger.warning("high_query_count", path=request.path, count=total)
        return response

Note: connection.queries only populates when DEBUG=True. In production, rely on APM tracing or a package like django-querycount instead.

Add Database Indexes

Indexes are the highest-leverage single change for query performance. Without one, the database scans every row.

Identifying missing indexes:

-- PostgreSQL: find slow queries and their execution plans
EXPLAIN ANALYZE
SELECT * FROM orders
WHERE status = 'pending' AND created_at > '2026-01-01'
ORDER BY created_at DESC;

Look for Seq Scan in the output. That means no index is being used.

Adding targeted indexes:

-- Single-column index for filtered lookups
CREATE INDEX idx_orders_status ON orders(status);

-- Composite index for queries that filter + sort
CREATE INDEX idx_orders_status_created ON orders(status, created_at DESC);

-- Partial index for a common filter condition
CREATE INDEX idx_orders_pending ON orders(created_at DESC)
WHERE status = 'pending';

In Django migrations:

class Migration(migrations.Migration):
    operations = [
        migrations.AddIndex(
            model_name='order',
            index=models.Index(
                fields=['status', '-created_at'],
                name='idx_order_status_created',
            ),
        ),
    ]

Index trade-offs:

Indexes speed up reads but slow down writes (every INSERT/UPDATE must update the index)
On very large tables (100M+ rows), adding an index can lock the table. Use CREATE INDEX CONCURRENTLY in PostgreSQL to avoid this
Over-indexing wastes storage and makes the query planner's job harder

Stop Over-Fetching Data

Loading columns you don't need wastes memory and network bandwidth, especially with large text or JSON fields.

The problem:

# Loads ALL columns including a 50KB description field
products = Product.objects.all()

The fix:

# Only fetch what you need
products = Product.objects.only('id', 'name', 'price', 'status')

# Or explicitly defer heavy fields
products = Product.objects.defer('description', 'metadata_json')

# For read-only list views, use values/values_list
product_names = Product.objects.values_list('id', 'name', flat=False)

Implement Caching

Cache results that are expensive to compute and don't change frequently.

Layer 1: Application-level cache (Redis/Memcached)

from django.core.cache import cache

def get_dashboard_stats(user_id):
    cache_key = f"dashboard_stats:{user_id}"
    stats = cache.get(cache_key)

    if stats is None:
        stats = compute_expensive_stats(user_id)
        cache.set(cache_key, stats, timeout=300)  # 5 minutes

    return stats

Layer 2: Query-level caching with Django's cached queries

from django.utils.functional import cached_property

class OrderSerializer(serializers.ModelSerializer):
    @cached_property
    def _prefetched_items(self):
        return list(self.instance.items.select_related('product'))

Layer 3: HTTP caching for read-heavy endpoints

from django.views.decorators.cache import cache_page

@cache_page(60 * 5)  # Cache for 5 minutes
def product_list(request):
    ...

Cache invalidation strategies:

Time-based (TTL): simplest, works for data that can tolerate staleness
Event-based: invalidate on write operations using signals or hooks
Versioned keys: append a version counter to cache keys, increment on data changes

from django.db.models.signals import post_save
from django.dispatch import receiver

@receiver(post_save, sender=Order)
def invalidate_order_cache(sender, instance, **kwargs):
    cache.delete(f"dashboard_stats:{instance.user_id}")
    cache.delete(f"order_detail:{instance.id}")

Offload Async Work

Anything that doesn't need to happen before the HTTP response can be moved out of the request cycle. I use this pattern extensively in my Message Scheduler project — Celery workers handle email and Telegram delivery asynchronously, keeping the API response under 50 ms even during peak load.

Common candidates:

Sending emails and notifications
Generating reports or PDFs
Processing uploaded files
Syncing data with external services
Updating search indexes
Aggregating analytics

Using Celery in Django:

# tasks.py
from celery import shared_task

@shared_task
def send_order_confirmation(order_id):
    order = Order.objects.get(id=order_id)
    send_email(
        to=order.customer.email,
        subject=f"Order #{order.id} confirmed",
        body=render_confirmation(order),
    )

# views.py
def create_order(request):
    order = Order.objects.create(...)
    send_order_confirmation.delay(order.id)  # Non-blocking
    return Response({"id": order.id}, status=201)

This takes email sending (which can take 500 ms–2 s depending on the provider) out of the response path entirely. For a production example of Celery + Redis in action with retry logic and idempotency keys, see the Message Scheduler case study.

Use Connection Pooling

Opening a new database connection per request is expensive. Connection pooling keeps a pool of reusable connections.

For Django with PostgreSQL:

# settings.py
DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.postgresql',
        'NAME': 'mydb',
        'CONN_MAX_AGE': 600,  # Keep connections alive for 10 minutes
    }
}

CONN_MAX_AGE keeps connections alive across requests within a thread. For actual connection pooling with control over pool size, use PgBouncer as an external pooler between Django and PostgreSQL. This matters when you're running multiple application workers.

Use Read Replicas

For read-heavy workloads, route read queries to replica databases while writes go to the primary.

Django database router:

class PrimaryReplicaRouter:
    def db_for_read(self, model, **hints):
        return 'replica'

    def db_for_write(self, model, **hints):
        return 'default'

    def allow_relation(self, obj1, obj2, **hints):
        return True

    def allow_migrate(self, db, app_label, model_name=None, **hints):
        return db == 'default'

Caveat: Read replicas have replication lag (usually milliseconds, but it can spike under load). Don't route reads to replicas immediately after a write if the user expects to see their own changes. This causes "read-your-own-write" inconsistency.

Step 4: Query-Level Deep Dives

When quick fixes aren't enough, you need to go deeper into individual query performance.

Using EXPLAIN ANALYZE

EXPLAIN ANALYZE executes the query and shows the plan the database used. It's where you go when you need to understand exactly why a specific query is slow.

EXPLAIN ANALYZE
SELECT o.id, o.status, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.created_at > '2026-01-01'
  AND o.status IN ('pending', 'processing')
ORDER BY o.created_at DESC
LIMIT 50;

What to look for in the output:

Indicator	Meaning	Action
`Seq Scan`	Full table scan	Add an index on the filtered columns
`Nested Loop` with high row count	Looping join on large result set	Consider a hash join, or add indexes
`Sort` with high cost	Sorting without index	Add an index that matches the ORDER BY
`Rows Removed by Filter` (high number)	Index not selective enough	Use a more specific composite index
`Buffers: shared read` (high)	Data not in memory	Increase `shared_buffers` or optimise query to touch fewer pages

Example screenshots

Before adding an index

After addin an index

Reproducing Production Queries Safely

Don't run EXPLAIN ANALYZE on production directly, since it actually executes the query. Instead:

Copy the slow query from Datadog/APM traces
Run it in a read replica or staging environment with production-like data
Compare plans between staging and production (data distribution matters)

# Connect to read replica
psql -h replica-host -U readonly_user -d mydb

# Set statement timeout as a safety net
SET statement_timeout = '30s';

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT ...;

Step 5: Verify and Monitor

Every fix needs measurement. The process I follow: push to staging, monitor, confirm improvement, then ship to production. Or revert.

The Verification Process

Push fix to staging
       ↓
Monitor Datadog dashboards (p75, p95, p99)
       ↓
  ┌─── Improved? ───┐
  ↓                  ↓
 YES                 NO
  ↓                  ↓
Push to prod     Investigate further
  ↓                  ↓
Monitor prod     Iterate or revert

What to Check After Each Fix

Did p95/p99 latency actually drop?
Did N+1 fixes reduce total query count per request?
Did the fix introduce any new errors? (Performance "fixes" sometimes break things.)
Did database CPU and I/O improve? Reduced query time should show up here too.
For caching changes, is the hit ratio trending upward?

Tracking Regressions

Performance work isn't something you do once and move on. Set up monitors that alert on:

p75, p95, p99 latency exceeding a threshold for more than 5 minutes
Query count per request increasing by more than 20%
Cache hit rate dropping below 80%
New slow queries appearing (anything exceeding 500 ms)

Trade-offs and Gotchas

Performance work involves trade-offs. Here are the ones I've dealt with most.

Indexing on Large Tables

Adding indexes to tables with hundreds of millions of rows isn't always possible. CREATE INDEX can lock the table for minutes. CREATE INDEX CONCURRENTLY avoids locking but takes longer and can fail under high write throughput. Sometimes the answer is redesigning: partition the table, archive old data, or use a materialised view.

Cache Invalidation

Incorrect invalidation leads to stale data: users seeing outdated information, balance mismatches, ghost records. Start with short TTLs and event-based invalidation. Don't cache data that changes on every request.

Over-Optimisation

Not every endpoint needs 50 ms latency. A monthly report endpoint used by 3 internal users can take 5 seconds and nobody will notice. Spend your time where users are, not where the numbers look bad in isolation.

One useful approach: set alerting thresholds relative to traffic. If an endpoint handles 50,000 requests/day, alert at 150 ms p75, 300 ms p95, 500 ms p99. A low-traffic internal endpoint can have much looser thresholds.

Read Replica Lag

Replication lag is usually sub-second, but under load it can spike. Design your application to tolerate this. Route reads-after-writes to the primary, not the replica.

Tools Reference

Category	Tools
APM / Tracing	Datadog APM, New Relic, Elastic APM, Jaeger
Database Profiling	Datadog DB Monitoring, pganalyze, `django-debug-toolbar` — see also my Django profiling walkthrough
Logging	structlog, ELK Stack, Datadog Logs
Caching	Redis, Memcached, Django cache framework
Async Workers	Celery + Redis/RabbitMQ — production example in Message Scheduler
Connection Pooling	PgBouncer, Django `CONN_MAX_AGE`
Load Testing	Locust, k6, Apache Bench
Python Profiling	`cProfile`, `snakeviz`, Pyinstrument
Query Analysis	`EXPLAIN ANALYZE`, `pg_stat_statements`, `auto_explain`

For language-specific performance tradeoffs: my HealthLab project uses Go for a single-binary deployment with goroutine-based concurrency, which avoids the GIL limitations I hit in Python for CPU-bound bot processing. And my portfolio system shows how Jinja2 template rendering with LaTeX achieves a 5x improvement over the previous manual workflow.

Key Takeaways

Measure first, optimise second. Observability is a prerequisite.
Percentiles over averages. p95 and p99 show what users actually experience.
Fix the boring stuff first. N+1 queries, missing indexes, and over-fetching account for most backend latency.
If it doesn't need to happen before the response, move it out of the request cycle.
Cache deliberately: short TTLs, event-based invalidation, clear cache keys.
Verify every change. The loop is observe → fix → measure → ship, not fix → hope.
Accept trade-offs. Not everything needs to be fast, and some optimisations create new problems.

Performance work is incremental. As traffic grows and features ship, new bottlenecks surface. The system is never "done." The point is having a process that catches regressions early and fixes them before users feel it.

Profiling Django APIs with Debug Toolbar and snakeviz

Ankit Jangwan — Thu, 02 Apr 2026 12:57:28 +0000

You don't need paid monitoring tools to find what's slow in your Django application. Two free, open-source tools cover most of it: Django Debug Toolbar for per-request profiling and snakeviz for visualizing Python's built-in cProfile data.

This post walks through how I use both tools to find and fix performance problems, based on patterns from my own projects. The examples are grounded in a Django API that handles 10,000+ scheduled messages per day with Celery workers and external API calls.

If you want the broader performance optimization workflow — including production monitoring, caching, and async offloading — I covered that in How to Optimise Backend Performance. This post goes deeper on the local profiling tools.

Setting Up Django Debug Toolbar

Installation takes about two minutes.

pip install django-debug-toolbar

# settings.py (development only)
INSTALLED_APPS = [
    ...
    'debug_toolbar',
]

MIDDLEWARE = [
    'debug_toolbar.middleware.DebugToolbarMiddleware',
    ...
]

DEBUG_TOOLBAR_PANELS = [
    'debug_toolbar.panels.sql.SQLPanel',
    'debug_toolbar.panels.profiling.ProfilingPanel',
    'debug_toolbar.panels.timer.TimerPanel',
    'debug_toolbar.panels.cache.CachePanel',
]

INTERNAL_IPS = ['127.0.0.1']

Add the URL configuration:

# urls.py
if settings.DEBUG:
    import debug_toolbar
    urlpatterns = [
        path('__debug__/', include(debug_toolbar.urls)),
    ] + urlpatterns

On Python 3.12+, the profiling panel needs the dev server running single-threaded:

python manage.py runserver --nothreading

The SQL Panel: Your N+1 Detector

The SQL panel is where I spend most of my time in Debug Toolbar. It shows every database query fired during a request, with timing, SQL text, and stack traces.

What to look for

Query count. A list endpoint returning 50 items should not fire 150 queries. If it does, you're missing select_related or prefetch_related.

"Duplicated" and "Similar" badges. Debug Toolbar groups identical query patterns and flags them. If you see a red "Duplicated" badge next to SELECT * FROM customers WHERE id = ? repeated 200 times, that's a textbook N+1.

The stack trace. Click any query to see which line of Python triggered it. This tells you whether the query came from the view, a serializer, a model method, or a template. Knowing where matters as much as knowing what.

Query time distribution. If one query takes 200 ms and the rest take 1 ms each, that query is your target. Often it's a missing index — the query is doing a sequential scan instead of using an index.

Finding an N+1 in practice

On a project similar to my Message Scheduler, I hit this endpoint locally:

GET /api/messages/?status=pending

Debug Toolbar showed: 187 queries in 420 ms. Several queries had "Duplicated" badges — the same SELECT * FROM users WHERE id = ? pattern repeated for every message in the list.

The view was loading messages and then accessing message.user.email in the serializer. Each access triggered a separate query.

The fix:

# Before: 187 queries
messages = Message.objects.filter(status='pending')

# After: 2 queries
messages = Message.objects.filter(status='pending').select_related('user')

After the change, Debug Toolbar showed 2 queries in 12 ms. One query for messages with a JOIN to users, one for the count.

The Profiling Panel: Where Time Actually Goes

The SQL panel tells you about database time. The profiling panel tells you about everything else — serialization, template rendering, Python computation, middleware.

Enable it by clicking its checkbox in the toolbar. It shows a collapsible call tree for the request:

GET /api/messages/ — 1842 ms
├── MessageListView.get() — 1842 ms (cumtime)
│   ├── MessageQuerySet.all() — 12 ms
│   ├── MessageSerializer.to_representation() — 1650 ms
│   │   ├── UserField.to_representation() × 200 — 1580 ms
│   │   │   └── SQL: SELECT * FROM users WHERE id = %s × 200
│   │   └── ChannelSerializer.to_representation() × 200 — 60 ms
│   └── Paginator.paginate() — 180 ms

Reading the call tree

Call count is the key signal. A function called 200 times inside a loop is almost always an N+1 or a missing batch operation. In the example above, UserField.to_representation() runs 200 times — once per message in the list.

Nesting shows the hierarchy. A slow parent with fast own-time means the parent is slow because of its children. MessageSerializer.to_representation() takes 1650 ms, but it's not doing anything slow itself — its child UserField.to_representation() is.

Start from the deepest nodes with the highest cumulative time and work upward. The actual bottleneck is usually at the bottom of the tree.

You can adjust the profiling depth with DEBUG_TOOLBAR_CONFIG:

DEBUG_TOOLBAR_CONFIG = {
    'PROFILER_MAX_DEPTH': 15,        # default: 10
    'PROFILER_THRESHOLD_RATIO': 5,   # default: 8
}

snakeviz: Visual Profiling with cProfile

Django Debug Toolbar works great for web requests. But when you need to profile a management command, a Celery task, or a function in isolation, cProfile + snakeviz is the tool.

Capturing a profile

import cProfile
from django.test import RequestFactory

factory = RequestFactory()
request = factory.get('/api/messages/?status=pending')

profiler = cProfile.Profile()
profiler.enable()
response = message_list_view(request)
profiler.disable()

profiler.dump_stats('message_list.prof')

Or from the command line:

python -m cProfile -o output.prof manage.py some_management_command

Viewing with snakeviz

pip install snakeviz
snakeviz message_list.prof

snakeviz opens a browser with an interactive visualization — either a sunburst chart or an icicle chart. Each block is a function. Wider blocks took more time. Blocks nested inside others were called by the parent.

Reading snakeviz output

The text table below the chart shows the same data as cProfile's text output:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      1    0.000    0.000    1.842    1.842 views.py:45(message_list)
    200    0.003    0.000    1.650    0.008 serializers.py:12(get_user)
    200    1.580    0.008    1.580    0.008 base.py:330(execute)
      1    0.001    0.001    0.180    0.180 pagination.py:22(paginate)

Column	What It Means
`ncalls`	How many times this function was called
`tottime`	Time inside this function, excluding sub-calls
`cumtime`	Total time including sub-calls
`percall`	Time per call

How to read this: Start from the top (sorted by cumtime). message_list takes 1.84 seconds total. get_user is called 200 times and accounts for 1.65 seconds — 89% of the view's time. The actual time is in base.py:execute, which is Django's database query executor. Classic N+1.

What patterns to look for:

High ncalls on database functions → N+1 queries
High tottime on a single function → CPU-bound bottleneck (serialization, computation)
High cumtime with low tottime → the function itself is fast but calls something slow

Profiling Celery tasks

For my Message Scheduler's delivery tasks, I profile individual task functions:

import cProfile

@shared_task
def send_message(message_id):
    # Normal task code...
    pass

# Profile it
cProfile.run('send_message(42)', 'send_task.prof')

Then snakeviz send_task.prof shows exactly where delivery time goes — API calls to SES, Telegram latency, database reads for message content. This is how I discovered that loading the full message object (including a large metadata JSON field) was adding unnecessary overhead. Switching to .only('id', 'channel', 'recipient', 'body') cut the database portion by 60%.

Deep Dives with EXPLAIN ANALYZE

When Debug Toolbar or snakeviz points to a slow query, EXPLAIN ANALYZE tells you why it's slow at the database level.

EXPLAIN ANALYZE
SELECT m.id, m.body, m.send_at, u.email
FROM messages m
JOIN users u ON m.user_id = u.id
WHERE m.status = 'pending'
  AND m.send_at > '2026-03-01'
ORDER BY m.send_at ASC
LIMIT 50;

What the output tells you

Indicator	Meaning	Fix
`Seq Scan`	Full table scan — no index used	Add index on filtered columns
`Nested Loop` + high rows	Looping join on large sets	Check join indexes
`Sort` with high cost	Sorting without index support	Add index matching ORDER BY
`Rows Removed by Filter` (high)	Index not selective enough	Use composite or partial index

Fixing a missing index

Debug Toolbar showed a query on the messages table taking 180 ms. EXPLAIN ANALYZE confirmed a sequential scan:

Seq Scan on messages  (cost=0.00..28453.00 rows=1250 actual time=0.028..178.403 rows=1247)
  Filter: ((status)::text = 'pending'::text AND (send_at > '2026-03-01'))
  Rows Removed by Filter: 498753

Scanning 500,000 rows to return 1,247. A partial index fixed it:

CREATE INDEX idx_messages_pending ON messages(send_at)
WHERE status = 'pending';

After the index:

Index Scan using idx_messages_pending on messages  (cost=0.29..42.15 rows=1250 actual time=0.015..1.203 rows=1247)

From 178 ms to 1.2 ms. The partial index is small because it only covers pending messages, so it stays fast even as the table grows.

In Django migrations:

class Migration(migrations.Migration):
    operations = [
        migrations.AddIndex(
            model_name='message',
            index=models.Index(
                fields=['send_at'],
                condition=models.Q(status='pending'),
                name='idx_messages_pending',
            ),
        ),
    ]

The Full Workflow

Here's the process I follow for every slow endpoint:

1. Hit the endpoint with Debug Toolbar enabled.
Check the SQL panel first. High query count with duplicate badges = N+1. Fix with select_related or prefetch_related.

2. Check the profiling panel.
If query count is fine but the request is still slow, the profiling panel shows where time goes in Python code — serialization, computation, template rendering.

3. Profile in isolation with cProfile + snakeviz.
For deeper analysis or non-web-request profiling (management commands, Celery tasks), capture a .prof file and visualize it.

4. Run EXPLAIN ANALYZE on slow queries.
When a specific query is the bottleneck, check the execution plan. Look for Seq Scan and add targeted indexes.

5. Verify the fix.
Hit the endpoint again with Debug Toolbar. Confirm query count dropped, execution time improved. Run EXPLAIN ANALYZE again to confirm the index is being used.

This loop — observe, identify, fix, verify — is the same one I follow across all my projects. I wrote about it in broader context (including production monitoring, caching strategies, and async offloading) in How to Optimise Backend Performance.

Beyond Local Profiling

Debug Toolbar and snakeviz are local development tools. They catch problems before code ships. But some issues only appear under production load — connection pool exhaustion, cache stampedes, replication lag.

For my Message Scheduler, I use Celery Flower for worker monitoring and structured logging with structlog for production request tracing. On my portfolio's AI chatbot, the Cloudflare Worker proxy handles error states and I track response latency through server logs.

The HealthLab platform uses health check endpoints that verify database connectivity — simple but catches the most common production failure.

The tools change between local and production, but the principle stays: find where time goes, fix the biggest bottleneck, verify the improvement.

Tools Reference

Tool	What It Does	When to Use
Django Debug Toolbar (SQL panel)	Shows all queries per request with timing and stack traces	First check on any slow endpoint
Django Debug Toolbar (Profiling panel)	Call tree with cumulative time per function	When query count is fine but request is slow
cProfile + snakeviz	Python profiler with visual flame graph	Management commands, Celery tasks, isolated functions
`EXPLAIN ANALYZE`	PostgreSQL execution plan with actual timings	When a specific query is the bottleneck
QueryCountMiddleware	Logs query count per request in staging	Catching N+1 regressions before they hit production

All my projects — including architecture diagrams, tradeoff analysis, and failure mode documentation — are at ankitjang.one/projects.

About me: I'm Ankit Jangwan, a Senior Software Engineer building backend systems with Django, PostgreSQL, Celery, and Go. See my case studies at ankitjang.one/case-studies.

Forem: Ankit Jangwan

How I Design Reliable Backend Systems

Start With the Failure Modes

Keep the Request Cycle Fast

Choose Boring Technology When You Can

Make Scheduling Precise, Not Approximate

Observability Is Not Optional

Authentication Should Match the Client

State Belongs Where It's Cheapest

Deployment Should Be One Command

What I'd Do Differently

The Recurring Theme

How to Optimise Backend Performance: A Practical Playbook

Why Percentiles Matter More Than Averages

Step 1: Establish Observability

Reproducible Demo Setup

Application Performance Monitoring (APM)

Database Profiling

Structured Logging

Dashboards and Alerting

Reading Profiling Data

Python cProfile

Django Debug Toolbar Profiling

Datadog APM Resource Pages

Walkthrough: Finding a Bottleneck End-to-End

Step 2: Identify and Prioritise Bottlenecks

Prioritisation Framework

Common Bottleneck Patterns

Step 3: Apply Targeted Fixes

Fix N+1 Queries

Add Database Indexes

Stop Over-Fetching Data

Implement Caching

Offload Async Work

Use Connection Pooling

Use Read Replicas

Step 4: Query-Level Deep Dives

Using EXPLAIN ANALYZE

Reproducing Production Queries Safely

Step 5: Verify and Monitor

The Verification Process

What to Check After Each Fix

Tracking Regressions

Trade-offs and Gotchas

Indexing on Large Tables

Cache Invalidation

Over-Optimisation

Read Replica Lag

Tools Reference

Key Takeaways

Further Reading

Profiling Django APIs with Debug Toolbar and snakeviz

Setting Up Django Debug Toolbar

The SQL Panel: Your N+1 Detector

What to look for

Finding an N+1 in practice

The Profiling Panel: Where Time Actually Goes

Reading the call tree

snakeviz: Visual Profiling with cProfile

Capturing a profile

Viewing with snakeviz

Reading snakeviz output

Profiling Celery tasks

Deep Dives with EXPLAIN ANALYZE

What the output tells you

Fixing a missing index

The Full Workflow

Beyond Local Profiling

Tools Reference