Forem: ohyeah

Your Next.js health check is lying to you (and how to fix it)

ohyeah — Fri, 01 May 2026 08:55:55 +0000

I've been monitoring my own SaaS in production for the last two months, and I've watched the same bug pattern hit indie projects over and over:

The app is on fire. Customers are seeing 500s. Stripe webhooks are silently failing. And yet GET /api/health is cheerfully returning 200 OK, every minute, like nothing's wrong.

The reason is almost always the same: the health check is testing the wrong thing.

This post is about what a health check should actually do, the three failure modes that catch people, and a working Next.js 13+ implementation you can paste in.

The "useless 200" pattern

Here's the health check I see most often in indie Next.js codebases:

// src/app/api/health/route.ts
export async function GET() {
  return Response.json({ ok: true });
}

This endpoint can only fail in one way: the Next.js process itself is dead. If that happens, your hosting platform was already going to know — Vercel/Render/Fly notice the process crashed before your monitor does.

What this endpoint cannot tell you:

Did someone rename DATABASE_URL to DATABASE_POOL_URL in env vars and forget to update the code?
Did your Supabase service-role token expire?
Did the connection pool max out and start refusing connections?
Did a middleware change start returning 308 redirects to /login for everything?
Is your background queue stuck?
Is the Stripe webhook handler returning 200 but silently swallowing events?

Every one of those bugs has hit a real production app I know of in the last 90 days. None of them were caught by a return { ok: true } health check. All of them were eventually caught by customer complaints — the worst possible monitor.

The three layers of "healthy"

Before showing the fix, the mental model that makes this easier:

Layer 1: shallow. "Is the function reachable?" This is the useless 200. It tells you the runtime is up, nothing more.

Layer 2: middle. "Are my critical dependencies reachable from this function right now?" Database. Auth provider. Cache. The cheapest possible roundtrip that actually exercises auth and the connection pool.

Layer 3: deep. "Is the entire system functioning?" Background workers running. Cron jobs not stuck. Queue not backed up. This is expensive and runs less often.

Most indie projects only need Layer 2. Layer 1 is what you have today and it doesn't help. Layer 3 is what big shops do; you don't need it yet.

The rest of this post is about doing Layer 2 correctly in Next.js.

The fix: a real Next.js 13+ health endpoint

Here's the Route Handler I run on my own SaaS. It uses Supabase but the pattern is the same for any DB:

// src/app/api/health/route.ts
import { NextResponse } from "next/server";
import { createSupabaseServiceRole } from "@/lib/supabase/server";

export const runtime = "nodejs";
export const dynamic = "force-dynamic";

export async function GET() {
  try {
    const supabase = createSupabaseServiceRole();

    // Cheapest possible call that exercises the connection pool + auth.
    // head: true returns no rows — microseconds, no payload.
    const { error } = await supabase
      .from("profiles")
      .select("id", { count: "exact", head: true })
      .limit(1);

    if (error) throw error;

    return NextResponse.json(
      { ok: true, ts: Date.now() },
      { headers: { "Cache-Control": "no-store" } }
    );
  } catch (e) {
    return NextResponse.json(
      { ok: false, error: (e as Error).message },
      { status: 503, headers: { "Cache-Control": "no-store" } }
    );
  }
}

There are five non-obvious decisions in those 25 lines. Each one is a bug I've personally watched bite somebody.

1. `runtime = "nodejs"`, not edge

Health checks should hit the same runtime your real traffic hits. If your app runs on the Node.js runtime (most indie SaaS), your health check should too. Otherwise you're testing a runtime your customers never use.

2. `dynamic = "force-dynamic"`

Without this, Next.js or your CDN can serve a cached 200 even after your DB is down. The cache happily reports "healthy" while every customer request is failing. I've seen this exact bug in production. Hard to debug because the health check looks fine.

3. `Cache-Control: no-store` on every response

Same reason. Belt and suspenders. CDNs respect no-store even if Next.js gets it wrong.

4. A real DB roundtrip — not `SELECT 1`

SELECT 1 works for raw Postgres, but it's a half-measure. You want a query that exercises:

Connection pool acquisition (catches "pool exhausted")
Auth (catches "service role token expired")
A real table the app uses (catches "ran migrations on the wrong DB")

The head: true count query above does all three. It costs microseconds and transfers no rows. Use the cheapest possible real query, not the cheapest possible fake query.

5. `503` on failure — not `200` with `{ ok: false }`

This is the one people get most wrong. Most upstream monitors — Kubernetes liveness probes, GCP/AWS health checks, external uptime tools — trigger on HTTP status, not body content. If you return 200 { ok: false }, your monitor sees a successful response and your platform never takes the bad pod out of rotation.

503 Service Unavailable is the right status for "I'm running but I can't serve traffic." Use it.

"But I want my health check at `/healthz`"

The k8s convention is /healthz. Easy with App Router rewrites in next.config.js:

module.exports = {
  rewrites: async () => [
    { source: "/healthz", destination: "/api/health" },
  ],
};

Now both URLs work and your liveness probe stays idiomatic.

Monitoring it from outside

Here's the part that costs people the most, because they think a health check is enough by itself.

It isn't. Your platform's liveness probe (Vercel internal, Kubernetes, etc.) checks the pod. It does not check:

DNS — did your registrar accidentally let the domain expire?
TLS — did your certificate auto-renewal silently fail?
CDN edge — is Cloudflare serving stale 502s while origin is fine?
The path between user and pod — region outage, BGP drama
Third-party degradation — your code is fine, but Stripe/OpenAI is throwing 500s

The only way to catch these is to hit your public URL from outside your infra, on a schedule, from multiple regions.

You can roll your own with cron-job.org + a Slack webhook in 30 minutes. Or use any external uptime monitor. I built SitePulse for this exact reason — one of the war stories below was what kicked it off — but the stack doesn't matter. The point is: don't rely on your own infra to tell you your own infra is broken.

The war story that made me write this

Last year I shipped a deploy that renamed an env var. I'd updated .env.example. I'd updated the code. I had not updated the production env var. The deploy went green. The 200-only /api/health kept returning 200. CI passed because tests use a different config.

For 41 minutes, every customer request to the affected endpoint returned 500. I noticed because someone tweeted at me.

If the health check had done a real DB roundtrip with the production config, it would have failed at deploy-time and the platform would have refused to promote the build. Instead it merrily reported "healthy" while 100% of real traffic broke.

That bug cost me a customer. Worse, it cost me trust — they'd been one of my early users.

The two-line fix (real DB query, 503 on failure) would have caught it inside the first request after deploy.

TL;DR

A health check that returns 200 without checking anything tells you the function is reachable. That's it. It's the cheapest possible information and it's almost never the information you need.

A useful health check:

Hits the same runtime your customers hit (runtime = "nodejs" if that's what you use)
Refuses to be cached (dynamic = "force-dynamic" + Cache-Control: no-store)
Does a real, cheap roundtrip to your most fragile dependency
Returns 503 on failure, not 200 with a flag in the body
Is checked from outside your infra, not just by your platform's internal probe

Do that and you'll catch the boring bugs that take down indie SaaS — environment drift, expired tokens, silent CDN issues — before your customers do.

If you found this useful, I wrote a shorter version on Stack Overflow covering the same patterns, and I publish more indie-SaaS-on-Vercel posts here on dev.to. The monitoring tool I built (SitePulse) is free for 5 monitors if you want the "external monitor" half of the story without writing it yourself.

I built my own UptimeRobot in a weekend with Next.js 16 + Vercel Cron

ohyeah — Thu, 30 Apr 2026 05:50:13 +0000

I've been paying UptimeRobot for years. It works. The free tier is generous. I have no real beef with them.

But every time I added a 6th monitor, the upgrade modal appeared. Every time I logged in to check a site, the dashboard nudged me toward Pro. Every time I wanted a public status page on my own domain, that was a paid feature too.

Eventually I asked the question every indie dev asks at some point: how hard could this actually be?

It turned out: a weekend to MVP, two weeks to ship to paying customers. Here's the architecture, the parts that surprised me, and the bugs that cost me an afternoon each.

The whole product, on one page

Probe a list of URLs every minute. HEAD or GET, optional body keyword check.
Detect "down" reliably — don't email you because of one flaky packet.
Email when status flips. Don't email every minute the site stays down.
Render a public status page at a custom slug.
Bill it. $9/month for 25 monitors, free up to 5.

That's the spec. Anything else I considered building, I asked: "would my own indie projects need this?" The answer for incident management, on-call rotations, request tracing, RUM, and Slack threading was: no. So they didn't get built.

The stack

Next.js 16 (App Router) on Vercel
Supabase for Postgres + Auth (Tokyo region — more on this below)
Vercel Cron runs a single endpoint every minute
Resend for alert emails
Stripe Checkout + webhooks for billing

That's it. No queue, no Redis, no separate worker fleet. The whole backend is one cron endpoint and a handful of Server Actions.

The 1-minute heartbeat

Vercel Cron sends a GET to /api/cron/check every minute. A single endpoint handles every monitor on the platform — no per-monitor crons, no fan-out queue.

The flow:

cron tick
  → claim_due_monitors (Postgres function, atomic SELECT FOR UPDATE)
    → process up to 200 monitors in parallel batches of 25
      → fetch each URL with AbortController timeout
        → upsert check result + flip status if needed
          → enqueue alert if status transitioned

The Postgres function is the load-bearing piece. It locks rows that are due for a check, bumps their next_check_at, and returns them in one round-trip. Two cron workers will never claim the same monitor in the same tick, because Postgres handles the contention for me.

-- simplified
create function claim_due_monitors(p_limit int)
returns setof monitors
language plpgsql
as $$
begin
  return query
    update monitors
    set next_check_at = now() + (interval_seconds * interval '1 second')
    where id in (
      select id from monitors
      where next_check_at <= now() and active = true
      order by next_check_at
      for update skip locked
      limit p_limit
    )
    returning *;
end;
$$;

for update skip locked is the magic. It lets a second cron worker (which won't happen here, but you want it to be safe) skip rows that are already being processed instead of waiting for a lock.

Probing 25 URLs concurrently in one function

Each tick can hit dozens of URLs. The cron route batches them with Promise.allSettled:

const CONCURRENCY = 25;
for (let i = 0; i < monitors.length; i += CONCURRENCY) {
  const slice = monitors.slice(i, i + CONCURRENCY);
  const results = await Promise.allSettled(
    slice.map((m) => processMonitor(admin, m))
  );
  // ...tally results, log errors
}

The probe itself is just fetch with three things you must get right:

const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), monitor.timeout_ms);

try {
  const res = await fetch(monitor.url, {
    method: monitor.keyword ? "GET" : monitor.method,
    redirect: "follow",
    signal: controller.signal,
    headers: {
      "user-agent": "SitePulseBot/1.0 (+https://sitepulse.satosushi.co)",
      "cache-control": "no-cache",
    },
    cache: "no-store", // never let Next cache a probe
  });

  // ALWAYS drain the body, even if you don't need it.
  // Otherwise the socket stays open and the next probe pays connect cost.
  if (!monitor.keyword) {
    try { await res.arrayBuffer(); } catch {}
  }

  // ...record result
} finally {
  clearTimeout(timeout);
}

Three subtle things in there:

cache: "no-store" — Next.js will happily cache fetch responses in production. You don't want a cached HTTP probe.
Drain the body — if you don't read the response body, the underlying connection sits in limbo. Across hundreds of probes per minute, this matters.
AbortController for timeouts — fetch has no built-in timeout. The default is "wait forever." Don't.

The 1-second latency I didn't notice for two days

I deployed the first version and a page load felt sluggish. Not broken — just sluggish. Maybe 800ms-1.2s for the dashboard to render.

Vercel Functions default to iad1 (Washington DC). My Supabase project is in Tokyo. Every Server Component that hit the database was making a US-east → Tokyo → US-east round trip per query. With 3-4 queries per page render, that's a second of pure network sitting between the user and the page.

// vercel.json
{
  "regions": ["hnd1"]
}

One line. Pinning functions to Tokyo (hnd1) drops Server Component render time to under 100ms. The lesson generalises: always colocate compute with your primary data store, especially for Server Components, where every render is a synchronous database conversation.

The Server Component cookie crash

Next.js 16's App Router gives Server Components access to cookies. Supabase's createServerClient wants a setAll callback so it can refresh tokens.

But Server Components are read-only — you can't set cookies during a render in production. If a token refresh happens during a Server Component pass, setAll throws, and the entire page returns a 500 with that lovely React digest:

Server Components render
 → Supabase tries to refresh expired token
 → setAll attempts to write cookies
 → ERR_HTTP_HEADERS_SENT-style error
 → 500 with digest 972974443

Fix is one try/catch:

{
  cookies: {
    getAll() {
      return cookieStore.getAll();
    },
    setAll(cookiesToSet) {
      try {
        cookiesToSet.forEach(({ name, value, options }) =>
          cookieStore.set(name, value, options)
        );
      } catch {
        // Server Component context — token will be refreshed on next request.
        // Safe to swallow.
      }
    },
  },
}

The Supabase docs hint at this but the existing examples I copied didn't have the try/catch. If your Supabase + Next.js 16 app sometimes 500s on logged-in users after a token expiry, this is probably why.

The trailing whitespace bug that ate three hours

I copied my STRIPE_WEBHOOK_SECRET from the Stripe CLI output into Vercel's env var UI. Webhooks 401'd in production. Local was fine.

The Stripe webhook secret has a trailing newline if you copy from a terminal. Vercel stores it verbatim — including the newline. The HTTP header Stripe-Signature then doesn't match anything, signature verification fails, and you get a 400 in your logs with no obvious cause.

The fix is to never trust your clipboard:

echo -n "$STRIPE_WEBHOOK_SECRET" | tr -d '\n\r\t ' | vercel env add STRIPE_WEBHOOK_SECRET production

Same gotcha applies to any header-borne secret: API tokens, basic auth, JWT signing keys. If signature verification fails in prod but works locally, check whitespace before checking anything else.

What I deliberately didn't build

The list of things people expect from a "real" uptime monitor that I left out:

Logs / RUM / transaction monitoring. That's what Sentry and Logflare are for.
Multi-region probing. I check from one region. If Cloudflare is down in São Paulo and your site is up in São Paulo, you and I will both find out at the same time.
On-call schedules / rotations. Indie devs are a one-person rotation. If I'm asleep, the alert waits.
Slack / Discord / PagerDuty / OpsGenie. Email + SMS covers the people I'm building for.
5-second checks. 1-minute is enough for indie projects. Sub-minute is genuinely expensive to do reliably.

Cutting features wasn't a sacrifice — it was the product. The competitors I respect (UptimeRobot, BetterStack, Pingdom) all do most of what I left out, and that's exactly why their pricing pages have four columns and a "contact sales" button.

What I learned about flat pricing

The most-discussed part of this product hasn't been the technical stack — it's been the price.

$9/month for 25 monitors. No per-seat. No per-region. No per-channel. Free up to 5 monitors.

The reasoning: when I'm picking a tool for a side project, I don't have time to evaluate three pricing tiers and figure out which one I'd grow into. I want one number. Flat pricing forces the product to do less, which forces me to make better tradeoffs about what to build.

It also means the product can never become "enterprise." That's fine. There are already excellent enterprise uptime monitors. There aren't enough good ones for indie devs.

The link

The product is live at sitepulse.satosushi.co — 5 monitors free, $9/mo for 25, no card to start. If you've been on UptimeRobot and want to see how it stacks up, I wrote a side-by-side comparison too.

Not open source — that's the business — but happy to answer architecture questions in the comments. The cron-claim-and-fan-out pattern in particular has been more reliable than any queue I've shipped, and I think it generalises to a lot of "do this thing every N seconds for N users" problems where you'd otherwise reach for SQS.