Forem: Kriss

Your Sidekiq job ran. It processed nothing. Nobody knew.

Kriss — Tue, 12 May 2026 18:41:58 +0000

Your nightly billing sync ran at 2am. Sidekiq shows it completed. No exceptions, no retries, no dead queue entries. Your app looks healthy.

It processed zero invoices.

It's been doing this for eleven days.

This happens more than people admit. Sidekiq is excellent at handling failed jobs — its retry mechanism and dead queue are genuinely well designed. But "failed" in Sidekiq means "raised an exception." A job that connects to the database, queries 0 rows, and exits cleanly isn't a failed job. It's a successful job that did nothing. Sidekiq has no opinion on the difference.

This article covers how to close that gap.

Why Sidekiq's built-in monitoring isn't enough for scheduled jobs

Sidekiq ships with a web UI that shows queue depths, processed counts, failed jobs, and scheduled jobs. For a queue-based system, this is useful. But for scheduled jobs — the kind you run with sidekiq-cron or sidekiq-scheduler — you need something different.

The questions that matter for scheduled jobs are:

Did it run on schedule? (Not just "has it ever run?")
Did it actually process anything?
Is it taking longer than usual?

Sidekiq's web UI answers none of these. It shows you the last enqueued time and whether the job class exists in the schedule. That's not the same as knowing whether it ran at 2am last Tuesday, and whether it exported 1,400 rows like it should have.

The dead man's switch pattern

The fix is to invert the monitoring model. Instead of your monitoring system polling Sidekiq to check if jobs ran, you make your jobs proactively check in with an external service. If the external service stops receiving check-ins, it alerts you.

This is called a dead man's switch (or heartbeat monitoring). The idea: if the job dies or goes silent, the external service notices — because it's looking for a regular ping that never came.

Here's the three-signal implementation: start, success, fail.

# app/workers/daily_export_worker.rb
require 'net/http'
require 'json'

class DailyExportWorker
  include Sidekiq::Job

  TOKEN = ENV['DEADMANCHECK_TOKEN']
  BASE  = "https://deadmancheck.io/ping/#{TOKEN}"

  def perform
    dmc_start   # begins duration timer

    rows = run_export

    dmc_success(rows)   # signals completion + row count
  rescue
    dmc_fail
    raise   # re-raise so Sidekiq handles retries normally
  end

  private

  def dmc_start
    Net::HTTP.get(URI("#{BASE}/start"))
  rescue; end

  def dmc_success(count)
    uri = URI(BASE)
    req = Net::HTTP::Post.new(uri, 'Content-Type' => 'application/json')
    req.body = { count: count }.to_json
    Net::HTTP.start(uri.host, uri.port, use_ssl: true) { |h| h.request(req) }
  rescue; end

  def dmc_fail
    Net::HTTP.get(URI("#{BASE}/fail"))
  rescue; end
end

A few things worth noting:

Each ping helper rescues all exceptions silently. A monitoring outage should never kill a production job — the monitoring is less important than the job.
The raise after dmc_fail is intentional. Let Sidekiq handle its own retry logic; don't swallow the error just because you've notified the external service.
Uses Ruby's stdlib Net::HTTP — no extra gem to add to your Gemfile.

Works the same with sidekiq-cron or sidekiq-scheduler

If you're using sidekiq-cron or sidekiq-scheduler to run workers on a cron schedule, the perform method is already the right integration point. Your schedule config stays the same:

# config/schedule.yml (sidekiq-scheduler)
daily_export:
  cron: "0 2 * * *"
  class: DailyExportWorker
  queue: default

Create one monitor per scheduled job and set its interval to your schedule length plus a buffer. For a daily job: 25 hours. For an hourly job: 70 minutes. The buffer prevents false alerts from minor timing drift.

Output assertions: the part most tutorials skip

Here's the thing about "job ran successfully": Sidekiq marks a job successful when it completes without an exception. That tells you about the job's execution. It tells you nothing about whether the job's output was valid.

If your export job queries a table that returns 0 rows (because an upstream pipeline broke two days ago), Sidekiq marks it done. Your success rate metrics stay green. You find out eleven days later when someone asks why their data is stale.

DeadManCheck lets you configure an output assertion: alert if the count in the ping is below a threshold. You set it to count > 0. Now a job that exports zero rows triggers an alert, even though Sidekiq considers it a success.

This is done through the POST body:

# In dmc_success, POST the row count
req.body = { count: rows_exported }.to_json

Then in the monitor settings, configure: "alert if count is 0 or less."

The other cron monitoring tools — Cronitor, Healthchecks.io, Better Stack — check whether the ping arrived. They don't check what the ping reported. Output assertions are the difference between knowing your job ran and knowing your job worked.

Duration monitoring

The start ping does double duty: it starts a duration timer. When the success ping arrives, DeadManCheck records the elapsed time.

After 5 or more runs, it builds a rolling average. If a run takes significantly longer than the baseline — say, your 30-second export starts taking 8 minutes — it flags the anomaly.

This is a useful leading indicator. A slow job often means:

A query that's hitting an un-indexed table after a data volume threshold was crossed
A downstream API starting to time out
A Redis or database connection pool under pressure

You find out before users notice latency in the actual product.

The full setup takes about 10 minutes

Create a free account — no credit card needed, free for 5 monitors
Add a new monitor, set the interval to match your schedule + buffer
Copy the token into your environment as DEADMANCHECK_TOKEN
Add the three helper methods to your worker (or a shared concern)
Set the output assertion threshold if your job processes records
Deploy, trigger the job manually once, confirm the ping arrives in the dashboard

After that, you'll get an alert if:

The job doesn't run on schedule (missed ping)
The job raises an exception (fail ping)
The job runs but processes nothing (output assertion)
The job takes significantly longer than usual (duration anomaly)

That's the full set of failure modes — including the silent ones that Sidekiq alone won't catch.

DeadManCheck is open source and self-hostable. If you'd rather run the monitoring infrastructure yourself: GitHub →

Kubernetes CronJobs silently fail more than you think

Kriss — Tue, 05 May 2026 14:06:03 +0000

A backup job missed 24 days of runs. Nobody knew. The CronJob looked fine in kubectl get cronjobs. No alerts fired. The last successful run timestamp in the status field just sat there, quietly getting older.

The root cause: the CronJob controller had silently given up scheduling after missing 100 runs. Logged an error. Stopped trying. Moved on.

This article explains why Kubernetes CronJobs are structurally unreliable without external monitoring, and what you can do about it.

The three failure modes Kubernetes won't tell you about

1. The 100 missed-schedule limit

This is the one that produces the war stories.

The Kubernetes CronJob controller checks how many schedules it missed since the last successful run. If that number exceeds 100, it permanently stops scheduling that CronJob — and logs a single error line:

Cannot determine if job needs to be started: too many missed start time (> 100)

That's it. No event. No alert. kubectl describe cronjob shows the last scheduled time getting stale. The CronJob shows as ACTIVE: 0. Everything looks fine until you notice your data is 24 days old.

This happens if:

The CronJob controller crashes or the API server is unreachable for an extended period
You set startingDeadlineSeconds too low and the cluster was briefly overloaded
A node outage prevented scheduling for long enough

The fix is restarting the CronJob (delete and recreate it, or bump the schedule), but the point is: you won't know it happened until you check manually.

2. Exit code 0 is not success

Your CronJob container can exit 0 after:

Connecting to a read replica that's 6 hours behind
Finding an empty queue and processing nothing
Silently swallowing an exception in a try/catch
Successfully completing a database backup of 0 bytes

Kubernetes marks the Job as Succeeded. The CronJob status shows the last successful run timestamp updated. Everything looks healthy. Your data pipeline has been doing nothing for a week.

3. Job history purged, evidence gone

By default:

successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 1

After three successful runs, the oldest Job pod is deleted. Its logs go with it. When you eventually notice something's wrong and go looking for "what happened on Tuesday?", the evidence no longer exists.

You can increase these limits, but you'll never retain more than a handful of runs. A real audit trail requires shipping logs to an external system.

The deeper problem: no external check

All of these failure modes share the same root cause: your monitoring system lives inside the cluster, so it fails along with the cluster.

If your alerting depends on the cluster being healthy, it won't alert you when the cluster is unhealthy. And CronJob failures almost always correlate with cluster health problems.

What you need is a check that runs outside your cluster and asks: "did this job run? Did it do something?" If the answer is no, it pages you — regardless of what the cluster thinks.

This is the dead man's switch pattern: instead of your monitoring system checking whether the job ran, the job checks in with an external system, and the external system alerts if it stops hearing from the job.

Implementing external monitoring for a Kubernetes CronJob

Add a start/success/fail ping to your job. Here's a minimal implementation:

Shell wrapper (works with any container)

#!/bin/bash
set -euo pipefail

BASE="https://deadmancheck.io/ping/${DEADMANCHECK_TOKEN}"

# Signal start (enables duration monitoring) — || true so a network blip never kills the job
curl -fsS "${BASE}/start" > /dev/null || true

# Alert on any error
trap 'curl -fsS "${BASE}/fail" > /dev/null' ERR

# Your actual job
ROWS=$(/app/run-export.sh)

# Signal success + row count for output assertion
curl -fsS -X POST -H "Content-Type: application/json" \
  -d "{\"count\": ${ROWS}}" \
  "${BASE}" > /dev/null

Python job

import requests
import os
import sys

TOKEN = os.environ["DEADMANCHECK_TOKEN"]
BASE = f"https://deadmancheck.io/ping/{TOKEN}"

def main():
    # Signal start — wrapped so a monitoring outage never kills the job
    try:
        requests.get(f"{BASE}/start", timeout=5)
    except Exception:
        pass
    try:
        records_processed = run_job()
        # POST count for output assertion: alert if count is 0
        requests.post(BASE, json={"count": records_processed}, timeout=5)
    except Exception:
        try:
            requests.get(f"{BASE}/fail", timeout=5)
        except Exception:
            pass
        sys.exit(1)

if __name__ == "__main__":
    main()

CronJob spec

Store the token in a Kubernetes Secret:

kubectl create secret generic deadmancheck-secret \
  --from-literal=token=your-token-here

Reference it in your CronJob spec:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: daily-export
  namespace: production
spec:
  schedule: "0 2 * * *"
  successfulJobsHistoryLimit: 5   # keep more history than the default
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: exporter
            image: your-registry/exporter:latest
            env:
            - name: DEADMANCHECK_TOKEN
              valueFrom:
                secretKeyRef:
                  name: deadmancheck-secret
                  key: token
          restartPolicy: OnFailure

Output assertions: the check Kubernetes can't do

Output assertions are the piece most monitoring tutorials skip. Here's why it matters.

Your job runs. Exits 0. Kubernetes marks it Succeeded. But the job processed 0 records.

If your monitoring only checks "did the job ping?" — like every other cron monitoring tool — you don't get alerted. The job pinged. It just pinged with count=0.

DeadManCheck lets you configure an output assertion: alert if count < N. Set it to count > 0. Now your job can't silently export nothing without triggering an alert.

This catches the failure mode that pure heartbeat monitoring misses: the job that runs, succeeds by every technical measure, and still does nothing useful.

What external monitoring catches vs what it doesn't

Failure mode	kubectl catches?	External monitoring catches?
Pod CrashLoopBackOff	Visible in logs/events	YES (missed ping)
100 missed-schedule limit hit	No alert fires	YES (missed ping)
Job exits 0, processes nothing	No	YES (output assertion)
Cluster outage kills controller	No	YES (missed ping)
Job takes 5× longer than usual	No	YES (duration anomaly)
CronJob accidentally deleted	No	YES (missed ping)

The realistic setup time

For an existing CronJob:

Create a free monitor — takes 2 minutes
Set interval to match your schedule + buffer (e.g., 25h for a daily job)
Enable output assertion if your job reports a count
Add the start/success/fail pings to your container script
Create the Secret, update the CronJob spec
Deploy and verify the first ping arrives

Total: 15-20 minutes including deployment. The first time a silent failure happens, you'll have wished you'd done it sooner.

One more thing: set a reasonable history limit

While you're in the CronJob spec, increase the history limits from the defaults:

successfulJobsHistoryLimit: 10
failedJobsHistoryLimit: 5

This doesn't replace external monitoring, but it gives you more context in kubectl describe cronjob when you're investigating an incident. The default of 3/1 is genuinely too low for production jobs.

DeadManCheck is open source and self-hostable if you'd rather run it on your own infrastructure. GitHub →

How to monitor Apache Airflow DAGs so you know when they silently fail

Kriss — Fri, 01 May 2026 12:14:44 +0000

Your Airflow DAG ran last night. All tasks: green. All durations: normal. Export job completed at 02:14.

Zero rows exported. Nobody knows.

This is the silent failure Airflow's built-in alerting doesn't catch. on_failure_callback fires when a task crashes. It doesn't fire when a task exits 0 after connecting to a stale database replica and processing nothing. That's the failure mode that eats your Monday morning.

This article shows you two ways to add external monitoring to Airflow DAGs — so you get paged for both kinds of failures.

Why Airflow's built-in alerts aren't enough

Airflow gives you several callback hooks:

on_failure_callback — task or DAG run failed
on_success_callback — task or DAG run succeeded
on_retry_callback — task queued for retry
on_execute_callback — task about to start
on_skipped_callback — task raised AirflowSkipException

These are useful. But none of them answer the question that actually matters for data pipelines: did the job do something?

Your export DAG catches a database timeout, logs it, and exits cleanly. Airflow marks it green. No callbacks fire. The data never lands.

You need an independent check — something outside Airflow that asks "did this DAG complete, and did it report non-zero output?" every time the schedule fires.

The approach: dead man's switch + output assertions

A dead man's switch monitor works like this:

You set up a monitor with an expected interval — say, "this DAG should report in every 24 hours"
Your DAG pings the monitor when it completes
If the monitor doesn't hear from the DAG within the window, it alerts you

This catches missed runs, paused DAGs, scheduler issues, and slow drift.

But the more powerful feature is output assertions: you pass a count with your ping, and the monitor alerts if count is 0 — even when the job completed and pinged successfully.

I'll use DeadManCheck for the examples. It's the only cron monitoring tool that supports output assertions, and it has a free tier for up to 5 monitors.

Option 1: DAG-level callback (cleanest approach)

If you want to monitor the whole DAG run — not individual tasks — use on_success_callback and on_failure_callback at the DAG level.

# airflow/dags/daily_export.py
# Airflow 2.x imports. For Airflow 3.x use:
#   from airflow.sdk import DAG
#   from airflow.providers.standard.operators.python import PythonOperator

import requests
import os
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

DEADMANCHECK_TOKEN = os.environ.get("DEADMANCHECK_TOKEN")
BASE_URL = f"https://deadmancheck.io/ping/{DEADMANCHECK_TOKEN}"


def ping_start(context):
    """Signal that the DAG has started — enables duration monitoring."""
    try:
        requests.get(f"{BASE_URL}/start", timeout=5)
    except Exception:
        pass  # never let monitoring break the job


def ping_success(context):
    """Signal success. Pull row count from XCom for output assertion."""
    try:
        rows = context["ti"].xcom_pull(task_ids="export_data", key="rows_exported") or 0
        requests.post(BASE_URL, json={"count": rows}, timeout=5)
    except Exception:
        pass


def ping_failure(context):
    """Signal explicit failure."""
    try:
        requests.get(f"{BASE_URL}/fail", timeout=5)
    except Exception:
        pass


with DAG(
    dag_id="daily_export",
    schedule="0 2 * * *",
    start_date=datetime(2026, 1, 1),
    catchup=False,
    on_success_callback=ping_success,
    on_failure_callback=ping_failure,
) as dag:

    def export_data(**context):
        rows = run_export()
        # Push count to XCom so ping_success can read it
        context["ti"].xcom_push(key="rows_exported", value=rows)
        return rows

    export_task = PythonOperator(
        task_id="export_data",
        python_callable=export_data,
        on_execute_callback=ping_start,  # task-level in 2.x; fires when this task starts
    )

A few things to note:

Wrap every ping in try/except. The monitoring call must never fail the DAG. If DeadManCheck is unreachable, your pipeline keeps running.

Push row count via XCom. The success callback receives the context object, which includes a TaskInstance. Use xcom_pull to retrieve the count from the last task.

Set on_execute_callback for duration monitoring. In Airflow 2.x this is a task-level callback, so it lives on the first task rather than the DAG itself. It sends the /start signal before that task runs. DeadManCheck then tracks how long each run takes and alerts when a run is significantly longer than the rolling average.

Option 2: Final task in the DAG graph

If you want the monitoring ping visible in the Airflow task graph — useful for debugging — add it as a final PythonOperator.

from airflow.operators.python import PythonOperator


def notify_deadmancheck(**context):
    rows = context["ti"].xcom_pull(task_ids="export_data", key="rows_exported") or 0
    requests.post(BASE_URL, json={"count": rows}, timeout=5)


# In your DAG:
notify = PythonOperator(
    task_id="notify_deadmancheck",
    python_callable=notify_deadmancheck,
)

export_task >> validate_data >> notify  # replace validate_data with your existing tasks

This approach makes the monitoring step explicit and auditable. Note that notify_deadmancheck deliberately has no try/except — if the ping fails, you want Airflow to retry it (and mark the task failed if retries are exhausted), rather than silently swallowing the error. This is the opposite of the callback approach above, where the pipeline must never be blocked by the monitoring call.

Configuring the monitor

In DeadManCheck, create a new monitor:

Type: Cron / Heartbeat
Interval: set to 25h (slightly longer than your 24h schedule, to allow for run time)
Output assertion: alert if count = 0
Alert channels: Slack, PagerDuty, email — whatever's in your incident flow

The output assertion is the key part. When your export runs and calls:

curl -fsS -X POST -H "Content-Type: application/json" \
  -d '{"count": 0}' \
  https://deadmancheck.io/ping/your-token > /dev/null

You get an alert. Even though Airflow shows the DAG as green.

Setting the environment variable

In your Airflow deployment, add DEADMANCHECK_TOKEN as an environment variable. Where you set it depends on your setup:

Docker Compose:

environment:
  - DEADMANCHECK_TOKEN=your-token-here

Kubernetes (via Secret):

kubectl create secret generic deadmancheck-secret \
  --from-literal=token=your-token-here

env:
  - name: DEADMANCHECK_TOKEN
    valueFrom:
      secretKeyRef:
        name: deadmancheck-secret
        key: token

Astronomer / MWAA: add it as an Airflow Variable or environment variable via the platform's UI.

What you catch with this setup

With the callback approach + output assertion:

Failure mode	Airflow catches?	DeadManCheck catches?
Task raises exception	YES	YES (via on_failure_callback)
DAG paused accidentally	No	YES (missed ping)
Scheduler down	No	YES (missed ping)
Job exports 0 rows	No	YES (output assertion)
Run takes 3× longer than usual	No	YES (duration anomaly)
API token expired, job exits 0	No	YES (output assertion)

Two minutes to set up

Create a free account — no credit card needed
Create a monitor, set interval to match your DAG schedule + buffer
Enable output assertion: alert if count = 0
Add the callbacks to your DAG
Deploy, run once, verify the ping arrives

After the first successful run, DeadManCheck will alert you if the DAG ever goes silent — or succeeds while doing nothing.

DeadManCheck is open source and self-hostable. If you'd rather run it on your own infrastructure, the GitHub repo has setup instructions.

How to add dead man's switch monitoring to any cron job in 2 minutes

Kriss — Thu, 30 Apr 2026 10:47:42 +0000

How to add dead man's switch monitoring to any cron job in 2 minutes

The concept is simple: your job checks in when it runs. If it stops checking in, you get alerted.

No agent to install. No SDK to integrate. Just a curl at the end of your script.

The one-liner

curl -fsS https://deadmancheck.io/ping/YOUR-TOKEN > /dev/null

That's it. Stick that at the end of your cron job. If the job stops running — server dies, cron daemon crashes, script errors out before it gets there — you get an alert.

The flags: -f fails silently on HTTP errors, -s suppresses progress output, -S still shows errors if -s is set. Redirect to /dev/null because you don't want curl output polluting your logs.

Setting it up

Sign up at deadmancheck.io (free for up to 5 monitors). Create a monitor, set the expected interval — say, every 24 hours — and copy your unique token.

Then configure the alert window. If you're running a daily job, set it to alert after 25 hours of silence. That gives a 1-hour grace period for slow servers and slight scheduling drift.

Start/end pattern for longer jobs

The one-liner is fine for quick jobs. For anything that runs more than a few minutes, use the start/end pattern. This also catches jobs that start but hang indefinitely.

# Signal job started
curl -fsS https://deadmancheck.io/ping/YOUR-TOKEN/start > /dev/null

# ... your job logic ...

# Signal job completed
curl -fsS https://deadmancheck.io/ping/YOUR-TOKEN > /dev/null

If the job starts but never pings the end URL within your configured timeout, you get alerted. Useful for ETL jobs that sometimes decide to run for 6 hours when they should take 20 minutes.

Python

import requests
import os

DEADMANCHECK_TOKEN = os.environ["DEADMANCHECK_TOKEN"]
BASE_URL = f"https://deadmancheck.io/ping/{DEADMANCHECK_TOKEN}"

def ping(path="", count=None):
    try:
        url = f"{BASE_URL}{path}"
        if count is not None:
            requests.post(url, json={"count": count}, timeout=5)
        else:
            requests.get(url, timeout=5)
    except requests.RequestException:
        pass  # never let monitoring break the job

ping("/start")
try:
    rows = run_export()
    ping(count=len(rows))
except Exception:
    ping("/fail")
    raise

The try/except around each ping is deliberate. Your monitoring call should never take down your job.

Ruby

require 'net/http'
require 'uri'
require 'json'

TOKEN = ENV['DEADMANCHECK_TOKEN']
BASE = "https://deadmancheck.io/ping/#{TOKEN}"

def ping(path = '', count = nil)
  uri = URI("#{BASE}#{path}")
  if count
    req = Net::HTTP::Post.new(uri, 'Content-Type' => 'application/json')
    req.body = JSON.generate({ count: count })
    Net::HTTP.start(uri.host, uri.port, use_ssl: true) { |http| http.request(req) }
  else
    Net::HTTP.get(uri)
  end
rescue StandardError
  # don't let monitoring kill the job
end

ping('/start')

begin
  count = run_etl
  ping('', count)
rescue => e
  ping('/fail')
  raise
end

Bash with error handling

For bash scripts, use a trap to ping the fail URL on any error:

#!/bin/bash
set -euo pipefail

TOKEN="YOUR-TOKEN"
BASE="https://deadmancheck.io/ping/${TOKEN}"

curl -fsS "${BASE}/start" > /dev/null

trap 'curl -fsS "${BASE}/fail" > /dev/null' ERR

/usr/local/bin/run-backup.sh

ROW_COUNT=$(wc -l < /backups/output.csv)
curl -fsS -X POST -H "Content-Type: application/json" \
  -d "{\"count\": ${ROW_COUNT}}" \
  "${BASE}" > /dev/null

set -euo pipefail means any unhandled error exits the script and triggers the trap. The ERR trap fires before exit, pinging the fail endpoint.

What to monitor first

If you're not sure where to start:

Database backups — silent failures here are catastrophic
ETL/data pipeline jobs — wrong data is worse than no data
Invoice/billing jobs — customers notice immediately
Report generation — stakeholders notice next morning
Cache warmers — performance degrades silently

Anything that runs unattended and that you'd be embarrassed to find broken three weeks later.

One token per cron job. If you have 10 jobs, create 10 monitors. DeadManCheck's free tier covers 5 monitors — the $12/mo plan covers 100, which handles most teams.

Two minutes of setup. One less thing to find out about the hard way.

Monitoring GitHub Actions scheduled workflows: a practical guide

Kriss — Wed, 29 Apr 2026 16:14:13 +0000

Monitoring GitHub Actions scheduled workflows: a practical guide

GitHub Actions is a surprisingly capable cron scheduler. Schedule a workflow, let it run nightly, forget about it.

Until it stops running. And you don't notice for two weeks.

Scheduled workflows in GitHub Actions are quietly unreliable. GitHub delays them, skips them during high load, and — most importantly — gives you no built-in alerting when they fail silently. Adding external monitoring takes about 5 minutes and saves you from that two-week discovery.

The basic setup

Here's a minimal scheduled workflow with monitoring:

name: Nightly export

on:
  schedule:
    - cron: '0 2 * * *'  # 2am UTC every day
  workflow_dispatch:  # allows manual triggering for testing

jobs:
  export:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run export
        run: python scripts/export.py

      - name: Ping DeadManCheck
        if: success()
        run: curl -fsS https://deadmancheck.io/ping/${{ secrets.DEADMANCHECK_TOKEN }} > /dev/null

The last step pings DeadManCheck only if all previous steps succeeded (if: success()). If the export script fails, the ping doesn't fire, and you get alerted after your configured grace period.

Set up the monitor with a 25-hour interval (giving a 1-hour buffer on the 24-hour schedule). Store your token in GitHub: Settings → Secrets and variables → Actions → New repository secret named DEADMANCHECK_TOKEN.

Adding start/end pings for longer jobs

For jobs that run more than a few minutes, use the start/end pattern. This catches jobs that hang:

steps:
  - uses: actions/checkout@v4

  - name: Ping start
    run: curl -fsS https://deadmancheck.io/ping/${{ secrets.DEADMANCHECK_TOKEN }}/start > /dev/null || true

  - name: Run ETL
    id: etl
    run: |
      python scripts/run_etl.py
      echo "rows=$(cat /tmp/etl_row_count.txt)" >> $GITHUB_OUTPUT

  - name: Ping done
    if: success()
    run: |
      curl -fsS -X POST -H "Content-Type: application/json" \
        -d "{\"count\": ${{ steps.etl.outputs.rows }}}" \
        "https://deadmancheck.io/ping/${{ secrets.DEADMANCHECK_TOKEN }}" \
        > /dev/null || true

  - name: Ping fail
    if: failure()
    run: curl -fsS https://deadmancheck.io/ping/${{ secrets.DEADMANCHECK_TOKEN }}/fail > /dev/null || true

Your ETL script writes the row count to /tmp/etl_row_count.txt. The monitoring step picks it up and includes it in the ping — so your monitor can alert on zero-output runs, not just missed runs.

The gotchas

GitHub delays scheduled workflows

This is the big one. GitHub's docs admit that scheduled workflows may be delayed during high load. A workflow scheduled for 2:00am UTC might run at 2:23am or 2:51am. During busy periods, delays of 30–60 minutes aren't unusual.

Don't set your DeadManCheck interval to exactly 24 hours. Set it to 25 hours. That buffer absorbs GitHub's scheduling jitter without letting real failures go undetected.

Scheduled workflows stop on inactive repos

If a repository has no commits in 60 days, GitHub disables scheduled workflows. You'll get an email warning. If you miss it, the job silently stops running — and your external monitor will catch it where GitHub's notification didn't reach you.

Test with workflow_dispatch before trusting the schedule

Always add workflow_dispatch as a trigger (it's in all examples above). You can trigger the workflow manually from the Actions tab or via the CLI:

gh workflow run nightly-export.yml

Test your monitoring integration before the first scheduled run. Confirm the ping appears in your DeadManCheck dashboard with the correct count.

Secrets aren't available in forks

If your repo is public and someone forks it, secrets.DEADMANCHECK_TOKEN will be empty in their fork. The curl will fail silently. This is fine — you don't want random forks pinging your monitor — but be aware of it when debugging.

Full production example

name: Nightly database backup

on:
  schedule:
    - cron: '0 2 * * *'
  workflow_dispatch:

jobs:
  backup:
    runs-on: ubuntu-latest
    timeout-minutes: 30  # hard limit — prevent hung jobs accumulating

    steps:
      - uses: actions/checkout@v4

      - name: Ping start
        run: |
          curl -fsS \
            "https://deadmancheck.io/ping/${{ secrets.DEADMANCHECK_TOKEN }}/start" \
            > /dev/null || true  # don't fail if monitoring is down

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-east-1

      - name: Run backup
        id: backup
        run: |
          python scripts/backup.py
          echo "rows=$(cat /tmp/backup_row_count.txt)" >> $GITHUB_OUTPUT

      - name: Upload to S3
        run: aws s3 cp /backups/latest.dump s3://my-backups/

      - name: Ping done
        if: success()
        run: |
          curl -fsS -X POST -H "Content-Type: application/json" \
            -d "{\"count\": ${{ steps.backup.outputs.rows }}}" \
            "https://deadmancheck.io/ping/${{ secrets.DEADMANCHECK_TOKEN }}" \
            > /dev/null || true

      - name: Ping fail
        if: failure()
        run: |
          curl -fsS \
            "https://deadmancheck.io/ping/${{ secrets.DEADMANCHECK_TOKEN }}/fail" \
            > /dev/null || true

A few things worth noting:

timeout-minutes: 30 is a hard ceiling. Without it, a hung job can sit there for 6 hours consuming a runner.
|| true on the monitoring pings means a DeadManCheck outage won't cause your backup job to report failed.
The row count flows from the backup step through $GITHUB_OUTPUT to the ping step.

After deploying

Trigger the workflow manually and confirm:

The workflow runs end-to-end without errors
DeadManCheck shows a recent ping on your monitor dashboard
The count looks correct for what the job processed

Wait for the first scheduled run and verify again. Two successful data points before you trust it.

Scheduled workflows are one of those things that feel reliable until the day they aren't. External monitoring is the difference between finding out immediately and finding out when someone asks why the weekly report is missing.

Output assertions: the cron job check most monitoring tools skip

Kriss — Tue, 28 Apr 2026 21:28:25 +0000

Output assertions: the cron job check most monitoring tools skip

A follow-up to A reader comment made me realise I'd only solved half the problem — this is a deeper reference guide on output assertions specifically.

"Did it run?" is the wrong question.

Every monitoring tool asks it. Heartbeat monitors, cron schedulers, even purpose-built tools like Cronitor and Healthchecks.io — they all fundamentally ask: did the job check in? If yes, green. If no, red.

It's a useful question. But it's not the useful question.

The failure mode that looks like success

Imagine a nightly job that syncs user records from your CRM into your database. It runs at midnight, takes about 90 seconds, and exits cleanly. Your heartbeat monitor sees the ping at 12:01:34am and marks it healthy.

What it doesn't see: the job synced 0 records. It has been syncing 0 records for eight days, since someone rotated the CRM API credentials and forgot to update the environment variable. The job connects, gets a 401, logs a warning, falls back to a no-op, and exits 0.

All monitoring: green. Business: broken for eight days.

This is not a hypothetical. Variants of this failure happen constantly. The job ran. That fact is true and also completely useless.

What "did it do anything?" looks like

Output assertions flip the question. Instead of only checking that the job pinged in, you also check what it reported.

A job that processes records should report how many it processed. A job that generates a file should report the file size. A job that sends emails should report how many it sent. You instrument the job to emit a count — one number representing meaningful work done — and your monitoring layer validates it falls within expected bounds.

The failure modes this catches:

Zero when non-zero expected: sync runs, processes nothing, exits clean
Suspiciously low counts: normally syncs 500 records, today synced 3
Count drift over time: weekly report used to include 10k rows, now consistently 200

None of these trip a heartbeat check. All of them are real problems.

Why most tools don't do this

Heartbeat monitoring is architecturally simple: job pings URL, URL records timestamp, alerting checks timestamp age. The data model is just "last seen at".

Output assertions require more: the job must emit structured data, the tool must store it, and the alerting logic must understand what "normal" looks like for that specific job. That's a significantly more complex product to build.

Most tools solve the simpler problem because it covers the obvious failure mode and is much easier to ship.

How to instrument your jobs

The instrumentation is lightweight. Pick a number that represents meaningful work and emit it at the end:

# Database backup — report dump file size
result = subprocess.run(["pg_dump", "-Fc", "mydb", "-f", "/backups/mydb.dump"])
dump_size = os.path.getsize("/backups/mydb.dump")
ping_monitor(count=dump_size)

# CRM sync — report records synced
synced = sync_from_crm()
ping_monitor(count=len(synced))

# Email campaign — report emails sent
sent = send_campaign(campaign_id)
ping_monitor(count=sent)

Three extra lines per job. The return is knowing your job didn't just run — it did something. (ping_monitor is a wrapper around your monitoring call — implementation below.)

Sending the count to your monitor

DeadManCheck accepts a count parameter with each ping:

curl -fsS -X POST -H "Content-Type: application/json" \
  -d '{"count": 1547}' \
  https://deadmancheck.io/ping/YOUR-TOKEN > /dev/null

You configure the assertion on the monitor: "alert if count is 0" or "alert if count drops below threshold". If the job checks in but reports zero records, you get alerted — even though the job technically ran fine.

It also does duration monitoring with rolling average anomaly detection. If your 90-second job starts taking 45 minutes, that gets flagged too. Jobs that hang are a separate silent failure mode that output counts don't catch on their own.

The right question

Monitoring that only asks "did it run?" will eventually lie to you at the worst possible moment.

The right question is "did it do anything useful?" Output assertions are how you ask that question automatically, at 2am, every night, without anyone having to check.

Start with your backup jobs. That's where the answer matters most.

A reader comment made me realise I'd only solved half the problem

Kriss — Sat, 25 Apr 2026 13:20:52 +0000

A reader comment made me realise I'd only solved half the problem

Last month I wrote about the cron job failure mode nobody talks about: the job that doesn't die, it just drags.

The short version: a nightly ETL job at a previous employer took four hours instead of forty minutes for six days before anyone noticed. It ran. It completed. It exited zero. Every dashboard showed green. Downstream data was silently wrong.

The fix I described was duration anomaly detection — once you have a few weeks of run history, you know what "normal" looks like. A job that takes 4x its baseline is a signal even if it succeeded. I built DeadManCheck partly because I couldn't find a tool that combined silence detection with duration tracking.

The article got some traction. Then someone left a comment that stopped me in my tracks.

The comment

The failure mode I keep seeing: the job runs, logs "complete," and the output silently goes nowhere.

No error. No alert. Just a cron that appeared healthy while accomplishing nothing for days.

The fix that actually works is external verification. Don't check that the job ran; check that the downstream artifact exists. A job that succeeds but doesn't write the expected DB record is the same as a failed job.

They were right. And I hadn't covered it.

Duration anomaly detection catches "job ran slow." Silence detection catches "job didn't run." Neither catches "job ran fine, on time, but produced nothing."

That's a third failure mode entirely.

What this looks like in practice

Here's a simplified backup script:

import psycopg2
import csv
import os

conn = psycopg2.connect(os.environ["DATABASE_URL"])
cur = conn.cursor()

cur.execute("SELECT * FROM orders WHERE exported = false")
rows = cur.fetchall()

with open("/backups/orders.csv", "w") as f:
    writer = csv.writer(f)
    writer.writerows(rows)

cur.execute("UPDATE orders SET exported = true WHERE exported = false")
conn.commit()

print(f"Backup complete. {len(rows)} rows exported.")

Can you spot the bug?

The script runs. It prints "Backup complete. 0 rows exported." It exits cleanly.

The bug is in a migration from three weeks earlier. A developer renamed the exported column to is_exported. The WHERE clause now silently returns nothing. Every night: zero rows fetched, empty CSV written, nothing marked, exit code 0.

Exit code: 0. Monitoring alert: none.

This is exactly what the commenter was describing. A job that succeeds but produces nothing is functionally the same as a failed job. Your monitoring just doesn't know that yet.

Why the standard fix is hard to scale

The commenter suggested checking the downstream artifact — verify the DB record exists, check the file isn't empty. That's the correct instinct, but it requires custom verification logic for every job. Each job writes to a different place, in a different format, with different expectations about what "something" looks like.

What I wanted was a generalised version: tell the monitoring service what your job produced, and let it decide if that's suspicious.

That's what I built into DeadManCheck as output assertions.

How output assertions work

The idea is simple. When your job pings the monitoring service at completion, it includes a count of what it actually did:

curl -fsS -X POST -H "Content-Type: application/json" \
  -d '{"count": 0}' \
  https://deadmancheck.io/ping/YOUR-TOKEN > /dev/null

You configure a rule: "alert if count is 0 more than once in a row" or "alert if count drops more than 80% below the rolling average."

The job ran. It just did nothing. Now you know.

In Python:

import requests
import os

def ping_deadmancheck(count=None):
    token = os.environ["DEADMANCHECK_TOKEN"]
    url = f"https://deadmancheck.io/ping/{token}"
    try:
        if count is not None:
            requests.post(url, json={"count": count}, timeout=5)
        else:
            requests.get(url, timeout=5)
    except requests.RequestException:
        pass  # never let monitoring break the job

rows_processed = do_the_work()
ping_deadmancheck(count=rows_processed)

Ten lines. The complexity stays in the service, not in your scripts. And unlike checking a downstream artifact, it works the same way regardless of what your job actually produces.

The full picture: three failure modes

After that comment, I updated my mental model. There are three distinct ways a cron job can fail silently:

Failure mode	What happens	What catches it
Job doesn't run	Silence. No ping arrives.	Dead man's switch (silence detection)
Job runs slow	Ping arrives late or after too long	Duration anomaly detection
Job runs, produces nothing	Ping arrives on time, output is empty	Output assertions

Most tools only cover the first row. Some cover the first two. The third is almost always a blind spot.

What I do now

Every background job I write now has three things:

A counter variable tracking records processed
A guard clause that exits non-zero if zero is never a valid outcome
A heartbeat ping that includes the count

rows_processed = do_the_work()

if rows_processed == 0:
    raise RuntimeError("Processed 0 records — investigate before marking success")

ping_deadmancheck(count=rows_processed)

For jobs where zero is sometimes valid (quiet periods, weekends), skip the guard clause and let the monitoring service decide based on historical patterns.

Credit where it's due

I wouldn't have built output assertions without that comment. Sometimes the feature request hiding in a code review or a reply thread is the most valuable one you'll get.

If you've got a background job running right now, ask yourself three questions:

Will I know if it silently stops running?
Will I know if it starts taking 4x longer than normal?
Will I know if it ran perfectly but accomplished nothing?

If any of those is "no" — that's your monitoring gap.

→ Try DeadManCheck free at deadmancheck.io

The cron job failure mode nobody talks about

Kriss — Sun, 29 Mar 2026 19:04:31 +0000

A few months ago, a nightly ETL job at a previous job nearly cost us a major client. Not because it failed. Because it took four hours instead of forty minutes — and nobody noticed for six days.

The job ran. It completed. It exited zero. Every monitoring dashboard showed green. Meanwhile, the downstream data pipeline was ingesting half-processed records, and reports were silently wrong. By the time a client flagged it, we had six days of corrupted reporting to unpick.

This is the failure mode nobody talks about: the job that doesn't die, it just... drags.

Why your existing monitoring misses it -

If you're using Healthchecks.io, Better Uptime, or a similar dead man's switch tool, here's how it works: your cron job pings a URL at the end of each run. If the ping doesn't arrive within a grace window, you get an alert.

That's genuinely useful. It catches jobs that crash, hang indefinitely, or never start. But what it doesn't catch is a job that completes in 240 minutes when it should take 45. The ping arrives. The check passes. Everything looks fine. The tool has no idea what "normal" looks like for that job — it only knows silence vs. noise.

Duration anomaly detection is the missing piece.

What duration anomaly detection actually means

The concept is simple: instead of only checking whether a job completed, you also check how long it took.

Once you have a few weeks of run history, you know that your nightly job usually takes 40–50 minutes. So when it takes four hours, that's a signal — even if it succeeded. Something changed: the dataset grew, a dependency got slow, a query plan degraded, a network hop started timing out and retrying.

Catching this early means you can investigate before it causes damage downstream.

The /start + /finish pattern -

Job begins

curl -s "https://deadmancheck.io/ping/abc123/start"

... your actual job logic ...

Job ends

curl -s "https://deadmancheck.io/ping/abc123"

Now the monitoring service knows: this run started at T, it ended at T+4h. It compares that against the rolling average of previous runs and alerts if the duration exceeds a configurable threshold — say, 2x the usual runtime. Two curl calls. The complexity lives in the service, not in your scripts.

Why this matters more as systems age -

New jobs are fast. As systems mature, things get slower in ways that creep up on you. Rows accumulate. Indexes bloat. Third-party APIs introduce latency. Your job that took 8 minutes in January takes 35 minutes in October.

Without duration tracking, you have no visibility into this degradation. With it, you have a canary. The alert fires at 70 minutes, you investigate, you find the index that needs rebuilding. Crisis averted before the downstream effects compound.

So I built this -

After looking for a tool that combined silence detection with duration anomaly detection and not finding one, I built DeadManCheck (deadmancheck.io). It supports the /start + /ping pattern, tracks rolling run history, and alerts you when a job takes significantly longer than its baseline. Standard silence detection is included too, so both failure modes are covered in one place.

Free tier available, no credit card required.

The checklist -

Next time you wire up a cron job, ask yourself:

Will I know if this job silently stops running?
Will I know if this job starts taking 4x longer than normal?
Will I know before my users do?

If the answer to any of those is "no", you have a monitoring gap. It's a small one to close.

→ Try DeadManCheck free at deadmancheck.io