<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Kriss</title>
    <description>The latest articles on Forem by Kriss (@krissv).</description>
    <link>https://forem.com/krissv</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3849903%2Fc4d1b106-f99a-4865-aa09-67b370de21bf.png</url>
      <title>Forem: Kriss</title>
      <link>https://forem.com/krissv</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/krissv"/>
    <language>en</language>
    <item>
      <title>Output assertions: the cron job check most monitoring tools skip</title>
      <dc:creator>Kriss</dc:creator>
      <pubDate>Tue, 28 Apr 2026 21:28:25 +0000</pubDate>
      <link>https://forem.com/krissv/output-assertions-the-cron-job-check-most-monitoring-tools-skip-15kn</link>
      <guid>https://forem.com/krissv/output-assertions-the-cron-job-check-most-monitoring-tools-skip-15kn</guid>
      <description>&lt;h1&gt;
  
  
  Output assertions: the cron job check most monitoring tools skip
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;A follow-up to &lt;a href="https://dev.to/krissv/a-reader-comment-made-me-realise-id-only-solved-half-the-problem-3cpg"&gt;A reader comment made me realise I'd only solved half the problem&lt;/a&gt; — this is a deeper reference guide on output assertions specifically.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;"Did it run?" is the wrong question.&lt;/p&gt;

&lt;p&gt;Every monitoring tool asks it. Heartbeat monitors, cron schedulers, even purpose-built tools like Cronitor and Healthchecks.io — they all fundamentally ask: did the job check in? If yes, green. If no, red.&lt;/p&gt;

&lt;p&gt;It's a useful question. But it's not the useful question.&lt;/p&gt;

&lt;h2&gt;
  
  
  The failure mode that looks like success
&lt;/h2&gt;

&lt;p&gt;Imagine a nightly job that syncs user records from your CRM into your database. It runs at midnight, takes about 90 seconds, and exits cleanly. Your heartbeat monitor sees the ping at 12:01:34am and marks it healthy.&lt;/p&gt;

&lt;p&gt;What it doesn't see: the job synced 0 records. It has been syncing 0 records for eight days, since someone rotated the CRM API credentials and forgot to update the environment variable. The job connects, gets a 401, logs a warning, falls back to a no-op, and exits 0.&lt;/p&gt;

&lt;p&gt;All monitoring: green. Business: broken for eight days.&lt;/p&gt;

&lt;p&gt;This is not a hypothetical. Variants of this failure happen constantly. The job ran. That fact is true and also completely useless.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "did it do anything?" looks like
&lt;/h2&gt;

&lt;p&gt;Output assertions flip the question. Instead of only checking that the job pinged in, you also check what it reported.&lt;/p&gt;

&lt;p&gt;A job that processes records should report how many it processed. A job that generates a file should report the file size. A job that sends emails should report how many it sent. You instrument the job to emit a count — one number representing meaningful work done — and your monitoring layer validates it falls within expected bounds.&lt;/p&gt;

&lt;p&gt;The failure modes this catches:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Zero when non-zero expected&lt;/strong&gt;: sync runs, processes nothing, exits clean&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Suspiciously low counts&lt;/strong&gt;: normally syncs 500 records, today synced 3&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Count drift over time&lt;/strong&gt;: weekly report used to include 10k rows, now consistently 200&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these trip a heartbeat check. All of them are real problems.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why most tools don't do this
&lt;/h2&gt;

&lt;p&gt;Heartbeat monitoring is architecturally simple: job pings URL, URL records timestamp, alerting checks timestamp age. The data model is just "last seen at".&lt;/p&gt;

&lt;p&gt;Output assertions require more: the job must emit structured data, the tool must store it, and the alerting logic must understand what "normal" looks like for that specific job. That's a significantly more complex product to build.&lt;/p&gt;

&lt;p&gt;Most tools solve the simpler problem because it covers the obvious failure mode and is much easier to ship.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to instrument your jobs
&lt;/h2&gt;

&lt;p&gt;The instrumentation is lightweight. Pick a number that represents meaningful work and emit it at the end:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Database backup — report dump file size
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pg_dump&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-Fc&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mydb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/backups/mydb.dump&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;dump_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getsize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/backups/mydb.dump&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;ping_monitor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dump_size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# CRM sync — report records synced
&lt;/span&gt;&lt;span class="n"&gt;synced&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sync_from_crm&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;ping_monitor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;synced&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# Email campaign — report emails sent
&lt;/span&gt;&lt;span class="n"&gt;sent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;send_campaign&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;campaign_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;ping_monitor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three extra lines per job. The return is knowing your job didn't just run — it did something. (&lt;code&gt;ping_monitor&lt;/code&gt; is a wrapper around your monitoring call — implementation below.)&lt;/p&gt;

&lt;h2&gt;
  
  
  Sending the count to your monitor
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://deadmancheck.io" rel="noopener noreferrer"&gt;DeadManCheck&lt;/a&gt; accepts a &lt;code&gt;count&lt;/code&gt; parameter with each ping:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-fsS&lt;/span&gt; &lt;span class="s2"&gt;"https://deadmancheck.io/ping/YOUR-TOKEN?count=1547"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /dev/null
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You configure the assertion on the monitor: "alert if count is 0" or "alert if count drops below threshold". If the job checks in but reports zero records, you get alerted — even though the job technically ran fine.&lt;/p&gt;

&lt;p&gt;It also does duration monitoring with rolling average anomaly detection. If your 90-second job starts taking 45 minutes, that gets flagged too. Jobs that hang are a separate silent failure mode that output counts don't catch on their own.&lt;/p&gt;

&lt;h2&gt;
  
  
  The right question
&lt;/h2&gt;

&lt;p&gt;Monitoring that only asks "did it run?" will eventually lie to you at the worst possible moment.&lt;/p&gt;

&lt;p&gt;The right question is "did it do anything useful?" Output assertions are how you ask that question automatically, at 2am, every night, without anyone having to check.&lt;/p&gt;

&lt;p&gt;Start with your backup jobs. That's where the answer matters most.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>monitoring</category>
      <category>opensource</category>
      <category>productivity</category>
    </item>
    <item>
      <title>A reader comment made me realise I'd only solved half the problem</title>
      <dc:creator>Kriss</dc:creator>
      <pubDate>Sat, 25 Apr 2026 13:20:52 +0000</pubDate>
      <link>https://forem.com/krissv/a-reader-comment-made-me-realise-id-only-solved-half-the-problem-3cpg</link>
      <guid>https://forem.com/krissv/a-reader-comment-made-me-realise-id-only-solved-half-the-problem-3cpg</guid>
      <description>&lt;h1&gt;
  
  
  A reader comment made me realise I'd only solved half the problem
&lt;/h1&gt;

&lt;p&gt;Last month I wrote about the cron job failure mode nobody talks about: the job that doesn't die, it just drags.&lt;/p&gt;

&lt;p&gt;The short version: a nightly ETL job at a previous employer took four hours instead of forty minutes for six days before anyone noticed. It ran. It completed. It exited zero. Every dashboard showed green. Downstream data was silently wrong.&lt;/p&gt;

&lt;p&gt;The fix I described was duration anomaly detection — once you have a few weeks of run history, you know what "normal" looks like. A job that takes 4x its baseline is a signal even if it succeeded. I built &lt;a href="https://deadmancheck.io" rel="noopener noreferrer"&gt;DeadManCheck&lt;/a&gt; partly because I couldn't find a tool that combined silence detection with duration tracking.&lt;/p&gt;

&lt;p&gt;The article got some traction. Then someone left a comment that stopped me in my tracks - &lt;a href="https://dev.to/krissv/the-cron-job-failure-mode-nobody-talks-about-3p1a"&gt;https://dev.to/krissv/the-cron-job-failure-mode-nobody-talks-about-3p1a&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The comment
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;The failure mode I keep seeing: the job runs, logs "complete," and the output silently goes nowhere.&lt;/p&gt;

&lt;p&gt;No error. No alert. Just a cron that appeared healthy while accomplishing nothing for days.&lt;/p&gt;

&lt;p&gt;The fix that actually works is external verification. Don't check that the job ran; check that the downstream artifact exists. A job that succeeds but doesn't write the expected DB record is the same as a failed job.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;They were right. And I hadn't covered it.&lt;/p&gt;

&lt;p&gt;Duration anomaly detection catches "job ran slow." Silence detection catches "job didn't run." Neither catches "job ran fine, on time, but produced nothing."&lt;/p&gt;

&lt;p&gt;That's a third failure mode entirely.&lt;/p&gt;




&lt;h2&gt;
  
  
  What this looks like in practice
&lt;/h2&gt;

&lt;p&gt;Here's a simplified backup script:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;psycopg2&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;csv&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;

&lt;span class="n"&gt;conn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;psycopg2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DATABASE_URL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;cur&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT * FROM orders WHERE exported = false&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fetchall&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/backups/orders.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;writer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;writer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;writer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;writerows&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;UPDATE orders SET exported = true WHERE exported = false&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;commit&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Backup complete. &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; rows exported.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Can you spot the bug?&lt;/p&gt;

&lt;p&gt;The script runs. It prints "Backup complete. 0 rows exported." It exits cleanly.&lt;/p&gt;

&lt;p&gt;The bug is in a migration from three weeks earlier. A developer renamed the &lt;code&gt;exported&lt;/code&gt; column to &lt;code&gt;is_exported&lt;/code&gt;. The WHERE clause now silently returns nothing. Every night: zero rows fetched, empty CSV written, nothing marked, exit code 0.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Exit code: &lt;code&gt;0&lt;/code&gt;. Monitoring alert: none.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is exactly what the commenter was describing. A job that succeeds but produces nothing is functionally the same as a failed job. Your monitoring just doesn't know that yet.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why the standard fix is hard to scale
&lt;/h2&gt;

&lt;p&gt;The commenter suggested checking the downstream artifact — verify the DB record exists, check the file isn't empty. That's the correct instinct, but it requires custom verification logic for every job. Each job writes to a different place, in a different format, with different expectations about what "something" looks like.&lt;/p&gt;

&lt;p&gt;What I wanted was a generalised version: tell the monitoring service what your job produced, and let it decide if that's suspicious.&lt;/p&gt;

&lt;p&gt;That's what I built into DeadManCheck as output assertions.&lt;/p&gt;




&lt;h2&gt;
  
  
  How output assertions work
&lt;/h2&gt;

&lt;p&gt;The idea is simple. When your job pings the monitoring service at completion, it includes a count of what it actually did:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-fsS&lt;/span&gt; &lt;span class="s2"&gt;"https://deadmancheck.io/ping/YOUR-TOKEN?count=0"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /dev/null
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You configure a rule: "alert if count is 0 more than once in a row" or "alert if count drops more than 80% below the rolling average."&lt;/p&gt;

&lt;p&gt;The job ran. It just did nothing. Now you know.&lt;/p&gt;

&lt;p&gt;In Python:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;ping_deadmancheck&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DEADMANCHECK_TOKEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://deadmancheck.io/ping/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RequestException&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;pass&lt;/span&gt;  &lt;span class="c1"&gt;# never let monitoring break the job
&lt;/span&gt;
&lt;span class="n"&gt;rows_processed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;do_the_work&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;ping_deadmancheck&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;rows_processed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Ten lines. The complexity stays in the service, not in your scripts. And unlike checking a downstream artifact, it works the same way regardless of what your job actually produces.&lt;/p&gt;




&lt;h2&gt;
  
  
  The full picture: three failure modes
&lt;/h2&gt;

&lt;p&gt;After that comment, I updated my mental model. There are three distinct ways a cron job can fail silently:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Failure mode&lt;/th&gt;
&lt;th&gt;What happens&lt;/th&gt;
&lt;th&gt;What catches it&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Job doesn't run&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Silence. No ping arrives.&lt;/td&gt;
&lt;td&gt;Dead man's switch (silence detection)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Job runs slow&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Ping arrives late or after too long&lt;/td&gt;
&lt;td&gt;Duration anomaly detection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Job runs, produces nothing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Ping arrives on time, output is empty&lt;/td&gt;
&lt;td&gt;Output assertions&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Most tools only cover the first row. Some cover the first two. The third is almost always a blind spot.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I do now
&lt;/h2&gt;

&lt;p&gt;Every background job I write now has three things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A counter variable tracking records processed&lt;/li&gt;
&lt;li&gt;A guard clause that exits non-zero if zero is never a valid outcome&lt;/li&gt;
&lt;li&gt;A heartbeat ping that includes the count
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;rows_processed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;do_the_work&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;rows_processed&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Processed 0 records — investigate before marking success&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;ping_deadmancheck&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;rows_processed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For jobs where zero is sometimes valid (quiet periods, weekends), skip the guard clause and let the monitoring service decide based on historical patterns.&lt;/p&gt;




&lt;h2&gt;
  
  
  Credit where it's due
&lt;/h2&gt;

&lt;p&gt;I wouldn't have built output assertions without that comment. Sometimes the feature request hiding in a code review or a reply thread is the most valuable one you'll get.&lt;/p&gt;

&lt;p&gt;If you've got a background job running right now, ask yourself three questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Will I know if it silently stops running?&lt;/li&gt;
&lt;li&gt;Will I know if it starts taking 4x longer than normal?&lt;/li&gt;
&lt;li&gt;Will I know if it ran perfectly but accomplished nothing?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If any of those is "no" — that's your monitoring gap.&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://deadmancheck.io" rel="noopener noreferrer"&gt;Try DeadManCheck free at deadmancheck.io&lt;/a&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>monitoring</category>
      <category>backend</category>
      <category>productivity</category>
    </item>
    <item>
      <title>The cron job failure mode nobody talks about</title>
      <dc:creator>Kriss</dc:creator>
      <pubDate>Sun, 29 Mar 2026 19:04:31 +0000</pubDate>
      <link>https://forem.com/krissv/the-cron-job-failure-mode-nobody-talks-about-3p1a</link>
      <guid>https://forem.com/krissv/the-cron-job-failure-mode-nobody-talks-about-3p1a</guid>
      <description>&lt;p&gt;A few months ago, a nightly ETL job at a previous job nearly cost us a major client. Not because it failed. Because it took four hours instead of forty minutes — and nobody noticed for six days.&lt;/p&gt;

&lt;p&gt;The job ran. It completed. It exited zero. Every monitoring dashboard showed green. Meanwhile, the downstream data pipeline was ingesting half-processed records, and reports were silently wrong. By the time a client flagged it, we had six days of corrupted reporting to unpick.&lt;/p&gt;

&lt;p&gt;This is the failure mode nobody talks about: the job that doesn't die, it just... drags.&lt;/p&gt;

&lt;p&gt;Why your existing monitoring misses it -&lt;/p&gt;

&lt;p&gt;If you're using Healthchecks.io, Better Uptime, or a similar dead man's switch tool, here's how it works: your cron job pings a URL at the end of each run. If the ping doesn't arrive within a grace window, you get an alert.&lt;/p&gt;

&lt;p&gt;That's genuinely useful. It catches jobs that crash, hang indefinitely, or never start. But what it doesn't catch is a job that completes in 240 minutes when it should take 45. The ping arrives. The check passes. Everything looks fine. The tool has no idea what "normal" looks like for that job — it only knows silence vs. noise.&lt;/p&gt;

&lt;p&gt;Duration anomaly detection is the missing piece.&lt;/p&gt;

&lt;p&gt;What duration anomaly detection actually means&lt;/p&gt;

&lt;p&gt;The concept is simple: instead of only checking whether a job completed, you also check how long it took.&lt;/p&gt;

&lt;p&gt;Once you have a few weeks of run history, you know that your nightly job usually takes 40–50 minutes. So when it takes four hours, that's a signal — even if it succeeded. Something changed: the dataset grew, a dependency got slow, a query plan degraded, a network hop started timing out and retrying.&lt;/p&gt;

&lt;p&gt;Catching this early means you can investigate before it causes damage downstream.&lt;/p&gt;

&lt;p&gt;The /start + /finish pattern -&lt;/p&gt;

&lt;h1&gt;
  
  
  Job begins
&lt;/h1&gt;

&lt;p&gt;curl -s "&lt;a href="https://deadmancheck.io/ping/abc123/start" rel="noopener noreferrer"&gt;https://deadmancheck.io/ping/abc123/start&lt;/a&gt;"&lt;/p&gt;

&lt;h1&gt;
  
  
  ... your actual job logic ...
&lt;/h1&gt;

&lt;h1&gt;
  
  
  Job ends
&lt;/h1&gt;

&lt;p&gt;curl -s "&lt;a href="https://deadmancheck.io/ping/abc123" rel="noopener noreferrer"&gt;https://deadmancheck.io/ping/abc123&lt;/a&gt;"&lt;/p&gt;

&lt;p&gt;Now the monitoring service knows: this run started at T, it ended at T+4h. It compares that against the rolling average of previous runs and alerts if the duration exceeds a configurable threshold — say, 2x the usual runtime. Two curl calls. The complexity lives in the service, not in your scripts.&lt;/p&gt;

&lt;p&gt;Why this matters more as systems age -&lt;/p&gt;

&lt;p&gt;New jobs are fast. As systems mature, things get slower in ways that creep up on you. Rows accumulate. Indexes bloat. Third-party APIs introduce latency. Your job that took 8 minutes in January takes 35 minutes in October.&lt;/p&gt;

&lt;p&gt;Without duration tracking, you have no visibility into this degradation. With it, you have a canary. The alert fires at 70 minutes, you investigate, you find the index that needs rebuilding. Crisis averted before the downstream effects compound.&lt;/p&gt;

&lt;p&gt;So I built this -&lt;/p&gt;

&lt;p&gt;After looking for a tool that combined silence detection with duration anomaly detection and not finding one, I built DeadManCheck (deadmancheck.io). It supports the /start + /ping pattern, tracks rolling run history, and alerts you when a job takes significantly longer than its baseline. Standard silence detection is included too, so both failure modes are covered in one place.&lt;/p&gt;

&lt;p&gt;Free tier available, no credit card required.&lt;/p&gt;

&lt;p&gt;The checklist - &lt;/p&gt;

&lt;p&gt;Next time you wire up a cron job, ask yourself:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Will I know if this job silently stops running?&lt;/li&gt;
&lt;li&gt;Will I know if this job starts taking 4x longer than normal?&lt;/li&gt;
&lt;li&gt;Will I know before my users do?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the answer to any of those is "no", you have a monitoring gap. It's a small one to close.&lt;/p&gt;

&lt;p&gt;→ Try DeadManCheck free at deadmancheck.io&lt;/p&gt;

</description>
      <category>devops</category>
      <category>monitoring</category>
      <category>backend</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
