<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Myroslav Vivcharyk</title>
    <description>The latest articles on Forem by Myroslav Vivcharyk (@devgeist).</description>
    <link>https://forem.com/devgeist</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2895644%2Fe0ecbeae-8773-45ef-8cba-4839d9a5854f.jpeg</url>
      <title>Forem: Myroslav Vivcharyk</title>
      <link>https://forem.com/devgeist</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/devgeist"/>
    <language>en</language>
    <item>
      <title>Ordered Retries in Kafka: Why You Probably Shouldn't Build This</title>
      <dc:creator>Myroslav Vivcharyk</dc:creator>
      <pubDate>Tue, 24 Mar 2026 15:10:00 +0000</pubDate>
      <link>https://forem.com/devgeist/ordered-retries-in-kafka-why-you-probably-shouldnt-build-this-e3l</link>
      <guid>https://forem.com/devgeist/ordered-retries-in-kafka-why-you-probably-shouldnt-build-this-e3l</guid>
      <description>&lt;p&gt;I once spent three weeks in architectural meetings arguing that we should insert SQS between two Kafka topics.&lt;/p&gt;

&lt;p&gt;I wasn't doing it because I liked SQS. I was doing it because the "clean" Kafka-native retry logic felt like a complexity bomb our team wasn't ready to defuse.&lt;/p&gt;

&lt;p&gt;As engineers, our job isn't just to build clever systems for the next promotion cycle. Our job is to build systems that don't set the on-call engineer's life on fire at 3 AM.&lt;/p&gt;

&lt;p&gt;In &lt;a href="https://devgeist.com/kafka-ordered-retries-1" rel="noopener noreferrer"&gt;Part 1&lt;/a&gt; and &lt;a href="https://devgeist.com/kafka-ordered-retries-2" rel="noopener noreferrer"&gt;Part 2&lt;/a&gt;, we built a &lt;a href="https://en.wikipedia.org/wiki/Rube_Goldberg_machine" rel="noopener noreferrer"&gt;Rube Goldberg machine&lt;/a&gt; of distributed locks, compacted topics, and reference counting. It works. It maintains strict ordering during failures without blocking entire partitions.&lt;/p&gt;

&lt;p&gt;Before you import &lt;a href="https://github.com/vmyroslav/kafka-resilience" rel="noopener noreferrer"&gt;my library&lt;/a&gt; or build your own, let's talk about the engineering hangover that comes after the party.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Technical Costs
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. The Cold Start Penalty
&lt;/h3&gt;

&lt;p&gt;Your service was stateless. Kubernetes could kill pods and spin up new ones in seconds. Not anymore.&lt;/p&gt;

&lt;p&gt;Now your application carries state. That state lives in the lock topic. On startup, every pod must:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Connect to Kafka&lt;/li&gt;
&lt;li&gt;Read the lock topic from the beginning&lt;/li&gt;
&lt;li&gt;Rebuild the in-memory map before processing anything&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;With 100 active locks? Instant. With 500,000 lock records after a bad day — every lock and unlock is a separate record, your pod startup goes from seconds to minutes.&lt;/p&gt;

&lt;p&gt;Here's where it gets ugly: your Horizontal Pod Autoscaler sees growing lag and spins up more pods. Those pods sit there "syncing state..." while the lag grows. The new pods can't help because they're still starting up. You've created a scaling death spiral.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mitigation:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Aggressive retention on the lock topic (minutes, not days)&lt;/li&gt;
&lt;li&gt;Tune &lt;code&gt;max.poll.interval.ms&lt;/code&gt; to accommodate sync time&lt;/li&gt;
&lt;li&gt;Use this pattern only for low-cardinality keys&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. The Memory Pressure
&lt;/h3&gt;

&lt;p&gt;The lock map itself is small — a Go map with 100,000 string keys and integer counters fits in single-digit megabytes. That's not the problem.&lt;/p&gt;

&lt;p&gt;The problem is what happens upstream. During a failure storm, every failed message triggers a lock write and a retry redirect. The lock topic and retry topic both grow rapidly. Your Kafka consumers buffer incoming records in memory before processing. The lock consumer, replaying a suddenly-huge topic during a restart, can spike memory well beyond steady-state expectations.&lt;/p&gt;

&lt;p&gt;Combine this with Cost #1: your pod OOM-kills during state restore, restarts, tries again, OOM-kills again. Crash loop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mitigation:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Circuit breaker: if lock count exceeds threshold, fall back to stop-the-world&lt;/li&gt;
&lt;li&gt;Set memory limits with headroom for failure storms, not steady-state&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Topic Overhead
&lt;/h3&gt;

&lt;p&gt;You wanted one topic: &lt;code&gt;orders&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;To make ordered retries work, you now operate four:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;orders&lt;/code&gt; — main topic&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;orders-retry&lt;/code&gt; — the retry queue&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;orders-lock&lt;/code&gt; — the compacted state topic&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;orders-dlq&lt;/code&gt; — dead letters&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;200 topics across 50 services won't bring your Kafka cluster down — production clusters run thousands. The cost is operational: each topic needs provisioning, monitoring, retention policies, and alerting. Your SRE team will love you.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mitigation:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Only use this pattern for topics that truly need it&lt;/li&gt;
&lt;li&gt;Budget the operational overhead before committing&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Zombie Lock Operations
&lt;/h3&gt;

&lt;p&gt;In Part 2, we discussed zombie locks — keys locked forever because the retry message never made it. The fix was compensating transactions and careful ordering.&lt;/p&gt;

&lt;p&gt;But "rare" still means "happens." When it does, you need to detect and clean it up.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Detection:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Alert on locks older than N hours&lt;/li&gt;
&lt;li&gt;Compare lock topic keys against retry topic — orphans are zombies&lt;/li&gt;
&lt;li&gt;Dashboard showing lock age distribution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cleanup:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Manual tombstone producer (you'll need the CLI tool)&lt;/li&gt;
&lt;li&gt;Automated reaper: background job that scans for stale locks and releases them&lt;/li&gt;
&lt;li&gt;TTL simulation: treat locks older than X hours as expired&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The reaper sounds easy until you realize: how do you know a lock is stale vs. just slow? A retry with exponential backoff might legitimately take hours. You need metadata — timestamps, attempt counts — in the lock record itself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; Do you have this tooling built? Do you have observability for it? If not, you're one zombie lock away from a very long night.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 3 AM Test
&lt;/h2&gt;

&lt;p&gt;For me, this matters more than the technical drawbacks — even though they're intertwined.&lt;/p&gt;

&lt;p&gt;Before building any complex system, try to write the runbook for it.&lt;/p&gt;

&lt;p&gt;A downstream team calls you: "Why is Order-123 not updating?"&lt;/p&gt;

&lt;p&gt;How many hops to find the answer?&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Check the main consumer logs&lt;/li&gt;
&lt;li&gt;See a "key locked, redirecting" message&lt;/li&gt;
&lt;li&gt;Query the lock topic to confirm the lock exists&lt;/li&gt;
&lt;li&gt;Query the retry topic to find the original failure&lt;/li&gt;
&lt;li&gt;Realize the lock is 4 hours old and the retry message... isn't there&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Now what? Do you have tooling to manually produce a tombstone? Do you know which partition that key hashes to? Can you do this at 3 AM, half-asleep, with Slack pinging?&lt;/p&gt;

&lt;p&gt;Here's the kicker: the person debugging this at 3 AM might not be you. In many organizations, the on-call rotation is a separate team — engineers who've never seen your code, who are juggling alerts from a dozen services. Can they navigate your distributed lock registry with a runbook, or are they going to escalate and wake you up anyway?&lt;/p&gt;

&lt;p&gt;And there's another problem: &lt;strong&gt;Kafka doesn't have key-based lookup&lt;/strong&gt;. You can't just say "show me the lock for Order-123." To find a specific key, you either scan the entire topic or build custom tooling that maintains an index. Does your company have that tooling? Will you build and maintain it? How will this system behave during a global failure when hundreds of keys are locked simultaneously?&lt;/p&gt;

&lt;p&gt;If you can't visualize this "penalty box" of locked keys — which keys are locked, for how long, and why — you haven't built a resilient system. You've built a black box.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The rule:&lt;/strong&gt; If you can't write the runbook, you can't operate the system.&lt;/p&gt;

&lt;h2&gt;
  
  
  Observability
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The alerts that matter:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;acquired - released&lt;/code&gt; growing over time → locks aren't clearing, potential zombies&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;state_restore.duration&lt;/code&gt; approaching &lt;code&gt;max.poll.interval.ms&lt;/code&gt; → rebalance loop incoming (see Part 2)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;dlq.enqueued&lt;/code&gt; &amp;gt; 0 → manual intervention required&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;redirected&lt;/code&gt; spike without corresponding &lt;code&gt;retry.processed&lt;/code&gt; increase → retries aren't succeeding&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without this, you're flying blind. Build the dashboards before you ship the feature.&lt;/p&gt;

&lt;h2&gt;
  
  
  Before You Build This
&lt;/h2&gt;

&lt;p&gt;Before reaching for distributed locks, exhaust the simpler options.&lt;/p&gt;

&lt;h3&gt;
  
  
  Option 1: Make It Commutative
&lt;/h3&gt;

&lt;p&gt;I know — in an existing system, this is easy to say and almost impossible to do. But if you're building something new, ask: do you &lt;em&gt;really&lt;/em&gt; need strict ordering?&lt;/p&gt;

&lt;p&gt;Sometimes you can redesign the problem away. Store deltas instead of absolute values. Use event sourcing with snapshots and recalculate on late arrivals. Add idempotency keys so duplicates and reordering become harmless. These aren't always possible. But when they are, they eliminate the problem entirely rather than managing it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Option 2: External State Store (Redis/Dragonfly)
&lt;/h3&gt;

&lt;p&gt;Most of you reading this probably thought: "Why not just use Redis and skip all this Kafka-native complexity?"&lt;/p&gt;

&lt;p&gt;Fair point. Redis gives you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Native TTL&lt;/strong&gt; — zombie locks expire automatically, no reaper needed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Instant startup&lt;/strong&gt; — no cold start problem, no state to rebuild&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fast lookups&lt;/strong&gt; — no consumer lag to worry about, O(1) key access&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Familiar tooling&lt;/strong&gt; — your team probably already knows how to operate Redis&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But this isn't free either. You're adding a new infrastructure dependency. Your Kafka consumer now fails if Redis is down. Network latency on every lock check adds up. And now you're debugging two systems instead of one.&lt;/p&gt;

&lt;p&gt;A word on &lt;a href="https://redis.io/docs/latest/develop/clients/patterns/distributed-locks/" rel="noopener noreferrer"&gt;Redlock&lt;/a&gt;: if you're considering distributed locking with Redis, read &lt;a href="https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html" rel="noopener noreferrer"&gt;Martin Kleppmann's analysis&lt;/a&gt; first. Distributed locking is hard regardless of the tool. Redis doesn't make the fundamental problem easier — it makes the &lt;em&gt;infrastructure&lt;/em&gt; easier.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;My take:&lt;/strong&gt; If I were starting fresh today and needed ordered retries with blocking semantics, I'd reach for Redis/Dragonfly before building the Kafka-native solution we covered in Parts 1 and 2. Redis doesn't eliminate complexity — it moves it somewhere your team probably already understands.&lt;/p&gt;

&lt;h3&gt;
  
  
  Option 3: Accept the Trade-off
&lt;/h3&gt;

&lt;p&gt;Sometimes the "stop-the-world" approach from Part 1 is actually fine.&lt;/p&gt;

&lt;p&gt;If your topic has low throughput — blocking the partition during a retry might be acceptable. The complexity of distributed locks only pays off when throughput requirements make blocking unacceptable.&lt;/p&gt;

&lt;p&gt;Run the numbers. If your worst-case retry takes 5 minutes and your partition processes 100 messages per hour, you're blocking 8 messages. Is that worth an operational burden?&lt;/p&gt;

&lt;h2&gt;
  
  
  When It's Worth It
&lt;/h2&gt;

&lt;p&gt;After all that, there are cases where this pattern might be the right call:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Strict ordering is contractual&lt;/strong&gt; — Downstream systems will break or charge you money if events arrive out of order. Not "might cause weird bugs" — actually break, with consequences.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Stop-the-world is unacceptable&lt;/strong&gt; — Throughput requirements mean you can't block a partition for minutes. You've measured this, not assumed it.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;You have the tooling for debugging and observability&lt;/strong&gt; — Your team can see which keys are locked, for how long, and why. You can produce a tombstone manually. An SRE can follow the runbook at 3 AM without escalating.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Your team understands it&lt;/strong&gt; — The next engineer can debug it without reading three blog posts. You've documented the architecture, the failure modes, and the recovery procedures.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If all four are true, go ahead. The &lt;a href="https://github.com/vmyroslav/kafka-resilience" rel="noopener noreferrer"&gt;kafka-resilience&lt;/a&gt; library handles the edge cases we covered in Part 2.&lt;/p&gt;

&lt;p&gt;If any are false, reconsider.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Tax
&lt;/h2&gt;

&lt;p&gt;Engineering is the art of choosing which problems you want to have.&lt;/p&gt;

&lt;p&gt;This solution solves the ordering problem. In exchange, it gives you the state management problem, the cold start problem, the memory pressure problem, and the 3 AM debugging problem.&lt;/p&gt;

&lt;p&gt;The technical challenges are solvable. The question is whether you should be the one solving them — or whether a simpler solution, even an imperfect one, is the wiser choice for your team and your system.&lt;/p&gt;

&lt;p&gt;Build the clever thing when you've exhausted the simple options. Not before.&lt;/p&gt;

</description>
      <category>kafka</category>
      <category>go</category>
      <category>eventdriven</category>
      <category>distributedsystems</category>
    </item>
    <item>
      <title>Ordered Retries in Kafka: The Bugs You'll Find in Production</title>
      <dc:creator>Myroslav Vivcharyk</dc:creator>
      <pubDate>Tue, 17 Mar 2026 15:10:00 +0000</pubDate>
      <link>https://forem.com/devgeist/ordered-retries-in-kafka-the-bugs-youll-find-in-production-3d5a</link>
      <guid>https://forem.com/devgeist/ordered-retries-in-kafka-the-bugs-youll-find-in-production-3d5a</guid>
      <description>&lt;p&gt;&lt;a href="https://devgeist.com/kafka-ordered-retries-1" rel="noopener noreferrer"&gt;In Part 1&lt;/a&gt;, we introduced a "middle ground" for Kafka retries: when a message fails, lock its key, send the message to a retry topic, and let the main partition continue processing other keys. If you need a recap, the &lt;a href="https://www.confluent.io/blog/error-handling-patterns-in-kafka/#pattern-4" rel="noopener noreferrer"&gt;Confluent blog&lt;/a&gt; covers the pattern.&lt;/p&gt;

&lt;p&gt;Simple on a whiteboard. In production, things break in ways the diagrams don't show.&lt;/p&gt;

&lt;p&gt;This article covers implementation gaps that tutorials skip. The code examples are from &lt;a href="https://github.com/vmyroslav/kafka-resilience" rel="noopener noreferrer"&gt;kafka-resilience&lt;/a&gt; — simplified for clarity, but the edge cases are real.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture Recap
&lt;/h2&gt;

&lt;p&gt;The goal: when a message fails, block only that key. Other keys keep flowing.&lt;/p&gt;

&lt;p&gt;To make this work across multiple consumer instances, we need three topics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Main Topic&lt;/strong&gt; — Your business events. The consumer checks if a key is locked before processing.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Retry Topic&lt;/strong&gt; — Messages land here either because they failed or because a predecessor with the same key failed. A separate consumer processes them with backoff. When a message succeeds here, it releases the lock.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Lock Topic&lt;/strong&gt; (compacted) — The shared lock registry. When Instance A locks "Order-123", it writes here. Instance B reads this topic to learn about locks it didn't create.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each instance runs &lt;strong&gt;two consumer loops&lt;/strong&gt;: one for business messages (main or retry), and one that continuously reads the lock topic to keep the local lock map in sync.&lt;/p&gt;

&lt;p&gt;Here's how they connect:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5j834zbr8kat0xd9a0b7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5j834zbr8kat0xd9a0b7.png" alt="Mermaid_1" width="800" height="495"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here's where it breaks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Problem 1: The Dual-Write
&lt;/h2&gt;

&lt;p&gt;When a message fails, you need to do two things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Write a lock to the lock topic&lt;/li&gt;
&lt;li&gt;Write the payload to the retry topic&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If the first succeeds but the second fails, you've created a &lt;strong&gt;zombie lock&lt;/strong&gt; — the key is locked forever, but there's no message in the retry queue to unlock it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhjy0mzw56iz7qa415mx0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhjy0mzw56iz7qa415mx0.png" alt="Mermaid 2" width="800" height="495"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution A: Kafka Transactions
&lt;/h3&gt;

&lt;p&gt;The obvious solution is Kafka transactions — wrap both writes in a transaction so they either both succeed or both fail.&lt;/p&gt;

&lt;p&gt;Most modern clients support this, even outside the Java ecosystem. If your setup allows it, this is a valid approach.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why I didn't use it:&lt;/strong&gt; Kafka transactions aren't just a flag you flip. They require a transactional producer, which means a dedicated transaction coordinator on the broker side. Every transactional write follows an &lt;code&gt;init → begin → produce → commit&lt;/code&gt; cycle — that's at least two extra round-trips per message. With &lt;code&gt;acks=all&lt;/code&gt; across availability zones, each round-trip adds latency.&lt;/p&gt;

&lt;p&gt;Under normal load, this overhead is manageable. During a failure storm — exactly when your retry system is under heavy load — it compounds. If 10,000 messages fail in a burst, you're running 10,000 transactions, each waiting for cross-AZ acknowledgments before committing. The retry system becomes the bottleneck at the worst possible moment.&lt;/p&gt;

&lt;p&gt;Transactions are the correct solution if your broker topology can absorb the overhead. For high-throughput systems, there's a lighter alternative.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution B: Compensating Transaction
&lt;/h3&gt;

&lt;p&gt;Instead, I used a &lt;strong&gt;compensating transaction&lt;/strong&gt;: acquire the lock first, then try to write the payload. If the write fails, undo the lock by writing a release.&lt;/p&gt;

&lt;p&gt;When processing fails, the &lt;code&gt;Tracker&lt;/code&gt; calls &lt;code&gt;redirect&lt;/code&gt; to lock the key and forward the message to the retry topic.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Tracker&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;redirect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="n"&gt;Message&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// Step 1: Acquire lock first&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;coordinator&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Acquire&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Errorf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"failed to acquire lock: %w"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// Step 2: Publish to retry topic&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;publisher&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;retryTopic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c"&gt;// Step 3: Compensate on failure — undo the lock&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;releaseErr&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;coordinator&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Release&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;releaseErr&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"compensation failed - potential zombie lock"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="s"&gt;"key"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Key&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                &lt;span class="s"&gt;"error"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;releaseErr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Errorf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"failed to publish to retry: %w"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The trade-off&lt;/strong&gt;: if the process crashes between acquiring the lock and completing the compensation, you get a zombie. Rare, but possible. For those edge cases, you'll need a background "reaper" to clean stale locks. We'll cover that in Part 3.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Whichever solution you choose, the remaining problems still apply.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Problem 2: The Reference Counting Gap
&lt;/h2&gt;

&lt;p&gt;Most descriptions of this pattern treat the retry topic as a queue of failed messages. It's not. It also holds messages that never failed — they were redirected because a predecessor with the same key did. This distinction matters now.&lt;/p&gt;

&lt;p&gt;Watch this sequence carefully:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;T1: Event A (Key: 123) fails
    → Lock Key 123
    → Messages in retry: 1

T2: Event B (Key: 123) arrives, sees lock
    → Redirect to retry topic (not failed — just blocked)
    → Messages in retry: 2

T3: Event A retries successfully
    → Release lock... but HOW?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If "release" means "delete the lock," you've just broken ordering. Event &lt;strong&gt;B&lt;/strong&gt; is still sitting in the retry topic. But the lock is gone. When Event &lt;strong&gt;C&lt;/strong&gt; arrives for &lt;em&gt;Key 123&lt;/em&gt;, it processes immediately — before &lt;strong&gt;B&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;T4: Event C (Key: 123) arrives
    → Check lock: NOT LOCKED (wrong!)
    → Processes immediately
    → C processed before B
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The lock isn't a boolean. It's a counter.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Each instance maintains a &lt;code&gt;LocalCoordinator&lt;/code&gt; — an in-memory map counting pending retries per key. Later, we'll wrap this with a &lt;code&gt;KafkaCoordinator&lt;/code&gt; that broadcasts state changes to other instances.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;LocalCoordinator&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;locks&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;  &lt;span class="c"&gt;// key → reference count&lt;/span&gt;
    &lt;span class="n"&gt;mu&lt;/span&gt;    &lt;span class="n"&gt;sync&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RWMutex&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;LocalCoordinator&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;Acquire&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Lock&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Unlock&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;locks&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;  &lt;span class="c"&gt;// Increment&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;LocalCoordinator&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;Release&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Lock&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Unlock&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;locks&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;--&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;locks&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nb"&gt;delete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;locks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c"&gt;// Only free when count hits zero&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;LocalCoordinator&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;IsLocked&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RLock&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RUnlock&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;locks&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now the sequence works correctly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;T1: Event A fails      → count = 1
T2: Event B redirects  → count = 2
T3: Event A succeeds   → count = 1 (still locked!)
T4: Event C arrives    → blocked (correct!)
T5: Event B succeeds   → count = 0 (now free)
T6: Event C proceeds   → processes in correct order
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The main topic stays blocked for Key 123 until &lt;em&gt;every&lt;/em&gt; pending message clears — both the ones that actually failed and the ones that were redirected to preserve order.&lt;/p&gt;

&lt;h2&gt;
  
  
  Problem 3: The Self-Consumption Trap
&lt;/h2&gt;

&lt;p&gt;Reference counting works locally. But we're broadcasting every lock change to Kafka so other instances can learn about it. And Kafka doesn't filter — you'll also consume your own messages.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffwz2x7xnjjuf5i1zp374.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffwz2x7xnjjuf5i1zp374.png" alt="Mermaid" width="800" height="495"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Instance A applies the lock twice: once when it acquires locally, once when it consumes its own broadcast. The reference count is now wrong. When the retry succeeds, the count drops to 1 instead of 0. The key stays locked forever.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix: tag every message with an origin ID and ignore your own.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;HeaderCoordinatorID&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"x-coordinator-id"&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;KafkaCoordinator&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;Acquire&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="n"&gt;Message&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;local&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Acquire&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Key&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;  &lt;span class="c"&gt;// Apply locally first&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;producer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lockTopic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;LockMessage&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;Key&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;   &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Key&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="n"&gt;Value&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Key&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="n"&gt;Headers&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;HeaderCoordinatorID&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;instanceID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c"&gt;// Tag with our ID&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;KafkaCoordinator&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;processLockMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="n"&gt;Message&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;originID&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Headers&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;HeaderCoordinatorID&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c"&gt;// Skip our own messages&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;originID&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;instanceID&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// Apply locks from other instances&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Value&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;local&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Acquire&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Key&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// Release locks for tombstones&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;local&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Release&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Key&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now each instance only applies &lt;em&gt;foreign&lt;/em&gt; locks from the broadcast.&lt;/p&gt;

&lt;p&gt;One detail: the instance ID must be stable across restarts. If a pod restarts and gets a new ID, it won't recognize its own pre-crash messages during replay and will double-count them. Use something deterministic — the consumer group member ID or a stable pod identifier — not a random UUID generated at startup.&lt;/p&gt;

&lt;h2&gt;
  
  
  Problem 4: The Compaction Collision
&lt;/h2&gt;

&lt;p&gt;The lock topic is compacted — Kafka periodically deduplicates records, keeping only the latest value per key. Great for storage. Dangerous for counting.&lt;/p&gt;

&lt;p&gt;Here's the trap: compaction doesn't happen instantly. It runs in the background, whenever Kafka decides. So your system works fine... until it doesn't.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Scenario: Three failures for the same key

T1: Event A fails  → Produce Key="Order-123", Value="LOCKED"
T2: Event B fails  → Produce Key="Order-123", Value="LOCKED"
T3: Event C fails  → Produce Key="Order-123", Value="LOCKED"

Local state: count = 3 ✓ (correct, you saw all three)

Hours later, Kafka compaction runs.
Three records with Key="Order-123" → collapsed into one.

T4: Pod restarts, replays lock topic to rebuild state.

Consumer sees one "LOCKED" record. Count = 1. (wrong!)

T5: Event A succeeds → Tombstone deletes the only record.

Count = 0. System thinks Order-123 is free.
But B and C are still in the retry queue.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The local in-memory state was correct. The problem only surfaces after a restart — when you replay the compacted topic and your counts are silently wrong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix: use a unique ID for each lock operation.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;By using a UUID as the Kafka key and putting the business key in headers, you force Kafka to keep all three lock records distinct. After restart, the consumer replays three separate records and counts correctly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;KafkaCoordinator&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;Acquire&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="n"&gt;Message&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;lockID&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c"&gt;// Unique per lock operation&lt;/span&gt;

    &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;local&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Acquire&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;producer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lockTopic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;LockMessage&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;Key&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;   &lt;span class="n"&gt;lockID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c"&gt;// Kafka key = UUID (unique, never collides)&lt;/span&gt;
        &lt;span class="n"&gt;Value&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;lockID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c"&gt;// Lock marker&lt;/span&gt;
        &lt;span class="n"&gt;Headers&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="s"&gt;"business-key"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;      &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Key&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;  &lt;span class="c"&gt;// Actual key in headers&lt;/span&gt;
            &lt;span class="n"&gt;HeaderCoordinatorID&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;instanceID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The consuming side reads the &lt;code&gt;business-key&lt;/code&gt; header to know which business key the lock applies to.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The trade-off you need to understand:&lt;/strong&gt; UUID keys mean compaction can never deduplicate lock records — every record has a unique key by definition. The lock topic now grows with every lock and unlock operation. With high failure rates, this topic can grow fast, which directly impacts cold start time (Problem 5) and memory usage (covered in Part 3). Tune your retention aggressively.&lt;/p&gt;

&lt;h2&gt;
  
  
  Problem 5: The Rebalancing Race
&lt;/h2&gt;

&lt;p&gt;Pod A crashes. Kubernetes starts Pod B. Kafka assigns Partition 0 to Pod B.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzerdosj3qzxc7mfjisq0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzerdosj3qzxc7mfjisq0.png" alt="Mermaid" width="800" height="495"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Pod B's local state is stale. It hasn't finished reading the lock topic, so it doesn't know &lt;em&gt;Order-123&lt;/em&gt; is locked. It processes messages out of order.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Synchronization Barrier
&lt;/h3&gt;

&lt;p&gt;A consumer must not process messages until its view of the lock state is current.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;KafkaCoordinator&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;Synchronize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// Step 1: Ask broker for the high water mark (latest offset)&lt;/span&gt;
    &lt;span class="n"&gt;hwm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;admin&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GetHighWaterMark&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lockTopic&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Errorf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"failed to get high water mark: %w"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// Step 2: Wait until our local consumer catches up&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;currentOffset&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lockConsumer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CurrentOffset&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;currentOffset&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;hwm&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;  &lt;span class="c"&gt;// Caught up, safe to proceed&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Done&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Err&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;After&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;100&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Millisecond&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
            &lt;span class="c"&gt;// Keep waiting&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Hook this into the consumer's setup phase:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Handler&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;Setup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="n"&gt;sarama&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ConsumerGroupSession&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// Block until lock state is current&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;coordinator&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Synchronize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Errorf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"failed to synchronize state: %w"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now a pod never "guesses" the lock state. It waits until it &lt;em&gt;knows&lt;/em&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Rebalance Loop
&lt;/h3&gt;

&lt;p&gt;This barrier has teeth. Every time a pod restarts or Kafka rebalances partitions, the consumer blocks until it catches up on the lock topic. With thousands of active locks, this adds seconds to startup. With millions — especially if the lock topic has grown due to the UUID trade-off from Problem 4 — it can trigger a rebalance loop: you exceed &lt;code&gt;max.poll.interval.ms&lt;/code&gt; waiting to sync, Kafka kicks you out, the next pod inherits the partition and does the same thing, and the partition becomes effectively frozen.&lt;/p&gt;

&lt;p&gt;For high-churn systems, you need to tune retention on the lock topic aggressively, or move to an external state store (covered in Part 3).&lt;/p&gt;

&lt;h3&gt;
  
  
  What About New Locks During Sync?
&lt;/h3&gt;

&lt;p&gt;New locks may arrive after the high water mark snapshot. But this doesn't cause a race — and the reason is worth understanding.&lt;/p&gt;

&lt;p&gt;The lock is always written &lt;em&gt;before&lt;/em&gt; the failed message gets redirected to the retry topic, and the main consumer commits its offset &lt;em&gt;after&lt;/em&gt; the redirect. So the sequence is: lock written → message redirected → offset committed. When Pod B takes over the partition, it starts consuming from the last committed offset. Any message that needs a lock was already redirected by Pod A before the offset advanced. Pod B won't encounter that message on the main topic — it's already in the retry topic.&lt;/p&gt;

&lt;p&gt;For messages that arrive &lt;em&gt;after&lt;/em&gt; Pod B starts, the lock consumer is already tailing the lock topic continuously. Since the lock is written before the redirect, Pod B will see the lock before the redirected message could possibly arrive.&lt;/p&gt;

&lt;h3&gt;
  
  
  What About the Retry Consumer?
&lt;/h3&gt;

&lt;p&gt;The retry consumer doesn't need a sync barrier. Its job is to process messages that are &lt;em&gt;already&lt;/em&gt; in the retry topic and release locks when they succeed. It doesn't check whether a key is locked — every message in the retry topic is there precisely because the key is (or was) locked. It processes sequentially within each partition, releases on success, and re-publishes on failure. No lock lookup needed, no stale state risk.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Implementing ordered retries isn't routing messages to different topics. It's building a distributed state machine on top of Kafka — and distributed state is never consistent when you need it to be.&lt;/p&gt;

&lt;p&gt;The pattern looks clean on a whiteboard. This is what happens when you actually build it.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Dual-Write&lt;/strong&gt; — Use transactions or compensating actions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reference Counting&lt;/strong&gt; — Locks are counters, not booleans&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-Consumption&lt;/strong&gt; — Tag messages with origin ID, ignore your own&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compaction Collision&lt;/strong&gt; — UUID per lock, business key in headers (but this disables compaction deduplication)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rebalancing Race&lt;/strong&gt; — Sync barrier before processing&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The full implementation is in &lt;a href="https://github.com/vmyroslav/kafka-resilience" rel="noopener noreferrer"&gt;kafka-resilience&lt;/a&gt;. The code snippets here are simplified — the real version handles more edge cases.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In Part 3&lt;/strong&gt;, we'll look at the operational cost: cold starts, memory pressure, and why you might choose a simpler solution instead.&lt;/p&gt;

</description>
      <category>kafka</category>
      <category>go</category>
      <category>eventdriven</category>
      <category>distributedsystems</category>
    </item>
    <item>
      <title>Ordered Retries in Kafka: Why The Retry Topic Is Breaking Downstream</title>
      <dc:creator>Myroslav Vivcharyk</dc:creator>
      <pubDate>Tue, 10 Mar 2026 15:10:00 +0000</pubDate>
      <link>https://forem.com/devgeist/ordered-retries-in-kafka-why-the-retry-topic-is-breaking-downstream-a31</link>
      <guid>https://forem.com/devgeist/ordered-retries-in-kafka-why-the-retry-topic-is-breaking-downstream-a31</guid>
      <description>&lt;p&gt;I was on-call when the P1 hit. Slack lighting up, a downstream team's database rejecting writes. Constraint violations everywhere.&lt;/p&gt;

&lt;p&gt;I pulled up the logs. The events looked fine. The schema was correct. The data made sense. Except the timestamps didn't.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;10:00:01 - Event 1: ADD 100 units to SKU-123
10:00:03 - Event 2: REMOVE 50 units from SKU-123
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the sequence that happened in the real world. Here's what the consumers saw:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;10:00:01 - Event 1: ADD 100 → FAILED (API timeout) → sent to retry topic
10:00:03 - Event 2: REMOVE 50 → FAILED (CHECK constraint: quantity &amp;lt; 0)
10:03:01 - Event 1: ADD 100 → SUCCESS (from retry topic, 3 minutes late)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The downstream inventory table had &lt;code&gt;CHECK (quantity &amp;gt;= 0)&lt;/code&gt;. Because &lt;em&gt;Event 2&lt;/em&gt; arrived before &lt;em&gt;Event 1&lt;/em&gt; was retried, the system tried to remove stock that didn't exist yet. PostgreSQL said no.&lt;/p&gt;

&lt;p&gt;The retry logic did exactly what it was designed to do — keep the pipeline moving by shoving failed messages aside. In doing so, it broke the ordering that every downstream consumer depended on.&lt;/p&gt;

&lt;p&gt;The postmortem took two days. The fix took even more. The root cause? &lt;strong&gt;Your retry decisions propagate to every downstream consumer.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Downstream Ripple Effect
&lt;/h2&gt;

&lt;p&gt;When you consume from a topic and publish to another, you aren't just moving data. You're a guardian of ordering semantics for everyone downstream.&lt;/p&gt;

&lt;p&gt;The team consuming your enriched stream might be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Building a state machine&lt;/strong&gt; — order status must flow &lt;code&gt;CREATED → PAID → SHIPPED&lt;/code&gt;, never backwards&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Calculating running totals&lt;/strong&gt; — account balances or inventory counts where &lt;code&gt;State(N)&lt;/code&gt; depends on &lt;code&gt;State(N-1)&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enforcing causality&lt;/strong&gt; — you can't update a record that hasn't been "created" yet&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The worst part? You won't know you've broken ordering until production data starts drifting. Or worse — a 2 AM call about corrupted state that's been silently building for weeks.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Two Bad Options
&lt;/h2&gt;

&lt;p&gt;If your operations are idempotent and commutative — &lt;code&gt;A then B&lt;/code&gt; gives the same result as &lt;code&gt;B then A&lt;/code&gt; — the standard retry-topic-and-DLQ pattern works fine. &lt;a href="https://www.uber.com/blog/reliable-reprocessing/" rel="noopener noreferrer"&gt;Uber does it&lt;/a&gt;. &lt;a href="https://www.confluent.io/blog/error-handling-patterns-in-kafka/#pattern-3" rel="noopener noreferrer"&gt;Confluent shows it&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;But for stateful systems where order matters, you're stuck choosing between two bad options:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option 1: Stop the World.&lt;/strong&gt; Don't commit the offset. Retry in place until it succeeds.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Handler&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;ConsumeClaim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="n"&gt;sarama&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ConsumerGroupSession&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;claim&lt;/span&gt; &lt;span class="n"&gt;sarama&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ConsumerGroupClaim&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;claim&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Messages&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="k"&gt;break&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;backoff&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Next&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MarkMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Order is preserved. But one poison pill blocks the entire partition — if offset 10 fails for three hours, offsets 11 through 1,000,000 sit waiting. Your client's background heartbeat thread won't save you — &lt;code&gt;max.poll.interval.ms&lt;/code&gt; is a separate timer, and if you exceed it, Kafka assumes your consumer is livelocked, kicks it from the group, and triggers a rebalance storm.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option 2: Skip and Pray.&lt;/strong&gt; Send the failed message to a retry topic. Commit. Keep going.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Handler&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;ConsumeClaim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="n"&gt;sarama&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ConsumerGroupSession&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;claim&lt;/span&gt; &lt;span class="n"&gt;sarama&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ConsumerGroupClaim&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;claim&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Messages&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sendToRetryTopic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MarkMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Throughput maintained. Ordering sacrificed. This is what caused my P1.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Third Way
&lt;/h2&gt;

&lt;p&gt;We need the availability of Option 2 with the ordering guarantees of Option 1.&lt;/p&gt;

&lt;p&gt;Picture a restaurant kitchen during dinner rush. &lt;em&gt;Table 5&lt;/em&gt; orders appetizers, main, dessert — in that order. The steak comes out wrong and needs a redo. You don't shut down the kitchen. You don't stop serving every other table. You hold &lt;em&gt;Table 5's&lt;/em&gt; remaining courses until the steak is ready. Tables 6, 7, and 8 keep getting their food on time.&lt;/p&gt;

&lt;p&gt;Each table is a message key. A failed dish doesn't block the kitchen — it blocks that table's order until resolved.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;When a message for key K fails, all subsequent messages for key K must wait — but messages for other keys proceed normally.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This pattern is described in a &lt;a href="https://www.confluent.io/blog/error-handling-patterns-in-kafka/#pattern-4" rel="noopener noreferrer"&gt;Confluent blog post&lt;/a&gt;. The concept is straightforward. The implementation is where things get interesting.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;To make this work across multiple consumer instances — say, 50 pods in Kubernetes — we need a way to coordinate which keys are currently blocked. A pod that locks a key must somehow tell every other pod. And when a pod crashes mid-retry, the lock must survive.&lt;/p&gt;

&lt;p&gt;We'll use Kafka itself as the coordination layer. On top of your original topic, you need two additional topics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Retry Topic&lt;/strong&gt; — messages land here either because they failed or because a predecessor with the same key failed. A separate consumer processes them with backoff. After exhausting retries, messages move to a DLQ — that's a standard pattern we won't rehash here.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lock Topic&lt;/strong&gt; (compacted) — the shared lock registry. Every lock and unlock is a record here.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A retry topic alone isn't enough. It handles the failed message — but the main consumer doesn't know that &lt;code&gt;SKU-123&lt;/code&gt; is in trouble. When the next message for &lt;code&gt;SKU-123&lt;/code&gt; arrives, it processes immediately, out of order. We need a way to tell every consumer instance: "this key is blocked, don't touch it." That's what the lock topic does.&lt;/p&gt;

&lt;h3&gt;
  
  
  How Lock State Propagates
&lt;/h3&gt;

&lt;p&gt;When a message fails, the consumer writes a record to the lock topic: key=&lt;code&gt;"SKU-123"&lt;/code&gt;, value=&lt;code&gt;"LOCKED"&lt;/code&gt;. When the retry eventually succeeds, it writes a tombstone (&lt;code&gt;null&lt;/code&gt; value) to signal the lock is cleared.&lt;/p&gt;

&lt;p&gt;Every consumer instance subscribes to the lock topic and builds a local in-memory map from the stream. Because the topic is &lt;a href="https://docs.confluent.io/kafka/design/log_compaction.html" rel="noopener noreferrer"&gt;compacted&lt;/a&gt;, Kafka periodically deduplicates records — keeping only the latest value per key. A new instance can read from the beginning and reconstruct the full lock state. The topic &lt;em&gt;is&lt;/em&gt; the database.&lt;/p&gt;

&lt;p&gt;This means each instance runs &lt;strong&gt;two consumer loops&lt;/strong&gt;: one for business messages (main or retry), and one that continuously tails the lock topic to keep the local map in sync.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Lock Check
&lt;/h3&gt;

&lt;p&gt;Before processing any message, the consumer checks its local map: &lt;em&gt;"Is this key currently blocked?"&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;New message arrives: key="order-123"
│
├─ Lock map: "order-123" locked?
│   │
│   ├─ YES → Redirect to retry topic
│   │        (preserve order behind predecessor)
│   │
│   └─ NO  → Process normally
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the key is locked, the message gets redirected to the retry topic — not because it failed, but because a &lt;em&gt;predecessor&lt;/em&gt; with the same key failed. It lines up behind that predecessor and waits its turn.&lt;/p&gt;

&lt;h2&gt;
  
  
  Seeing It In Action
&lt;/h2&gt;

&lt;p&gt;Here's how the full flow handles the scenario from the opening:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpt8mac1yrarrmx25bass.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpt8mac1yrarrmx25bass.png" alt="Mermaid"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Event 3 (different key) was never blocked. Events 1 and 2 (same key) maintained their order despite the failure. The partition kept flowing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where It Breaks
&lt;/h2&gt;

&lt;p&gt;That diagram shows the happy path. In a distributed system with 50 pods, there are at least five ways this breaks:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The Dual-Write.&lt;/strong&gt; When a message fails, you need to write a lock &lt;em&gt;and&lt;/em&gt; publish the message to the retry topic. If you crash between the two, you've created a zombie lock — a key locked forever with no retry message that will ever release it. Do you wrap both writes in a Kafka transaction? Or is there something cheaper?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The Reference Counting Gap.&lt;/strong&gt; Two messages for the same key fail. You lock the key. The first retry succeeds — do you release the lock? If you do, the third message for that key processes immediately, before the second retry. The lock isn't a boolean. It's a counter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. The Self-Consumption Trap.&lt;/strong&gt; Every instance publishes lock records and subscribes to the lock topic. Kafka doesn't filter — you'll consume your own messages. Apply the lock locally &lt;em&gt;and&lt;/em&gt; from the broadcast, and you've double-counted every lock.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. The Compaction Collision.&lt;/strong&gt; Three failures for the same key produce three lock records with the same Kafka key. Compaction collapses them into one. After a restart, you replay one record instead of three. Your reference count is silently wrong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. The Rebalancing Race.&lt;/strong&gt; A new pod starts, gets assigned a partition, and begins processing. But it hasn't finished reading the lock topic yet. It doesn't know what's locked. It processes a message out of order.&lt;/p&gt;

&lt;p&gt;None of these are theoretical. They all showed up during implementation.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;I've built a Go library — &lt;a href="https://github.com/vmyroslav/kafka-resilience" rel="noopener noreferrer"&gt;kafka-resilience&lt;/a&gt; — that handles these edge cases.&lt;/p&gt;

&lt;p&gt;In &lt;a href="https://devgeist.com/kafka-ordered-retries-2" rel="noopener noreferrer"&gt;Part 2&lt;/a&gt;, we'll walk through each of the five problems above and their fixes. Code included.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Part 3&lt;/strong&gt; is the counterargument: operational costs, failure modes, and why you might choose a simpler solution.&lt;/p&gt;

&lt;p&gt;The goal isn't to convince you to use this — it's to give you the full picture so you can make an informed decision.&lt;/p&gt;

</description>
      <category>kafka</category>
      <category>go</category>
      <category>eventdriven</category>
      <category>distributedsystems</category>
    </item>
    <item>
      <title>The Ghost in the Database: Why Is This Empty Table Taking 20 Seconds to Query?</title>
      <dc:creator>Myroslav Vivcharyk</dc:creator>
      <pubDate>Tue, 03 Feb 2026 15:05:00 +0000</pubDate>
      <link>https://forem.com/devgeist/the-ghost-in-the-database-why-is-this-empty-table-taking-20-seconds-to-query-47ma</link>
      <guid>https://forem.com/devgeist/the-ghost-in-the-database-why-is-this-empty-table-taking-20-seconds-to-query-47ma</guid>
      <description>&lt;p&gt;&lt;code&gt;SELECT * FROM my_table;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Query execution time: &lt;strong&gt;~5 seconds&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Table rows: &lt;strong&gt;0&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This made absolutely no sense.&lt;/p&gt;

&lt;p&gt;A few years ago, I found myself staring at a performance metric that defied logic. We had a table that was, for all intents and purposes, &lt;strong&gt;empty&lt;/strong&gt;. Yet, querying it was taking longer than a coffee break.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;Here's the context: one massive MySQL database attached to an old monolith. Multiple teams outside of monolith had read-only access. The database was running on one of the beefiest RDS instances available - typically at &lt;strong&gt;50-60%&lt;/strong&gt; load.&lt;/p&gt;

&lt;p&gt;What bothered me was that this load level never quite matched our actual traffic patterns. Something felt off.&lt;/p&gt;

&lt;p&gt;Our monolith used the &lt;a href="https://microservices.io/patterns/data/transactional-outbox.html" rel="noopener noreferrer"&gt;transactional outbox pattern&lt;/a&gt; - storing records in MySQL before publishing to RabbitMQ. The setup was straightforward: the monolith writes messages to an &lt;code&gt;outbox_message&lt;/code&gt; table, and a separate worker service reads them in batches, publishes to RabbitMQ, then deletes them.&lt;/p&gt;

&lt;p&gt;High throughput table. Lots of writes. Lots of deletes. Should be simple.&lt;/p&gt;

&lt;p&gt;Then these warnings started appearing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[WARN] Slow batch processing detected: 23.4s
[WARN] Slow batch processing detected: 31.2s
[WARN] Slow batch processing detected: 47.8s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Confirming the Symptoms
&lt;/h2&gt;

&lt;p&gt;The monitoring dashboard showed the first signs. The outbox worker's latency metric - which measures the time to fetch messages from the database and publish them to RabbitMQ - had jumped to high levels. We were seeing &lt;em&gt;20+&lt;/em&gt; seconds with spikes up to &lt;em&gt;50&lt;/em&gt; seconds.&lt;/p&gt;

&lt;p&gt;Distributed traces pointed to the culprit: the &lt;code&gt;SELECT&lt;/code&gt; query that fetches message batches was taking almost all the time. The commit operation was also suspiciously slow.&lt;/p&gt;

&lt;p&gt;Time to verify directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;outbox_message&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;An empty table taking 5 seconds to query.&lt;/p&gt;

&lt;p&gt;I checked RDS Performance Insights expecting to see obvious problems - CPU maxed out, disk I/O bottlenecks, &lt;em&gt;something&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Nothing. The database was sitting at a steady 50-60% load. It looked "busy," but not "breaking." No unusual spikes. The Top Queries view didn't even show our problematic query.&lt;/p&gt;

&lt;p&gt;Which was actually the first real clue. This wasn't a typical bottleneck.&lt;/p&gt;

&lt;h2&gt;
  
  
  Following the Clues
&lt;/h2&gt;

&lt;p&gt;The MySQL slow query log revealed what Performance Insights had missed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Query_time: 18.814028  Lock_time: 0.000021 Rows_sent: 1000  Rows_examined: 1000
SELECT * FROM outbox_message LIMIT 1000 FOR UPDATE SKIP LOCKED;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;18.8 seconds&lt;/strong&gt; for a query that examined only 1000 rows. The lock time was negligible.&lt;/p&gt;

&lt;p&gt;My first theory: blocking transactions. The worker retrieves messages in batches concurrently, so maybe they're stepping on each other.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;trx_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;trx_state&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;trx_started&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;trx_query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;dl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lock_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lock_mode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LOCK_STATUS&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;information_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;innodb_trx&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;performance_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data_locks&lt;/span&gt; &lt;span class="n"&gt;dl&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;trx_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ENGINE_TRANSACTION_ID&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;trx_query&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="s1"&gt;'%outbox_message%'&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;trx_started&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Yes - several long-running transactions on the &lt;code&gt;outbox_message&lt;/code&gt; table. But here's the thing: no deadlocks. All transactions had GRANTED locks, and all lock types were &lt;a href="https://dev.mysql.com/doc/refman/8.4/en/innodb-locking.html#innodb-intention-locks" rel="noopener noreferrer"&gt;compatible&lt;/a&gt;. No conflicts.&lt;/p&gt;

&lt;p&gt;The locks weren't the problem.&lt;/p&gt;

&lt;p&gt;Let's look at the table itself:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;table_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
       &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;format_bytes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data_length&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;data_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;format_bytes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data_free&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;free_space&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;TABLE_ROWS&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;information_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tables&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;TABLE_NAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'outbox_message'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data_size: 1.2 GiB
free_space: 1.1 GiB
TABLE_ROWS: 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;An empty table occupying &lt;strong&gt;1.2 GiB&lt;/strong&gt; on disk. Something was fundamentally wrong with this particular table.&lt;/p&gt;

&lt;p&gt;Time to look deeper:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SHOW&lt;/span&gt; &lt;span class="n"&gt;ENGINE&lt;/span&gt; &lt;span class="n"&gt;INNODB&lt;/span&gt; &lt;span class="n"&gt;STATUS&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And there it was:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;History list length 21,842,680
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;21 million&lt;/strong&gt; entries in the history list.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding What's Actually Happening
&lt;/h2&gt;

&lt;p&gt;Before we go further, let's step back and understand what this history list means.&lt;/p&gt;

&lt;h3&gt;
  
  
  Multi-Version Concurrency Control (MVCC)
&lt;/h3&gt;

&lt;p&gt;InnoDB uses &lt;a href="https://dev.mysql.com/doc/refman/8.4/en/innodb-multi-versioning.html" rel="noopener noreferrer"&gt;MVCC&lt;/a&gt; to allow concurrent reads and writes without locking. When you &lt;strong&gt;update&lt;/strong&gt; or &lt;strong&gt;delete&lt;/strong&gt; a row:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;InnoDB doesn't immediately overwrite or remove the old version&lt;/li&gt;
&lt;li&gt;It creates a &lt;strong&gt;new version&lt;/strong&gt; of the row (or marks it as deleted)&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;old version&lt;/strong&gt; gets stored in the undo log&lt;/li&gt;
&lt;li&gt;A reference to this old version is added to the &lt;strong&gt;history list&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This enables readers to access old versions while writers work on new ones - no blocking required.&lt;/p&gt;

&lt;p&gt;InnoDB's &lt;a href="https://dev.mysql.com/doc/refman/8.4/en/innodb-purge-configuration.html" rel="noopener noreferrer"&gt;purge operations&lt;/a&gt; run in the background to clean up old row versions. But here's the critical constraint: the purge thread can only remove versions that are older than &lt;strong&gt;currently active transactions&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;When you start a transaction, InnoDB creates a "snapshot" at that point in time. Even if your transaction sits idle, MySQL must keep all row versions created after your transaction started - because you might need them for consistent reads.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Domino Effect
&lt;/h3&gt;

&lt;p&gt;Here's what happened in our case:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Three transactions started.&lt;/strong&gt; Due to a bug in another service, they never completed - no commit, no rollback. They just sat there, open for hours.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Meanwhile, the outbox worker continued its normal operation:&lt;/strong&gt; &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Write message to &lt;code&gt;outbox_message&lt;/code&gt; (creates new row version) &lt;/li&gt;
&lt;li&gt;Read and process message &lt;/li&gt;
&lt;li&gt;Delete message (creates another row version marking the row as deleted) &lt;/li&gt;
&lt;li&gt;Repeat millions of times&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The purge thread's dilemma:&lt;/strong&gt; It could see all these deleted row versions in the history list. It &lt;em&gt;wanted&lt;/em&gt; to clean them up. But it couldn't - those three transactions were still active. MySQL had to keep &lt;strong&gt;every single row version&lt;/strong&gt; created since they started.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The result:&lt;/strong&gt; The history list grew to &lt;strong&gt;21 million&lt;/strong&gt; entries at that moment. The table physically contained gigabytes of old row versions for rows that had been processed and deleted.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The query impact:&lt;/strong&gt; When our outbox worker ran &lt;code&gt;SELECT * FROM outbox_message&lt;/code&gt;, InnoDB had to scan through &lt;strong&gt;21 million old versions&lt;/strong&gt; to determine which rows should be visible to this transaction.&lt;/p&gt;

&lt;p&gt;This is why an empty table was taking 20 seconds to query. We weren't querying an empty table - we were querying through &lt;em&gt;21 million&lt;/em&gt; historical versions of rows that didn't exist anymore.&lt;/p&gt;

&lt;p&gt;And here's the kicker about that 50-60% database load I mentioned earlier: it wasn't doing useful work. The database was spending enormous resources on background that was going nowhere.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding the Smoking Gun
&lt;/h2&gt;

&lt;p&gt;Now that I understood the problem, I needed to find what was preventing the purge. Let's find those hanging transactions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;ps&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;processlist_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;trx_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
       &lt;span class="n"&gt;trx_started&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;TIMESTAMPDIFF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;HOUR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;trx_started&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;hours_running&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;trx_query&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;information_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;innodb_trx&lt;/span&gt; &lt;span class="n"&gt;trx&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;information_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;processlist&lt;/span&gt; &lt;span class="n"&gt;ps&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;trx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trx_mysql_thread_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ps&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;trx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trx_started&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="k"&gt;CURRENT_TIME&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="k"&gt;MINUTE&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;trx_started&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There they are: three transactions that had been running for &lt;strong&gt;looong time&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;These queries were coming from another service that had database access - a bug in their code was leaving transactions open, preventing MySQL from cleaning up millions of row versions for rows that had been processed and deleted.&lt;/p&gt;

&lt;p&gt;And here was the systemic issue hiding in plain sight:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SHOW&lt;/span&gt; &lt;span class="n"&gt;VARIABLES&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;Variable_name&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s1"&gt;'innodb_max_purge_lag'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;'innodb_max_purge_lag_delay'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;'innodb_purge_threads'&lt;/span&gt; 
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    innodb_max_purge_lag: 0
    innodb_max_purge_lag_delay: 0
    innodb_purge_threads: 4 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;innodb_max_purge_lag = 0&lt;/code&gt; means the history list can grow &lt;strong&gt;indefinitely&lt;/strong&gt;. No guardrails. No limits.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Resolution
&lt;/h2&gt;

&lt;p&gt;Now that we understood the problem, the fix was straightforward:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Immediate actions:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Kill the hanging transactions&lt;/li&gt;
&lt;li&gt;Wait for the purge thread to catch up (took sime time to process 21 million entries)&lt;/li&gt;
&lt;li&gt;Optimize the table to reclaim disk space&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Within minutes, query performance returned to normal. And the database load? It dropped from 50-60% down to 10%. Same traffic — but now the database wasn't wasting resources on futile purge operations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Long-term fixes:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Configure proper purge limits:&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;    &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="k"&gt;GLOBAL&lt;/span&gt; &lt;span class="n"&gt;innodb_max_purge_lag&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;XXX&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This prevents the history list from growing beyond &lt;code&gt;XXX&lt;/code&gt; entries. If it reaches this limit, InnoDB adds small delays to DML operations to let the purge thread catch up. A small performance may be better than a complete breakdown.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Fix transaction timeout handling&lt;/strong&gt; across all services with database access. No transaction should run for hours/days.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add monitoring and alerts&lt;/strong&gt; for history list length. This metric should be on every write-heavy database dashboard.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;information_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;INNODB_METRICS&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'trx_rseg_history_len'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;p&gt;You can't control everything in a large systems. Bugs will happen. But what you &lt;em&gt;can&lt;/em&gt; control:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Monitoring Matters
&lt;/h3&gt;

&lt;p&gt;Add HLL metric to your database dashboard:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;information_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;INNODB_METRICS&lt;/span&gt; 
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'trx_rseg_history_len'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or to your Performance insights dashboard.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Set Proper Guardrails
&lt;/h3&gt;

&lt;p&gt;Default MySQL configurations assume perfect transaction hygiene. In complex systems with multiple services, you may need limits.&lt;/p&gt;

&lt;p&gt;Don't let &lt;code&gt;innodb_max_purge_lag&lt;/code&gt; stay at 0 in write-heavy databases with high delete volumes. Set it to something reasonable and accept the small performance trade-off for the reliability gain.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Understand the Internals
&lt;/h3&gt;

&lt;p&gt;Knowing how MVCC and the purge thread work was important to diagnosing this issue. The history list concept isn't obscure trivia - it's fundamental to how InnoDB handles concurrency.&lt;/p&gt;

&lt;p&gt;Every &lt;code&gt;DELETE&lt;/code&gt; doesn't immediately remove data; it just marks a new version. Every &lt;code&gt;UPDATE&lt;/code&gt; creates a new row version. All of this goes into the history list until the purge thread can safely remove it.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Discipline Matters
&lt;/h3&gt;

&lt;p&gt;The service that caused this only had read-only access. But even a read-only service can bring down your database if it doesn't properly manage transactions.&lt;/p&gt;

&lt;p&gt;Every service needs proper timeout handling and error recovery that ensures transactions complete. No exceptions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://blog.jcole.us/2014/04/16/the-basics-of-the-innodb-undo-logging-and-history-system/" rel="noopener noreferrer"&gt;The basics of the InnoDB undo logging and history system&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.mysql.com/doc/refman/8.0/en/innodb-purge-configuration.html" rel="noopener noreferrer"&gt;MySQL Purge Configuration&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://lefred.be/content/a-graph-a-day-keeps-the-doctor-away-mysql-history-list-length/" rel="noopener noreferrer"&gt;MySQL History List Length&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.percona.com/blog/chasing-a-hung-transaction-in-mysql-innodb-history-length-strikes-back/" rel="noopener noreferrer"&gt;Chasing a Hung Transaction in MySQL&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>mysql</category>
      <category>database</category>
      <category>backend</category>
    </item>
    <item>
      <title>Go Graceful Shutdown: Beyond ctx.Done()</title>
      <dc:creator>Myroslav Vivcharyk</dc:creator>
      <pubDate>Tue, 13 Jan 2026 15:15:00 +0000</pubDate>
      <link>https://forem.com/devgeist/when-ctxdone-isnt-enough-3ia6</link>
      <guid>https://forem.com/devgeist/when-ctxdone-isnt-enough-3ia6</guid>
      <description>&lt;p&gt;As Go engineers, we've all internalized the context pattern. When an application shuts down, we propagate the cancellation, our select blocks catch &lt;code&gt;&amp;lt;-ctx.Done()&lt;/code&gt;, and we return. Clean. Simple. Idiomatic.&lt;/p&gt;

&lt;p&gt;It usually looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Done&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="c"&gt;// Clean exit, right?&lt;/span&gt;
    &lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
        &lt;span class="c"&gt;// Do work&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For many applications, this is "good enough." If you're running a stateless process that can safely restart mid-operation, Kubernetes will just spin up a new POD and life goes on. A few "context canceled" errors in your logs? &lt;em&gt;Who cares.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;But what about processing a payment when &lt;code&gt;SIGTERM&lt;/code&gt; arrives? Or acknowledging a message you've half-processed?&lt;/p&gt;

&lt;h2&gt;
  
  
  When "good enough" isn't enough anymore?
&lt;/h2&gt;

&lt;p&gt;Not every application can afford to drop everything and exit. Consider what happens when you're halfway through processing a payment and &lt;code&gt;SIGTERM&lt;/code&gt; arrives. Or when you've pulled a message from a queue but haven't acknowledged it yet.&lt;/p&gt;

&lt;p&gt;Here are few scenarios where naive context cancellation creates real problems:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Message queues with acknowledgments&lt;/strong&gt; — SQS, RabbitMQ, Kafka — any system where you must explicitly confirm message processing. Unacknowledged messages get reprocessed, causing duplicates or delays.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-step transactions&lt;/strong&gt; — Payment captured but not recorded. Inventory reserved but not confirmed. Customer charged but order not created.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Non-idempotent operations&lt;/strong&gt; — Notifications already sent. Resources already created. Webhooks already fired. You can't just "try again."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Long-running batch jobs&lt;/strong&gt; — Hours of computation that can't resume from the middle. Data migrations left in inconsistent states.&lt;/p&gt;

&lt;p&gt;In all these cases, the cost of interruption exceeds the cost of waiting a few extra seconds for clean completion.&lt;/p&gt;

&lt;p&gt;If your code lives in any of these worlds, keep reading.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: One Signal, Two Jobs
&lt;/h2&gt;

&lt;p&gt;Let's look at a typical message processing pipeline - I'll use SQS as an example since it clearly demonstrates the issue, but the pattern applies to any acknowledge-based system:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faifu1ea9zo31kq64wbwo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faifu1ea9zo31kq64wbwo.png" alt="Scenario 1: SQS Processing" width="800" height="495"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here's the typical implementation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;ConsumeMessages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;queueURL&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="nb"&gt;make&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;chan&lt;/span&gt; &lt;span class="n"&gt;Message&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c"&gt;// Buffered channel&lt;/span&gt;

    &lt;span class="k"&gt;go&lt;/span&gt; &lt;span class="n"&gt;pollMessages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;queueURL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Done&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
            &lt;span class="c"&gt;// Problem: returns immediately!&lt;/span&gt;
            &lt;span class="c"&gt;// Buffered messages abandoned&lt;/span&gt;
            &lt;span class="c"&gt;// In-flight processing interrupted&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Err&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;processMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    &lt;span class="c"&gt;// ← SIGTERM during this&lt;/span&gt;
            &lt;span class="n"&gt;deleteFromSQS&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;     &lt;span class="c"&gt;// ← Never executes&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When &lt;code&gt;SIGTERM&lt;/code&gt; arrives and &lt;code&gt;ctx.Done()&lt;/code&gt; fires, here's what may happen:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;processMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="n"&gt;Message&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// Step 1: Call external API&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;callSomeAPI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Data&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;  &lt;span class="c"&gt;// ctx.Done() could interrupt here&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// Step 2: Database save interrupted by ctx.Done()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;saveToDatabase&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Data&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;  &lt;span class="c"&gt;// or here&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Since we propagate the context to &lt;code&gt;processMessage()&lt;/code&gt;, the context cancellation flows through your function calls. The external API call might complete, but the database save fails with &lt;code&gt;context.Canceled&lt;/code&gt;. Now you have partial state — the API side effect happened, but nothing was saved. And since &lt;code&gt;acknowledgeMessage()&lt;/code&gt; never runs, the message gets reprocessed, potentially duplicating the API call.&lt;/p&gt;

&lt;p&gt;Additionally, any messages already fetched from the queue but sitting in a buffer waiting to be processed get abandoned when the loop exits — they'll timeout and reappear for reprocessing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The root issue&lt;/strong&gt;: We have one signal (&lt;code&gt;ctx.Done()&lt;/code&gt;) trying to do two completely different jobs.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Stop accepting new work&lt;/li&gt;
&lt;li&gt;Stop working immediately&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We need to separate these concerns.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mental Model: The Restaurant Kitchen
&lt;/h2&gt;

&lt;p&gt;Imagine your system like a restaurant kitchen during closing time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  Orders In  →  Prep Station  →  Cooking  →  Serve to Customer
   (fetch)       (buffer)       (process)     (complete)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Naive shutdown&lt;/strong&gt; (&lt;code&gt;ctx.Done()&lt;/code&gt;):  &lt;/p&gt;

&lt;p&gt;The manager shouts "We're closed!" and the chef drops the pan mid-stir and walks out.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Orders on the counter? Tossed.&lt;/li&gt;
&lt;li&gt;Food on the stove? Abandoned half-cooked.&lt;/li&gt;
&lt;li&gt;Customers waiting for meals? Kicked out hungry.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Good shutdown&lt;/strong&gt; (graceful):  &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Lock the front door — Stop taking new orders&lt;/li&gt;
&lt;li&gt;Finish what's cooking — Complete dishes on the stove
&lt;/li&gt;
&lt;li&gt;Serve completed meals — Deliver to waiting customers&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Then&lt;/em&gt; turn off the lights and go home&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Of course, you can't wait forever. If it's been 30 minutes and someone's still cooking, you'll eventually have to call it. But you give the kitchen a reasonable chance to finish.&lt;/p&gt;

&lt;p&gt;This is exactly our strategy: control the shutdown signal so work can flow to completion.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Two-Phase Shutdown
&lt;/h2&gt;

&lt;p&gt;The solution is elegantly simple: use two signals instead of one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Signal 1&lt;/strong&gt;: "Please stop gracefully" — Stop accepting new work&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Signal 2&lt;/strong&gt;: "Emergency brake" — Force shutdown if graceful fails (via &lt;code&gt;ctx.Done()&lt;/code&gt;)&lt;/p&gt;

&lt;p&gt;Here's the conceptual flow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// The core idea&lt;/span&gt;
&lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Done&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
    &lt;span class="c"&gt;// Emergency: parent context canceled, stop everything now&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Err&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;stopSignalCh&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
    &lt;span class="c"&gt;// Graceful shutdown in three phases:&lt;/span&gt;

    &lt;span class="c"&gt;// Phase 1: Stop accepting new work&lt;/span&gt;
    &lt;span class="n"&gt;cancelPoller&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c"&gt;// Phase 2: Wait for poller to finish and close the work channel&lt;/span&gt;
    &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;pollerDone&lt;/span&gt;

    &lt;span class="c"&gt;// Phase 3: Processor drains remaining messages from channel&lt;/span&gt;
    &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;processorDone&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We're using the simple channel close as a coordination mechanism. When the poller stops, it closes the message channel. The processor, ranging over that channel, naturally drains any remaining messages and then exits. No complex synchronization needed — just Go's built-in channel semantics.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation: A Real Example
&lt;/h2&gt;

&lt;p&gt;Let me show you how this looks in practice. I'll use a real SQS consumer implementation that I built - you can see the &lt;a href="https://github.com/vmyroslav/sqs-go" rel="noopener noreferrer"&gt;full code on GitHub&lt;/a&gt;, but I'll walk through the key parts here.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Data Flow
&lt;/h3&gt;

&lt;p&gt;Before diving into code, let's visualize what we're building:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Single Message Flow:
┌──────┐    ┌─────────┐    ┌─────────┐    ┌────────┐
│ Poll │───▶│Transform│───▶│ Process │───▶│ Delete │
│ SQS  │    │ Message │    │Handler  │    │  (ACK) │
└──────┘    └─────────┘    └─────────┘    └────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Graceful Shutdown Flow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;stopSignalCh&lt;/code&gt; receives signal&lt;/li&gt;
&lt;li&gt;Poller stops fetching → closes channel&lt;/li&gt;
&lt;li&gt;Processor sees closed channel → drains remaining messages&lt;/li&gt;
&lt;li&gt;All messages acknowledged → Exit&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Core Structure
&lt;/h3&gt;

&lt;p&gt;We need to track both the shutdown request and completion:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;Consumer&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt; &lt;span class="n"&gt;any&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;poller&lt;/span&gt;      &lt;span class="n"&gt;Poller&lt;/span&gt;
    &lt;span class="n"&gt;processor&lt;/span&gt;   &lt;span class="n"&gt;Processor&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="n"&gt;stopSignalCh&lt;/span&gt; &lt;span class="k"&gt;chan&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt;&lt;span class="p"&gt;{}&lt;/span&gt;  &lt;span class="c"&gt;// Trigger graceful shutdown&lt;/span&gt;
    &lt;span class="n"&gt;stoppedCh&lt;/span&gt;    &lt;span class="k"&gt;chan&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt;&lt;span class="p"&gt;{}&lt;/span&gt;  &lt;span class="c"&gt;// Signal completion&lt;/span&gt;

    &lt;span class="c"&gt;// ... config, logging, etc&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The Main Loop: Control Tower
&lt;/h3&gt;

&lt;p&gt;Here's the heart of the system:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;SQSConsumer&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="n"&gt;Consume&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;queueURL&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;handler&lt;/span&gt; &lt;span class="n"&gt;Handler&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// Buffer size balances throughput against shutdown time&lt;/span&gt;
    &lt;span class="n"&gt;bufferSize&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ProcessorWorkerPoolSize&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
    &lt;span class="n"&gt;msgs&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="nb"&gt;make&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;chan&lt;/span&gt; &lt;span class="n"&gt;sqstypes&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Message&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bufferSize&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c"&gt;// Separate contexts let us control shutdown independently&lt;/span&gt;
    &lt;span class="n"&gt;processCtx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cancelProcess&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithCancel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;pollerCtx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cancelPoller&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithCancel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c"&gt;// Cleanup always happens, whether we exit gracefully or via context cancellation&lt;/span&gt;
    &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;cancelPoller&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;cancelProcess&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isRunning&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;false&lt;/span&gt;
        &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RecordShutdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}()&lt;/span&gt;

    &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isRunning&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;

    &lt;span class="n"&gt;pollerErrCh&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="nb"&gt;make&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;chan&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;processErrCh&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="nb"&gt;make&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;chan&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c"&gt;// Start both components&lt;/span&gt;
    &lt;span class="k"&gt;go&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;pollerErrCh&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;poller&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Poll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pollerCtx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;msgs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}()&lt;/span&gt;
    &lt;span class="k"&gt;go&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;processErrCh&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;processor&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;processCtx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;msgs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}()&lt;/span&gt;

    &lt;span class="c"&gt;// The control tower&lt;/span&gt;
    &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Done&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
        &lt;span class="c"&gt;// Emergency brake—parent context canceled (e.g., timeout expired)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Err&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stopSignalCh&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
        &lt;span class="c"&gt;// Graceful shutdown sequence&lt;/span&gt;

        &lt;span class="c"&gt;// PHASE 1: Stop accepting new work&lt;/span&gt;
        &lt;span class="n"&gt;cancelPoller&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="c"&gt;// PHASE 2: Wait for poller to close the channel&lt;/span&gt;
        &lt;span class="c"&gt;// CRITICAL: The poller MUST close 'msgs' when it exits&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;pollerErrCh&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="c"&gt;// PHASE 3: Processor drains remaining messages&lt;/span&gt;
        &lt;span class="c"&gt;// It exits naturally when the channel closes&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;processErrCh&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="c"&gt;// Signal that shutdown is complete&lt;/span&gt;
            &lt;span class="c"&gt;// This unblocks Close() which is waiting on this channel&lt;/span&gt;
        &lt;span class="nb"&gt;close&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stoppedCh&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key insight: &lt;strong&gt;we cancel only the poller's context&lt;/strong&gt; (closing the door). The processor keeps running until the channel is empty and closed. The processor doesn't know or care that we're shutting down — it just keeps processing until there's no more work.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Poller: Locking the Front Door
&lt;/h3&gt;

&lt;p&gt;Remember our restaurant analogy? The poller is the host who locks the door. Its job is simple but critical — &lt;strong&gt;it must close the channel when it exits&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Poller&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;Poll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ch&lt;/span&gt; &lt;span class="k"&gt;chan&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="n"&gt;Message&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="nb"&gt;close&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ch&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c"&gt;// This one line coordinates everything&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Err&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
              &lt;span class="c"&gt;// Context canceled—time to stop&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="c"&gt;// Poll for messages&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fetchMessages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;queueURL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;ch&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
                &lt;span class="c"&gt;// Message sent to processor&lt;/span&gt;
            &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Done&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
                &lt;span class="c"&gt;// Context canceled while sending&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That &lt;code&gt;defer close(ch)&lt;/code&gt; does the heavy lifting. When the poller's context is canceled:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The function returns&lt;/li&gt;
&lt;li&gt;The defer closes the channel&lt;/li&gt;
&lt;li&gt;The processor's range loop exits naturally&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;No explicit coordination needed.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Processor: Finishing What's on the Stove
&lt;/h3&gt;

&lt;p&gt;The processor simply ranges over the channel until it closes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Processor&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="n"&gt;Process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;msgs&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="k"&gt;chan&lt;/span&gt; &lt;span class="n"&gt;Message&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;handler&lt;/span&gt; &lt;span class="n"&gt;Handler&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;wg&lt;/span&gt; &lt;span class="n"&gt;sync&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WaitGroup&lt;/span&gt;

    &lt;span class="c"&gt;// Start worker pool&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;workerPoolSize&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;wg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;go&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="n"&gt;wg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Done&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

            &lt;span class="c"&gt;// Process messages until channel closes&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;msgs&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Done&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
                        &lt;span class="c"&gt;// Emergency exit if timeout expires&lt;/span&gt;
                    &lt;span class="k"&gt;return&lt;/span&gt;
                &lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;processMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}()&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// Wait for all workers to finish&lt;/span&gt;
    &lt;span class="n"&gt;wg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Wait&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The processor keeps the &lt;code&gt;ctx.Done()&lt;/code&gt; check in case we hit the timeout and need to force-exit, but during normal graceful shutdown, it simply processes until &lt;code&gt;msgs&lt;/code&gt; closes.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Close Method: Initiating Shutdown
&lt;/h3&gt;

&lt;p&gt;This is the public API your &lt;code&gt;main.go&lt;/code&gt; calls when it catches &lt;code&gt;SIGTERM&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Consumer&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="n"&gt;Close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Lock&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isRunning&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isClosing&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Unlock&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isClosing&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;
    &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stopSignalCh&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt;&lt;span class="p"&gt;{}{}&lt;/span&gt;  &lt;span class="c"&gt;// Trigger graceful shutdown&lt;/span&gt;
    &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Unlock&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c"&gt;// Wait for completion, but not forever&lt;/span&gt;
    &lt;span class="n"&gt;timeout&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Duration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GracefulShutdownTimeout&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Second&lt;/span&gt;

    &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stoppedCh&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
        &lt;span class="c"&gt;// Clean shutdown completed&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;

    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;After&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
        &lt;span class="c"&gt;// Timeout expired—shutdown incomplete&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Errorf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"shutdown timeout after %v"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The timeout is crucial. In the real world, you can't wait indefinitely. Kubernetes will eventually send &lt;code&gt;SIGKILL&lt;/code&gt; if your POD doesn't terminate. The timeout gives you a chance to clean up, but ensures you don't hang forever.&lt;/p&gt;

&lt;h3&gt;
  
  
  Choosing a Timeout Value
&lt;/h3&gt;

&lt;p&gt;How do you pick a good timeout?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;shutdownTime = (bufferSize / workerPoolSize) × avgProcessingTime × safetyFactor
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Buffer: &lt;em&gt;30&lt;/em&gt; messages&lt;/li&gt;
&lt;li&gt;Workers: &lt;em&gt;10&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Processing time: &lt;em&gt;100ms&lt;/em&gt; per message&lt;/li&gt;
&lt;li&gt;Safety factor: &lt;em&gt;3x&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;shutdownTime = (30 / 10) × 100ms × 3 = 900ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Set timeout to 2-3x this: &lt;code&gt;GracefulShutdownTimeout&lt;/code&gt; = &lt;strong&gt;3 seconds&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  What Happens When The Timeout Expires?
&lt;/h3&gt;

&lt;p&gt;If the timeout expires, &lt;code&gt;Close()&lt;/code&gt; returns an error, but the consumer is still running. At this point, you have a few options:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Let it go&lt;/strong&gt; — Kubernetes will &lt;code&gt;SIGKILL&lt;/code&gt; the POD shortly.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Force cancel&lt;/strong&gt; — Cancel the parent context to trigger the emergency brake.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In production code, this looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cancel&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithCancel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Background&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="n"&gt;cancel&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;consumer&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;NewConsumer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;poller&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c"&gt;// Handle shutdown signals&lt;/span&gt;
    &lt;span class="n"&gt;sigCh&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="nb"&gt;make&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;chan&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Signal&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;signal&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Notify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sigCh&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;syscall&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SIGTERM&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;syscall&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SIGINT&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;go&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;sigCh&lt;/span&gt;
        &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Shutdown signal received"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;consumer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Close&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="c"&gt;// Decision point: what now?          &lt;/span&gt;
            &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"some messages may be reprocessed"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"error"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;cancel&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c"&gt;// Emergency brake: force everything to stop&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}()&lt;/span&gt;

    &lt;span class="c"&gt;// This blocks until shutdown completes&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;consumer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Consume&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Consumer error"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"error"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The problem:&lt;/strong&gt; &lt;code&gt;ctx.Done()&lt;/code&gt; can't distinguish "stop accepting work" from "finish current work." This may leave in-flight work in inconsistent states.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The solution:&lt;/strong&gt; &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use separate contexts for different components&lt;/li&gt;
&lt;li&gt;Use a dedicated signal channel for graceful shutdown&lt;/li&gt;
&lt;li&gt;Let channel closure coordinate the drain naturally&lt;/li&gt;
&lt;li&gt;Always include a timeout — you can't wait forever&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;When to use this pattern:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Message queue consumers (SQS, RabbitMQ, Kafka)&lt;/li&gt;
&lt;li&gt;Any system with explicit acknowledgments&lt;/li&gt;
&lt;li&gt;Multi-step operations that must complete atomically&lt;/li&gt;
&lt;li&gt;Long-running batch jobs&lt;/li&gt;
&lt;li&gt;Non-idempotent external API calls&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;When you don't need it:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stateless operations&lt;/li&gt;
&lt;li&gt;Idempotent operations that can safely restart&lt;/li&gt;
&lt;li&gt;Pure computational tasks with no side effects&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's Next?
&lt;/h2&gt;

&lt;p&gt;If you're building systems with in-flight work, consider what your "units of work" look like. What state do they maintain? What happens if they're interrupted? How long does completion typically take?&lt;/p&gt;

&lt;p&gt;The answers to these questions will guide your shutdown strategy. Sometimes immediate exit is fine. Sometimes you need another approach. The key is making a conscious choice rather than defaulting to &lt;code&gt;ctx.Done()&lt;/code&gt; everywhere.&lt;/p&gt;

&lt;p&gt;The approach I've shown here is specific to message consumers, but the principle applies broadly. &lt;/p&gt;

&lt;p&gt;For the complete implementation check out the &lt;a href="https://github.com/vmyroslav/sqs-go" rel="noopener noreferrer"&gt;full code on GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;And if you've implemented graceful shutdown in a different way, or have real-life stories about what happens when you don't, I'd love to discuss them.&lt;/p&gt;

</description>
      <category>go</category>
      <category>backend</category>
      <category>distributedsystems</category>
    </item>
    <item>
      <title>Kafka Consumer Health Checks: Why Progress Beats Lag</title>
      <dc:creator>Myroslav Vivcharyk</dc:creator>
      <pubDate>Tue, 16 Dec 2025 14:02:20 +0000</pubDate>
      <link>https://forem.com/devgeist/kafka-consumer-health-checks-dead-or-alive-179e</link>
      <guid>https://forem.com/devgeist/kafka-consumer-health-checks-dead-or-alive-179e</guid>
      <description>&lt;p&gt;Many of you have been there - it’s 3 AM, your phone buzzes, and you’re staring at an alert: “&lt;strong&gt;Kafka consumer lag exceeded threshold.&lt;/strong&gt;” &lt;/p&gt;

&lt;p&gt;You stumble to your laptop, check the metrics, and… everything looks fine. The messages are processing. The lag was just a temporary spike from a slow downstream service. You silence the alert and go back to bed, mentally adding another item to your "fix someday" list.&lt;/p&gt;

&lt;p&gt;Sound familiar? &lt;/p&gt;

&lt;p&gt;Here's what's actually happening: &lt;strong&gt;lag tells you how far behind you are, but not whether you're making progress&lt;/strong&gt;. A consumer can sit at 1000 messages lag for 10 minutes because it's stuck, or because it's processing at exactly the rate messages arrive. From lag alone, you can't tell the difference.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The real question isn't "how much lag?" — it's "are we making progress?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That’s the problem I found myself wrestling with a while back. After dealing with false positive alerts and delayed detection of real issues, I discovered a simple but brilliant solution from &lt;a href="https://www.pagerduty.com/eng/kafka-health-checks/" rel="noopener noreferrer"&gt;PagerDuty’s engineering team&lt;/a&gt;. Rather than retelling their story, I want to show you how to implement this approach in Go, with some insights I’ve picked up along the way.&lt;/p&gt;

&lt;p&gt;💡 &lt;em&gt;Want to dive straight into the code? Check out the complete &lt;a href="https://github.com/vmyroslav/kafka-pulse-go" rel="noopener noreferrer"&gt;source code here&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Traditional Health Checks Fall Short
&lt;/h2&gt;

&lt;p&gt;Let’s talk about common patterns and why they don’t quite work:&lt;/p&gt;

&lt;h3&gt;
  
  
  The “Always Healthy” Approach
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;healthHandler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ResponseWriter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WriteHeader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StatusOK&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Write&lt;/span&gt;&lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"OK"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="c"&gt;// "I'm alive!" (Are you though?)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This tells you nothing. Your consumer could be completely frozen, and Kubernetes (any orchestrator) would happily keep it alive.&lt;/p&gt;

&lt;h3&gt;
  
  
  The “Ping the Broker” Approach
&lt;/h3&gt;

&lt;p&gt;This is better than nothing — at least you’re verifying connectivity. But connectivity doesn’t mean your consumer is processing messages. Your network might be fine while your consumer group is stuck in an infinite rebalance loop.&lt;/p&gt;

&lt;h3&gt;
  
  
  The “Lag Threshold” Trap
&lt;/h3&gt;

&lt;p&gt;This is where most teams land. You’re probably already tracking consumer lag through Prometheus or similar tools. You’ve set up alerts when lag exceeds some threshold — maybe 100 messages, maybe 1000, maybe a million.&lt;/p&gt;

&lt;p&gt;But here’s the problem: &lt;em&gt;what should that threshold be?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Set it too low&lt;/strong&gt; → You wake up at 3 AM because a downstream API responded slowly for 30 seconds. The consumer is fine, just briefly overwhelmed. False alarm.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Set it too high&lt;/strong&gt; → You miss genuine issues until they’ve already cascaded into customer-visible problems. By the time your alert fires, you’re in damage control mode.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The granularity problem&lt;/strong&gt;: A single stuck POD in your deployment of 50 can go unnoticed if you’re monitoring average lag. That POD quietly fails while the other 49 keep working, masking the issue in your aggregate metrics.&lt;/p&gt;

&lt;p&gt;The fundamental tension: &lt;strong&gt;fast detection means false positives, reliable alerts mean delayed detection&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;You can build sophisticated heuristics or even ML models to detect anomalies, but all of that adds complexity and still detects problems with noticeable delay.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Key Insight: Progress vs. Position
&lt;/h2&gt;

&lt;p&gt;What we really need is a way to answer a simple question: &lt;strong&gt;Is this specific consumer instance making progress?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Instead of measuring lag, we measure &lt;em&gt;progress&lt;/em&gt;. Here's the approach:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Heartbeat&lt;/strong&gt;: Track the timestamp of the last processed message for each partition. If we're processing messages, we're healthy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Verification&lt;/strong&gt;: If enough time passes without new messages (say, X seconds), we don't panic yet. We query the Kafka broker for the latest offset. Now we can make a decision:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Consumer Offset &amp;lt; Broker Offset&lt;/strong&gt;: ❌ &lt;strong&gt;UNHEALTHY&lt;/strong&gt; &lt;em&gt;(there are messages available, but we're not processing them — we're stuck)&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consumer Offset ≥ Broker Offset&lt;/strong&gt;: ✅ &lt;strong&gt;HEALTHY&lt;/strong&gt; &lt;em&gt;(we're caught up, just waiting for more work — we're idle)&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This elegantly distinguishes between a stuck consumer and a consumer that's simply idle.&lt;/p&gt;

&lt;h3&gt;
  
  
  How It Works: Three Scenarios
&lt;/h3&gt;

&lt;p&gt;Let me show you exactly what happens in each case.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scenario 1: Active Processing (Healthy)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When messages are flowing and your consumer is processing them, health checks are instant:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fidftrxd4dmno7olrtp7l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fidftrxd4dmno7olrtp7l.png" alt="Scenario 1: Active Processing" width="800" height="463"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;No broker queries needed — we've seen recent activity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scenario 2: Stuck Consumer (Unhealthy)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The consumer freezes, but messages keep arriving:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyjyal6821gqft76b81zw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyjyal6821gqft76b81zw.png" alt="Scenario 2: Stuck Consumer" width="800" height="463"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The broker query reveals there's work to do, but we're not doing it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scenario 3: Idle Consumer (Healthy)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The consumer is caught up, just waiting for new messages:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3pqxhqw36dx8zvp16gox.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3pqxhqw36dx8zvp16gox.png" alt="Scenario 3: Idle Consumer" width="800" height="463"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is the key distinction — same timeout, different outcome based on broker state.&lt;/p&gt;

&lt;p&gt;The beauty of this approach is its simplicity — we're not doing complex analysis or ML. We're just asking two questions: &lt;em&gt;"Have I seen new work? If not, is there new work available?"&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation
&lt;/h2&gt;

&lt;p&gt;I’ve packaged this logic into &lt;a href="https://github.com/vmyroslav/kafka-pulse-go" rel="noopener noreferrer"&gt;&lt;code&gt;kafka-pulse-go&lt;/code&gt;&lt;/a&gt;, a lightweight library that works with most popular Kafka clients. The core logic is decoupled from specific client implementations, with adapters provided for &lt;a href="https://github.com/IBM/sarama" rel="noopener noreferrer"&gt;Sarama&lt;/a&gt;, &lt;a href="https://github.com/segmentio/kafka-go" rel="noopener noreferrer"&gt;segmentio/kafka-go&lt;/a&gt;, and &lt;a href="https://github.com/confluentinc/confluent-kafka-go" rel="noopener noreferrer"&gt;Confluent’s client&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Let me walk you through how to use it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Setting Up the Monitor
&lt;/h3&gt;

&lt;p&gt;The setup is straightforward. Here's what you need with Sarama:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s"&gt;"github.com/IBM/sarama"&lt;/span&gt;
    &lt;span class="n"&gt;adapter&lt;/span&gt; &lt;span class="s"&gt;"github.com/vmyroslav/kafka-pulse-go/adapter/sarama"&lt;/span&gt;
    &lt;span class="s"&gt;"github.com/vmyroslav/kafka-pulse-go/pulse"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// Your existing Sarama setup&lt;/span&gt;
    &lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;sarama&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewConfig&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;sarama&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;brokers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Fatal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// Create the health checker adapter&lt;/span&gt;
    &lt;span class="n"&gt;brokerClient&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;adapter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewClientAdapter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c"&gt;// Configure the monitor&lt;/span&gt;
    &lt;span class="n"&gt;monitorConfig&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;pulse&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;Logger&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;       &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;StuckTimeout&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Second&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;monitor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;pulse&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewHealthChecker&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;monitorConfig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;brokerClient&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Fatal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// Pass monitor to your consumer...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What's&lt;/strong&gt; &lt;code&gt;StuckTimeout&lt;/code&gt;? This defines how long we'll wait without seeing new messages before we query the broker. Set this based on your expected message frequency:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;High-throughput topics: 10-30 seconds works well&lt;/li&gt;
&lt;li&gt;Medium-volume topics: 1-2 minutes is reasonable&lt;/li&gt;
&lt;li&gt;Low-volume topics: You might need 5-10 minutes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key is balancing detection speed with broker query overhead. Too short and you’ll query the broker constantly during idle periods. Too long and you’ll delay detecting real issues.&lt;/p&gt;

&lt;h3&gt;
  
  
  Integrating with Your Consumer
&lt;/h3&gt;

&lt;p&gt;The integration is minimal — just call &lt;code&gt;Track()&lt;/code&gt; for each message you process:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;consumerGroupHandler&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;ConsumeClaim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="n"&gt;sarama&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ConsumerGroupSession&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="n"&gt;claim&lt;/span&gt; &lt;span class="n"&gt;sarama&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ConsumerGroupClaim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;claim&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Messages&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c"&gt;// Process your business logic&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;processMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="c"&gt;// Handle error...&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="c"&gt;// Tell the health monitor we're making progress&lt;/span&gt;
        &lt;span class="n"&gt;wrappedMessage&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;adapter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;monitor&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Track&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;wrappedMessage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MarkMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That’s it. The monitor now knows when you’ve processed a message and can track progress per partition.&lt;/p&gt;

&lt;h3&gt;
  
  
  Exposing the Health Check
&lt;/h3&gt;

&lt;p&gt;Wire this into your HTTP server for health checks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hs&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;HealthServer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;healthHandler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ResponseWriter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cancel&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithTimeout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Second&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="n"&gt;cancel&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;isHealthy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;hs&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;monitor&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Healthy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c"&gt;// Handle broker connection errors&lt;/span&gt;
        &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WriteHeader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StatusServiceUnavailable&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewEncoder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="s"&gt;"status"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"unhealthy"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s"&gt;"error"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;  &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;isHealthy&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WriteHeader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StatusOK&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewEncoder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;"status"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"healthy"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WriteHeader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StatusServiceUnavailable&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewEncoder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="s"&gt;"status"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"unhealthy"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s"&gt;"reason"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"consumer stuck behind available messages"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now your orchestrator (Kubernetes, ECS, whatever) can hit this endpoint and know the real health of your consumer.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Critical Detail: Handling Rebalances
&lt;/h2&gt;

&lt;p&gt;Here’s where custom implementations may fail. In Kafka, partition ownership is fluid. If you add a new POD or an old one crashes, Kafka triggers a rebalance. Your consumer instance might lose ownership of &lt;em&gt;Partition 1&lt;/em&gt; and gain an ownership of &lt;em&gt;Partition 2&lt;/em&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Danger of the "Zombie Partition"
&lt;/h3&gt;

&lt;p&gt;Without proper cleanup, here's what happens:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Rebalance Triggered&lt;/strong&gt;: You added a new POD to consumer group, or an existing one crashes. Kafka initiates a consumer group rebalance to redistribute partition ownership.&lt;/li&gt;
&lt;li&gt;Your consumer loses ownership of &lt;em&gt;Partition 1&lt;/em&gt; and stops reading from it (correctly - you no longer own it)&lt;/li&gt;
&lt;li&gt;The health checker still holds a reference to &lt;em&gt;Partition 1&lt;/em&gt; in its memory&lt;/li&gt;
&lt;li&gt;Time passes without new messages for &lt;em&gt;Partition 1&lt;/em&gt; in your tracker (obviously — you're not reading it anymore)&lt;/li&gt;
&lt;li&gt;The health checker eventually queries the broker and sees the offset for &lt;em&gt;Partition 1&lt;/em&gt; is moving (thanks to whoever owns it now)&lt;/li&gt;
&lt;li&gt;But your local tracked offset for &lt;em&gt;Partition 1&lt;/em&gt; is frozen at the last message you processed before losing ownership&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Result&lt;/strong&gt;: The health checker flags your pod as &lt;em&gt;UNHEALTHY&lt;/em&gt; for a partition it doesn't even own anymore&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is a "zombie partition" problem — your health checker is tracking ghosts.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Fix
&lt;/h3&gt;

&lt;p&gt;You must manage the partition lifecycle. Call &lt;code&gt;Release()&lt;/code&gt; when a session ends to purge that partition from the monitor’s memory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;consumerGroupHandler&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;ConsumeClaim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="n"&gt;sarama&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ConsumerGroupSession&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="n"&gt;claim&lt;/span&gt; &lt;span class="n"&gt;sarama&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ConsumerGroupClaim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Done&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
            &lt;span class="c"&gt;// CRITICAL: Clean up partition tracking during rebalance&lt;/span&gt;
            &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;monitor&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Release&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;claim&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Topic&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;claim&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Partition&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;

        &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;claim&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Messages&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;processMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="n"&gt;wrappedMessage&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;adapter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;monitor&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Track&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wrappedMessage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MarkMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;ctx.Done()&lt;/code&gt; signal from Sarama indicates that the session is ending — either due to a rebalance, shutdown, or error. This is your cue to clean up tracking state for this partition.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why this matters&lt;/strong&gt;: In a production system with frequent deployments, rebalances happen constantly. Without proper cleanup, you’ll see cascading false alerts as PODs report unhealthy for partitions they’ve lost ownership of.&lt;/p&gt;

&lt;h2&gt;
  
  
  Under the Hood
&lt;/h2&gt;

&lt;p&gt;The library’s design is intentionally simple. Here are the key pieces:&lt;/p&gt;

&lt;h3&gt;
  
  
  State Tracking
&lt;/h3&gt;

&lt;p&gt;We maintain a map of topic-partitions to their last-seen offset and timestamp:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;tracker&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;topicPartitionOffsets&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;int32&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="n"&gt;OffsetTimestamp&lt;/span&gt;
    &lt;span class="n"&gt;mu&lt;/span&gt;                    &lt;span class="n"&gt;sync&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RWMutex&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;OffsetTimestamp&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;Timestamp&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Time&lt;/span&gt;
    &lt;span class="n"&gt;Offset&lt;/span&gt;    &lt;span class="kt"&gt;int64&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why a map?&lt;/strong&gt; Because each partition is independent. &lt;code&gt;Partition 0&lt;/code&gt; being stuck doesn’t mean &lt;code&gt;Partition 1&lt;/code&gt; is stuck. We need per-partition granularity to avoid false positives.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Health Check Logic
&lt;/h3&gt;

&lt;p&gt;The verification follows the two-step process:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;HealthChecker&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;Healthy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;currentState&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tracker&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;currentOffsets&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;clock&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;topic&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;currentState&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;partition&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;currentState&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;offsetTimestamp&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;currentState&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="n"&gt;partition&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

            &lt;span class="c"&gt;// Step 1: Has it been too long since we saw a new message?&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;offsetTimestamp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Timestamp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stuckTimeout&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="c"&gt;// Step 2: Query broker for latest offset&lt;/span&gt;
                &lt;span class="n"&gt;latestOffset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GetLatestOffset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;partition&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ignoreBrokerErrors&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                        &lt;span class="k"&gt;continue&lt;/span&gt; &lt;span class="c"&gt;// Don't fail on transient broker issues&lt;/span&gt;
                    &lt;span class="p"&gt;}&lt;/span&gt;
                    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Errorf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"failed to get latest offset: %w"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;

                &lt;span class="c"&gt;// Are we stuck or just idle?&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;offsetTimestamp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Offset&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;latestOffset&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="c"&gt;// Stuck! There are messages we haven't processed&lt;/span&gt;
                    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
                &lt;span class="c"&gt;// We're caught up, just idle - that's fine&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key insight is in that comparison: &lt;code&gt;offsetTimestamp.Offset &amp;lt; latestOffset&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical Considerations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What About Poison Pill Messages?
&lt;/h3&gt;

&lt;p&gt;A common pitfall in health checks is confusing activity with progress. If a consumer is stuck in an infinite retry loop on a single malformed message, it’s technically “active” — burning CPU and making network calls — but it’s not “healthy” because the stream is blocked.&lt;/p&gt;

&lt;p&gt;This approach handles it correctly. Here’s why:&lt;/p&gt;

&lt;p&gt;The tracker only updates the timestamp when the offset actually moves forward:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Only update timestamp if offset has changed&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;existing&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Offset&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Offset&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;existing&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Timestamp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IsZero&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;topicPartitionOffsets&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Topic&lt;/span&gt;&lt;span class="p"&gt;()][&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Partition&lt;/span&gt;&lt;span class="p"&gt;()]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;OffsetTimestamp&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;Offset&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;    &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Offset&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="n"&gt;Timestamp&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;clock&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Now&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If your consumer spends 10 minutes retrying the same message at offset 100, the offset never advances. The timestamp stays frozen at the last successful message. Eventually the timeout expires, and because the broker has newer messages (101, 102, 103…), the health check correctly reports &lt;strong&gt;Unhealthy&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This forces visibility on “zombie loops” that may otherwise go unnoticed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Broker Outages: Choosing Your Failure Mode
&lt;/h3&gt;

&lt;p&gt;Since our verification step requires querying the broker, you need to decide what happens when that call fails.&lt;/p&gt;

&lt;p&gt;The library exposes an &lt;code&gt;IgnoreBrokerErrors&lt;/code&gt; flag that controls this behavior:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option A: Fail Closed&lt;/strong&gt; &lt;em&gt;(default)&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;pulse&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;StuckTimeout&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;       &lt;span class="m"&gt;30&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Second&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;IgnoreBrokerErrors&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="no"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c"&gt;// default - fail on broker errors&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Upside&lt;/strong&gt;: You never report “Healthy” when you’re actually blind to the real state&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Risk&lt;/strong&gt;: A temporary Kafka blip can cause Kubernetes to restart your entire consumer fleet simultaneously (a “restart storm”). Since restarting won’t fix the broker, this just adds chaos to an ongoing outage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Option B: Fail Open&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;pulse&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;StuckTimeout&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;       &lt;span class="m"&gt;30&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Second&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;IgnoreBrokerErrors&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c"&gt;// ignore transient broker errors&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Upside&lt;/strong&gt;: Your PODs stay running, warm, and ready to resume work the moment the broker recovers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Risk&lt;/strong&gt;: You might miss a legitimate network partition affecting only your specific POD&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;My recommendation&lt;/strong&gt;: Start with fail-closed (the default). It's safer — you never report healthy when you're blind to the actual state. If you experience restart storms during broker incidents, switch to fail-open and add separate monitoring for broker connectivity at the infrastructure level.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does This Replace Consumer Lag Monitoring?
&lt;/h3&gt;

&lt;p&gt;No! Consumer lag metrics are still valuable for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Capacity planning&lt;/strong&gt;: Understanding if you need to scale your consumer group&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance trending&lt;/strong&gt;: Tracking how your processing speed changes over time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Business SLAs&lt;/strong&gt;: If your requirement is “process messages within 5 minutes,” lag is exactly what you need to measure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This health check complements lag monitoring — it doesn't replace it. Lag tells you if you need more consumers. Health checks tell you if your existing consumers are working.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;p&gt;The core insight is simple: &lt;strong&gt;health checks should measure progress&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;By distinguishing between “stuck with work to do” and “idle but caught up,” we eliminate false positives without missing real issues. No more 3 AM alerts for temporary slowdowns. No more delayed detection of genuinely stuck consumers.&lt;/p&gt;

&lt;p&gt;The implementation is straightforward:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Track offset progression for each partition&lt;/li&gt;
&lt;li&gt;When idle too long, verify against broker state&lt;/li&gt;
&lt;li&gt;Handle rebalances cleanly with proper partition lifecycle management&lt;/li&gt;
&lt;li&gt;Choose your failure mode for broker errors&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Whether you use &lt;a href="https://github.com/vmyroslav/kafka-pulse-go" rel="noopener noreferrer"&gt;&lt;code&gt;kafka-pulse-go&lt;/code&gt;&lt;/a&gt; or implement this pattern yourself, it’s universal across languages and Kafka clients. The principle remains the same.&lt;/p&gt;

&lt;h3&gt;
  
  
  Try It Out
&lt;/h3&gt;

&lt;p&gt;The repo includes a &lt;a href="https://github.com/vmyroslav/kafka-pulse-go/tree/main/examples/sarama/testcontainers" rel="noopener noreferrer"&gt;complete working example&lt;/a&gt;. It's the fastest way to see how this works in practice.&lt;/p&gt;

</description>
      <category>go</category>
      <category>kafka</category>
      <category>distributedsystems</category>
      <category>devops</category>
    </item>
    <item>
      <title>API testing with simulation Vol.2</title>
      <dc:creator>Myroslav Vivcharyk</dc:creator>
      <pubDate>Thu, 06 Mar 2025 17:32:28 +0000</pubDate>
      <link>https://forem.com/devgeist/api-testing-with-simulation-vol2-hbb</link>
      <guid>https://forem.com/devgeist/api-testing-with-simulation-vol2-hbb</guid>
      <description>&lt;p&gt;In &lt;a href="https://dev.to/esprit_codes/api-testing-through-simulations-2fgf"&gt;Part 1&lt;/a&gt;, we introduced Hoverfly as a tool for API simulation and explored its basic setup. We saw how it can capture API interactions and replay them during tests, making our test suite more reliable.&lt;/p&gt;

&lt;p&gt;However, real-world applications present some &lt;strong&gt;real-world challenges&lt;/strong&gt; that go beyond those basic examples. Let's be honest — if the simple scenarios from first part were sufficient for your needs, you probably don't need this series.&lt;/p&gt;

&lt;p&gt;This part explores solutions to common challenges. We'll focus on specific scenarios that frequently appear in applications, and I believe you can adapt these examples for your own needs far beyond what we'll cover here.&lt;/p&gt;

&lt;h2&gt;
  
  
  Challenge One: Dynamic Data
&lt;/h2&gt;

&lt;p&gt;Let's look at a test scenario:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"create activity"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;testFunc&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;testing&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ClientWithResponses&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;ctx&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Background&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="n"&gt;id&lt;/span&gt;      &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;int32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;title&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;randomTitleForResource&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"activity"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;dueDate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PostApiV1ActivitiesWithApplicationJSONV10BodyWithResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;PostApiV1ActivitiesApplicationJSONV10RequestBody&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;Id&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;        &lt;span class="n"&gt;tests&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ToPtr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="n"&gt;Title&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;     &lt;span class="n"&gt;tests&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ToPtr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="n"&gt;DueDate&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;   &lt;span class="n"&gt;tests&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ToPtr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dueDate&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="n"&gt;Completed&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tests&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ToPtr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="no"&gt;false&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;require&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NoError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;require&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NotNil&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"expected resp, got nil"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;require&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NotNil&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ApplicationjsonV10200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"expected resp body, got nil"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;assert&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Equal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StatusOK&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StatusCode&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="n"&gt;assert&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Equal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ApplicationjsonV10200&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;assert&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Equal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ApplicationjsonV10200&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Title&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;assert&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Equal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;dueDate&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;UTC&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RFC3339&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ApplicationjsonV10200&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DueDate&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;UTC&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RFC3339&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;assert&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Equal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ApplicationjsonV10200&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Completed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With a helper function to generate random titles:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Generate a random title to test the possibility of random data in Hoverfly&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;randomTitleForResource&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resource&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"%s-%s-title-%s"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fakePrefix&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tests&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RandomString&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A couple of points that may catch our attention:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Random data generation&lt;/strong&gt;: The test creates a unique random title for each test run&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Current timestamps&lt;/strong&gt;: The test uses the current time as the &lt;code&gt;dueDate&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These scenarios are quite common in testing.&lt;/p&gt;

&lt;p&gt;But when we run these tests with simulations, they'll &lt;strong&gt;fail&lt;/strong&gt;. Why? Because our captured simulation contains specific values, but our test generates new ones each time. The simulation can't match the new requests to the captured responses.&lt;/p&gt;

&lt;p&gt;Similarly, look at this test with &lt;em&gt;UUIDs&lt;/em&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"create activity with UUID in title"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;testFunc&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;testing&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ClientWithResponses&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Background&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="c"&gt;// Create multiple activities with different UUIDs&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="kt"&gt;int32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

            &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PostApiV1ActivitiesWithApplicationJSONV10BodyWithResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;PostApiV1ActivitiesApplicationJSONV10RequestBody&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="n"&gt;Id&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;        &lt;span class="n"&gt;tests&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ToPtr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                    &lt;span class="n"&gt;Title&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;     &lt;span class="n"&gt;tests&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ToPtr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                    &lt;span class="n"&gt;DueDate&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;   &lt;span class="n"&gt;tests&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ToPtr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Now&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
                    &lt;span class="n"&gt;Completed&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tests&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ToPtr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="no"&gt;false&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;require&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NoError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;require&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NotNil&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;require&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NotNil&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ApplicationjsonV10200&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;assert&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Equal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StatusOK&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StatusCode&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
            &lt;span class="n"&gt;assert&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NotNil&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ApplicationjsonV10200&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;assert&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Equal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ApplicationjsonV10200&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Title&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each iteration generates a new &lt;em&gt;UUID&lt;/em&gt;, creating a unique request that won't match our captured simulation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding Hoverfly Simulations
&lt;/h2&gt;

&lt;p&gt;Let's peek inside the simulation file. Hoverfly works by storing pairs of requests and responses. When it receives an incoming request, it compares it against stored requests to find a match, then returns the corresponding response. Here's what a captured request-response pair looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"request"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"matcher"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"exact"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/api/v1/Activities/1"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"method"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"matcher"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"exact"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"GET"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"destination"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"matcher"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"exact"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"fakerestapi.azurewebsites.net"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"scheme"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"matcher"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"exact"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"body"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"matcher"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"exact"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;""&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"response"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"body"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"{&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;id&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:1,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;title&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;Activity 1&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;dueDate&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;2025-03-04T23:24:16.4781493+00:00&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;completed&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:false}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"encodedBody"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"headers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Api-Supported-Versions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"1.0"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Content-Type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"application/json; charset=utf-8; v=1.0"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"Tue, 04 Mar 2025 22:24:15 GMT"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Hoverfly"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"Was-Here"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Server"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"Kestrel"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"templated"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Request Matchers
&lt;/h3&gt;

&lt;p&gt;When Hoverfly captures a request, it creates a &lt;strong&gt;Request Matcher&lt;/strong&gt; for each field in the request. A Request Matcher consists of:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The request field name (path, method, destination, etc.)&lt;/li&gt;
&lt;li&gt;The type of match to use for comparison&lt;/li&gt;
&lt;li&gt;The request field value&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;By default, Hoverfly sets the type of match to &lt;code&gt;exact&lt;/code&gt; for each field, which means the incoming request must exactly match the stored request for each field. This is why random data in our tests breaks the simulation—the new random values won't exactly match the captured ones.&lt;/p&gt;

&lt;p&gt;Hoverfly supports multiple matcher types including &lt;code&gt;exact&lt;/code&gt;, &lt;code&gt;glob&lt;/code&gt;, &lt;code&gt;regex&lt;/code&gt;, and more. You can find details in &lt;a href="https://docs.hoverfly.io/en/latest/pages/keyconcepts/simulations/pairs.html#request-matchers" rel="noopener noreferrer"&gt;Hoverfly's official documentation&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Response Templating
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;templated&lt;/code&gt; field in the response determines whether Hoverfly should use templating. When enabled, you can &lt;strong&gt;inject data from the request into the response&lt;/strong&gt; using templating syntax.&lt;/p&gt;

&lt;p&gt;For example, this templated response includes the URL path from the original request:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"response"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"body"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"You requested: {{ Request.Path.[0].Value }}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"encodedBody"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"headers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Content-Type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"text/plain"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"templated"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It's worth noting that Hoverfly's templating is far more powerful than just simple value injection. You can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use conditional logic with &lt;code&gt;{{if}}&lt;/code&gt;, &lt;code&gt;{{else}}&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Extract data using &lt;code&gt;JSONPath&lt;/code&gt; or &lt;code&gt;XPath&lt;/code&gt; expressions&lt;/li&gt;
&lt;li&gt;Perform string manipulations (&lt;code&gt;substring&lt;/code&gt;, &lt;code&gt;lowercase&lt;/code&gt;, etc.)&lt;/li&gt;
&lt;li&gt;Generate random data, etc.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When Hoverfly receives a request, it will extract the values from the request body and inject them into the response, creating a consistent experience even with random data.&lt;/p&gt;

&lt;p&gt;This is exactly what we need to solve our problems. By combining request matchers with templated responses, we can create simulations that work with any random values our tests generate.&lt;/p&gt;

&lt;p&gt;For our dynamic data problem, we'll use these features to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Change the matcher types&lt;/strong&gt; for fields with dynamic data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enable response templating&lt;/strong&gt; to inject request values into responses&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apply these changes&lt;/strong&gt; automatically with our simulation processor&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Creating a Simulation Processor
&lt;/h2&gt;

&lt;p&gt;Now that we understand how Hoverfly processes requests and responses, let's build a solution for our problems. In my code, I'll use the internal implementations from the Hoverfly package since it's open-source. However, remember this is just a JSON file, so if you use any language other than Go, you can easily implement the same approach.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Goal
&lt;/h3&gt;

&lt;p&gt;Our processor needs to achieve several objectives:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Identify dynamic data patterns&lt;/strong&gt; in both requests and responses&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Replace exact matchers with matchers&lt;/strong&gt; for fields containing UUIDs, timestamps, prefixed random strings, etc.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enable templating&lt;/strong&gt; in responses to use values from the request&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Support static responses&lt;/strong&gt; for specific endpoints that need predetermined behavior&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We'll use a declarative configuration file to define our processing rules, for example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1.0"&lt;/span&gt;
&lt;span class="na"&gt;settings&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;debug&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;

&lt;span class="na"&gt;patterns&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# All UUIDs will be replaced with the default regex pattern&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;uuid"&lt;/span&gt;

  &lt;span class="c1"&gt;# Dates will be replaced&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;datetime"&lt;/span&gt;
    &lt;span class="na"&gt;formats&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2006-01-02T15:04:05Z07:00"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2006-01-02"&lt;/span&gt;

  &lt;span class="c1"&gt;# Prefix matching&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prefix"&lt;/span&gt;
    &lt;span class="na"&gt;pattern&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fake-api-test-"&lt;/span&gt;
    &lt;span class="na"&gt;length&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;

&lt;span class="na"&gt;endpoints&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# Replace the response of the endpoint with the static_response&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;method&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GET"&lt;/span&gt;
    &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/api/v1/Activities/30"&lt;/span&gt;
    &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;200&lt;/span&gt;
    &lt;span class="na"&gt;static_response&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;{&lt;/span&gt;
        &lt;span class="s"&gt;"id": 77,&lt;/span&gt;
        &lt;span class="s"&gt;"title": "John Doe",&lt;/span&gt;
        &lt;span class="s"&gt;"dueDate": "2025-02-19T17:12:52.127Z",&lt;/span&gt;
        &lt;span class="s"&gt;"completed": false&lt;/span&gt;
      &lt;span class="s"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With this configuration, we aim to automatically transform our simulation files to handle dynamic data without manually editing them.&lt;/p&gt;

&lt;p&gt;I don't want to go deep into the implementation details in this article (if you're interested, you can find everything in the &lt;a href="https://github.com/vmyroslav/api-test-demo/tree/part2/tools/postprocessor" rel="noopener noreferrer"&gt;code&lt;/a&gt;), so I'll just show you the main concepts.&lt;br&gt;
Here's the entry point of the processor implementation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;PostProcessor&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;Process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;simulation&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;v2&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SimulationViewV5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;simulation&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Errorf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"nil simulation provided"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;simulation&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RequestResponsePairs&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;pair&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;simulation&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RequestResponsePairs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

        &lt;span class="c"&gt;// 1. Check if this is a static endpoint rule&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;endpointProcessor&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;rule&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;endpointProcessor&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FindMatchingRule&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pair&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;rule&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;endpointProcessor&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ApplyRule&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pair&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Errorf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"applying static response for pair %d: %w"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="c"&gt;// 2. Process the pair&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;processingStrategy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pair&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Errorf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"processing pair %d: %w"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's skip the static endpoint for now and look closer at the processing of each pair.&lt;br&gt;
For prefixed patterns in requests:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;PrefixProcessor&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;ProcessRequest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;replaceWith&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;replaceWith&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// Properly escape any regex special characters in the prefix&lt;/span&gt;
    &lt;span class="n"&gt;escapedPrefix&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;regexp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;QuoteMeta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prefix&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c"&gt;// Check if we detected a resource name embedded in the value&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;resourceName&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;detectResourceName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;resourceName&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;includeResourceName&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;
        &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resourceName&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;resourceName&lt;/span&gt;

        &lt;span class="c"&gt;// Pattern with resource name included&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"%s%s-[a-zA-Z0-9]{%d}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;escapedPrefix&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resourceName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;randomLength&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// Standard pattern without resource name&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"%s[a-zA-Z0-9]{%d}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;escapedPrefix&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;randomLength&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And for responses:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;PrefixProcessor&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;ProcessResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;field&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;modifiedFields&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Match&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;false&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// If this field was modified in the request and no fixed replacement&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;modifiedFields&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;replaceWith&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"{{ Request.Body 'jsonpath' '$.%s' }}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;replaceWith&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;replaceWith&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;false&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This approach:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Changes the matcher for fields to use &lt;code&gt;regex&lt;/code&gt; matcher&lt;/li&gt;
&lt;li&gt;Uses Hoverfly's templating system to extract values from the incoming request&lt;/li&gt;
&lt;li&gt;Uses those same values in the response, maintaining consistency&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Following a similar principle, let's look at the UUID example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;UUIDProcessor&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;ProcessRequest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;replaceWith&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;replaceWith&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="c"&gt;// Use proper regex pattern for UUIDs that will work with Hoverfly's RegexMatch&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s"&gt;`[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;UUIDProcessor&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;ProcessResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;field&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;modifiedFields&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Match&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;false&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// If we have a fixed replacement, use it regardless of modified fields&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;replaceWith&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;replaceWith&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// If this field was modified in the request and modifiedFields is provided&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;modifiedFields&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;modifiedFields&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"{{ Request.Body 'jsonpath' '$.%s' }}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;false&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All these processors are orchestrated:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Process handles both the request and response processing for a pair&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;DefaultProcessingStrategy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;Process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pair&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;v2&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RequestMatcherResponsePairViewV5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// First try to process the request&lt;/span&gt;
    &lt;span class="n"&gt;modifiedFields&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ProcessRequest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pair&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// Then process the response&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;modifiedFields&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c"&gt;// If we modified fields in the request, process response with them&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ProcessResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pair&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;modifiedFields&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// If we didn't modify any request fields (e.g., no request body or no matches),&lt;/span&gt;
    &lt;span class="c"&gt;// try to process the response independently&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;processResponseForRequestsWithoutBody&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pair&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here we:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;First process the request matchers, identifying dynamic fields&lt;/li&gt;
&lt;li&gt;Tracks which fields were modified in the request&lt;/li&gt;
&lt;li&gt;Process the response, applying templating where appropriate&lt;/li&gt;
&lt;li&gt;Handles special cases like requests without bodies&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: I understand that the code snippets above can be a bit complex to digest on first reading. Don't worry if you don't grasp all the implementation immediately - they're included primarily for those who want to dive deeper. If you prefer, you can focus on the overall concept rather than the specific code. The full source code is available on GitHub if you want to explore further.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;To summarize the main idea: &lt;strong&gt;We modify the output JSON files, telling Hoverfly to use the request data in the response&lt;/strong&gt;. You can, of course, modify this approach to make responses with static data, random data, or whatever else you need.&lt;/p&gt;

&lt;h2&gt;
  
  
  API Retries
&lt;/h2&gt;

&lt;p&gt;Sometimes things don't work, at least not after the first attempt. Our application might need to handle temporary failures by retrying requests.&lt;br&gt;
Let's look at a test server that simulates an unreliable API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ts&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;TestServer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;handleGetActivity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ResponseWriter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;attempts&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c"&gt;// Extract the id from the URL path&lt;/span&gt;
    &lt;span class="n"&gt;pathParts&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;strings&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;URL&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"/"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pathParts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Invalid URL"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StatusBadRequest&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;idStr&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;pathParts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;strconv&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Atoi&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;idStr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Invalid ID"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StatusBadRequest&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// Simulate different responses based on attempt number&lt;/span&gt;
    &lt;span class="k"&gt;switch&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
        &lt;span class="c"&gt;// First attempt - Service Unavailable&lt;/span&gt;
        &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WriteHeader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StatusServiceUnavailable&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewEncoder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="s"&gt;"error"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"Service Temporarily Unavailable"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s"&gt;"code"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;  &lt;span class="s"&gt;"SERVER_BUSY"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
        &lt;span class="c"&gt;// Second attempt - Gateway Timeout&lt;/span&gt;
        &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WriteHeader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StatusGatewayTimeout&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
        &lt;span class="c"&gt;// Third attempt - Success&lt;/span&gt;
        &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Header&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Content-Type"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"application/json"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WriteHeader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StatusOK&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewEncoder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="k"&gt;interface&lt;/span&gt;&lt;span class="p"&gt;{}{&lt;/span&gt;
            &lt;span class="s"&gt;"id"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;        &lt;span class="kt"&gt;int32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="s"&gt;"title"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;     &lt;span class="s"&gt;"Test Activity"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s"&gt;"dueDate"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;   &lt;span class="s"&gt;"2025-02-20T09:58:59.009Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s"&gt;"completed"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="no"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
        &lt;span class="c"&gt;// Any subsequent attempts - Bad Request&lt;/span&gt;
        &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WriteHeader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StatusTooManyRequests&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewEncoder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="s"&gt;"error"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"Too many attempts"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This server:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Returns a 503 (Service Unavailable) on the first attempt&lt;/li&gt;
&lt;li&gt;Returns a 504 (Gateway Timeout) on the second attempt&lt;/li&gt;
&lt;li&gt;Succeeds with a 200 (OK) on the third attempt&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;By default, Hoverfly would only capture the final successful response. And of course our test will fail, but like we have a trick for that as well.&lt;/p&gt;

&lt;h3&gt;
  
  
  Enabling Stateful Capture
&lt;/h3&gt;

&lt;p&gt;To capture this sequence of responses, we need to enable &lt;strong&gt;stateful capture mode&lt;/strong&gt; in Hoverfly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;hoverctl mode capture &lt;span class="nt"&gt;--stateful&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  How Stateful Capture Works
&lt;/h3&gt;

&lt;p&gt;When in stateful capture mode, Hoverfly adds metadata to the captured request/response pairs to track their sequence:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;requiresState&lt;/strong&gt; - Added to the request to indicate which state is required to match this entry&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;transitionsState&lt;/strong&gt; - Added to the response to indicate how the state should change after this response&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Let's look at how this appears in the simulation file:&lt;/p&gt;

&lt;p&gt;Here's how the sequence works:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Hoverfly automatically creates a state variable called &lt;code&gt;sequence:1&lt;/code&gt; and initializes it to "1"&lt;/li&gt;
&lt;li&gt;The first request matches when the state is "1"&lt;/li&gt;
&lt;li&gt;After serving the first response, the state transitions to "2"&lt;/li&gt;
&lt;li&gt;The second request matches when the state is "2"&lt;/li&gt;
&lt;li&gt;And so on…&lt;/li&gt;
&lt;li&gt;After serving the last response, there's no further transition defined&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Testing with Stateful Simulations
&lt;/h3&gt;

&lt;p&gt;When we run tests against this simulation, Hoverfly will play back the sequence of responses in order, perfectly simulating our retry scenario:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;TestActivitiesWithRetries&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;testing&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;serverUrl&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ok&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LookupEnv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"API_SERVER_URL"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;ok&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;serverUrl&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;ts&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;tests&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewTestServer&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="c"&gt;// Start our test server&lt;/span&gt;
        &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;serverUrl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;URL&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;httpClient&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;tests&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewHttpClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;NewClientWithResponses&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;serverUrl&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;WithHTTPClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;httpClient&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Fatalf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"failed to inistialize client: %v"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c"&gt;// Reset the server state&lt;/span&gt;
        &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;httpClient&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;serverUrl&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="s"&gt;"/reset"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"application/json"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}()&lt;/span&gt;

    &lt;span class="n"&gt;scenarios&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;testCase&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"create activity with retries"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;testFunc&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;testing&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ClientWithResponses&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="n"&gt;ctx&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Background&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
                    &lt;span class="n"&gt;id&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;int32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;7474&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"Test Activity"&lt;/span&gt;
                &lt;span class="p"&gt;)&lt;/span&gt;

                &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;lastErr&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;
                &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GetApiV1ActivitiesIdWithResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                        &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                        &lt;span class="n"&gt;lastErr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;
                        &lt;span class="k"&gt;continue&lt;/span&gt;
                    &lt;span class="p"&gt;}&lt;/span&gt;

                    &lt;span class="c"&gt;// Success!&lt;/span&gt;
                    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StatusCode&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StatusOK&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                        &lt;span class="n"&gt;require&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NotNil&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ApplicationjsonV10200&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                        &lt;span class="n"&gt;assert&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Equal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ApplicationjsonV10200&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                        &lt;span class="n"&gt;assert&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Equal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ApplicationjsonV10200&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Title&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                        &lt;span class="k"&gt;return&lt;/span&gt;
                    &lt;span class="p"&gt;}&lt;/span&gt;

                    &lt;span class="c"&gt;// Continue if it's a retryable status&lt;/span&gt;
                    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StatusCode&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StatusServiceUnavailable&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
                        &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StatusCode&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StatusGatewayTimeout&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                        &lt;span class="k"&gt;continue&lt;/span&gt;
                    &lt;span class="p"&gt;}&lt;/span&gt;

                    &lt;span class="c"&gt;// Non-retryable error&lt;/span&gt;
                    &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Fatalf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Got non-retryable status: %d"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StatusCode&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;

                &lt;span class="c"&gt;// If we got here, all retries failed&lt;/span&gt;
                &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Fatalf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"All retries failed. Last error: %v"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lastErr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tt&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;scenarios&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;testing&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;tt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;testFunc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Benefits of Stateful Simulation
&lt;/h3&gt;

&lt;p&gt;Using stateful capture may provide several benefits:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Test retry mechanisms properly&lt;/strong&gt; - Verify that your client correctly handles temporary errors&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Simulate workflows&lt;/strong&gt; - Test multi-step processes that change state with each request&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reproduce edge cases&lt;/strong&gt; - Capture and replay rare error conditions&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;By combining our pattern processors with stateful capture, we gain the ability to test both dynamic data and complex stateful interactions without changing our test code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Static Response Replacement
&lt;/h2&gt;

&lt;p&gt;Sometimes you want to provide a completely different response for specific endpoints. This is useful for:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Creating test-specific behavior&lt;/strong&gt; - Returning predefined responses for certain test cases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Handling edge cases&lt;/strong&gt; - Simulating error conditions or special scenarios&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Maintaining consistent test data&lt;/strong&gt; - Ensuring tests run with known values&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Our processor supports this through the endpoints section in the configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;endpoints&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;method&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GET"&lt;/span&gt;
    &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/api/v1/Activities/30"&lt;/span&gt;
    &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;200&lt;/span&gt;
    &lt;span class="na"&gt;static_response&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;{&lt;/span&gt;
        &lt;span class="s"&gt;"id": 77,&lt;/span&gt;
        &lt;span class="s"&gt;"title": "John Doe",&lt;/span&gt;
        &lt;span class="s"&gt;"dueDate": "2025-02-19T17:12:52.127Z",&lt;/span&gt;
        &lt;span class="s"&gt;"completed": false&lt;/span&gt;
      &lt;span class="s"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This completely overrides any captured response for the specified endpoint, giving you precise control over the behavior in your tests.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;We've explored several powerful techniques for API testing with simulations:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Understanding Hoverfly's matching system&lt;/strong&gt; - Learning how request matchers work&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Handling dynamic data&lt;/strong&gt; - Making simulations work with UUIDs, random strings, and timestamps&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Testing retry logic&lt;/strong&gt; - Using stateful capture to simulate unreliable APIs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Static response replacement&lt;/strong&gt; - Overriding specific endpoints for test scenarios&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Of course, this approach may not be for everyone. If you require much more flexibility or you're starting a new project, then tools like &lt;a href="https://wiremock.org/docs/" rel="noopener noreferrer"&gt;WireMock&lt;/a&gt; might be a better choice. But there are specific use cases where this approach is particularly helpful and more effective than alternatives, making Hoverfly a useful tool to have in your testing arsenal.&lt;/p&gt;

&lt;p&gt;In the final part of this series, we'll explore integrating these tools into &lt;strong&gt;CI&lt;/strong&gt; pipelines and development workflows.&lt;/p&gt;

&lt;h2&gt;
  
  
  Source Code
&lt;/h2&gt;

&lt;p&gt;The complete source code for this solution is available on &lt;a href="https://github.com/vmyroslav/api-test-demo/tree/part2" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>go</category>
      <category>api</category>
      <category>testing</category>
      <category>hoverfly</category>
    </item>
    <item>
      <title>API testing through simulations</title>
      <dc:creator>Myroslav Vivcharyk</dc:creator>
      <pubDate>Sun, 02 Mar 2025 14:36:41 +0000</pubDate>
      <link>https://forem.com/devgeist/api-testing-through-simulations-2fgf</link>
      <guid>https://forem.com/devgeist/api-testing-through-simulations-2fgf</guid>
      <description>&lt;h2&gt;
  
  
  The Challenge of API Testing
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;API testing&lt;/strong&gt; is a challenging aspect of software development. While there are many approaches available — from mocking to contract testing — each comes with its own set of trade-offs.&lt;br&gt;
Rather than debating the pros and cons of different testing approaches (&lt;em&gt;feel free to draw your testing pyramids in the comments&lt;/em&gt;), this series of articles focuses on a specific scenario: &lt;strong&gt;testing against real APIs&lt;/strong&gt;. Whether it’s production, staging, or a sandbox environment, we’ll explore how to improve tests that interact with actual API endpoints.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Real-World Context
&lt;/h2&gt;

&lt;p&gt;Let’s face it: many development teams find themselves deeply integrated with real APIs — &lt;em&gt;and that’s normal&lt;/em&gt;. Instead of suggesting a complete architectural overhaul or debating how we arrived at this point, let’s focus on practical improvements to your existing setup.&lt;br&gt;
When interacting with real APIs, developers often face several critical challenges:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Network Reliability&lt;/strong&gt;: Random connection errors and timeouts leading to flaky tests&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Latency&lt;/strong&gt;: API calls taking seconds (or even minutes!) to complete, significantly slowing down test execution&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Resource Consumption&lt;/strong&gt;: Every test run potentially consuming API quotas or incurring actual costs&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Environment Stability&lt;/strong&gt;: External API availability and rate limits affecting test reliability&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  A Practical Solution
&lt;/h2&gt;

&lt;p&gt;We’ll explore &lt;a href="https://docs.hoverfly.io/en/latest/" rel="noopener noreferrer"&gt;Hoverfly&lt;/a&gt; — an open-source API simulation tool that deserves more attention. Despite its capabilities, comprehensive examples and real-world use cases can be hard to find — &lt;em&gt;which is precisely why we’re writing this series&lt;/em&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Hoverfly is a lightweight, open source API simulation tool. Using Hoverfly, you can create realistic simulations of the APIs your application depends on.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;em&gt;What does this mean in practice?&lt;/em&gt; We can run our tests against a real API once (or periodically), record all interactions, and use these recordings for future test runs. This gives us the benefits of both worlds: real API behavior with predictable test data.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💡 Want to dive straight into the code? Check out the complete &lt;a href="https://github.com/vmyroslav/api-test-demo/tree/part1" rel="noopener noreferrer"&gt;source code here&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Docker and Docker Compose&lt;/li&gt;
&lt;li&gt;Go 1.23 or later&lt;/li&gt;
&lt;li&gt;Task (&lt;a href="https://taskfile.dev" rel="noopener noreferrer"&gt;taskfile&lt;/a&gt;) for running our commands&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Setting Up Our Test Environment
&lt;/h2&gt;

&lt;p&gt;First, let’s set up our development environment. We’ll use &lt;a href="https://github.com/oapi-codegen/oapi-codegen" rel="noopener noreferrer"&gt;OpenAPI Generator&lt;/a&gt; within Docker to generate our HTTP client, keeping our local environment clean. For this demonstration, we’ll work with &lt;a href="https://fakerestapi.azurewebsites.net/index.html" rel="noopener noreferrer"&gt;fakerestapi.azurewebsites.net&lt;/a&gt; (&lt;em&gt;kudos to the maintainers of public APIs&lt;/em&gt;).&lt;br&gt;
Here’s our Docker configuration for the client generator:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; golang:1.23-alpine&lt;/span&gt;

&lt;span class="k"&gt;RUN &lt;/span&gt;apk add &lt;span class="nt"&gt;--no-cache&lt;/span&gt; git

&lt;span class="k"&gt;RUN &lt;/span&gt;go &lt;span class="nb"&gt;install &lt;/span&gt;github.com/oapi-codegen/oapi-codegen/v2/cmd/oapi-codegen@v2.4.1

&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /app&lt;/span&gt;
&lt;span class="k"&gt;ENTRYPOINT&lt;/span&gt;&lt;span class="s"&gt; ["oapi-codegen"]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the corresponding docker-compose service:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;oapi-generator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;container_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;oapi-generator&lt;/span&gt;
    &lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;context&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;.&lt;/span&gt;
      &lt;span class="na"&gt;dockerfile&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./docker/oapi.Dockerfile&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;.:/app&lt;/span&gt;
    &lt;span class="na"&gt;working_dir&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/app&lt;/span&gt;
    &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-package"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;oapi"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-generate"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;types,client"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-o"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./client/oapi/client.go"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./api/test_api.json"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Writing First Tests
&lt;/h2&gt;

&lt;p&gt;With our client generated, let’s write some basic tests to demonstrate API interactions. We’ll focus on the Authors endpoint for this example (but you can more in the &lt;a href="https://github.com/vmyroslav/api-test-demo/blob/part1/client/oapi/client_test.go" rel="noopener noreferrer"&gt;source&lt;/a&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;TestAuthors&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;testing&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
 &lt;span class="n"&gt;scenarios&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;testCase&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt;
   &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"get author"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="n"&gt;testFunc&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;testing&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ClientWithResponses&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
     &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Background&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
     &lt;span class="n"&gt;id&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;int32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GetApiV1AuthorsIdWithResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;require&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NoError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;require&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NotNil&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"expected resp, got nil"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;require&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NotNil&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ApplicationjsonV10200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"expected resp body, got nil"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;assert&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Equal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StatusOK&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StatusCode&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="n"&gt;assert&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Equal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ApplicationjsonV10200&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;assert&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NotEmpty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ApplicationjsonV10200&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FirstName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"expected non-empty first name"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;assert&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NotEmpty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ApplicationjsonV10200&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LastName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"expected non-empty last name"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt;
   &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"create new author"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="n"&gt;testFunc&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;testing&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ClientWithResponses&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
     &lt;span class="n"&gt;ctx&lt;/span&gt;       &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Background&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
     &lt;span class="n"&gt;id&lt;/span&gt;        &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;int32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
     &lt;span class="n"&gt;firstName&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;randomTitleForResource&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"author"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
     &lt;span class="n"&gt;lastName&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;randomTitleForResource&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"author"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PostApiV1AuthorsWithApplicationJSONV10BodyWithResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
     &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
     &lt;span class="n"&gt;PostApiV1AuthorsApplicationJSONV10RequestBody&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="n"&gt;Id&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;        &lt;span class="n"&gt;tests&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ToPtr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
      &lt;span class="n"&gt;FirstName&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tests&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ToPtr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;firstName&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
      &lt;span class="n"&gt;LastName&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;  &lt;span class="n"&gt;tests&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ToPtr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lastName&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
     &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;require&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NoError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;require&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NotNil&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"expected resp, got nil"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;require&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NotNil&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ApplicationjsonV10200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"expected resp body, got nil"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;assert&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Equal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StatusOK&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StatusCode&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="n"&gt;assert&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Equal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ApplicationjsonV10200&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;assert&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Equal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;firstName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ApplicationjsonV10200&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FirstName&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;assert&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Equal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lastName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ApplicationjsonV10200&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LastName&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
 &lt;span class="p"&gt;}&lt;/span&gt;

 &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;setupClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

 &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tt&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;scenarios&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;testing&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
   &lt;span class="n"&gt;tt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;testFunc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
 &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;setupClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;testing&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ClientWithResponses&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
 &lt;span class="n"&gt;httpClient&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DefaultClient&lt;/span&gt;
 &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;NewClientWithResponses&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;baseURL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;WithHTTPClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;httpClient&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
 &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Fatalf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"failed to inistialize client: %v"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
 &lt;span class="p"&gt;}&lt;/span&gt;

 &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Enter Hoverfly: Setting Up API Simulation
&lt;/h2&gt;

&lt;p&gt;The core concept in Hoverfly is “simulations” — JSON files that store request-response pairs from your API interactions. Let’s set up Hoverfly in a Docker container:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; alpine:3.21.0&lt;/span&gt;

&lt;span class="c"&gt;# Packages&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;apk add &lt;span class="nt"&gt;--no-cache&lt;/span&gt; wget unzip curl

&lt;span class="c"&gt;# Set default arguments&lt;/span&gt;
&lt;span class="k"&gt;ARG&lt;/span&gt;&lt;span class="s"&gt; HOVERFLY_VERSION="v1.10.6"&lt;/span&gt;
&lt;span class="k"&gt;ARG&lt;/span&gt;&lt;span class="s"&gt; HOVERFLY_ADMIN_PORT=8888&lt;/span&gt;
&lt;span class="k"&gt;ARG&lt;/span&gt;&lt;span class="s"&gt; HOVERFLY_PROXY_PORT=8500&lt;/span&gt;

&lt;span class="k"&gt;ENV&lt;/span&gt;&lt;span class="s"&gt; HOVERFLY_ADMIN_PORT=${HOVERFLY_ADMIN_PORT}&lt;/span&gt;
&lt;span class="k"&gt;ENV&lt;/span&gt;&lt;span class="s"&gt; HOVERFLY_PROXY_PORT=${HOVERFLY_PROXY_PORT}&lt;/span&gt;

&lt;span class="c"&gt;# Download and install both hoverfly and hoverctl&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;wget &lt;span class="nt"&gt;-q&lt;/span&gt; &lt;span class="s2"&gt;"https://github.com/SpectoLabs/hoverfly/releases/download/v&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;HOVERFLY_VERSION&lt;/span&gt;&lt;span class="p"&gt;#v&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/hoverfly_bundle_linux_amd64.zip"&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;    unzip hoverfly_bundle_linux_amd64.zip &lt;span class="nt"&gt;-d&lt;/span&gt; /tmp &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="nb"&gt;mv&lt;/span&gt; /tmp/hoverfly /usr/local/bin/ &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="nb"&gt;mv&lt;/span&gt; /tmp/hoverctl /usr/local/bin/ &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="nb"&gt;chmod&lt;/span&gt; +x /usr/local/bin/hoverfly &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="nb"&gt;chmod&lt;/span&gt; +x /usr/local/bin/hoverctl &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; hoverfly_bundle_linux_amd64.zip /tmp/&lt;span class="k"&gt;*&lt;/span&gt;

&lt;span class="c"&gt;# Create default hoverctl config with environment variables&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; /root/.hoverfly &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"hoverfly.host: localhost"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /root/.hoverfly/config.yaml &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"hoverfly.admin.port: &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;HOVERFLY_ADMIN_PORT&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; /root/.hoverfly/config.yaml &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"hoverfly.proxy.port: &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;HOVERFLY_PROXY_PORT&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; /root/.hoverfly/config.yaml

&lt;span class="k"&gt;EXPOSE&lt;/span&gt;&lt;span class="s"&gt; ${HOVERFLY_PROXY_PORT} ${HOVERFLY_ADMIN_PORT}&lt;/span&gt;

&lt;span class="k"&gt;ENTRYPOINT&lt;/span&gt;&lt;span class="s"&gt; ["hoverfly", "-listen-on-host=0.0.0.0"]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add new docker-compose service to our configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;  &lt;span class="na"&gt;hoverfly&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;container_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api-test-demo-hoverfly&lt;/span&gt;
    &lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;context&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./docker&lt;/span&gt;
      &lt;span class="na"&gt;dockerfile&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hoverfly.Dockerfile&lt;/span&gt;
      &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;HOVERFLY_VERSION=1.10.6&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;HOVERFLY_ADMIN_PORT=${HOVERFLY_ADMIN_PORT:-8888}&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;HOVERFLY_PROXY_PORT=${HOVERFLY_PROXY_PORT:-8500}&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${HOVERFLY_HOST_ADMIN_PORT:-8888}:${HOVERFLY_ADMIN_PORT:-8888}"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${HOVERFLY_HOST_PROXY_PORT:-8500}:${HOVERFLY_PROXY_PORT:-8500}"&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./testdata/hoverfly:/testdata/hoverfly&lt;/span&gt;
    &lt;span class="na"&gt;healthcheck&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;test&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CMD"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;curl"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-f"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:${HOVERFLY_ADMIN_PORT:-8888}/api/health"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1s&lt;/span&gt;
      &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;3s&lt;/span&gt;
      &lt;span class="na"&gt;retries&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This setup provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A Hoverfly instance&lt;/li&gt;
&lt;li&gt;Web UI for monitoring (port 8888)&lt;/li&gt;
&lt;li&gt;Proxy server for API interception (port 8500)&lt;/li&gt;
&lt;li&gt;Persistent storage for our simulations
The last step would be adding proxy for our HTTP client configuration. For this we need to add a tiny change:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;NewHttpClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;testing&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Client&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Helper&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;hoverflyAddr&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"HOVERFLY_PROXY"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;hoverflyAddr&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DefaultClient&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// We will run tests in parallel only if we are using Hoverfly.&lt;/span&gt;
    &lt;span class="c"&gt;// Just to avoid redundant load for public API.&lt;/span&gt;
    &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Parallel&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;Timeout&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Second&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;Transport&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Transport&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;Proxy&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ProxyURL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;URL&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;Scheme&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"http"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;Host&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;   &lt;span class="n"&gt;hoverflyAddr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;}),&lt;/span&gt;
            &lt;span class="n"&gt;TLSClientConfig&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;tls&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;InsecureSkipVerify&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c"&gt;// Skip certificate verification when using Hoverfly&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Recording and Replaying API Interactions
&lt;/h2&gt;

&lt;p&gt;Using task definitions, we can capture API interactions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;test:local:capture&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;desc&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run tests on local machine and capture traffic with Hoverfly. Export simulation to file.&lt;/span&gt;
    &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;HOVERFLY_PROXY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;{{.HOVERFLY_PROXY}}'&lt;/span&gt;
    &lt;span class="na"&gt;vars&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;TIMESTAMP&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;sh&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;date +%Y%m%d_%H%M%S&lt;/span&gt;
    &lt;span class="na"&gt;cmds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;defer&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt;&lt;span class="nv"&gt;task&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;hoverfly&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;&lt;span class="nv"&gt;stop&lt;/span&gt;&lt;span class="pi"&gt;}&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;task&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;internal:bootstrap&lt;/span&gt;
        &lt;span class="na"&gt;vars&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;CLI_ARGS&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hoverfly&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;task&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hoverfly:mode&lt;/span&gt;
        &lt;span class="na"&gt;vars&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;MODE&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;capture&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;task&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;test:local&lt;/span&gt;
        &lt;span class="na"&gt;vars&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;HOVERFLY_PROXY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;localhost:{{.HOVERFLY_PORT&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;|&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;default&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;8500}}"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;task&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hoverfly:export&lt;/span&gt;
        &lt;span class="na"&gt;vars&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;SIMULATIONS_DIR&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;{{.SIMULATIONS_DIR}}'&lt;/span&gt;
          &lt;span class="na"&gt;SIMULATION_FILE&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;{{.TIMESTAMP}}'&lt;/span&gt;

&lt;span class="err"&gt;   &lt;/span&gt;&lt;span class="na"&gt;test:local&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;desc&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;Run all tests on local machine. By default, tests are run against real API.&lt;/span&gt;
      &lt;span class="s"&gt;Optionally use Hoverfly for simulation or capturing by setting HOVERFLY_PROXY environment variable and running hoverfly service.&lt;/span&gt;
    &lt;span class="na"&gt;deps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;internal:check-gotestsum&lt;/span&gt;
    &lt;span class="na"&gt;cmds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;PATH="$(go env GOPATH)/bin:$PATH" gotestsum&lt;/span&gt;
        &lt;span class="s"&gt;--format pkgname&lt;/span&gt;
        &lt;span class="s"&gt;--hide-summary=skipped&lt;/span&gt;
        &lt;span class="s"&gt;--&lt;/span&gt;
        &lt;span class="s"&gt;-json ./...&lt;/span&gt;
        &lt;span class="s"&gt;-race&lt;/span&gt;
        &lt;span class="s"&gt;-timeout=30s&lt;/span&gt;
        &lt;span class="s"&gt;-count=1&lt;/span&gt;
    &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;HOVERFLY_PROXY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;{{.HOVERFLY_PROXY&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;|&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;default&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;""}}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When we run this task, it:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Starts Hoverfly container&lt;/li&gt;
&lt;li&gt;Enables capture mode&lt;/li&gt;
&lt;li&gt;Runs tests through the proxy&lt;/li&gt;
&lt;li&gt;Saves interactions to a JSON file&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Running Simulations
&lt;/h2&gt;

&lt;p&gt;Once we’ve captured our API interactions, we can run tests using the simulated responses. Here’s our task definition:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;test:local:simulate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;desc&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run tests locally simulating the API with Hoverfly&lt;/span&gt;
    &lt;span class="na"&gt;deps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;task&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;internal:bootstrap&lt;/span&gt;
        &lt;span class="na"&gt;vars&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;CLI_ARGS&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hoverfly&lt;/span&gt;
    &lt;span class="na"&gt;vars&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;LATEST_SIM&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;sh&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;if ! SIM=$(task internal:find:latest:simulation SIMULATIONS_DIR={{.SIMULATIONS_DIR}} EXPIRED_AFTER_DAYS={{.EXPIRED_AFTER_DAYS}}); then&lt;/span&gt;
            &lt;span class="s"&gt;echo ""&lt;/span&gt;
          &lt;span class="s"&gt;else&lt;/span&gt;
            &lt;span class="s"&gt;echo "$SIM"&lt;/span&gt;
          &lt;span class="s"&gt;fi&lt;/span&gt;
    &lt;span class="na"&gt;preconditions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;sh&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;-f&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;"{{.LATEST_SIM}}"&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;]'&lt;/span&gt;
        &lt;span class="na"&gt;msg&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Simulation&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;file&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;does&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;not&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;exist.&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Please&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;run&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;'task&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;test:local:capture'&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;first"&lt;/span&gt;
    &lt;span class="na"&gt;cmds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;task&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hoverfly:mode&lt;/span&gt;
        &lt;span class="na"&gt;vars&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;MODE&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;simulate&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;task&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hoverfly:import&lt;/span&gt;
        &lt;span class="na"&gt;vars&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;SIMULATION_FILE&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;{{.LATEST_SIM}}'&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;task&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;test:local&lt;/span&gt;
        &lt;span class="na"&gt;vars&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;HOVERFLY_PROXY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;localhost:{{.HOVERFLY_PORT&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;|&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;default&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;8500}}"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The workflow is straightforward:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Hoverfly starts in simulate mode&lt;/li&gt;
&lt;li&gt;We load our previously captured simulation file (&lt;em&gt;the task automatically finds the most recent one&lt;/em&gt;)&lt;/li&gt;
&lt;li&gt;Tests run against Hoverfly instead of the real API&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Validating Our Setup
&lt;/h2&gt;

&lt;p&gt;To verify that our simulation is actually working (and not secretly hitting the real API), let’s intentionally break one of our tests:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;TestAuthors&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;testing&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;scenarios&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;testCase&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"get author"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;testFunc&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;testing&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ClientWithResponses&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Background&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
                    &lt;span class="n"&gt;id&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;int32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="p"&gt;)&lt;/span&gt;

                &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GetApiV1AuthorsIdWithResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;require&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NoError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;require&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NotNil&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"expected resp, got nil"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;require&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NotNil&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ApplicationjsonV10200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"expected resp body, got nil"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

                &lt;span class="n"&gt;assert&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Equal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StatusOK&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StatusCode&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

                &lt;span class="c"&gt;// Intentionally incorrect assertion - the real API returns id=1&lt;/span&gt;
                &lt;span class="n"&gt;assert&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Equal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ApplicationjsonV10200&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This failure confirms three points:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Isolation&lt;/strong&gt;: Tests are running against simulated responses&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Accuracy&lt;/strong&gt;: Simulations faithfully reflect real API behavior&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Validation&lt;/strong&gt;: Test assertions work correctly against simulated data&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Let’s recap what we’ve built: a testing infrastructure that combines real API testing with simulation capabilities. The best part? We achieved this with minimal changes to our existing codebase. No need for extensive refactoring or rewriting tests — we simply added a proxy configuration to our HTTP client and let Hoverfly handle the rest.&lt;br&gt;
The workflow is straightforward — we write our tests as usual, but run them through a Hoverfly proxy server. Hoverfly captures API interactions and stores them as simulation files. For subsequent test runs, Hoverfly can replay these stored responses instead of hitting the real API.&lt;br&gt;
This setup gives us the benefits of both worlds — we’re testing against real API behavior, but with the reliability of local tests. No more flaky tests due to network issues, no more burning through API quotas during development, and most importantly, consistent test execution across all environments.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💡 A Note on Performance Expectations&lt;/p&gt;

&lt;p&gt;While Hoverfly is excellent for reliability and simulation, it’s important to manage your expectations. If your API typically responds within 150–200ms, you might not see significant speed improvements in your tests.&lt;br&gt;
Choose Hoverfly for reliability and consistent behavior first, speed improvements second.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Looking Ahead
&lt;/h2&gt;

&lt;p&gt;In the next part of this series, we’ll explore:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dynamic Data&lt;/strong&gt;: Handling variable responses&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CI/CD Integration&lt;/strong&gt;: Seamless pipeline implementation&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>go</category>
      <category>api</category>
      <category>testing</category>
      <category>hoverfly</category>
    </item>
  </channel>
</rss>
