Forem: Vishal C. Chaliya

Mastering the Saga Microservice Pattern in Event-Driven Systems

Vishal C. Chaliya — Sat, 02 May 2026 06:47:33 +0000

When you send money overseas to a friend or a family member you just tap and the money is sent, it feels instantaneous, but behind the scenes, a complex dance of microservices ensures that the transaction is a success. Let's explore a common scenario in a cross-border transaction and see how different microservices coordinate this intricate process.

The Cross-Border Payment Journey
Inside a financial institution say a bank or currency exchange partner, a series of specialized microservices comes alive when you initiate a transfer.

Example
A Payment Gateway acknowledges your request, passing it to the Currency Conversion service, which determines the optimal exchange rate. Following conversion, the transaction must adhere to the Compliance service to meet international regulations and pass through Fraud Detection to ensure legitimacy. Finally, the Payment Execution service processes the transaction, with a Notification service confirming the transfer to your friend.

The Complexity of Distributed Systems
In the above flow what if the Fraud Detection flags the transaction or the Currency Conversion fails? In a monolithic system, rolling back is straightforward, but the distributed architecture complicates things. This is where the saga pattern can save your life, coordinating the services without a central controller, much like a flock of birds flying in formation" or "emergency responders reacting to a radio call(all playing their part independently).

Understanding the Saga Pattern
The saga pattern manages distributed transactions by allowing each microservice to execute its part of the process independently and react to failures with compensation logic. Each service performs local transactions and is responsible for its "oops, let's fix that" plan, ensuring system stability.

The Cross-Border Payment: The Saga Approach
In a distributed Saga, the "perfect" flow is broken into a chain of independent local transactions. Instead of one giant lock, each service commits its own work immediately and then shouts to the next service: "I’m done, your turn!"

How it works: Choreography
In our payment journey, the Currency Conversion service doesn't wait for permission. It locks in the exchange rate, updates its own database, and emits a RateConverted event. The Compliance service, which has been "listening" for that specific event, wakes up and begins its check.

This creates a Choreography: a decentralized dance where no single "boss" directs the flow. Like a jazz ensemble, each microservice knows the "rhythm" (the sequence of events) and improvises its part when it hears the right cue.

The "Safety Net": Compensating Transactions
The true power of a Saga is how it handles failure. If the Fraud Detection service flags the transfer as suspicious, it emits a FraudDetected event. Because there is no "Undo" button in a distributed system, the previous services must execute compensating transactions

For Example
Currency Conversion sees the fraud event and automatically executes a reversal to release the held funds at the original rate.

Payment Gateway receives the failure and updates your dashboard to "Rejected," triggering a refund if necessary.

Real-World Challenges
While the saga sounds good but Implementing sagas in production introduces challenges. Debugging distributed systems can be complex, akin to piecing together a mystery novel with scattered pages. Ensuring smooth scaling, managing event broker loads like Kafka, and maintaining idempotency are critical considerations. Because there is no central log, you must rely on Distributed Tracing to follow a single transaction across five different services.

Furthermore, developers must ensure Idempotency—the guarantee that if a service receives the same "Failure" event twice, it doesn't accidentally execute the compensation twice. Managing the load on event brokers like Kafka and handling "out-of-order" events are the technical taxes you pay for such high resilience.

Embracing a New Perspective
The saga pattern is not about avoiding failures but embracing them and designing for graceful recovery. This approach shifts focus from preventing errors to effectively managing them.

Trade-Offs and Decisions
Is the saga pattern always the best choice? Not necessarily. For applications needing real-time responses or where compensating actions are intricate, sagas might add unwarranted complexity. Eventual consistency is a trade-off, and if your business can't accept it, sagas might not be suitable.

Conclusion: Crafting a Harmonious System
The saga microservice pattern acts as a safety net for distributed systems, allowing each service to function independently while preserving overall process integrity. It's not a panacea, but when applied judiciously, it turns a chaotic orchestra into a harmonious symphony. Next time you send money globally, remember the saga pattern ensuring smooth operations. For engineers, embracing this complexity is part of the adventure.

When do companies actually adopt Kafka / event-driven architecture?

Vishal C. Chaliya — Sat, 28 Feb 2026 17:04:19 +0000

I’ve been spending a lot time learning more about Kafka, streaming systems, CDC, and event-driven architecture.

It’s really very interesting — but I’m trying to figure out whether this specialization actually makes sense as a service offering.

At what point does a team say, “Okay, we need Kafka now”?

From what I’ve seen, in early-stage startups usually try to keep things simple(no cdc, no kafka, no microservices).
On the other hand, larger companies often already have dedicated teams and established infrastructure.
So I am curious like do someone actually hires a kafka specialist or they just hire full time employee.
If you’ve worked at a company that adopted Kafka or event-driven systems:
What triggered it?
Was it traffic growth?
Microservices getting messy?
Data consistency issues?
Analytics or integration needs?
Something breaking in production?

And when that complexity showed up, how did your team handle it?
Did you grow the expertise internally?
Hire someone specifically for it?
Bring in outside help?
Or just let backend engineers figure it out over time?

I’m not asking whether specialization helps land a job — I understand that it can.
I’m asking whether, from a business standpoint, there’s a sustainable niche for independent specialists in streaming architecture. Or is this almost always something companies internalize once they’re big enough?

I will really appreciate any insight on this topic.
Thank you in advance.

What is the Outbox Pattern? Solving a Nightmare

Vishal C. Chaliya — Tue, 24 Feb 2026 10:46:49 +0000

Before we define it, let’s understand the nightmare it was designed to solve:
Why does this design pattern exist in the first place?

Let’s take a simple example.

Suppose you’re in the mood to watch a movie and relax.
You’ve got your popcorn, your comfy sofa, and Netflix open.

You log in. Netflix already recommends movies based on your watch history.
You select one, press “Watch Now,” and the movie starts streaming.

That’s the ideal scenario.

But behind that smooth experience, what is actually happening?

Imagine You’re Designing This System

You are responsible for two critical operations:

Save the selected movie in the database (Very important — so Netflix can improve recommendations using behavioral data.)
Notify another service to start streaming the movie (By emitting an event like MovieStarted.)

Seems simple.

But here’s where distributed systems start laughing at you.

The Real Problem

What if:

The database save succeeds
But the event emission fails?

Or worse:

The event is emitted
But the database transaction fails?

Now your system is inconsistent.

The recommendation service thinks the user watched the movie.
The streaming service thinks nothing happened.
Or the opposite.

This is called the Dual Write Problem(the nightmare we need to solve).

You are writing to two different systems:

A relational database (ACID guarantees — atomicity, consistency, durability)
A message broker (asynchronous, eventually consistent)

And there is no single atomic transaction spanning both.

No shared commit boundary.
No guaranteed consistency.
No safety.

Enter the Transactional Outbox Pattern

The idea is simple but powerful.

Instead of:

Save to DB → Emit event to broker

You do:

Save business data → Insert event into OUTBOX table → Commit transaction

Both operations succeed or both fail.

Now both operations happen inside the same database transaction.

If the commit fails → nothing is persisted.
If it commits → both the state change and the event record are durable.

This solves the atomicity problem.

Ok, so now two tables have the required data — what changed?

The change is that the event is no longer sent directly to Kafka.
Instead:

You can now poll the outbox table, read the unprocessed events, emit them to the message broker, and then mark them as processed.

Or you can use CDC (Change Data Capture) on the outbox table so that it directly captures database changes (from WAL/binlog) and emits them to the message broker automatically.

Or you can even introduce an entirely separate service dedicated to handling this responsibility.

We removed distributed transactions (2PC) and still preserved atomicity between state change and event creation.

So You Solved the Problem and Saved Your Job… But What If?

The event is published, the broker acknowledges it, but the application crashes before marking the event as processed. After restart → the event is published again. Now you have duplicates.
In a horizontally scaled system, multiple instances poll the same outbox table and the same event is picked more than once.
The DB transaction commits, the event is emitted, but the broker crashes before persisting it. Or a network timeout occurs and you don’t know whether the publish succeeded.You retry — and create duplicates.
You have a high-throughput system and polling the outbox table increases database load, creates lag, and eventually becomes a bottleneck.

For these reasons, you should never rely on the Outbox implementation alone.

Outbox guarantees atomicity — not delivery perfection.

You must design your consumers to handle failure scenarios:

Consumers should be idempotent
Use idempotency keys
Partition by aggregate ID to preserve ordering
Handle duplicate messages safely
Use deduplication tables if required
For high DB load, prefer CDC tools like Debezium over polling

What I’m Trying to Say Is:
The Outbox Pattern is not a one-stop solution.
Many engineers assume it solves broker reliability.

It does not.

The Outbox Pattern does not solve reliability of the broker.
It only guarantees atomicity between state change and event creation.
It guarantees at-least-once delivery, not exactly-once.
If you want to keep your job as a system designer, you must design around its weaknesses — not ignore them.

Now that you understand the Outbox Pattern properly, let’s look at some examples on when and where it should be used — and where it should not.

When To Use the Outbox Pattern

✅ 1. Database is the Source of Truth

Example: Order Management System

Order saved in PostgreSQL
OrderCreated event must be emitted
Losing that event breaks inventory and billing

Strong consistency is required → Outbox is ideal.

✅ 2. Financial or Healthcare Systems

Example: Payment Processing

Transaction written to DB
Event triggers ledger updates, fraud checks, notifications

Losing the event = financial inconsistency.

Outbox ensures atomicity between transaction and event creation.

When NOT To Use the Outbox Pattern

❌ 1. When the Event Log Is the Source of Truth

Example: An Event Sourcing system built around Kafka

In this architecture:

All state changes are written directly to Kafka first.
The database is just a projection (materialized view).
The event log is the system of record.

Here, writing to the database first and then using an outbox adds unnecessary complexity.

You should publish to Kafka as the primary write operation and build state from events.

Outbox is not needed.

❌ 2. Ultra High Throughput Streaming Systems

Example: Real-time clickstream analytics or ad impression tracking

Millions of events per second
Events are transient and not tightly coupled to transactional DB state
Occasional event loss may be acceptable

In such systems, polling a relational database becomes a bottleneck:

Heavy I/O
Lock contention
Index scans
Increased latency

It is better to:

Write directly to Kafka
Use stream processing (Kafka Streams / Flink)
Materialize views downstream

❌ 3. When Eventual Consistency Is Acceptable

Example: Tracking “user viewed product” for analytics

If one tracking event is lost:

It does not break core business logic
No financial or critical data is affected

Using Outbox here adds operational overhead without strong benefit.

❌ 4. When You Don’t Control the Database

Example:

Using a third-party SaaS database
No ability to create tables
No transaction control

Since Outbox relies on atomic database transactions, it cannot be properly implemented.

Closing Thoughts

The Outbox Pattern is not about Kafka, polling, or CDC.
It is about solving the dual write problem in a practical way.

It guarantees atomicity between state change and event creation —
but it does not guarantee broker reliability or exactly-once delivery.

The mistake many engineers make is believing a pattern solves the entire problem.

It doesn’t.

Outbox is a powerful tool — but real reliability comes from designing for failure, not assuming it won’t happen.

If you’re exploring event-driven architectures further, I’ve also written about Kafka Streams and why it matters in real-world systems:

What Are Kafka Streams and Why Should You Care About Them?

Vishal C. Chaliya — Wed, 11 Feb 2026 09:28:32 +0000

Have you ever wondered how streaming giants like YouTube, Netflix or Amazon Prime suggest content from the same creators you’re currently watching, recommend similar videos, or even pitch specific products in real-time?
We know this as targeted marketing, driven by your watch history, genres, and preferred content length. To a business, this data is pure gold. But to an engineer, the real challenge is: How do we process this data "on the fly"?

Suppose you are the Chief System Architect of YouTube. You are tasked with building a system that collects and analyzes this massive influx of "gold." How would you process a vast, never-ending stream of data without the system buckling?

In this scenario, you turn to Stream Processing.

What is Stream Processing?

Martin Kleppmann defines it as:

“Stream processing is a computing paradigm focused on continuously processing data as it is generated, rather than storing it first and processing it in batches. It allows systems to react to events in near real-time, enabling low-latency analytics, monitoring, and decision making. Stream processing systems ingest data streams, apply transformations or computations, and emit results while the input is still being produced.”

Essentially, instead of storing data and running a massive batch job at 2:00 AM, you process it the moment it arrives. But how do we implement this?

This is where Kafka Streams enters the picture.

By textbook definition:

“Kafka Streams is a lightweight, Java-based library for building real-time, scalable stream processing applications that read from and write to Apache Kafka topics. It provides high-level abstractions for continuous processing such as filtering, mapping, grouping, windowing, and aggregations, while handling fault tolerance and state management internally.”

Now that we know what to do and which tool to use, let’s build our stream pipeline.

NOTE: This is a simplified mental model to explain the role of stream processing and Kafka Streams, not an exact representation of YouTube’s internal architecture. A giant like YouTube uses multiple stream processors, batch + streaming, ML pipelines, feature stores, etc to provide a seamless user experience.

Designing the Stream Pipeline

In Kafka Streams, we map our logic into a Topology. A topology is a directed acyclic graph (DAG) of processing nodes that represent the transformation steps applied to the data stream.

We start with Watch History, User Activities as our source of truth. In technical terms, this is our Source Processor (reading from a Kafka Topic).
Using the Kafka Streams DSL (Domain Specific Language), we can define three distinct operations:

1: Data Masking and Sanitization
Before deriving any higher-level signals, it is often necessary to sanitize incoming events.
This node:

consumes raw user interaction events
removes or masks unnecessary or sensitive fields
standardizes the event structure

This step ensures that downstream processors operate only on relevant and safe data, reducing coupling and improving maintainability.
The output of this node is a sanitized event stream, which becomes the input for subsequent processing steps.

2: Similar Content Recommendation
To power this, we need the User ID, Channel Name, and Genre. For example, if you watch a WWE video, the genre is Professional Wrestling. The goal is to immediately suggest related promotions like AEW or TNA.

In this node, we take the raw KStream, apply a map or transform operation to extract the relevant metadata, and pass it to a Sink Processor. This sink then emits the event into a new Kafka topic: similar-content.

3: Preferred Video Length
Here, we focus on user behaviour. Does the user prefer 30-second Shorts or 20-minute video essays?
We transform the incoming KStream into a specialized object containing the User ID and duration metrics. This transformed data is then streamed into a dedicated topic: preferred-content-length.

4: Product Discovery
If a user searches for specific items within the platform, we can extract these signals immediately. By filtering search events within the topology, we can transform them into product-intent objects and emit them into a product-recommendations topic.

Now that the data is emitted as well-defined events, downstream applications can analyze it independently and serve users far more effectively — and you get to keep your high-paying job, all thanks to stream processing and Kafka Streams 😉

Kafka Streams as a Transformer, Not the Brain'
Many descriptions label Kafka Streams as the "brain" or "heart" of an application (which, in some cases, may be true). However, in this architecture, Kafka Streams acts as a high-performance Transformer and Supplier.

It cleans, shapes, and routes data so that downstream microservices can act on it. This is the hallmark of a well-designed Event-Driven Architecture.
Congratulations! You’ve just scratched the surface of real-time data orchestration.

But a question remains: Why not just use a traditional database? Beyond the sheer volume of "heavy writes," what are the structural drawbacks of using a database for this?
Stay tuned for Part 2.