Forem: Precious Adedibu

The Outbox Pattern in Payment Systems

Precious Adedibu — Thu, 14 May 2026 14:02:28 +0000

_How to guarantee no transaction event is ever silently lost, even when Kafka goes down
_

Every payment system eventually faces the same problem. A transaction completes. Downstream systems need to know. An SMS must be sent. An audit log must be written. A fraud analysis must run. A compliance record must be created.
The obvious solution is to publish an event to Kafka after the payment is processed and let downstream consumers handle each task asynchronously. It works perfectly in development. In production, Kafka goes down during peak processing, and you discover your architecture has a gap that no amount of retry logic can fix.
This article explains that gap precisely, why it exists, and how the outbox pattern closes it permanently. It covers the database schema, the Spring Boot implementation, the failure scenarios the pattern handles, and the exactly-once delivery nuance that most engineers miss the first time they implement this.

The problem: direct Kafka publishing

The naive implementation of async event publishing looks like this:
@Service
public class PaymentService {

@Autowired
private KafkaTemplate<String, PaymentEvent> kafkaTemplate;

@Transactional
public void processPayment(Payment payment) {
    // Step 1: Save payment to database
    paymentRepository.save(payment);

    // Step 2: Publish event to Kafka
    kafkaTemplate.send("payment.completed", new PaymentEvent(payment));

    // Step 3: Return success to user
}

}

This works until Kafka is unavailable. When that happens, you have two bad options:
Fail the entire payment because Kafka is unreachable. The user gets an error for a problem that has nothing to do with their payment.
Complete the payment but swallow the Kafka exception. The payment is recorded but the event is silently lost. No SMS sent, no audit log written, no compliance record created.

Neither is acceptable. The first creates a false dependency between payment processing and Kafka availability. The second creates silent data loss in a regulated financial system.
There is also a third failure mode that is easy to miss. If your application crashes between step 1 and step 2, the payment is recorded in the database but the Kafka publish never happened. The transaction committed but the event was never produced. You have no way to know this happened, and no automatic recovery mechanism.

The solution: the outbox pattern

The outbox pattern eliminates these failure modes by removing Kafka from the critical path of payment processing entirely. The payment service never publishes to Kafka directly. Instead it writes to a dedicated outbox table in PostgreSQL, and a separate background processor handles the Kafka publishing independently.

The core insight

Writing to the outbox table and writing the payment record happen in the same database transaction. Both succeed together or both fail together. PostgreSQL's ACID guarantees make this atomic. There is no window where the payment is recorded but the event is not, because they are committed as a single operation.
Kafka publishing moves outside the transaction entirely. It becomes a best-effort background operation that can fail, retry, and eventually succeed without ever affecting the user-facing payment flow.

The database schema

First, create the outbox table:
CREATE TABLE outbox_events (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
aggregate_id VARCHAR(100) NOT NULL, -- e.g. transaction ID
aggregate_type VARCHAR(100) NOT NULL, -- e.g. 'PAYMENT'
event_type VARCHAR(100) NOT NULL, -- e.g. 'PAYMENT_COMPLETED'
topic VARCHAR(100) NOT NULL, -- Kafka topic to publish to
payload JSONB NOT NULL, -- message content
sent BOOLEAN NOT NULL DEFAULT FALSE,
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
sent_at TIMESTAMP
);

-- Index for the processor to efficiently find unsent events
CREATE INDEX idx_outbox_unsent ON outbox_events (sent, created_at)
WHERE sent = FALSE;

The index on sent and created_at is important. At scale this table will have millions of rows. Without the index, the processor scans the entire table on every run. With the partial index filtering on sent = FALSE, the database only scans the small subset of rows that actually need processing.

The payment service implementation

The payment service writes both the transaction record and the outbox event in a single atomic transaction:
@Service
public class PaymentService {

@Autowired private PaymentRepository paymentRepository;
@Autowired private OutboxRepository outboxRepository;

@Transactional
public PaymentResponse processPayment(PaymentRequest request) {

    // Step 1: Validate and build payment
    Payment payment = buildPayment(request);

    // Step 2: Write payment record
    paymentRepository.save(payment);

    // Step 3: Write outbox event IN THE SAME TRANSACTION
    OutboxEvent event = OutboxEvent.builder()
        .aggregateId(payment.getId())
        .aggregateType("PAYMENT")
        .eventType("PAYMENT_COMPLETED")
        .topic("payment.completed")
        .payload(buildPayload(payment))
        .sent(false)
        .build();
    outboxRepository.save(event);

    // Step 4: Commit transaction
    // Both payment and outbox event committed atomically

    // Step 5: Return success
    // Kafka not mentioned anywhere in this method
    return PaymentResponse.success(payment.getId());
}

}

Notice that Kafka does not appear anywhere in this class. The payment service has no Kafka dependency. It only knows about PostgreSQL. This is the decoupling that makes the pattern resilient.

The outbox processor

A separate Spring component runs on a schedule and handles all Kafka publishing:
@Component
public class OutboxProcessor {

@Autowired private OutboxRepository outboxRepository;
@Autowired private KafkaTemplate<String, String> kafkaTemplate;

@Scheduled(fixedDelay = 5000)  // runs every 5 seconds
public void processOutbox() {

    // Read unsent events, ordered by creation time
    List<OutboxEvent> events = outboxRepository
        .findBySentFalseOrderByCreatedAtAsc();

    for (OutboxEvent event : events) {
        try {
            // Publish to Kafka
            kafkaTemplate.send(
                event.getTopic(),
                event.getAggregateId(),
                event.getPayload()
            ).get(); // wait for broker acknowledgement

            // Mark as sent only after Kafka confirms receipt
            event.setSent(true);
            event.setSentAt(LocalDateTime.now());
            outboxRepository.save(event);

        } catch (Exception e) {
            // Kafka unavailable or publish failed
            // Log and continue to next event
            // This event will be retried on the next run
            log.error("Failed to publish event {}: {}",
                event.getId(), e.getMessage());
        }
    }
}

}

The processor runs every 5 seconds. In normal operation events are published within seconds of being written. When Kafka is unavailable, events accumulate in the outbox table with sent=false and are published in order once Kafka recovers. Nothing is lost.

Failure scenarios and how the pattern handles each

Kafka is down when payment is processed

Payment service:
Write payment record to transactions table [success]
Write event to outbox table, sent=false [success]
Return success to user [success]

Outbox processor runs:
Read event from outbox table [success]
Attempt to publish to Kafka [fails - Kafka down]
Log error, event stays sent=false [retries next run]

Kafka recovers:
Outbox processor reads same event, sent=false
Publishes to Kafka [success]
Marks sent=true
Downstream consumers receive event

Payment service crashes before returning success

Payment service:
Write payment record [success]
Write outbox event, sent=false [success]
CRASH before returning response to user

User retries payment:
Idempotency key check: has this been processed?
Yes - same payment record exists in database
Return success, do not create duplicate payment

Outbox processor:
Finds original event, sent=false
Publishes to Kafka
Marks sent=true
Single event published, single outcome

Processor crashes after publishing but before marking sent

Outbox processor:
Read event, sent=false
Publish to Kafka [success]
CRASH before marking sent=true

Processor restarts:
Reads same event, still sent=false
Publishes to Kafka again - DUPLICATE
Marks sent=true

Kafka now has this event twice.
This is why consumers must be idempotent.

This is the most important failure scenario to understand. The outbox pattern guarantees at-least-once delivery, not exactly-once delivery. Duplicate messages are possible in crash scenarios. The solution is idempotent consumers, not attempting to prevent duplicates at the producer level.

Idempotent consumers: the final safety net

Every consumer that receives events from Kafka must handle duplicates gracefully. The pattern is consistent across all consumers: check whether this event ID has already been processed before doing any work.
@Service
public class SmsNotificationConsumer {

@Autowired private SmsService smsService;
@Autowired private ProcessedEventRepository processedEvents;

@KafkaListener(topics = "payment.completed")
@Transactional
public void handlePaymentCompleted(PaymentEvent event) {

    // Check if already processed
    if (processedEvents.existsByEventId(event.getId())) {
        log.info("Duplicate event ignored: {}", event.getId());
        return;
    }

    // Process the event
    smsService.send(
        event.getUserPhone(),
        "Your payment of " + event.getAmount() + " was successful"
    );

    // Record that we have processed this event
    // This and the SMS send are in one transaction
    processedEvents.save(new ProcessedEvent(event.getId()));
}

}

Each consumer maintains its own processed events table. The check and the processing happen in one database transaction, making the consumer idempotent even in the face of duplicates.

Exactly-once delivery: the honest picture

The outbox pattern is frequently described as providing exactly-once delivery. This is imprecise and worth correcting.

What the outbox pattern actually guarantees

Every event will be published to Kafka at least once. No events are silently lost.
Duplicate events are possible in crash scenarios, specifically when the processor crashes after publishing but before marking sent=true.
The frequency of duplicates is very low in practice but cannot be eliminated entirely without coordination overhead that makes the system impractical.

What Kafka's enable.idempotence covers

Enabling idempotence on the Kafka producer eliminates duplicates caused by network retries within a single producer session. Kafka assigns each producer a unique ID and tracks sequence numbers. If the same message arrives twice from the same producer session, Kafka deduplicates it automatically.
props.put("enable.idempotence", "true");
props.put("acks", "all");
props.put("retries", Integer.MAX_VALUE);

However, enable.idempotence does not cover the crash scenario described above. When the processor restarts, it receives a new producer ID. Kafka treats it as a completely different producer and has no way to know that the message it is receiving is a duplicate of one sent in the previous session.
The practical solution
The industry standard is to combine at-least-once delivery from the outbox pattern with idempotent consumers. This gives you the correct outcome in every scenario:
Outbox pattern: no events lost regardless of Kafka availability
enable.idempotence: no duplicates from network retries within a session
Idempotent consumers: no duplicate outcomes even when duplicates arrive

The goal is not exactly-once publishing. The goal is exactly-once outcomes. These are different things, and the combination above achieves the second without the impractical overhead of attempting the first.

Operational considerations

Outbox table growth

The outbox table grows continuously. Sent events should be cleaned up periodically to prevent it from becoming a performance problem:
-- Run on a schedule, e.g. daily
DELETE FROM outbox_events
WHERE sent = TRUE
AND sent_at < NOW() - INTERVAL '7 days';

Keep sent events for a few days before deleting them. This gives you a window to investigate any consumer issues and replay events if needed.

Multiple processor instances

If you run multiple instances of your application, multiple outbox processors will run simultaneously. Without coordination, they will all try to publish the same events at the same time.
The solution is to use SELECT FOR UPDATE SKIP LOCKED when reading from the outbox table. This is a PostgreSQL feature that allows concurrent processors to claim events exclusively without blocking each other:
@Query(value = "SELECT * FROM outbox_events " +
"WHERE sent = FALSE " +
"ORDER BY created_at ASC " +
"LIMIT 100 " +
"FOR UPDATE SKIP LOCKED",
nativeQuery = true)
List findUnsentEventsWithLock();

Each processor instance claims a batch of events exclusively. Other instances skip locked rows and claim their own batches. Multiple instances process events in parallel without conflicts.

Monitoring

Monitor these two metrics to catch outbox problems early:
Outbox table depth: the number of rows where sent=false. Under normal conditions this should be near zero. A growing backlog indicates either Kafka is unavailable or the processor has stopped running.
Outbox event age: the oldest created_at timestamp among unsent events. Events more than a few minutes old indicate a problem that needs investigation.

Summary

Direct Kafka publishing from the payment service creates a hidden dependency on Kafka availability and a data loss window on application crashes.
The outbox pattern removes Kafka from the critical payment path by writing events to a PostgreSQL table atomically with the payment record.
A separate background processor reads the outbox table and publishes to Kafka independently of payment processing.
Kafka downtime delays downstream processing but never prevents payment confirmation or causes event loss.
The outbox pattern guarantees at-least-once delivery, not exactly-once. Duplicate messages are possible.
Idempotent consumers handle duplicates gracefully, making the end-to-end outcome correct even when duplicates arrive.
Use SELECT FOR UPDATE SKIP LOCKED when running multiple application instances to prevent concurrent processors from duplicating work.
Monitor outbox table depth and event age to catch problems before they affect users.

Caching in Payment Systems

Precious Adedibu — Thu, 30 Apr 2026 17:51:23 +0000

Every backend engineer has heard the advice: add a cache. It will make your system faster. And they are right, it will. But a cache that is not properly understood is one of the most effective ways to take down your entire production system, including the database it was supposed to protect.
This article covers everything you need to know about caching in a payment system: what it is, how it works, the patterns that govern it, the failure modes that will wake you up at 2am, and the decisions that separate engineers who use cache from engineers who understand it.

What a cache actually is

A cache is a temporary storage layer that holds copies of data in memory so that future requests for that data can be served faster. The operative word is temporary. A cache is not a database. It is not your source of truth. It is a shortcut.
Three properties define cache and everything else follows from them:
Temporary: cached data has a finite lifespan and will eventually be deleted
In memory: cache lives in RAM, not on disk. RAM access takes nanoseconds. Disk access takes milliseconds. That is a 1,000x speed difference.
A copy: the real data lives in your database. Cache holds a duplicate for fast retrieval.

In a payment platform, database traffic breaks down roughly as:
95% reads: balance checks, transaction history, KYC status lookups, fraud checks
5% writes: new transactions, status updates, profile changes

Running all of this through one database means reads and writes compete for the same CPU, memory, and disk I/O. A heavy reconciliation job scanning millions of rows can starve your live payment processing of resources. Cache solves this by absorbing the read traffic before it ever reaches the database.

The cache-aside pattern

The most common caching strategy is cache-aside, also called lazy loading. The flow has three steps:
Web server receives a request
Server checks cache first. If data is there (cache hit), return it immediately. Database never touched.
If data is not in cache (cache miss), query the database, store the result in cache, return to client.

The name lazy loading comes from the fact that data is only loaded into cache when someone actually requests it, not proactively. Your cache gradually warms up as users make requests. Cold cache after a restart means more database hits until the cache repopulates.
In Redis, the implementation is straightforward:
cache.set('user:USR_001:kyc_status', 'VERIFIED', EX, 21600)
cache.get('user:USR_001:kyc_status')

The three components of every cache write are the key (the unique identifier), the value (the data being stored), and the expiry (how long before it is automatically deleted). That expiry is called the TTL, Time To Live.

TTL: the expiration timer that governs everything

TTL is the number of seconds before a cached value is automatically deleted. Once a key expires, the next request for it is a cache miss, triggering a fresh database query. The result is stored again with a new TTL and the cycle continues.
Setting the right TTL for each piece of data is one of the most important caching decisions you make. Both extremes are wrong:
TTL too short: cache expires constantly, every request goes to the database, you have gained nothing from having a cache
TTL too long: cached data becomes stale, users see outdated information, which in a fintech means showing wrong balances or transaction states

The right question for every piece of data is: if a user sees this value X seconds after it was written, does anything bad happen? That answer determines your TTL.
In a payment platform, different data has very different tolerance for staleness:
Transaction fee configuration: changes monthly. TTL of 1 hour is safe.
User KYC status: changes rarely. TTL of 6 hours is acceptable.
User session token: security-sensitive. TTL of 30 minutes maximum.
OTP codes: single use, expires in minutes by design. TTL of 5 minutes.
Wallet balance for display: changes on every transaction. TTL of 10 seconds at most.
Wallet balance for authorisation decisions: never cache. Always read from the primary database.

The most important rule in fintech caching: never use a cached value to authorise a financial transaction. Cache is for display. Authorisation always reads from the primary database.

Cache consistency: the invisible problem

When a user updates their data, you update the database. Then you update the cache. These are two separate operations. They cannot be wrapped in a single atomic transaction the way two database writes can be.
If the database update succeeds but the cache update fails, your cache now holds stale data. Every request that hits cache gets the wrong answer. And it stays wrong until the TTL expires and forces a fresh database read.

Why this is harder than it sounds

In a microservices architecture, multiple services may write to the same database tables. If Service A updates a user's account tier and Service B has that tier cached, Service B will serve stale data until its TTL expires, regardless of what Service A did.
This window where the database and cache disagree is called an inconsistency window. For most data it is acceptable and self-healing. For financial data it is unacceptable, which is exactly why wallet balances used for authorisation must never come from cache.

The consistency strategies

Write-through: update the database and the cache in the same operation. If either fails, you retry until both succeed. Keeps cache consistent but adds latency to every write.
Cache invalidation: when data changes, delete the cache key entirely rather than updating it. The next read will be a cache miss and will fetch fresh data. Simpler than write-through, but creates a spike of cache misses immediately after updates.
TTL-based expiry: accept that the cache will be stale for up to X seconds and set your TTL accordingly. The simplest approach and perfectly valid for data where short-term staleness is acceptable.
Facebook published a paper called Scaling Memcache at Facebook that is one of the most important pieces of engineering writing on cache consistency at scale. It is worth reading once you have completed the foundations of your systems design study.

Cache as a single point of failure

A single cache server is a single point of failure. If that one Redis instance goes down, every request that would have been served by cache is now a cache miss. Every cache miss becomes a database query. If your database was sized to handle, say, 5,000 queries per second with cache absorbing everything else, it suddenly receives 50,000 queries per second.
The database was not built for this. It struggles. It dies. Your entire system is down.
This is called a cache avalanche. Redis did not just fail. It took your database with it.

How cache avalanche happens step by step
Redis goes offline
Every cached key is gone
Every incoming request is a cache miss
Every cache miss becomes a database query
Database receives a sudden flood of traffic it was never designed to handle
Database CPU spikes to 100%
Database response times grow to seconds
Connection pool exhausted
Database dies
Total system outage

This entire sequence can happen in under 60 seconds. Faster than any engineer can manually intervene.

Three things that prevent cache avalanche
First: run Redis in cluster mode across multiple availability zones. A cluster distributes your keys across multiple nodes, each with its own replica in a separate availability zone. When one node dies, its replica is promoted automatically. Your application continues serving cache hits without interruption. No avalanche.
Second: overprovision memory. Never run Redis above 60% capacity. That 40% buffer absorbs two specific scenarios. Traffic spikes: when unexpected load brings more data to cache, Redis has room to absorb it without immediately evicting existing keys. Node failure redistribution: when one node dies, its keys redistribute to surviving nodes. If those nodes are at 90% capacity, the redistribution causes immediate mass eviction and a flood of cache misses. At 60% capacity, surviving nodes absorb the redistribution without evicting anything.
Third: stagger your TTLs with random jitter. If you cache 100,000 keys all with a TTL of 3,600 seconds and they all expire at the same time, you get 100,000 simultaneous cache misses. Every one of them hits the database in the same second. To prevent this, add a random offset to each TTL:
ttl = 3600 + random(0, 300) // expire between 60 and 65 minutes

Misses now spread across a 5-minute window instead of hitting simultaneously. The database handles them at a manageable rate.

Redis cluster: how data distributes across nodes

When you run Redis in cluster mode, data does not just randomly distribute across nodes. Redis uses a deterministic system called hash slots to decide exactly which node stores which key.
What hash slots are
Redis pre-divides the entire key space into exactly 16,384 slots, numbered 0 to 16,383. This number is fixed regardless of how many nodes you have. Every possible key maps to exactly one slot via a hash function:
slot = CRC16(key) % 16384

CRC16 is a standard algorithm that converts any string into a number. The same key always produces the same slot number. This determinism is what makes routing fast and reliable.
How slots distribute across nodes
With three primary nodes, the 16,384 slots divide roughly equally:
Primary Node 1 owns slots 0 to 5,460
Primary Node 2 owns slots 5,461 to 10,922
Primary Node 3 owns slots 10,923 to 16,383

When you write a key, the Redis client library hashes it, calculates the slot, and sends the write directly to the node that owns that slot. No searching. No broadcasting. Pure math tells the client exactly where to go.

What happens when a client asks the wrong node
If a client sends a request to the wrong node, that node responds with a MOVED redirect:
Client: GET user:USR_001 (sent to Node 1 by mistake)
Node 1: MOVED 5789 node2:7001
Client: GET user:USR_001 (resent to Node 2)
Node 2: returns the data

The client updates its local slot map so future requests go directly to the correct node. Cluster-aware Redis client libraries like Jedis (Java) or ioredis (Node.js) maintain this map automatically.

Who assigns the hash slots
Hash slot assignment happens once at cluster creation time. The redis-cli cluster create command divides the 16,384 slots equally across your primary nodes and stores the assignment on every node. Every node in the cluster holds a complete copy of the slot map. There is no central coordinator. Any node can redirect any client to the right place.

Eviction policies: what happens when cache is full

Redis has a fixed memory limit. When that limit is reached and new data needs to be stored, Redis must delete something to make room. The eviction policy controls which keys get deleted.

LRU: Least Recently Used
Delete whichever key has not been accessed for the longest time. The assumption is that recently accessed data will likely be accessed again soon. Data that has not been touched in hours is probably not hot.
LRU is the most commonly used eviction policy and Redis's default. It works well because of a natural property of real-world access patterns.

The 80/20 rule in cache access
In any application with many users and many pieces of data, a small minority of keys get requested constantly while the vast majority are rarely touched. In a payment platform:
Transaction fee configurations: read on every single transaction. Extremely hot.
Active user KYC status: read multiple times per session for active users. Hot.
System configuration values: read constantly by all services. Hot.
Transaction history for users who logged in once six months ago: cold.
Profiles of dormant accounts: cold.

Roughly 20% of your keys account for 80% of your cache reads. LRU naturally protects this hot 20% because those keys are accessed constantly, keeping their last-accessed timestamp recent. LRU will never evict them. It evicts the cold 80%, which is exactly the right behaviour.
LFU: Least Frequently Used
Delete whichever key has been accessed the fewest total times, regardless of recency. Better than LRU when some data is genuinely hot long-term and you do not want it evicted just because it was not accessed in the last few minutes. More complex to implement and requires Redis 4.0 or later.
FIFO: First In First Out
Delete whichever key was stored in the cache earliest, regardless of access frequency or recency. Simplest to implement, worst performing in practice. Evicts based on age not utility. Rarely the right choice for production systems.
For most production payment systems, LRU with 60% memory provisioning is the correct starting configuration. Change it only when you have data showing a different pattern in your specific workload.

When NOT to cache

The question of what to cache gets a lot of attention. The question of what never to cache gets almost none. In a payment system, this is where the real discipline lives.

Never cache wallet balances used for authorisation
A cached balance that is 10 seconds stale is fine for displaying to a user on their dashboard. It is catastrophic for authorising a debit. If a user makes two simultaneous transfers and both are authorised against a stale cached balance, you have a double spend. The authorisation check for money movement always reads from the primary database. Always.

Never cache OTP codes as the sole source of truth
OTPs must be verified against the value stored in the database or a dedicated time-based generation system. A cached OTP that persists beyond its intended expiry due to a TTL misconfiguration is a security vulnerability.

Never cache active transaction status
A transaction in flight has a status that changes rapidly: initiated, processing, authorised, settled, failed. Serving a cached status of processing for a transaction that has already failed or settled creates a confusing and potentially dangerous user experience. Active transaction status always comes from the database.

Never treat cache as your only storage
If Redis restarts, all in-memory data is gone. Any data that exists only in Redis and nowhere else is permanently lost. Every piece of data in your cache must have its source of truth in a persistent store. Cache is a copy. The database is the original.

Replication is not a backup

This applies equally to cache and databases and is worth stating explicitly. Redis replication keeps a copy of your data on a replica node. If the primary dies, the replica is promoted and cache continues serving requests.
But replication copies everything, including mistakes. If a bug in your application writes corrupted data to Redis, that corruption replicates to every replica. Replication does not protect you from data corruption. It protects you from infrastructure failure.
For cache this is less critical because cache data is always regenerable from the database. But the mental model matters: replication is availability protection, not data protection.

Summary: the caching mental model for payment systems

Cache is a temporary, in-memory copy of data from your database. It is not your source of truth.
Use cache for data that is read frequently but changes infrequently.
Set TTLs appropriate to how stale each piece of data can be. Add random jitter to prevent simultaneous expiry.
Cache and database can become inconsistent. Design for this. Never use cached values for authorisation decisions.
A single cache instance is a single point of failure. Run in cluster mode across multiple availability zones.
Never run Redis above 60% capacity. The buffer protects you during spikes and node failure redistribution.
Cache avalanche is real. Redis dying can take your database with it if you have not planned for it.
LRU eviction with memory overprovisioning is the correct default for most payment systems.
Wallet balances for authorisation, OTPs, and active transaction status should never come from cache.