Forem: Ankit Kumar Shaw

Building a 5000+ Notifications/sec Event Pipeline with Kafka & Distributed Idempotency

Ankit Kumar Shaw — Fri, 03 Apr 2026 10:24:46 +0000

The Hook: When Synchronous Becomes the Bottleneck

Picture this: Your leave approval API is taking 2.3 seconds to respond. Not because the database is slow. Not because the business logic is complex. But because it's waiting for:

An email to be sent to the approver
An SMS notification to the employee
An audit log to be written to a compliance database
A push notification to mobile devices

Each of these external calls adds 300-500ms of latency. And if one fails? The entire approval operation fails. The user sees an error. The leave request is lost. Support tickets flood in.

This was the reality when I joined the notification team. Our leave management system—serving 20,000 employees—was tightly coupled with notification delivery. Every approval action had to wait for emails, SMS, and audit logs to complete. If SendGrid was slow or Twilio was down, leave approvals ground to a halt.

The solution? Decouple notification delivery from the critical path using event-driven architecture with Apache Kafka.

This is the story of how we re-architected the notification system to handle 5000+ notifications/sec with zero message loss, 100% idempotency, and 85% reduction in third-party failure propagation—all while cutting approval API latency by 70%.

The Problem: Synchronous Coupling at Scale

Original Architecture (The Synchronous Nightmare)

User → Leave Approval API → [Business Logic]
                                    ↓
                    ┌───────────────┴───────────────┐
                    │                               │
          [Send Email] ──→ SendGrid (500ms)        │
                    │                               │
          [Send SMS] ──→ Twilio (400ms)            │
                    │                               │
          [Audit Log] ──→ Compliance DB (300ms)    │
                    │                               │
          [Push Notification] ──→ FCM (350ms)      │
                    │                               │
                    └───────────────┬───────────────┘
                                    ↓
                          Total Latency: 2.3s
                          Return Response to User

The Cascading Failures:

Latency Amplification: Every notification channel adds to API response time
Failure Propagation: If SendGrid is down, the entire approval fails
No Retry Logic: Failed notifications are lost forever
Poor User Experience: Users wait 2+ seconds for a simple approval
Tight Coupling: Business logic can't evolve independently of notification logic

Production Impact:

450 incidents/month from third-party notification provider failures
p99 latency: 2.3 seconds (unacceptable for user-facing APIs)
Zero fault isolation: One provider outage affects all operations

The Solution: Event-Driven Architecture with Kafka

The Core Insight

Observation: Notification delivery is not part of the core business transaction. If a leave is approved, it's approved—whether or not the email sends immediately is a separate concern.

Key Architectural Decision: Decouple notification delivery from the approval workflow using asynchronous event processing.

Target Architecture (Event-Driven)

User → Leave Approval API → [Business Logic]
                                    ↓
                    ┌───────────────┴───────────────┐
                    │  Save to Database             │
                    │  (50ms)                       │
                    └───────────────┬───────────────┘
                                    ↓
                    Publish Event to Kafka Topic
                    (5ms - async, fire-and-forget)
                                    ↓
                          Return Response: 55ms
                          (User sees success immediately)

                    ┌─────────────────────────────────┐
                    │   Kafka Topic (12 partitions)   │
                    └─────────────┬───────────────────┘
                                  ↓
              ┌───────────────────┴───────────────────┐
              │                                       │
    [Consumer Group: Email]              [Consumer Group: SMS]
    [Consumer Group: Audit]              [Consumer Group: Push]
              │                                       │
              ↓                                       ↓
    Process independently with                Circuit breakers,
    retries, DLQ, idempotency                exponential backoff

Immediate Benefits:

70% latency reduction: API response time drops from 2.3s → 700ms (then further optimized to 250ms with DB tuning)
Fault Isolation: Email provider outage doesn't affect leave approval
Independent Scaling: Notification consumers scale independently of API layer
Retry Logic: Failed notifications automatically retry with exponential backoff
Zero Message Loss: Dead Letter Queues (DLQ) capture failures for manual intervention

Deep Dive: Kafka Architecture Decisions

Why Kafka Over RabbitMQ or SQS?

As I researched event-streaming platforms, I evaluated three options:

Aspect	Kafka	RabbitMQ	AWS SQS
Throughput	1M+ msg/sec	50K msg/sec	120K msg/sec
Ordering Guarantee	Per-partition ordering	No strict ordering	FIFO queues (limited throughput)
Replay Capability	Yes (offset management)	No	Limited (max 14 days retention)
Durability	Replicated logs	Optional persistence	High (managed by AWS)
Consumer Groups	Native support	Manual implementation	Limited
Best For	Event streaming, high throughput	Task queues, RPC	Serverless, decoupled microservices

Why We Chose Kafka:

Throughput Requirements: We needed to scale from 100 → 5000+ notifications/sec
Replay Capability: Critical for debugging and reprocessing failed batches
Ordering Guarantees: Per-user notification ordering was essential (e.g., "leave requested" before "leave approved")
Consumer Groups: Native support for multiple independent consumers (Email, SMS, Push, Audit)

Kafka Partitioning Strategy: Scaling to 5000+ Notifications/sec

The Partitioning Challenge

Key Insight: Kafka partitions are the unit of parallelism. More partitions = more concurrent consumers = higher throughput.

Design Decision: Use 12 partitions partitioned by userId hash.

Key Configuration:

// Idempotent producer with userId as partition key
config.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true);
config.put(ProducerConfig.ACKS_CONFIG, "all"); // Wait for all replicas

kafkaTemplate.send("notification-events", 
                   event.getUserId().toString(), // Partition key
                   event);

Why Partition by userId?

Ordering Guarantee: All notifications for a user go to the same partition (FIFO order preserved)
Load Distribution: Uniform distribution across partitions (assuming balanced user activity)
Consumer Affinity: Each consumer processes a subset of users (better caching)

Throughput Calculation:

Single consumer throughput: ~500 msg/sec (bounded by external API calls)
12 partitions × 500 msg/sec = 6000 msg/sec peak capacity
Production average: ~800 msg/sec, peak: 5000+ msg/sec

The Idempotency Challenge: Ensuring "Exactly-Once" Semantics

The Duplicate Notification Problem

Scenario: Kafka delivers "at-least-once" by default. If a consumer processes a message but crashes before committing the offset, the message is redelivered.

Without Idempotency:

1. Consumer receives "Leave Approved" email notification
2. SendGrid API call succeeds (email sent)
3. Consumer crashes before committing Kafka offset
4. Message redelivered → duplicate email sent

Real-World Impact: Users receiving 3-5 duplicate emails for the same action is unacceptable.

Solution: Two-Barrier Idempotency Pattern

Architecture:

Kafka Message → [Check Redis] → [Check PostgreSQL] → [Send Notification] → [Update Both]
                     ↓                ↓
              Fast dedup          Durable dedup
              (~1ms)              (~10ms)

Core Logic:

@KafkaListener(topics = "notification-events")
public void processNotification(NotificationEvent event) {
    String dedupeKey = generateDedupeKey(event); // userId + eventType + timestamp

    // Barrier 1: Redis (Fast Path - 99.9% of duplicates caught here)
    if (!redisTemplate.opsForValue().setIfAbsent(dedupeKey, "PROCESSING", Duration.ofMinutes(10))) {
        return; // Duplicate detected
    }

    // Barrier 2: PostgreSQL (Durable Fallback)
    if (notificationRepo.existsByDedupeKey(dedupeKey)) {
        return; // Duplicate detected in database
    }

    // Process notification and mark as completed in both barriers
    emailService.send(event);
    notificationRepo.save(new NotificationRecord(dedupeKey, event));
    redisTemplate.opsForValue().set(dedupeKey, "COMPLETED", Duration.ofHours(24));
}

Why Two Barriers?

Redis (Primary): Ultra-fast deduplication (<1ms) for 99.9% of cases
PostgreSQL (Fallback): Durable guarantee even if Redis is evicted or fails

Result: 100% deduplication with minimal performance overhead

Circuit Breakers & Exponential Backoff: Surviving Third-Party Failures

The Third-Party Dependency Problem

Observation: External notification providers (SendGrid, Twilio, FCM) fail regularly:

Rate limits (429 errors)
Transient network issues
Scheduled maintenance
Provider outages

Before Event-Driven Architecture:

These failures propagated to the leave approval API
450 incidents/month affecting core business flows

After Kafka + Circuit Breakers:

Failures isolated to notification consumers
Core business logic unaffected
~68 incidents/month (85% reduction)

Exponential Backoff: Configured Kafka consumer retry with exponential backoff (1s → 2s → 4s → 8s → 16s max).

Result:

SendGrid outage → Circuit breaker opens → DLQ captures events → No impact to core system
450 → ~68 incidents/month (85% reduction in failure propagation)

Dead Letter Queue (DLQ): Zero-Loss Failure Handling

The Problem of Permanent Failures

Scenarios:

Invalid phone number (SMS)
Malformed email address
User opted out of notifications
Provider returns 400 (non-retryable error)

Without DLQ: After N retries, message is discarded → data loss.

DLQ Architecture

Kafka Topic: notification-events
        ↓
    Consumer
        ↓
    [Try Process]
        ↓
    ┌──────────┬──────────┐
    │          │          │
  Success   Retryable  Non-Retryable
    │       Failure     Failure
    │          │          │
    ✓      [Retry]   [Send to DLQ]
            (3x)          ↓
                   Kafka Topic: notification-dlq
                          ↓
                   [Manual Review Queue]
                   [Alerting/Monitoring]

Implementation:

@KafkaListener(topics = "notification-events")
public void consume(NotificationEvent event) {
    try {
        processNotification(event);
    } catch (RetryableException e) {
        throw e; // Let Kafka consumer retry
    } catch (NonRetryableException e) {
        log.error("Non-retryable error, sending to DLQ: {}", e.getMessage());
        dlqProducer.send("notification-dlq", event);
    }
}

DLQ Monitoring:

Grafana dashboard for DLQ message count
PagerDuty alert if DLQ > 100 messages
Daily batch job to review and reprocess DLQ messages

The Backend Engineer's Perspective: Event-Driven Patterns

As I built this system, I kept seeing parallels to distributed system patterns I'd encountered:

Kafka Consumers = Microservices
Each consumer group (Email, SMS, Push, Audit) is an independent microservice. They scale, fail, and deploy independently.
Idempotency = Distributed Transactions
The two-barrier pattern is similar to two-phase commit (2PC), but optimized for performance (fast Redis path + durable PostgreSQL fallback).
Circuit Breakers = Resilience Patterns
Same pattern I use in Spring Boot microservices with Resilience4j. The only difference? Here it's protecting Kafka consumers instead of REST APIs.
DLQ = Exception Handling at Scale
DLQ is like a try-catch block for distributed systems. Instead of rethrowing, we quarantine failures for human review.
Partitioning = Database Sharding
Kafka partitions are like database shards—both distribute load across multiple nodes while preserving per-key ordering.

The Insight: Event-driven architecture isn't fundamentally different from synchronous microservices. It's the same engineering principles applied to asynchronous, decoupled workflows.

Conclusion: From Synchronous Coupling to Scalable Event Streams

When I started this project, I thought event-driven architecture was "advanced" and maybe unnecessary. Why not just keep it simple with synchronous REST calls?

But as I profiled the system, measured the latency, and watched third-party failures bring down core business flows, the path forward became clear: services that don't need to fail together shouldn't be coupled together.

Kafka gave us that decoupling. Circuit breakers gave us resilience. Idempotency gave us correctness. And DLQs gave us zero-loss guarantees.

The result? A notification system that handles 5000+ notifications/sec, recovers from failures autonomously, and never blocks the critical path of leave approvals.

If you're building microservices and hitting synchronous coupling bottlenecks, start with one question: "Does this operation need to complete before returning a response to the user?"

If the answer is no, make it asynchronous. Your users—and your on-call engineers—will thank you.

Connect with me: If you're building event-driven systems or have questions about Kafka architecture, let's discuss! I'm always learning and happy to share insights.

Demystifying Agentic AI: Why I'm Trading Chains for Graphs with LangGraph

Ankit Kumar Shaw — Fri, 03 Apr 2026 07:52:30 +0000

The Hook: When Simple Prompts Aren't Enough

A year ago, if you asked me about AI integration in backend systems, I'd point to API calls to OpenAI with well-crafted prompts. Simple, predictable, stateless:

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Summarize this article"}]
)

Linear. Deterministic invocation (even if the output varied). Just another REST API call in my Spring Boot microservices world.

But then I encountered a real problem: What if the AI needs to make decisions, call tools, evaluate results, and loop back based on outcomes? What if it needs to research a topic, validate findings, and autonomously decide whether to dig deeper or move forward?

That simple prompt-response pattern breaks down. You need state management, conditional branching, tool orchestration, and retry logic—concepts backend engineers like me live and breathe, but applied to non-deterministic AI workflows.

This is where I discovered Agentic AI and LangGraph—a framework that treats AI tasks not as linear chains, but as state machines represented by graphs. As I explored the documentation and experimented with multi-agent systems, I realized: the engineering principles haven't changed. Only the substrate has.

The Evolution: From Prompts to Chains to Autonomous Agents

Stage 1: Simple Prompts (The API Call Era)

User Input → LLM → Output

Limitation: No memory, no context, no tool use. Every interaction is isolated.

Stage 2: LangChain (The Linear Pipeline Era)

Input → Prompt Template → LLM → Output Parser → Result

Innovation: Introduced chains—sequential steps where outputs feed into next steps. Added memory and basic tool calling.

Limitation: Chains are linear. If the LLM needs to loop back based on a condition (e.g., "research more if the answer isn't confident"), you're stuck. You can't model feedback loops or conditional branching elegantly.

Stage 3: Agentic AI with LangGraph (The Autonomous Decision Era)

                    ┌─────────────┐
                    │   Agent     │
                    │  (Planner)  │
                    └──────┬──────┘
                           │
                ┌──────────┴──────────┐
                │                     │
         ┌──────▼──────┐       ┌─────▼──────┐
         │  Research   │       │ Summarize  │
         │   Tool      │       │   Tool     │
         └──────┬──────┘       └─────┬──────┘
                │                     │
                └──────────┬──────────┘
                           │
                    ┌──────▼──────┐
                    │  Evaluator  │
                    │  (Quality   │
                    │   Check)    │
                    └──────┬──────┘
                           │
                  ┌────────┴────────┐
                  │                 │
              Good Quality     Needs More Research
                  │                 │
            ┌─────▼─────┐         Loop Back to Agent
            │   Output   │
            └────────────┘

The Shift: The AI now controls the flow. It decides when to research, when to validate, when to retry, when to stop. This is Agentic AI—autonomous, self-directed workflows.

Why LangGraph? The Graph vs. Chain Paradigm

As I researched frameworks (LangChain, CrewAI, AutoGen, LangGraph), a core architectural question emerged: How do you model complex, conditional AI workflows?

The Problem with Chains

LangChain's chains are fundamentally Directed Acyclic Graphs (DAGs) where each node executes once, in order. Perfect for pipelines:

Retrieve Documents → Rank by Relevance → Send to LLM → Format Output

But what if:

The LLM's output quality is poor, and you need to retry with a different strategy?
You want the AI to self-critique and loop back to refine its answer?
Different outcomes require branching logic (e.g., if confidence < 70%, call a specialist agent)?

Chains don't handle cycles or conditional edges well. You end up with brittle workarounds.

LangGraph's Solution: Stateful Graphs with Cycles

LangGraph treats workflows as state machines. Each node is a function. Edges define transitions. The graph can have cycles (loops), conditional routing, and persistent state.

Key Concepts:

Nodes: Functions that take state, perform an action (call LLM, invoke tool, validate), and return updated state.
Edges: Define transitions. Can be conditional (if confidence > 80%, go to "output"; else, go to "research_more").
State: A shared data structure (like a TypedDict) that all nodes read/write. This is your "application context."
Cycles: Allowed! An agent can loop back to itself or a previous node.

In my backend engineering terms: LangGraph is like Spring State Machine or workflow engines (Camunda, Temporal), but purpose-built for AI agents.

LangGraph vs. LangChain: An Architectural Comparison

Aspect	LangChain	LangGraph
Mental Model	Linear pipelines (chains)	State machines (graphs)
Flow Control	Sequential, mostly linear	Conditional routing, cycles allowed
State Management	Chain-scoped memory	Graph-level, persistent state
Best For	Simple RAG, Q&A, document processing	Multi-step reasoning, agent loops, self-correction
Complexity	Low to Medium	Medium to High
When to Use	"Do A → B → C"	"Do A, evaluate, maybe do B, loop if needed, then C"

Example Use Cases:

LangChain: Search documents, rank by relevance, answer question. (One-shot workflow)
LangGraph: Research a topic, validate findings, if insufficient → research more, summarize when confident. (Iterative, agent-driven)

In my research, I found LangGraph is ideal when the AI needs to think in loops—like a backend retry mechanism, but at the decision-making level.

The Backend Engineer's Perspective: Familiar Patterns in New Territory

As I explored LangGraph, I kept seeing parallels to systems I've built with Spring Boot:

1. State Management = Request Context

In Spring Boot, I use @RequestScope beans or ThreadLocal to share context across service layers. LangGraph's AgentState is the same concept—a shared object that flows through the workflow.

2. Conditional Routing = Service Orchestration

When building microservices, I often have logic like:

if (user.hasPermission()) {
    orderService.createOrder();
} else {
    throw new UnauthorizedException();
}

LangGraph's conditional edges are the same—routing based on state.

3. Retry Logic = Circuit Breakers

The evaluator-to-planner loop is like Resilience4j's retry mechanism:

@Retry(name = "orderService", fallbackMethod = "fallback")
public Order createOrder() { ... }

LangGraph just applies this pattern to AI decision-making.

4. Preventing Infinite Loops = Rate Limiting

The iteration_count guard is like rate limiting or max retry configs in Spring Boot. You always need an escape hatch.

The Insight: Agentic AI isn't magic—it's distributed systems applied to non-deterministic workflows. My experience with microservices, state machines, and fault tolerance directly translates.

Why This Matters: The Future of Backend + AI Integration

As I dug deeper into agentic AI, I realized: backend engineers are uniquely positioned to excel in this space. Why?

We Understand State: Managing state across distributed systems is our bread and butter. AI agents are just another stateful workflow.
We Think in Graphs: Microservices communication, DAG-based workflows (Airflow, Temporal), dependency graphs—we already model systems as graphs.
We Prioritize Reliability: Retry logic, fallbacks, timeout handling, idempotency—all critical for production AI systems. Most AI tutorials skip this.
We Know Observability: Logging, tracing, monitoring—essential when your AI agent makes 10 autonomous decisions before producing output.

LangGraph is essentially a workflow engine for AI. If you've worked with Camunda, Temporal, or AWS Step Functions, you already have the mental model.

What I Learned: Key Takeaways

Graphs > Chains for Complex Reasoning: If your AI needs to loop, self-correct, or make decisions, linear chains won't cut it. LangGraph's state machine model handles complexity gracefully.
State is Everything: The AgentState object is the contract between nodes. Design it carefully—just like you would a database schema.
Conditional Routing is Where Intelligence Lives: The should_continue_research function determines the agent's behavior. This is where you encode business logic.
Always Have an Escape Hatch: AI can loop forever if you're not careful. Iteration limits, cost caps, and timeout mechanisms are non-negotiable.
Backend Principles Apply: Separation of concerns, idempotency, retry logic, observability—all the patterns I use in Spring Boot apply to agentic AI.

Conclusion: AI Meets Backend Engineering

When I started exploring agentic AI, I worried I was stepping outside my backend engineering domain. But as I built these systems, I realized: the fundamentals are identical. State management, control flow, error handling, observability—these are universal engineering principles.

LangGraph is the bridge between AI's non-deterministic nature and backend engineering's demand for structure. It lets us build reliable, autonomous systems where AI makes decisions, but engineering rigor ensures they're safe, observable, and maintainable.

If you're a backend engineer curious about AI, my advice: Start with LangGraph. You'll recognize the patterns. The only difference? Instead of HTTP requests between services, you have LLM calls between agents.

And that's not a limitation—it's an opportunity.

Connect with me: If you're exploring AI integration in backend systems or have questions about LangGraph, let's discuss! I'm actively learning this space and happy to share insights.

From 1.5s to 250ms: How We 6x'd API Latency with Spring Boot Optimization

Ankit Kumar Shaw — Fri, 03 Apr 2026 06:26:52 +0000

The Hook: When Your System Meets Reality

Picture this: It's Tuesday morning, and your leave management system—serving 20,000 employees across enterprise teams—suddenly becomes the bottleneck. Approvals that should take seconds are taking 1.5 seconds per request. Dashboard loads feel sluggish. Support tickets flood in. The database team reports 85% CPU utilization, and your ops team is preparing incident escalation.

The system was built to "work." But production doesn't reward "working"—it demands reliability, speed, and efficiency at scale.

This is the story of how we inherited a system with p99 latency of 1.5s on a platform serving 15K+ daily requests, and through methodical architectural profiling and optimization, reduced it to 250ms (6x faster), while simultaneously cutting database CPU consumption by 40%.

Note: Even after migrating from legacy Python/Flask to Spring Boot, the latency bottleneck persisted. The migration provided better observability and scalability, but the root causes—inefficient queries, undersized connection pools, and missing caching—remained. The optimizations below solved the actual problem.

The Scenario: A System Under Pressure

Current State (Post-Migration, Spring Boot)

System: Newly migrated Spring Boot microservice (from legacy Python/Flask)
Scale: 20,000 employees, 15,000+ requests/day
Latency: p99 = 1.5 seconds (unacceptable for user experience)
Database Load: 85% CPU utilization on a 4-core instance
Problem: Migration alone didn't solve the performance issues; architectural inefficiencies remained

Why This Matters

At scale, every 100ms saved = better UX, lower operational costs, and reduced infrastructure spend. For 15K daily requests, cutting latency by 1.25s saves ~50 compute-hours daily. That's the difference between needing two database instances or one.

Root Cause Analysis: Finding the Bottleneck

Our investigation focused on three key areas:

1. Database Query Profiling

Using Spring Boot Actuator and MySQL slow query logs, we discovered the smoking gun: The N+1 Query Problem — A textbook horror:

// ❌ BEFORE: N+1 Query Anti-pattern
public List<LeaveRequest> getLeaveRequestsByDepartment(Long deptId) {
    List<Department> depts = em.createQuery(
        "SELECT d FROM Department d WHERE d.id = :deptId", 
        Department.class)
        .setParameter("deptId", deptId)
        .getResultList(); // Query 1: Fetch department

    for (Department dept : depts) {
        List<Employee> employees = em.createQuery(
            "SELECT e FROM Employee e WHERE e.department.id = :deptId",
            Employee.class)
            .setParameter("deptId", dept.getId())
            .getResultList(); // Query 2-N: Fetch all employees (1 per loop)

        for (Employee emp : employees) {
            List<LeaveRequest> leaves = em.createQuery(
                "SELECT l FROM LeaveRequest l WHERE l.employee.id = :empId",
                LeaveRequest.class)
                .setParameter("empId", emp.getId())
                .getResultList(); // Query 3-N²: Fetch leaves per employee
        }
    }
}

Impact: For a department with 500 employees, this triggered 1 + 500 + (500 * avg_leaves) = ~2000+ queries.

Database response time per request: ~1.2 seconds (just waiting for the database).

2. Connection Pool Exhaustion

HikariCP was configured with:

spring.datasource.hikari.maximum-pool-size=10

With 15K requests/day and 1.2s per database roundtrip, connections were being held too long. We hit connection pool saturation, causing requests to queue.

3. Missing Caching Layer

User roles, department hierarchies, and leave policies were fetched on every request—data that changes infrequently but was being queried thousands of times daily.

Solution 1: Hibernate JOIN FETCH (Eliminate N+1)

The most impactful change was using Hibernate's JOIN FETCH to eagerly load relationships in a single query:

// ✅ AFTER: Single JOIN FETCH query
@Query(value = """
    SELECT DISTINCT d FROM Department d
    LEFT JOIN FETCH d.employees e
    LEFT JOIN FETCH e.leaveRequests l
    WHERE d.id = :deptId
    """)
List<Department> getLeaveRequestsByDepartment(@Param("deptId") Long deptId);

Trade-off: This fetches all data upfront (potentially unnecessary if you only need recent leaves). But in this case, the leave request list was always needed.

Result: From 2000+ queries → 1 query. Database latency dropped from 1.2s → 350ms.

Solution 2: HikariCP Connection Pool Tuning

Before diving into pool size, we profiled connection lifecycle:

# ❌ BEFORE (Undersized)
spring.datasource.hikari.maximum-pool-size=10
spring.datasource.hikari.minimum-idle=2

# ✅ AFTER (Right-sized for concurrency)
spring.datasource.hikari.maximum-pool-size=50
spring.datasource.hikari.minimum-idle=10
spring.datasource.hikari.connection-timeout=10000 # 10s timeout before failing
spring.datasource.hikari.idle-timeout=600000     # 10 min before closing idle connections
spring.datasource.hikari.leak-detection-threshold=60000 # Detect connection leaks

Why 50?
Using Little's Law (Connections = Concurrent Requests × Query Time):

Measured concurrent requests at p99: 30-40 (from production JMX metrics)
Safety buffer for traffic spikes: +10-20
Total: 50 connections (30-40 + 10-20)

This sizing eliminates connection queue time while avoiding resource waste.

Result: Connection queue time eliminated. No more "Connection pool exhausted" errors.

Solution 3: Redis-Backed Caching Strategy

We implemented a two-tier caching strategy for frequently-accessed, infrequently-changed data:

@Service
@RequiredArgsConstructor
public class LeavePolicyCache {
    private final RedisTemplate<String, LeavePolicy> redisTemplate;
    private final LeavePolicyRepository leavePolicyRepository;
    private static final String CACHE_KEY_PREFIX = "leave:policy:";
    private static final long TTL_SECONDS = 86400; // 24 hours

    /**
     * Fetch leave policy for an employee (vacation quota, sick leave, carry-forward rules).
     * Policies rarely change, so we cache them aggressively.
     */
    public LeavePolicy getLeavePolicy(Long employeeId) {
        String cacheKey = CACHE_KEY_PREFIX + employeeId;

        // Try Redis first
        LeavePolicy cached = redisTemplate.opsForValue().get(cacheKey);
        if (cached != null) {
            return cached; // Cache hit: O(1) lookup, <1ms response
        }

        // Cache miss: Fetch from database
        // (joins employee → employment_type → leave_policy tables)
        LeavePolicy policy = leavePolicyRepository.findByEmployeeId(employeeId)
            .orElseThrow();

        // Populate cache with 24-hour TTL
        redisTemplate.opsForValue()
            .set(cacheKey, policy, Duration.ofSeconds(TTL_SECONDS));

        return policy;
    }

    @EventListener
    public void onLeavePolicyUpdated(LeavePolicyUpdatedEvent event) {
        // Invalidate cache when policies change (e.g., annual quota reset, policy update)
        String cacheKey = CACHE_KEY_PREFIX + event.getEmployeeId();
        redisTemplate.delete(cacheKey);
    }
}

What We Cache (and Why):

Leave Policies: Vacation quotas, sick leave limits, carry-forward rules — accessed on every leave request submission/validation, but change only during annual resets or policy updates.
Approval Workflows: Manager hierarchies and approval chains — needed to route leave requests, but organization structure changes infrequently.

Cache Invalidation Strategy: Event-driven (on policy/org change, we emit events and invalidate) + 24-hour TTL.

Why Long TTL? Leave policies are extremely stable (change annually or when employee switches departments). 24-hour TTL ensures we catch manual DB changes without event emission.

Result: Leave policy lookups: 120ms (database join across 3 tables) → <1ms (Redis). Reduced database CPU by ~40% since these queries were happening on every leave request view/submission.

Architecture Overview: Before vs. After

Before Optimization

Client Request
    ↓
Spring Boot Controller
    ↓
Service Layer (Business Logic)
    ↓
[N+1 Queries] → Database (1.2s latency)
    ↓
[No Caching]
    ↓
Response (1.5s p99)

After Optimization

Client Request
    ↓
Spring Boot Controller
    ↓
Cache Check (Redis) ← 1ms ✅
    ├─ Hit: Return cached Leave Policy
    └─ Miss: Continue to database
    ↓
Service Layer (Business Logic)
    ↓
[Single JOIN FETCH Query] → Database (350ms)
    ↓
[HikariCP Optimized] (50-connection pool)
    ↓
Response (250ms p99) ✅

Results: The Metrics That Matter

Metric	Before	After	Improvement
p99 Latency	1.5s	250ms	6x faster ✅
p50 Latency	850ms	180ms	4.7x faster
Database CPU	85%	51%	40% reduction ✅
Queries/Request	2000+	1	99.95% fewer
Redis Hit Rate	N/A	87%	—
Connection Pool Timeout Errors	45/day	0	100% elimination

Business Impact

User Experience: Dashboard loads feel snappy (250ms vs 1.5s)
Infrastructure: Supports 15K+ daily requests on smaller database instance
Reliability: No more p99 latency spikes during business hours

Lessons Learned: Engineering Trade-offs

What We Sacrificed

Flexibility: JOIN FETCH loads all data. If you only needed recent leaves, you'd fetch unnecessary historical data.
Solution: Separate queries for different use cases (leave summary vs. full history).
Memory: Increased HikariCP pool size (10 → 50) uses more heap.
Reality: Better to use extra memory than have request queues. Monitoring showed peak heap: 1.2GB / 4GB available.
Cache Consistency: Redis introduces eventual consistency.
Mitigation: Event-driven invalidation + TTL ensures freshness within 1 hour.

What Worked

JOIN FETCH: Single, largest impact. 99.95% query reduction.
Connection pooling: Eliminated queueing entirely.
Caching strategy: 87% hit rate validates our data access patterns.

Key Takeaways

Profile First, Optimize Second
Don't guess where the bottleneck is. We assumed CPU was the issue—it was actually database queries. Use Spring Boot Actuator, MySQL slow logs, and flamegraphs.
N+1 is Insidious
With 20K employees and historical leave records, N+1 queries cascaded into 2000+ database roundtrips per request. Always use EXPLAIN PLAN and test with realistic data sizes.
Connection Pooling Matters More Than You Think
Undersized pools (10 connections) caused request queueing, which is invisible in application metrics but devastating to latency. Right-size based on concurrency math, not gut feel.
Caching is Not Free
Cache invalidation is hard. We chose event-driven + TTL because it's reliable and simple. Measure hit rates to validate your strategy.
6x Latency Improvement = 6x Better UX
Users feel the difference between 1.5s and 250ms. This wasn't just a technical victory—it was a product improvement.