<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Ankit Kumar Shaw</title>
    <description>The latest articles on Forem by Ankit Kumar Shaw (@ankitkumarshaw).</description>
    <link>https://forem.com/ankitkumarshaw</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3857996%2F8087e457-fdb9-4df4-b58c-3e0ef7d1d41e.png</url>
      <title>Forem: Ankit Kumar Shaw</title>
      <link>https://forem.com/ankitkumarshaw</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/ankitkumarshaw"/>
    <language>en</language>
    <item>
      <title>Building a 5000+ Notifications/sec Event Pipeline with Kafka &amp; Distributed Idempotency</title>
      <dc:creator>Ankit Kumar Shaw</dc:creator>
      <pubDate>Fri, 03 Apr 2026 10:24:46 +0000</pubDate>
      <link>https://forem.com/ankitkumarshaw/building-a-5000-notificationssec-event-pipeline-with-kafka-distributed-idempotency-1fpp</link>
      <guid>https://forem.com/ankitkumarshaw/building-a-5000-notificationssec-event-pipeline-with-kafka-distributed-idempotency-1fpp</guid>
      <description>&lt;h2&gt;
  
  
  The Hook: When Synchronous Becomes the Bottleneck
&lt;/h2&gt;

&lt;p&gt;Picture this: Your leave approval API is taking 2.3 seconds to respond. Not because the database is slow. Not because the business logic is complex. But because it's waiting for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An email to be sent to the approver&lt;/li&gt;
&lt;li&gt;An SMS notification to the employee&lt;/li&gt;
&lt;li&gt;An audit log to be written to a compliance database&lt;/li&gt;
&lt;li&gt;A push notification to mobile devices&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each of these external calls adds 300-500ms of latency. And if one fails? The entire approval operation fails. The user sees an error. The leave request is lost. Support tickets flood in.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This was the reality&lt;/strong&gt; when I joined the notification team. Our leave management system—serving 20,000 employees—was tightly coupled with notification delivery. Every approval action had to wait for emails, SMS, and audit logs to complete. If SendGrid was slow or Twilio was down, leave approvals ground to a halt.&lt;/p&gt;

&lt;p&gt;The solution? &lt;strong&gt;Decouple notification delivery from the critical path using event-driven architecture with Apache Kafka.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the story of how we re-architected the notification system to handle &lt;strong&gt;5000+ notifications/sec&lt;/strong&gt; with &lt;strong&gt;zero message loss&lt;/strong&gt;, &lt;strong&gt;100% idempotency&lt;/strong&gt;, and &lt;strong&gt;85% reduction in third-party failure propagation&lt;/strong&gt;—all while cutting approval API latency by &lt;strong&gt;70%&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem: Synchronous Coupling at Scale
&lt;/h2&gt;

&lt;h4&gt;
  
  
  Original Architecture (The Synchronous Nightmare)
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User → Leave Approval API → [Business Logic]
                                    ↓
                    ┌───────────────┴───────────────┐
                    │                               │
          [Send Email] ──→ SendGrid (500ms)        │
                    │                               │
          [Send SMS] ──→ Twilio (400ms)            │
                    │                               │
          [Audit Log] ──→ Compliance DB (300ms)    │
                    │                               │
          [Push Notification] ──→ FCM (350ms)      │
                    │                               │
                    └───────────────┬───────────────┘
                                    ↓
                          Total Latency: 2.3s
                          Return Response to User
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The Cascading Failures:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Latency Amplification&lt;/strong&gt;: Every notification channel adds to API response time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Failure Propagation&lt;/strong&gt;: If SendGrid is down, the entire approval fails&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No Retry Logic&lt;/strong&gt;: Failed notifications are lost forever&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Poor User Experience&lt;/strong&gt;: Users wait 2+ seconds for a simple approval&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tight Coupling&lt;/strong&gt;: Business logic can't evolve independently of notification logic&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Production Impact:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;450 incidents/month&lt;/strong&gt; from third-party notification provider failures&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;p99 latency: 2.3 seconds&lt;/strong&gt; (unacceptable for user-facing APIs)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero fault isolation&lt;/strong&gt;: One provider outage affects all operations&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Solution: Event-Driven Architecture with Kafka
&lt;/h2&gt;

&lt;h4&gt;
  
  
  The Core Insight
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Observation:&lt;/strong&gt; Notification delivery is not part of the core business transaction. If a leave is approved, it's approved—whether or not the email sends immediately is a separate concern.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Architectural Decision:&lt;/strong&gt; Decouple notification delivery from the approval workflow using &lt;strong&gt;asynchronous event processing&lt;/strong&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  Target Architecture (Event-Driven)
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User → Leave Approval API → [Business Logic]
                                    ↓
                    ┌───────────────┴───────────────┐
                    │  Save to Database             │
                    │  (50ms)                       │
                    └───────────────┬───────────────┘
                                    ↓
                    Publish Event to Kafka Topic
                    (5ms - async, fire-and-forget)
                                    ↓
                          Return Response: 55ms
                          (User sees success immediately)

                    ┌─────────────────────────────────┐
                    │   Kafka Topic (12 partitions)   │
                    └─────────────┬───────────────────┘
                                  ↓
              ┌───────────────────┴───────────────────┐
              │                                       │
    [Consumer Group: Email]              [Consumer Group: SMS]
    [Consumer Group: Audit]              [Consumer Group: Push]
              │                                       │
              ↓                                       ↓
    Process independently with                Circuit breakers,
    retries, DLQ, idempotency                exponential backoff
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Immediate Benefits:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;70% latency reduction&lt;/strong&gt;: API response time drops from 2.3s → 700ms (then further optimized to 250ms with DB tuning)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fault Isolation&lt;/strong&gt;: Email provider outage doesn't affect leave approval&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Independent Scaling&lt;/strong&gt;: Notification consumers scale independently of API layer&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retry Logic&lt;/strong&gt;: Failed notifications automatically retry with exponential backoff&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero Message Loss&lt;/strong&gt;: Dead Letter Queues (DLQ) capture failures for manual intervention&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Deep Dive: Kafka Architecture Decisions
&lt;/h2&gt;

&lt;h4&gt;
  
  
  Why Kafka Over RabbitMQ or SQS?
&lt;/h4&gt;

&lt;p&gt;As I researched event-streaming platforms, I evaluated three options:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Kafka&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;RabbitMQ&lt;/th&gt;
&lt;th&gt;AWS SQS&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Throughput&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1M+ msg/sec&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;50K msg/sec&lt;/td&gt;
&lt;td&gt;120K msg/sec&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Ordering Guarantee&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Per-partition ordering&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No strict ordering&lt;/td&gt;
&lt;td&gt;FIFO queues (limited throughput)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Replay Capability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes (offset management)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Limited (max 14 days retention)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Durability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Replicated logs&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Optional persistence&lt;/td&gt;
&lt;td&gt;High (managed by AWS)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Consumer Groups&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Native support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Manual implementation&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best For&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Event streaming, high throughput&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Task queues, RPC&lt;/td&gt;
&lt;td&gt;Serverless, decoupled microservices&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why We Chose Kafka:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Throughput Requirements&lt;/strong&gt;: We needed to scale from 100 → 5000+ notifications/sec&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Replay Capability&lt;/strong&gt;: Critical for debugging and reprocessing failed batches&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ordering Guarantees&lt;/strong&gt;: Per-user notification ordering was essential (e.g., "leave requested" before "leave approved")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consumer Groups&lt;/strong&gt;: Native support for multiple independent consumers (Email, SMS, Push, Audit)&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Kafka Partitioning Strategy: Scaling to 5000+ Notifications/sec
&lt;/h2&gt;

&lt;h4&gt;
  
  
  The Partitioning Challenge
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Key Insight:&lt;/strong&gt; Kafka partitions are the unit of parallelism. More partitions = more concurrent consumers = higher throughput.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Design Decision:&lt;/strong&gt; Use &lt;strong&gt;12 partitions&lt;/strong&gt; partitioned by &lt;code&gt;userId&lt;/code&gt; hash.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Configuration:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Idempotent producer with userId as partition key&lt;/span&gt;
&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ProducerConfig&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;ENABLE_IDEMPOTENCE_CONFIG&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ProducerConfig&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;ACKS_CONFIG&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"all"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// Wait for all replicas&lt;/span&gt;

&lt;span class="n"&gt;kafkaTemplate&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;send&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"notification-events"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; 
                   &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getUserId&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;toString&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt; &lt;span class="c1"&gt;// Partition key&lt;/span&gt;
                   &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why Partition by &lt;code&gt;userId&lt;/code&gt;?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ordering Guarantee&lt;/strong&gt;: All notifications for a user go to the same partition (FIFO order preserved)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Load Distribution&lt;/strong&gt;: Uniform distribution across partitions (assuming balanced user activity)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consumer Affinity&lt;/strong&gt;: Each consumer processes a subset of users (better caching)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Throughput Calculation:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Single consumer throughput: &lt;strong&gt;~500 msg/sec&lt;/strong&gt; (bounded by external API calls)&lt;/li&gt;
&lt;li&gt;12 partitions × 500 msg/sec = &lt;strong&gt;6000 msg/sec peak capacity&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Production average: &lt;strong&gt;~800 msg/sec&lt;/strong&gt;, peak: &lt;strong&gt;5000+ msg/sec&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Idempotency Challenge: Ensuring "Exactly-Once" Semantics
&lt;/h2&gt;

&lt;h4&gt;
  
  
  The Duplicate Notification Problem
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Scenario:&lt;/strong&gt; Kafka delivers "at-least-once" by default. If a consumer processes a message but crashes before committing the offset, the message is redelivered.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without Idempotency:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. Consumer receives "Leave Approved" email notification
2. SendGrid API call succeeds (email sent)
3. Consumer crashes before committing Kafka offset
4. Message redelivered → duplicate email sent
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Real-World Impact:&lt;/strong&gt; Users receiving 3-5 duplicate emails for the same action is unacceptable.&lt;/p&gt;

&lt;h4&gt;
  
  
  Solution: Two-Barrier Idempotency Pattern
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Architecture:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Kafka Message → [Check Redis] → [Check PostgreSQL] → [Send Notification] → [Update Both]
                     ↓                ↓
              Fast dedup          Durable dedup
              (~1ms)              (~10ms)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Core Logic:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@KafkaListener&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;topics&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"notification-events"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;processNotification&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;NotificationEvent&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;dedupeKey&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;generateDedupeKey&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// userId + eventType + timestamp&lt;/span&gt;

    &lt;span class="c1"&gt;// Barrier 1: Redis (Fast Path - 99.9% of duplicates caught here)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(!&lt;/span&gt;&lt;span class="n"&gt;redisTemplate&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;opsForValue&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;setIfAbsent&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dedupeKey&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"PROCESSING"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Duration&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;ofMinutes&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="o"&gt;)))&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// Duplicate detected&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// Barrier 2: PostgreSQL (Durable Fallback)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;notificationRepo&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;existsByDedupeKey&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dedupeKey&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// Duplicate detected in database&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// Process notification and mark as completed in both barriers&lt;/span&gt;
    &lt;span class="n"&gt;emailService&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;send&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;notificationRepo&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;save&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;NotificationRecord&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dedupeKey&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
    &lt;span class="n"&gt;redisTemplate&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;opsForValue&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;set&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dedupeKey&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"COMPLETED"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Duration&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;ofHours&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;24&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why Two Barriers?&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Redis (Primary)&lt;/strong&gt;: Ultra-fast deduplication (&amp;lt;1ms) for 99.9% of cases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PostgreSQL (Fallback)&lt;/strong&gt;: Durable guarantee even if Redis is evicted or fails&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; 100% deduplication with minimal performance overhead&lt;/p&gt;




&lt;h2&gt;
  
  
  Circuit Breakers &amp;amp; Exponential Backoff: Surviving Third-Party Failures
&lt;/h2&gt;

&lt;h4&gt;
  
  
  The Third-Party Dependency Problem
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Observation:&lt;/strong&gt; External notification providers (SendGrid, Twilio, FCM) fail regularly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Rate limits (429 errors)&lt;/li&gt;
&lt;li&gt;Transient network issues&lt;/li&gt;
&lt;li&gt;Scheduled maintenance&lt;/li&gt;
&lt;li&gt;Provider outages&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Before Event-Driven Architecture:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;These failures propagated to the leave approval API&lt;/li&gt;
&lt;li&gt;450 incidents/month affecting core business flows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;After Kafka + Circuit Breakers:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Failures isolated to notification consumers&lt;/li&gt;
&lt;li&gt;Core business logic unaffected&lt;/li&gt;
&lt;li&gt;~68 incidents/month (85% reduction)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Exponential Backoff:&lt;/strong&gt; Configured Kafka consumer retry with exponential backoff (1s → 2s → 4s → 8s → 16s max).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SendGrid outage → Circuit breaker opens → DLQ captures events → No impact to core system&lt;/li&gt;
&lt;li&gt;450 → ~68 incidents/month (85% reduction in failure propagation)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Dead Letter Queue (DLQ): Zero-Loss Failure Handling
&lt;/h2&gt;

&lt;h4&gt;
  
  
  The Problem of Permanent Failures
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Scenarios:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Invalid phone number (SMS)&lt;/li&gt;
&lt;li&gt;Malformed email address&lt;/li&gt;
&lt;li&gt;User opted out of notifications&lt;/li&gt;
&lt;li&gt;Provider returns 400 (non-retryable error)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Without DLQ:&lt;/strong&gt; After N retries, message is discarded → &lt;strong&gt;data loss&lt;/strong&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  DLQ Architecture
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Kafka Topic: notification-events
        ↓
    Consumer
        ↓
    [Try Process]
        ↓
    ┌──────────┬──────────┐
    │          │          │
  Success   Retryable  Non-Retryable
    │       Failure     Failure
    │          │          │
    ✓      [Retry]   [Send to DLQ]
            (3x)          ↓
                   Kafka Topic: notification-dlq
                          ↓
                   [Manual Review Queue]
                   [Alerting/Monitoring]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Implementation:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@KafkaListener&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;topics&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"notification-events"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;consume&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;NotificationEvent&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;processNotification&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;RetryableException&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// Let Kafka consumer retry&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;NonRetryableException&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;error&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Non-retryable error, sending to DLQ: {}"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getMessage&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
        &lt;span class="n"&gt;dlqProducer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;send&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"notification-dlq"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;DLQ Monitoring:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Grafana dashboard for DLQ message count&lt;/li&gt;
&lt;li&gt;PagerDuty alert if DLQ &amp;gt; 100 messages&lt;/li&gt;
&lt;li&gt;Daily batch job to review and reprocess DLQ messages&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Backend Engineer's Perspective: Event-Driven Patterns
&lt;/h2&gt;

&lt;p&gt;As I built this system, I kept seeing parallels to distributed system patterns I'd encountered:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Kafka Consumers = Microservices&lt;/strong&gt;&lt;br&gt;
Each consumer group (Email, SMS, Push, Audit) is an independent microservice. They scale, fail, and deploy independently.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Idempotency = Distributed Transactions&lt;/strong&gt;&lt;br&gt;
The two-barrier pattern is similar to two-phase commit (2PC), but optimized for performance (fast Redis path + durable PostgreSQL fallback).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Circuit Breakers = Resilience Patterns&lt;/strong&gt;&lt;br&gt;
Same pattern I use in Spring Boot microservices with Resilience4j. The only difference? Here it's protecting Kafka consumers instead of REST APIs.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;DLQ = Exception Handling at Scale&lt;/strong&gt;&lt;br&gt;
DLQ is like a try-catch block for distributed systems. Instead of rethrowing, we quarantine failures for human review.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Partitioning = Database Sharding&lt;/strong&gt;&lt;br&gt;
Kafka partitions are like database shards—both distribute load across multiple nodes while preserving per-key ordering.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The Insight:&lt;/strong&gt; Event-driven architecture isn't fundamentally different from synchronous microservices. It's the same engineering principles applied to asynchronous, decoupled workflows.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion: From Synchronous Coupling to Scalable Event Streams
&lt;/h2&gt;

&lt;p&gt;When I started this project, I thought event-driven architecture was "advanced" and maybe unnecessary. Why not just keep it simple with synchronous REST calls?&lt;/p&gt;

&lt;p&gt;But as I profiled the system, measured the latency, and watched third-party failures bring down core business flows, the path forward became clear: &lt;strong&gt;services that don't need to fail together shouldn't be coupled together.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Kafka gave us that decoupling. Circuit breakers gave us resilience. Idempotency gave us correctness. And DLQs gave us zero-loss guarantees.&lt;/p&gt;

&lt;p&gt;The result? A notification system that handles &lt;strong&gt;5000+ notifications/sec&lt;/strong&gt;, recovers from failures autonomously, and never blocks the critical path of leave approvals.&lt;/p&gt;

&lt;p&gt;If you're building microservices and hitting synchronous coupling bottlenecks, start with one question: &lt;strong&gt;&lt;em&gt;"Does this operation need to complete before returning a response to the user?"&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If the answer is no, make it asynchronous. Your users—and your on-call engineers—will thank you.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Connect with me:&lt;/strong&gt; If you're building event-driven systems or have questions about Kafka architecture, let's discuss! I'm always learning and happy to share insights.&lt;/p&gt;

</description>
      <category>kafka</category>
      <category>microservices</category>
      <category>backend</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Demystifying Agentic AI: Why I'm Trading Chains for Graphs with LangGraph</title>
      <dc:creator>Ankit Kumar Shaw</dc:creator>
      <pubDate>Fri, 03 Apr 2026 07:52:30 +0000</pubDate>
      <link>https://forem.com/ankitkumarshaw/demystifying-agentic-ai-why-im-trading-chains-for-graphs-with-langgraph-2678</link>
      <guid>https://forem.com/ankitkumarshaw/demystifying-agentic-ai-why-im-trading-chains-for-graphs-with-langgraph-2678</guid>
      <description>&lt;h2&gt;
  
  
  The Hook: When Simple Prompts Aren't Enough
&lt;/h2&gt;

&lt;p&gt;A year ago, if you asked me about AI integration in backend systems, I'd point to API calls to OpenAI with well-crafted prompts. Simple, predictable, stateless:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ChatCompletion&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize this article&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Linear. Deterministic invocation (even if the output varied). Just another REST API call in my Spring Boot microservices world.&lt;/p&gt;

&lt;p&gt;But then I encountered a real problem: &lt;strong&gt;What if the AI needs to make decisions, call tools, evaluate results, and loop back based on outcomes?&lt;/strong&gt; What if it needs to research a topic, validate findings, and &lt;em&gt;autonomously decide&lt;/em&gt; whether to dig deeper or move forward?&lt;/p&gt;

&lt;p&gt;That simple prompt-response pattern breaks down. You need &lt;strong&gt;state management&lt;/strong&gt;, &lt;strong&gt;conditional branching&lt;/strong&gt;, &lt;strong&gt;tool orchestration&lt;/strong&gt;, and &lt;strong&gt;retry logic&lt;/strong&gt;—concepts backend engineers like me live and breathe, but applied to non-deterministic AI workflows.&lt;/p&gt;

&lt;p&gt;This is where I discovered &lt;strong&gt;Agentic AI&lt;/strong&gt; and &lt;strong&gt;LangGraph&lt;/strong&gt;—a framework that treats AI tasks not as linear chains, but as state machines represented by graphs. As I explored the documentation and experimented with multi-agent systems, I realized: &lt;strong&gt;the engineering principles haven't changed. Only the substrate has.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Evolution: From Prompts to Chains to Autonomous Agents
&lt;/h2&gt;

&lt;h4&gt;
  
  
  Stage 1: Simple Prompts (The API Call Era)
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Input → LLM → Output
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Limitation&lt;/strong&gt;: No memory, no context, no tool use. Every interaction is isolated.&lt;/p&gt;

&lt;h4&gt;
  
  
  Stage 2: LangChain (The Linear Pipeline Era)
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input → Prompt Template → LLM → Output Parser → Result
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Innovation&lt;/strong&gt;: Introduced &lt;strong&gt;chains&lt;/strong&gt;—sequential steps where outputs feed into next steps. Added memory and basic tool calling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Limitation&lt;/strong&gt;: Chains are &lt;em&gt;linear&lt;/em&gt;. If the LLM needs to loop back based on a condition (e.g., "research more if the answer isn't confident"), you're stuck. You can't model &lt;strong&gt;feedback loops&lt;/strong&gt; or &lt;strong&gt;conditional branching&lt;/strong&gt; elegantly.&lt;/p&gt;

&lt;h4&gt;
  
  
  Stage 3: Agentic AI with LangGraph (The Autonomous Decision Era)
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                    ┌─────────────┐
                    │   Agent     │
                    │  (Planner)  │
                    └──────┬──────┘
                           │
                ┌──────────┴──────────┐
                │                     │
         ┌──────▼──────┐       ┌─────▼──────┐
         │  Research   │       │ Summarize  │
         │   Tool      │       │   Tool     │
         └──────┬──────┘       └─────┬──────┘
                │                     │
                └──────────┬──────────┘
                           │
                    ┌──────▼──────┐
                    │  Evaluator  │
                    │  (Quality   │
                    │   Check)    │
                    └──────┬──────┘
                           │
                  ┌────────┴────────┐
                  │                 │
              Good Quality     Needs More Research
                  │                 │
            ┌─────▼─────┐         Loop Back to Agent
            │   Output   │
            └────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The Shift&lt;/strong&gt;: The AI now controls the flow. It decides when to research, when to validate, when to retry, when to stop. This is &lt;strong&gt;Agentic AI&lt;/strong&gt;—autonomous, self-directed workflows.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why LangGraph? The Graph vs. Chain Paradigm
&lt;/h2&gt;

&lt;p&gt;As I researched frameworks (LangChain, CrewAI, AutoGen, LangGraph), a core architectural question emerged: &lt;strong&gt;How do you model complex, conditional AI workflows?&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Problem with Chains
&lt;/h3&gt;

&lt;p&gt;LangChain's chains are fundamentally &lt;strong&gt;Directed Acyclic Graphs (DAGs)&lt;/strong&gt; where each node executes once, in order. Perfect for pipelines:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Retrieve Documents → Rank by Relevance → Send to LLM → Format Output
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But what if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The LLM's output quality is poor, and you need to &lt;strong&gt;retry with a different strategy&lt;/strong&gt;?&lt;/li&gt;
&lt;li&gt;You want the AI to &lt;strong&gt;self-critique&lt;/strong&gt; and loop back to refine its answer?&lt;/li&gt;
&lt;li&gt;Different outcomes require &lt;strong&gt;branching logic&lt;/strong&gt; (e.g., if confidence &amp;lt; 70%, call a specialist agent)?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Chains don't handle &lt;strong&gt;cycles&lt;/strong&gt; or &lt;strong&gt;conditional edges&lt;/strong&gt; well. You end up with brittle workarounds.&lt;/p&gt;

&lt;h3&gt;
  
  
  LangGraph's Solution: Stateful Graphs with Cycles
&lt;/h3&gt;

&lt;p&gt;LangGraph treats workflows as &lt;strong&gt;state machines&lt;/strong&gt;. Each node is a function. Edges define transitions. The graph can have &lt;strong&gt;cycles&lt;/strong&gt; (loops), &lt;strong&gt;conditional routing&lt;/strong&gt;, and &lt;strong&gt;persistent state&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Concepts:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Nodes&lt;/strong&gt;: Functions that take state, perform an action (call LLM, invoke tool, validate), and return updated state.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Edges&lt;/strong&gt;: Define transitions. Can be conditional (&lt;code&gt;if confidence &amp;gt; 80%, go to "output"; else, go to "research_more"&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;State&lt;/strong&gt;: A shared data structure (like a &lt;code&gt;TypedDict&lt;/code&gt;) that all nodes read/write. This is your "application context."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cycles&lt;/strong&gt;: Allowed! An agent can loop back to itself or a previous node.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;In my backend engineering terms:&lt;/strong&gt; LangGraph is like Spring State Machine or workflow engines (Camunda, Temporal), but purpose-built for AI agents.&lt;/p&gt;




&lt;h2&gt;
  
  
  LangGraph vs. LangChain: An Architectural Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;LangChain&lt;/th&gt;
&lt;th&gt;LangGraph&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Mental Model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Linear pipelines (chains)&lt;/td&gt;
&lt;td&gt;State machines (graphs)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Flow Control&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Sequential, mostly linear&lt;/td&gt;
&lt;td&gt;Conditional routing, cycles allowed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;State Management&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Chain-scoped memory&lt;/td&gt;
&lt;td&gt;Graph-level, persistent state&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best For&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Simple RAG, Q&amp;amp;A, document processing&lt;/td&gt;
&lt;td&gt;Multi-step reasoning, agent loops, self-correction&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Complexity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Low to Medium&lt;/td&gt;
&lt;td&gt;Medium to High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;When to Use&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"Do A → B → C"&lt;/td&gt;
&lt;td&gt;"Do A, evaluate, maybe do B, loop if needed, then C"&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Example Use Cases:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LangChain&lt;/strong&gt;: Search documents, rank by relevance, answer question. (One-shot workflow)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LangGraph&lt;/strong&gt;: Research a topic, validate findings, if insufficient → research more, summarize when confident. (Iterative, agent-driven)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;In my research&lt;/strong&gt;, I found LangGraph is ideal when the AI needs to &lt;em&gt;think in loops&lt;/em&gt;—like a backend retry mechanism, but at the &lt;em&gt;decision-making level&lt;/em&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Backend Engineer's Perspective: Familiar Patterns in New Territory
&lt;/h2&gt;

&lt;p&gt;As I explored LangGraph, I kept seeing parallels to systems I've built with Spring Boot:&lt;/p&gt;

&lt;h4&gt;
  
  
  1. &lt;strong&gt;State Management = Request Context&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;In Spring Boot, I use &lt;code&gt;@RequestScope&lt;/code&gt; beans or &lt;code&gt;ThreadLocal&lt;/code&gt; to share context across service layers. LangGraph's &lt;code&gt;AgentState&lt;/code&gt; is the same concept—a shared object that flows through the workflow.&lt;/p&gt;

&lt;h4&gt;
  
  
  2. &lt;strong&gt;Conditional Routing = Service Orchestration&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;When building microservices, I often have logic like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;hasPermission&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;orderService&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;createOrder&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;UnauthorizedException&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;LangGraph's &lt;strong&gt;conditional edges&lt;/strong&gt; are the same—routing based on state.&lt;/p&gt;

&lt;h4&gt;
  
  
  3. &lt;strong&gt;Retry Logic = Circuit Breakers&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;The evaluator-to-planner loop is like Resilience4j's retry mechanism:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@Retry&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"orderService"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fallbackMethod&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"fallback"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;Order&lt;/span&gt; &lt;span class="nf"&gt;createOrder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt; &lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;LangGraph just applies this pattern to &lt;em&gt;AI decision-making&lt;/em&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  4. &lt;strong&gt;Preventing Infinite Loops = Rate Limiting&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;The &lt;code&gt;iteration_count&lt;/code&gt; guard is like rate limiting or max retry configs in Spring Boot. You always need an escape hatch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Insight&lt;/strong&gt;: &lt;strong&gt;Agentic AI isn't magic—it's distributed systems applied to non-deterministic workflows.&lt;/strong&gt; My experience with microservices, state machines, and fault tolerance directly translates.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Matters: The Future of Backend + AI Integration
&lt;/h2&gt;

&lt;p&gt;As I dug deeper into agentic AI, I realized: &lt;strong&gt;backend engineers are uniquely positioned to excel in this space&lt;/strong&gt;. Why?&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;We Understand State&lt;/strong&gt;: Managing state across distributed systems is our bread and butter. AI agents are just another stateful workflow.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;We Think in Graphs&lt;/strong&gt;: Microservices communication, DAG-based workflows (Airflow, Temporal), dependency graphs—we already model systems as graphs.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;We Prioritize Reliability&lt;/strong&gt;: Retry logic, fallbacks, timeout handling, idempotency—all critical for production AI systems. Most AI tutorials skip this.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;We Know Observability&lt;/strong&gt;: Logging, tracing, monitoring—essential when your AI agent makes 10 autonomous decisions before producing output.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;LangGraph is essentially a workflow engine for AI&lt;/strong&gt;. If you've worked with Camunda, Temporal, or AWS Step Functions, you already have the mental model.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Learned: Key Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Graphs &amp;gt; Chains for Complex Reasoning:&lt;/strong&gt;  If your AI needs to loop, self-correct, or make decisions, linear chains won't cut it. LangGraph's state machine model handles complexity gracefully.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;State is Everything:&lt;/strong&gt;  The &lt;code&gt;AgentState&lt;/code&gt; object is the contract between nodes. Design it carefully—just like you would a database schema.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Conditional Routing is Where Intelligence Lives:&lt;/strong&gt;  The &lt;code&gt;should_continue_research&lt;/code&gt; function determines the agent's behavior. This is where you encode business logic.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Always Have an Escape Hatch:&lt;/strong&gt;  AI can loop forever if you're not careful. Iteration limits, cost caps, and timeout mechanisms are non-negotiable.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Backend Principles Apply:&lt;/strong&gt; Separation of concerns, idempotency, retry logic, observability—all the patterns I use in Spring Boot apply to agentic AI.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Conclusion: AI Meets Backend Engineering
&lt;/h2&gt;

&lt;p&gt;When I started exploring agentic AI, I worried I was stepping outside my backend engineering domain. But as I built these systems, I realized: &lt;strong&gt;the fundamentals are identical&lt;/strong&gt;. State management, control flow, error handling, observability—these are universal engineering principles.&lt;/p&gt;

&lt;p&gt;LangGraph is the bridge between AI's non-deterministic nature and backend engineering's demand for structure. It lets us build &lt;strong&gt;reliable, autonomous systems&lt;/strong&gt; where AI makes decisions, but engineering rigor ensures they're safe, observable, and maintainable.&lt;/p&gt;

&lt;p&gt;If you're a backend engineer curious about AI, my advice: &lt;strong&gt;Start with LangGraph&lt;/strong&gt;. You'll recognize the patterns. The only difference? Instead of HTTP requests between services, you have LLM calls between agents.&lt;/p&gt;

&lt;p&gt;And that's not a limitation—it's an opportunity.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Connect with me:&lt;/strong&gt;  If you're exploring AI integration in backend systems or have questions about LangGraph, let's discuss! I'm actively learning this space and happy to share insights.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>langgraph</category>
      <category>llm</category>
      <category>agentaichallenge</category>
    </item>
    <item>
      <title>From 1.5s to 250ms: How We 6x'd API Latency with Spring Boot Optimization</title>
      <dc:creator>Ankit Kumar Shaw</dc:creator>
      <pubDate>Fri, 03 Apr 2026 06:26:52 +0000</pubDate>
      <link>https://forem.com/ankitkumarshaw/from-15s-to-250ms-how-we-6xd-api-latency-with-spring-boot-optimization-27e0</link>
      <guid>https://forem.com/ankitkumarshaw/from-15s-to-250ms-how-we-6xd-api-latency-with-spring-boot-optimization-27e0</guid>
      <description>&lt;h2&gt;
  
  
  The Hook: When Your System Meets Reality
&lt;/h2&gt;

&lt;p&gt;Picture this: It's Tuesday morning, and your leave management system—serving 20,000 employees across enterprise teams—suddenly becomes the bottleneck. Approvals that should take seconds are taking &lt;strong&gt;1.5 seconds per request&lt;/strong&gt;. Dashboard loads feel sluggish. Support tickets flood in. The database team reports &lt;strong&gt;85% CPU utilization&lt;/strong&gt;, and your ops team is preparing incident escalation.&lt;/p&gt;

&lt;p&gt;The system was built to "work." But production doesn't reward "working"—it demands &lt;strong&gt;reliability, speed, and efficiency at scale&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This is the story of how we inherited a system with &lt;strong&gt;p99 latency of 1.5s&lt;/strong&gt; on a platform serving 15K+ daily requests, and through methodical architectural profiling and optimization, reduced it to &lt;strong&gt;250ms (6x faster)&lt;/strong&gt;, while simultaneously cutting &lt;strong&gt;database CPU consumption by 40%&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Even after migrating from legacy Python/Flask to Spring Boot, the latency bottleneck persisted. The migration provided better observability and scalability, but the root causes—inefficient queries, undersized connection pools, and missing caching—remained. The optimizations below solved the actual problem.&lt;/p&gt;




&lt;h3&gt;
  
  
  The Scenario: A System Under Pressure
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Current State (Post-Migration, Spring Boot)
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;System&lt;/strong&gt;: Newly migrated Spring Boot microservice (from legacy Python/Flask)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scale&lt;/strong&gt;: 20,000 employees, 15,000+ requests/day&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency&lt;/strong&gt;: p99 = 1.5 seconds (unacceptable for user experience)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Database Load&lt;/strong&gt;: 85% CPU utilization on a 4-core instance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Problem&lt;/strong&gt;: Migration alone didn't solve the performance issues; architectural inefficiencies remained&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Why This Matters
&lt;/h4&gt;

&lt;p&gt;At scale, every 100ms saved = better UX, lower operational costs, and reduced infrastructure spend. For 15K daily requests, cutting latency by 1.25s saves &lt;strong&gt;~50 compute-hours daily&lt;/strong&gt;. That's the difference between needing two database instances or one.&lt;/p&gt;

&lt;h3&gt;
  
  
  Root Cause Analysis: Finding the Bottleneck
&lt;/h3&gt;

&lt;p&gt;Our investigation focused on three key areas:&lt;/p&gt;

&lt;h4&gt;
  
  
  1. &lt;strong&gt;Database Query Profiling&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Using Spring Boot Actuator and MySQL slow query logs, we discovered the smoking gun: &lt;strong&gt;The N+1 Query Problem&lt;/strong&gt; — A textbook horror:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// ❌ BEFORE: N+1 Query Anti-pattern&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;LeaveRequest&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;getLeaveRequestsByDepartment&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Long&lt;/span&gt; &lt;span class="n"&gt;deptId&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Department&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;depts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;em&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;createQuery&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
        &lt;span class="s"&gt;"SELECT d FROM Department d WHERE d.id = :deptId"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; 
        &lt;span class="nc"&gt;Department&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setParameter&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"deptId"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;deptId&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getResultList&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt; &lt;span class="c1"&gt;// Query 1: Fetch department&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Department&lt;/span&gt; &lt;span class="n"&gt;dept&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;depts&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Employee&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;em&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;createQuery&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
            &lt;span class="s"&gt;"SELECT e FROM Employee e WHERE e.department.id = :deptId"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
            &lt;span class="nc"&gt;Employee&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setParameter&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"deptId"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dept&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getId&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getResultList&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt; &lt;span class="c1"&gt;// Query 2-N: Fetch all employees (1 per loop)&lt;/span&gt;

        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Employee&lt;/span&gt; &lt;span class="n"&gt;emp&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;LeaveRequest&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;leaves&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;em&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;createQuery&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
                &lt;span class="s"&gt;"SELECT l FROM LeaveRequest l WHERE l.employee.id = :empId"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
                &lt;span class="nc"&gt;LeaveRequest&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setParameter&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"empId"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;emp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getId&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getResultList&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt; &lt;span class="c1"&gt;// Query 3-N²: Fetch leaves per employee&lt;/span&gt;
        &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Impact&lt;/strong&gt;: For a department with 500 employees, this triggered &lt;strong&gt;1 + 500 + (500 * avg_leaves) = ~2000+ queries&lt;/strong&gt;.&lt;br&gt;&lt;br&gt;
Database response time per request: &lt;strong&gt;~1.2 seconds&lt;/strong&gt; (just waiting for the database).&lt;/p&gt;
&lt;h4&gt;
  
  
  2. &lt;strong&gt;Connection Pool Exhaustion&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;HikariCP was configured with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="s"&gt;spring.datasource.hikari.maximum-pool-size=10&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With 15K requests/day and 1.2s per database roundtrip, connections were being held too long. We hit connection pool saturation, causing requests to queue.&lt;/p&gt;

&lt;h4&gt;
  
  
  3. &lt;strong&gt;Missing Caching Layer&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;User roles, department hierarchies, and leave policies were fetched on every request—data that changes infrequently but was being queried thousands of times daily.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution 1: Hibernate JOIN FETCH (Eliminate N+1)
&lt;/h3&gt;

&lt;p&gt;The most impactful change was using &lt;strong&gt;Hibernate's JOIN FETCH&lt;/strong&gt; to eagerly load relationships in a single query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// ✅ AFTER: Single JOIN FETCH query&lt;/span&gt;
&lt;span class="nd"&gt;@Query&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""
    SELECT DISTINCT d FROM Department d
    LEFT JOIN FETCH d.employees e
    LEFT JOIN FETCH e.leaveRequests l
    WHERE d.id = :deptId
    """&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Department&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;getLeaveRequestsByDepartment&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nd"&gt;@Param&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"deptId"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="nc"&gt;Long&lt;/span&gt; &lt;span class="n"&gt;deptId&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Trade-off&lt;/strong&gt;: This fetches &lt;em&gt;all&lt;/em&gt; data upfront (potentially unnecessary if you only need recent leaves). But in this case, the leave request list was always needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: From 2000+ queries → &lt;strong&gt;1 query&lt;/strong&gt;. Database latency dropped from 1.2s → 350ms.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution 2: HikariCP Connection Pool Tuning
&lt;/h3&gt;

&lt;p&gt;Before diving into pool size, we profiled connection lifecycle:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# ❌ BEFORE (Undersized)&lt;/span&gt;
&lt;span class="s"&gt;spring.datasource.hikari.maximum-pool-size=10&lt;/span&gt;
&lt;span class="s"&gt;spring.datasource.hikari.minimum-idle=2&lt;/span&gt;

&lt;span class="c1"&gt;# ✅ AFTER (Right-sized for concurrency)&lt;/span&gt;
&lt;span class="s"&gt;spring.datasource.hikari.maximum-pool-size=50&lt;/span&gt;
&lt;span class="s"&gt;spring.datasource.hikari.minimum-idle=10&lt;/span&gt;
&lt;span class="s"&gt;spring.datasource.hikari.connection-timeout=10000&lt;/span&gt; &lt;span class="c1"&gt;# 10s timeout before failing&lt;/span&gt;
&lt;span class="s"&gt;spring.datasource.hikari.idle-timeout=600000&lt;/span&gt;     &lt;span class="c1"&gt;# 10 min before closing idle connections&lt;/span&gt;
&lt;span class="s"&gt;spring.datasource.hikari.leak-detection-threshold=60000&lt;/span&gt; &lt;span class="c1"&gt;# Detect connection leaks&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why 50?&lt;/strong&gt; &lt;br&gt;
Using &lt;strong&gt;Little's Law&lt;/strong&gt; (Connections = Concurrent Requests × Query Time):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Measured concurrent requests at p99: &lt;strong&gt;30-40&lt;/strong&gt; (from production JMX metrics)&lt;/li&gt;
&lt;li&gt;Safety buffer for traffic spikes: &lt;strong&gt;+10-20&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Total: 50 connections&lt;/strong&gt; (30-40 + 10-20)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This sizing eliminates connection queue time while avoiding resource waste.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: Connection queue time eliminated. No more "Connection pool exhausted" errors.&lt;/p&gt;
&lt;h3&gt;
  
  
  Solution 3: Redis-Backed Caching Strategy
&lt;/h3&gt;

&lt;p&gt;We implemented a &lt;strong&gt;two-tier caching strategy&lt;/strong&gt; for frequently-accessed, infrequently-changed data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@Service&lt;/span&gt;
&lt;span class="nd"&gt;@RequiredArgsConstructor&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;LeavePolicyCache&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="nc"&gt;RedisTemplate&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;LeavePolicy&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;redisTemplate&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="nc"&gt;LeavePolicyRepository&lt;/span&gt; &lt;span class="n"&gt;leavePolicyRepository&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;static&lt;/span&gt; &lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="no"&gt;CACHE_KEY_PREFIX&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"leave:policy:"&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;static&lt;/span&gt; &lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="no"&gt;TTL_SECONDS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;86400&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// 24 hours&lt;/span&gt;

    &lt;span class="cm"&gt;/**
     * Fetch leave policy for an employee (vacation quota, sick leave, carry-forward rules).
     * Policies rarely change, so we cache them aggressively.
     */&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;LeavePolicy&lt;/span&gt; &lt;span class="nf"&gt;getLeavePolicy&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Long&lt;/span&gt; &lt;span class="n"&gt;employeeId&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;cacheKey&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;CACHE_KEY_PREFIX&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;employeeId&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;

        &lt;span class="c1"&gt;// Try Redis first&lt;/span&gt;
        &lt;span class="nc"&gt;LeavePolicy&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redisTemplate&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;opsForValue&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cacheKey&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cached&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// Cache hit: O(1) lookup, &amp;lt;1ms response&lt;/span&gt;
        &lt;span class="o"&gt;}&lt;/span&gt;

        &lt;span class="c1"&gt;// Cache miss: Fetch from database&lt;/span&gt;
        &lt;span class="c1"&gt;// (joins employee → employment_type → leave_policy tables)&lt;/span&gt;
        &lt;span class="nc"&gt;LeavePolicy&lt;/span&gt; &lt;span class="n"&gt;policy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;leavePolicyRepository&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;findByEmployeeId&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;employeeId&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;orElseThrow&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;

        &lt;span class="c1"&gt;// Populate cache with 24-hour TTL&lt;/span&gt;
        &lt;span class="n"&gt;redisTemplate&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;opsForValue&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;set&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cacheKey&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;policy&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Duration&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;ofSeconds&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="no"&gt;TTL_SECONDS&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;policy&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;

    &lt;span class="nd"&gt;@EventListener&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;onLeavePolicyUpdated&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;LeavePolicyUpdatedEvent&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Invalidate cache when policies change (e.g., annual quota reset, policy update)&lt;/span&gt;
        &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;cacheKey&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;CACHE_KEY_PREFIX&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getEmployeeId&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
        &lt;span class="n"&gt;redisTemplate&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;delete&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cacheKey&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What We Cache (and Why)&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Leave Policies&lt;/strong&gt;: Vacation quotas, sick leave limits, carry-forward rules — accessed on every leave request submission/validation, but change only during annual resets or policy updates.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Approval Workflows&lt;/strong&gt;: Manager hierarchies and approval chains — needed to route leave requests, but organization structure changes infrequently.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Cache Invalidation Strategy&lt;/strong&gt;: Event-driven (on policy/org change, we emit events and invalidate) + 24-hour TTL.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Why Long TTL?&lt;/strong&gt; Leave policies are extremely stable (change annually or when employee switches departments). 24-hour TTL ensures we catch manual DB changes without event emission.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: Leave policy lookups: 120ms (database join across 3 tables) → &lt;strong&gt;&amp;lt;1ms (Redis)&lt;/strong&gt;. Reduced database CPU by ~40% since these queries were happening on every leave request view/submission.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture Overview: Before vs. After
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Before Optimization
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Client Request
    ↓
Spring Boot Controller
    ↓
Service Layer (Business Logic)
    ↓
[N+1 Queries] → Database (1.2s latency)
    ↓
[No Caching]
    ↓
Response (1.5s p99)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  After Optimization
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Client Request
    ↓
Spring Boot Controller
    ↓
Cache Check (Redis) ← 1ms ✅
    ├─ Hit: Return cached Leave Policy
    └─ Miss: Continue to database
    ↓
Service Layer (Business Logic)
    ↓
[Single JOIN FETCH Query] → Database (350ms)
    ↓
[HikariCP Optimized] (50-connection pool)
    ↓
Response (250ms p99) ✅
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Results: The Metrics That Matter
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;th&gt;Improvement&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;p99 Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1.5s&lt;/td&gt;
&lt;td&gt;250ms&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;6x faster&lt;/strong&gt; ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;p50 Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;850ms&lt;/td&gt;
&lt;td&gt;180ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4.7x faster&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Database CPU&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;85%&lt;/td&gt;
&lt;td&gt;51%&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;40% reduction&lt;/strong&gt; ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Queries/Request&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2000+&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;99.95% fewer&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Redis Hit Rate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;87%&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Connection Pool Timeout Errors&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;45/day&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100% elimination&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Business Impact
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;User Experience&lt;/strong&gt;: Dashboard loads feel snappy (250ms vs 1.5s)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Infrastructure&lt;/strong&gt;: Supports 15K+ daily requests on smaller database instance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reliability&lt;/strong&gt;: No more p99 latency spikes during business hours&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Lessons Learned: Engineering Trade-offs
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What We Sacrificed
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Flexibility&lt;/strong&gt;: JOIN FETCH loads all data. If you only needed recent leaves, you'd fetch unnecessary historical data.&lt;br&gt;
&lt;strong&gt;Solution&lt;/strong&gt;: Separate queries for different use cases (leave summary vs. full history).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Memory&lt;/strong&gt;: Increased HikariCP pool size (10 → 50) uses more heap.&lt;br&gt;
&lt;strong&gt;Reality&lt;/strong&gt;: Better to use extra memory than have request queues. Monitoring showed peak heap: 1.2GB / 4GB available.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cache Consistency&lt;/strong&gt;: Redis introduces eventual consistency.&lt;br&gt;
&lt;strong&gt;Mitigation&lt;/strong&gt;: Event-driven invalidation + TTL ensures freshness within 1 hour.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  What Worked
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;JOIN FETCH&lt;/strong&gt;: Single, largest impact. 99.95% query reduction.&lt;br&gt;
&lt;strong&gt;Connection pooling&lt;/strong&gt;: Eliminated queueing entirely.&lt;br&gt;
&lt;strong&gt;Caching strategy&lt;/strong&gt;: 87% hit rate validates our data access patterns.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Profile First, Optimize Second&lt;/strong&gt;&lt;br&gt;
Don't guess where the bottleneck is. We assumed CPU was the issue—it was actually database queries. Use Spring Boot Actuator, MySQL slow logs, and flamegraphs.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;N+1 is Insidious&lt;/strong&gt;&lt;br&gt;
With 20K employees and historical leave records, N+1 queries cascaded into 2000+ database roundtrips per request. &lt;strong&gt;Always use EXPLAIN PLAN and test with realistic data sizes&lt;/strong&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Connection Pooling Matters More Than You Think&lt;/strong&gt;&lt;br&gt;
Undersized pools (10 connections) caused request queueing, which is invisible in application metrics but devastating to latency. Right-size based on concurrency math, not gut feel.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Caching is Not Free&lt;/strong&gt;&lt;br&gt;
Cache invalidation is hard. We chose event-driven + TTL because it's reliable and simple. Measure hit rates to validate your strategy.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;6x Latency Improvement = 6x Better UX&lt;/strong&gt;&lt;br&gt;
Users feel the difference between 1.5s and 250ms. This wasn't just a technical victory—it was a product improvement.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




</description>
      <category>springboot</category>
      <category>distributedsystems</category>
      <category>redis</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
