<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: HyperscaleDesignHub</title>
    <description>The latest articles on Forem by HyperscaleDesignHub (@vijaya_bhaskarv_ba95adf9).</description>
    <link>https://forem.com/vijaya_bhaskarv_ba95adf9</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2128567%2F6b2b3eee-1906-469e-9ecc-8f7f4db2f0c6.png</url>
      <title>Forem: HyperscaleDesignHub</title>
      <link>https://forem.com/vijaya_bhaskarv_ba95adf9</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/vijaya_bhaskarv_ba95adf9"/>
    <language>en</language>
    <item>
      <title>Who Needs Real-Time Streaming? Use Cases &amp; Architecture Across Industries</title>
      <dc:creator>HyperscaleDesignHub</dc:creator>
      <pubDate>Sun, 26 Oct 2025 14:01:10 +0000</pubDate>
      <link>https://forem.com/vijaya_bhaskarv_ba95adf9/who-needs-real-time-streaming-use-cases-architecture-across-industries-5fl5</link>
      <guid>https://forem.com/vijaya_bhaskarv_ba95adf9/who-needs-real-time-streaming-use-cases-architecture-across-industries-5fl5</guid>
      <description>&lt;p&gt;In today's fast-paced digital world, the question isn't "Do I need real-time data streaming?" but rather "How fast do I need it?" From detecting fraudulent transactions in milliseconds to optimizing supply chains in real-time, streaming data has become the backbone of modern digital experiences.&lt;/p&gt;

&lt;p&gt;But here's the thing: &lt;strong&gt;not everyone needs to process 1 million messages per second&lt;/strong&gt;. Your startup's user analytics might work perfectly fine with 1,000 events per second, while a major bank's fraud detection system requires enterprise-grade throughput.&lt;/p&gt;

&lt;p&gt;Let me show you &lt;strong&gt;when you need real-time streaming&lt;/strong&gt;, &lt;strong&gt;what scale you actually need&lt;/strong&gt;, and &lt;strong&gt;how to architect it properly&lt;/strong&gt; across different industries.&lt;/p&gt;

&lt;h2&gt;
  
  
  🎯 The Real-Time Spectrum: When Every Second Counts
&lt;/h2&gt;

&lt;p&gt;Before diving into use cases, let's understand the different flavors of "real-time":&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Latency&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Use Case&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Example&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&amp;lt;1ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High-frequency trading&lt;/td&gt;
&lt;td&gt;Stock market microsecond arbitrage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&amp;lt;100ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Gaming &amp;amp; Interactive&lt;/td&gt;
&lt;td&gt;Real-time leaderboards, live chat&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&amp;lt;1 second&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Fraud detection&lt;/td&gt;
&lt;td&gt;Credit card transaction blocking&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&amp;lt;10 seconds&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Monitoring &amp;amp; Alerts&lt;/td&gt;
&lt;td&gt;Infrastructure failure detection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&amp;lt;1 minute&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Analytics &amp;amp; Dashboards&lt;/td&gt;
&lt;td&gt;Real-time business metrics&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The key insight&lt;/strong&gt;: Match your architecture complexity to your actual latency requirements. Don't over-engineer!&lt;/p&gt;

&lt;h2&gt;
  
  
  🏗️ Architecture Patterns by Scale
&lt;/h2&gt;

&lt;p&gt;Based on the &lt;a href="https://github.com/hyperscaledesignhub/RealtimeDataPlatform" rel="noopener noreferrer"&gt;RealtimeDataPlatform&lt;/a&gt; implementations, here are three proven architectures:&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Local Development&lt;/strong&gt; (~1K msg/sec)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Cost: FREE | Use Case: Development &amp;amp; Testing&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Producer  │───▶│   Pulsar    │───▶│    Flink    │
│  (Docker)   │    │  (Docker)   │    │  (Docker)   │
└─────────────┘    └─────────────┘    └──────┬──────┘
                                              │
┌─────────────┐    ┌─────────────┐           │
│   Grafana   │◀───│ ClickHouse  │◀──────────┘
│ (Dashboards)│    │ (Storage)   │
└─────────────┘    └─────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Perfect for:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Proof of concepts&lt;/li&gt;
&lt;li&gt;Algorithm development&lt;/li&gt;
&lt;li&gt;Learning streaming concepts&lt;/li&gt;
&lt;li&gt;Small team experiments&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Small-Medium Business&lt;/strong&gt; (50K msg/sec)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Cost: $200-250/month | Use Case: Growing Companies&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────┐
│                AWS EKS (t3.medium nodes)                │
│                                                         │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐     │
│  │  Producer   │  │   Pulsar    │  │    Flink    │     │
│  │ (10K IDs)   │─▶│ (3 brokers) │─▶│ (2 workers) │     │
│  └─────────────┘  └─────────────┘  └──────┬──────┘     │
│                                           │              │
│  ┌─────────────┐  ┌─────────────┐        │              │
│  │ Monitoring  │  │ ClickHouse  │◀───────┘              │
│  │ (Grafana)   │  │(2 replicas) │                       │
│  └─────────────┘  └─────────────┘                       │
└─────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Perfect for:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SaaS platforms&lt;/li&gt;
&lt;li&gt;Regional e-commerce&lt;/li&gt;
&lt;li&gt;IoT startups&lt;/li&gt;
&lt;li&gt;Gaming companies&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Enterprise Scale&lt;/strong&gt; (1M msg/sec)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Cost: $25,000/month | Use Case: Large Organizations&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────┐
│           AWS EKS (c5.2xlarge + NVMe storage)              │
│                                                             │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌────────┐ │
│ │  Producer   │ │   Pulsar    │ │    Flink    │ │ Click  │ │
│ │(100K IDs)   │▶│(6 brokers)  │▶│(6 workers)  │▶│ House  │ │
│ │Multi-AZ     │ │Multi-AZ     │ │Multi-AZ     │ │Multi-AZ│ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └────────┘ │
│                                                             │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │          VictoriaMetrics + Grafana Stack                │ │
│ │    (Unified monitoring across all components)           │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Perfect for:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Global financial institutions&lt;/li&gt;
&lt;li&gt;Major e-commerce platforms&lt;/li&gt;
&lt;li&gt;Telecommunications providers&lt;/li&gt;
&lt;li&gt;Enterprise IoT deployments&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  🏢 Industry Use Cases: When Real-Time Makes Business Sense
&lt;/h2&gt;

&lt;h3&gt;
  
  
  🛒 &lt;strong&gt;E-Commerce: Every Click Counts&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Why Real-Time Matters:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cart abandonment&lt;/strong&gt;: React within seconds to offer discounts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inventory management&lt;/strong&gt;: Prevent overselling during flash sales&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fraud prevention&lt;/strong&gt;: Block suspicious transactions instantly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Personalization&lt;/strong&gt;: Update recommendations as users browse&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Real-World Example:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;📱 User adds iPhone to cart
    ↓ (50ms)
🔍 Inventory check: 2 units left
    ↓ (100ms)
💰 Price optimization: Apply 5% discount for cart abandonment risk
    ↓ (200ms)
🎯 Recommendation update: Show compatible accessories
    ↓ (500ms)
📊 Analytics: Update real-time sales dashboard
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Architecture Need&lt;/strong&gt;: &lt;a href="https://github.com/hyperscaledesignhub/RealtimeDataPlatform/tree/main/realtime-platform-50k-events" rel="noopener noreferrer"&gt;50K MPS Setup&lt;/a&gt; for most e-commerce, 1M MPS for Amazon-scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Metrics to Stream:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Page views and click events&lt;/li&gt;
&lt;li&gt;Cart modifications&lt;/li&gt;
&lt;li&gt;Payment transactions&lt;/li&gt;
&lt;li&gt;Inventory levels&lt;/li&gt;
&lt;li&gt;User session data&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  💰 &lt;strong&gt;Finance: Milliseconds = Millions&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Why Real-Time Matters:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;High-frequency trading&lt;/strong&gt;: Execute trades in microseconds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fraud detection&lt;/strong&gt;: Block transactions before completion&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Risk management&lt;/strong&gt;: Adjust portfolios based on market movements&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compliance&lt;/strong&gt;: Real-time reporting for regulatory requirements&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Real-World Example:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;💳 Credit card swipe: $5,000 transaction
    ↓ (10ms)
🤖 ML Model: Unusual amount + new location = 85% fraud probability
    ↓ (50ms)
🚫 Transaction blocked + SMS sent to customer
    ↓ (100ms)
📊 Risk dashboard updated: +1 blocked fraud attempt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Architecture Need&lt;/strong&gt;: &lt;a href="https://github.com/hyperscaledesignhub/RealtimeDataPlatform/tree/main/realtime-platform-1million-events" rel="noopener noreferrer"&gt;1M MPS Setup&lt;/a&gt; for major financial institutions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Metrics to Stream:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Transaction data&lt;/li&gt;
&lt;li&gt;Market price feeds&lt;/li&gt;
&lt;li&gt;Risk calculations&lt;/li&gt;
&lt;li&gt;Customer behavior patterns&lt;/li&gt;
&lt;li&gt;Regulatory compliance events&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🎮 &lt;strong&gt;Gaming: Real-Time Engagement&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Why Real-Time Matters:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Leaderboards&lt;/strong&gt;: Update rankings instantly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Matchmaking&lt;/strong&gt;: Pair players with similar skill levels&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;In-game events&lt;/strong&gt;: Dynamic content based on player actions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anti-cheat&lt;/strong&gt;: Detect suspicious behavior patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Real-World Example:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;🎯 Player achieves high score: 1,245,830 points
    ↓ (10ms)
🏆 Leaderboard update: #3 globally
    ↓ (50ms)
🎊 Achievement unlocked: "Top 10 Global"
    ↓ (100ms)
👥 Notify friends: "Alex just reached #3!"
    ↓ (200ms)
💰 Offer premium upgrade: "Celebrate with special skin!"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Architecture Need&lt;/strong&gt;: 50K MPS for indie games, 1M MPS for AAA multiplayer games.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Metrics to Stream:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Player actions and scores&lt;/li&gt;
&lt;li&gt;In-game purchases&lt;/li&gt;
&lt;li&gt;Session duration and engagement&lt;/li&gt;
&lt;li&gt;Performance metrics&lt;/li&gt;
&lt;li&gt;Social interactions&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🏭 &lt;strong&gt;IoT &amp;amp; Manufacturing: Predictive Intelligence&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Why Real-Time Matters:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Predictive maintenance&lt;/strong&gt;: Fix equipment before it breaks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quality control&lt;/strong&gt;: Detect defects in real-time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Energy optimization&lt;/strong&gt;: Adjust consumption based on demand&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Safety monitoring&lt;/strong&gt;: Immediate alerts for dangerous conditions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Real-World Example:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;🌡️ Temperature sensor: 85°C (normal: 70°C)
    ↓ (1 second)
⚠️ Anomaly detection: Temperature rising trend
    ↓ (2 seconds)
🔧 Maintenance alert: Schedule inspection within 4 hours
    ↓ (5 seconds)
📊 Dashboard update: Equipment health status = Warning
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Architecture Need&lt;/strong&gt;: 50K MPS for smart buildings, 1M MPS for industrial IoT.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Metrics to Stream:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sensor readings (temperature, pressure, vibration)&lt;/li&gt;
&lt;li&gt;Equipment status and performance&lt;/li&gt;
&lt;li&gt;Environmental conditions&lt;/li&gt;
&lt;li&gt;Energy consumption&lt;/li&gt;
&lt;li&gt;Safety alerts&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  📱 &lt;strong&gt;Social Media: Viral Content Detection&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Why Real-Time Matters:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Trending topics&lt;/strong&gt;: Identify viral content early&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Content moderation&lt;/strong&gt;: Remove harmful content instantly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Engagement optimization&lt;/strong&gt;: Boost high-performing posts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Influencer identification&lt;/strong&gt;: Spot rising content creators&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Real-World Example:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;📸 User posts photo with #NewProduct
    ↓ (100ms)
🔥 Engagement spike: 1000 likes in 2 minutes
    ↓ (1 second)
📈 Trending algorithm: Boost to wider audience
    ↓ (5 seconds)
💰 Ad targeting: Show related product ads
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Architecture Need&lt;/strong&gt;: 1M MPS for major platforms, 50K MPS for niche communities.&lt;/p&gt;

&lt;h3&gt;
  
  
  🚛 &lt;strong&gt;Logistics: Supply Chain Optimization&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Why Real-Time Matters:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Route optimization&lt;/strong&gt;: Adjust for traffic and weather&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inventory tracking&lt;/strong&gt;: Real-time stock levels across warehouses&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Delivery predictions&lt;/strong&gt;: Accurate ETAs for customers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Exception handling&lt;/strong&gt;: Immediate response to delays&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Real-World Example:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;📦 Package scanned at distribution center
    ↓ (500ms)
🗺️ Route optimization: Traffic jam detected, reroute
    ↓ (2 seconds)
📱 Customer notification: "Delivery delayed by 30 minutes"
    ↓ (5 seconds)
📊 Analytics: Update delivery performance metrics
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Architecture Need&lt;/strong&gt;: 50K MPS for regional logistics, 1M MPS for global shipping companies.&lt;/p&gt;

&lt;h2&gt;
  
  
  🎪 &lt;strong&gt;When Real-Time Becomes Essential&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Not every use case needs real-time processing. Here's when it becomes critical:&lt;/p&gt;

&lt;h3&gt;
  
  
  ✅ &lt;strong&gt;Perfect for Real-Time Streaming&lt;/strong&gt;
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;User Experience Depends on Speed&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Gaming leaderboards&lt;/li&gt;
&lt;li&gt;Live chat applications&lt;/li&gt;
&lt;li&gt;Real-time collaboration tools&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Financial Impact of Delays&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Trading platforms&lt;/li&gt;
&lt;li&gt;Fraud detection&lt;/li&gt;
&lt;li&gt;Dynamic pricing&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Safety-Critical Systems&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Medical monitoring&lt;/li&gt;
&lt;li&gt;Industrial safety&lt;/li&gt;
&lt;li&gt;Autonomous vehicles&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Competitive Advantage through Speed&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Personalized recommendations&lt;/li&gt;
&lt;li&gt;Real-time offers&lt;/li&gt;
&lt;li&gt;Instant customer support&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  ❌ &lt;strong&gt;Better with Batch Processing&lt;/strong&gt;
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Historical Analysis&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Monthly sales reports&lt;/li&gt;
&lt;li&gt;Annual compliance reporting&lt;/li&gt;
&lt;li&gt;Data warehouse ETL&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Complex Computations&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Machine learning model training&lt;/li&gt;
&lt;li&gt;Financial reconciliation&lt;/li&gt;
&lt;li&gt;Scientific simulations&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Cost-Sensitive Operations&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Backup processing&lt;/li&gt;
&lt;li&gt;Archive operations&lt;/li&gt;
&lt;li&gt;Non-urgent analytics&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  🛠️ &lt;strong&gt;Technology Stack Breakdown&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Here's what powers the &lt;a href="https://github.com/hyperscaledesignhub/RealtimeDataPlatform" rel="noopener noreferrer"&gt;RealtimeDataPlatform&lt;/a&gt; across different scales:&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Core Components&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;Message Broker&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Apache Pulsar&lt;/span&gt;
  &lt;span class="s"&gt;- Why&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Better than Kafka for geo-replication&lt;/span&gt;
  &lt;span class="s"&gt;- Scalability&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Handles multi-tenant workloads&lt;/span&gt;
  &lt;span class="s"&gt;- Features&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Built-in schema registry, tiered storage&lt;/span&gt;

&lt;span class="na"&gt;Stream Processing&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Apache Flink&lt;/span&gt;  
  &lt;span class="s"&gt;- Why&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s"&gt;True low-latency processing&lt;/span&gt;
  &lt;span class="s"&gt;- Scalability&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Horizontal scaling with checkpointing&lt;/span&gt;
  &lt;span class="s"&gt;- Features&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Event-time processing, stateful operations&lt;/span&gt;

&lt;span class="na"&gt;Storage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClickHouse&lt;/span&gt;
  &lt;span class="s"&gt;- Why&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Optimized for analytical queries&lt;/span&gt;
  &lt;span class="s"&gt;- Scalability&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Columnar storage with compression&lt;/span&gt;
  &lt;span class="s"&gt;- Features&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Real-time ingestion, SQL interface&lt;/span&gt;

&lt;span class="na"&gt;Monitoring&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Grafana + VictoriaMetrics&lt;/span&gt;
  &lt;span class="s"&gt;- Why&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Unified observability across all components&lt;/span&gt;
  &lt;span class="s"&gt;- Scalability&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Better compression than Prometheus&lt;/span&gt;
  &lt;span class="s"&gt;- Features&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Custom dashboards, alerting&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Scaling Strategies&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;From 1K to 50K messages/sec:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Horizontal Pod Scaling&lt;/strong&gt;: Add more Flink TaskManagers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storage Optimization&lt;/strong&gt;: Partition ClickHouse tables by time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Network Optimization&lt;/strong&gt;: Use node affinity for co-location&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;From 50K to 1M messages/sec:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Infrastructure Upgrade&lt;/strong&gt;: c5.2xlarge instances with NVMe&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-AZ Deployment&lt;/strong&gt;: Distribute across availability zones&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Advanced Monitoring&lt;/strong&gt;: Dedicated monitoring namespace&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  📊 &lt;strong&gt;ROI Calculator: Is Real-Time Worth It?&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Cost Analysis Template&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Real-Time Implementation Cost:
- Infrastructure: $200-25,000/month (based on scale)
- Development: 2-6 months
- Maintenance: 20% of development cost annually

Business Value Calculation:
- Revenue increase from faster responses
- Cost savings from early problem detection  
- Competitive advantage quantification
- Customer satisfaction improvement

Break-even typically: 6-18 months
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Decision Framework&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Ask yourself:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;How much does a 1-hour delay cost your business?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If &amp;gt;$1000: Consider real-time&lt;/li&gt;
&lt;li&gt;If &amp;gt;$10000: Real-time is essential&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;What's your user expectation?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Gaming: &amp;lt;100ms expected&lt;/li&gt;
&lt;li&gt;E-commerce: &amp;lt;1s acceptable&lt;/li&gt;
&lt;li&gt;Analytics: &amp;lt;1 minute usually fine&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;How complex is your processing?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Simple aggregations: Real-time feasible&lt;/li&gt;
&lt;li&gt;ML training: Stick to batch processing&lt;/li&gt;
&lt;li&gt;Fraud detection: Real-time critical&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  🚀 &lt;strong&gt;Getting Started: Your Real-Time Journey&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Phase 1: Proof of Concept (Week 1-2)&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Start with local development setup&lt;/span&gt;
git clone https://github.com/hyperscaledesignhub/RealtimeDataPlatform
&lt;span class="nb"&gt;cd &lt;/span&gt;local-setup
./scripts/start-pipeline.sh

&lt;span class="c"&gt;# Experiment with your data patterns&lt;/span&gt;
&lt;span class="c"&gt;# Measure actual throughput requirements&lt;/span&gt;
&lt;span class="c"&gt;# Validate business value&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Phase 2: Production Pilot (Month 1-2)&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Deploy 50K MPS setup for initial load&lt;/span&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;realtime-platform-50k-events
&lt;span class="c"&gt;# Follow deployment guide&lt;/span&gt;
&lt;span class="c"&gt;# Monitor performance and costs&lt;/span&gt;
&lt;span class="c"&gt;# Gather user feedback&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Phase 3: Scale as Needed (Month 3+)&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Upgrade to 1M MPS if requirements justify it&lt;/span&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;realtime-platform-1million-events
&lt;span class="c"&gt;# Enterprise-grade monitoring and alerting&lt;/span&gt;
&lt;span class="c"&gt;# Multi-region deployment for global reach&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  🎯 &lt;strong&gt;Industry-Specific Quick Start Guides&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;E-Commerce Startup&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Start with&lt;/strong&gt;: Local setup for development&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scale to&lt;/strong&gt;: 50K MPS when you hit 10K daily users&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Focus on&lt;/strong&gt;: Cart abandonment, inventory tracking&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Key metrics&lt;/strong&gt;: Conversion rate, page load time&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Financial Services&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Start with&lt;/strong&gt;: 50K MPS for fraud detection&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scale to&lt;/strong&gt;: 1M MPS for trading platforms&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Focus on&lt;/strong&gt;: Transaction monitoring, risk analysis&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Key metrics&lt;/strong&gt;: Fraud detection rate, latency&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;IoT Company&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Start with&lt;/strong&gt;: 50K MPS for device monitoring&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scale to&lt;/strong&gt;: 1M MPS for industrial deployments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Focus on&lt;/strong&gt;: Predictive maintenance, anomaly detection
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Key metrics&lt;/strong&gt;: Uptime, maintenance cost savings&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Gaming Studio&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Start with&lt;/strong&gt;: Local setup for single-player games&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scale to&lt;/strong&gt;: 1M MPS for massively multiplayer games&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Focus on&lt;/strong&gt;: Real-time leaderboards, matchmaking&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Key metrics&lt;/strong&gt;: Player engagement, session duration&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  🏁 &lt;strong&gt;Conclusion: The Real-Time Imperative&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Real-time streaming isn't just a technology choice—it's a &lt;strong&gt;business strategy&lt;/strong&gt;. The companies winning today are those who can &lt;strong&gt;act on data as it happens&lt;/strong&gt;, not hours or days later.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key takeaways:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;✅ &lt;strong&gt;Match your architecture to your actual needs&lt;/strong&gt;—don't over-engineer&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Start small and scale progressively&lt;/strong&gt; based on proven business value&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Focus on the use cases that directly impact revenue or user experience&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Invest in monitoring and observability from day one&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Consider the total cost of ownership, not just infrastructure costs&lt;/strong&gt;  &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The bottom line&lt;/strong&gt;: If waiting for data costs you more than processing it in real-time, you need streaming architecture. If your users expect instant responses, you need real-time processing. If your competitors are faster, you need to catch up.&lt;/p&gt;

&lt;p&gt;The question isn't whether you'll adopt real-time streaming—it's &lt;strong&gt;when&lt;/strong&gt; and &lt;strong&gt;at what scale&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  📚 &lt;strong&gt;Resources &amp;amp; Next Steps&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Complete Platform&lt;/strong&gt;: &lt;a href="https://github.com/hyperscaledesignhub/RealtimeDataPlatform" rel="noopener noreferrer"&gt;RealtimeDataPlatform GitHub&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;50K Setup Guide&lt;/strong&gt;: &lt;a href="https://github.com/hyperscaledesignhub/RealtimeDataPlatform/tree/main/realtime-platform-50k-events" rel="noopener noreferrer"&gt;Small-Medium Business Architecture&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1M Setup Guide&lt;/strong&gt;: &lt;a href="https://github.com/hyperscaledesignhub/RealtimeDataPlatform/tree/main/realtime-platform-1million-events" rel="noopener noreferrer"&gt;Enterprise Architecture&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Local Development&lt;/strong&gt;: &lt;a href="https://github.com/hyperscaledesignhub/RealtimeDataPlatform/tree/main/local-setup" rel="noopener noreferrer"&gt;Quick Start Guide&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;What's your real-time streaming use case? Share your requirements and challenges in the comments!&lt;/strong&gt; 👇&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Follow me for more posts on streaming architecture, scalability patterns, and production DevOps!&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; #realtime #streaming #architecture #iot #ecommerce #finance #gaming #devops #microservices&lt;/p&gt;




&lt;h2&gt;
  
  
  🎮 &lt;strong&gt;Interactive Use Case Matcher&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Answer these questions to find your ideal architecture:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;What's your expected peak throughput?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&amp;lt;1K msg/sec → Local development setup&lt;/li&gt;
&lt;li&gt;1K-50K msg/sec → 50K MPS architecture
&lt;/li&gt;
&lt;li&gt;&amp;gt;50K msg/sec → 1M MPS enterprise setup&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;What's your latency requirement?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&amp;lt;100ms → Gaming/trading focused setup&lt;/li&gt;
&lt;li&gt;&amp;lt;1 second → E-commerce/fraud detection&lt;/li&gt;
&lt;li&gt;&amp;lt;10 seconds → Analytics/monitoring&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;What's your budget?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Free → Local development&lt;/li&gt;
&lt;li&gt;$200-500/month → 50K MPS&lt;/li&gt;
&lt;li&gt;$25,000+/month → Enterprise 1M MPS&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;What's your industry?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;E-commerce → Focus on cart/inventory streams&lt;/li&gt;
&lt;li&gt;Finance → Emphasize fraud detection&lt;/li&gt;
&lt;li&gt;IoT → Sensor data and predictive maintenance&lt;/li&gt;
&lt;li&gt;Gaming → Real-time leaderboards and events&lt;/li&gt;
&lt;li&gt;Social Media → Content engagement tracking&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Got your answers? Check the corresponding setup guide and start building! 🚀&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>systemdesign</category>
      <category>dataengineering</category>
      <category>performance</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Realtime Data Streaming Platform: Building a Unified Monitoring Stack</title>
      <dc:creator>HyperscaleDesignHub</dc:creator>
      <pubDate>Sun, 26 Oct 2025 13:52:57 +0000</pubDate>
      <link>https://forem.com/vijaya_bhaskarv_ba95adf9/realtime-data-streaming-platform-building-a-unified-monitoring-stack-n9o</link>
      <guid>https://forem.com/vijaya_bhaskarv_ba95adf9/realtime-data-streaming-platform-building-a-unified-monitoring-stack-n9o</guid>
      <description>&lt;p&gt;When you're running a real-time streaming platform processing &lt;strong&gt;1 million messages per second&lt;/strong&gt;, you can't afford to be blind. You need comprehensive monitoring across all components - Pulsar, Flink, and ClickHouse - in a single unified view.&lt;/p&gt;

&lt;p&gt;In this guide, I'll show you how to build a &lt;strong&gt;production-grade monitoring stack&lt;/strong&gt; that provides real-time visibility into your entire streaming pipeline using &lt;strong&gt;VictoriaMetrics&lt;/strong&gt; and &lt;strong&gt;Grafana&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  🎯 What We're Building
&lt;/h2&gt;

&lt;p&gt;A unified monitoring solution that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;📊 &lt;strong&gt;Single Grafana instance&lt;/strong&gt; for all components&lt;/li&gt;
&lt;li&gt;⚡ &lt;strong&gt;VictoriaMetrics&lt;/strong&gt; as the metrics backend (Prometheus-compatible)&lt;/li&gt;
&lt;li&gt;📈 &lt;strong&gt;Real-time dashboards&lt;/strong&gt; for Pulsar, Flink, and ClickHouse&lt;/li&gt;
&lt;li&gt;🔌 &lt;strong&gt;Automated setup&lt;/strong&gt; with scripts and Helm charts&lt;/li&gt;
&lt;li&gt;🎨 &lt;strong&gt;Pre-built dashboards&lt;/strong&gt; ready to import&lt;/li&gt;
&lt;li&gt;🚀 &lt;strong&gt;Scalable&lt;/strong&gt; to handle 1M+ metrics/sec&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  🏗️ Architecture Overview
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────────┐
│               Unified Monitoring Architecture                   │
└─────────────────────────────────────────────────────────────────┘

┌─────────────┐  ┌─────────────┐  ┌──────────────┐
│   Pulsar    │  │    Flink    │  │  ClickHouse  │
│   Metrics   │  │   Metrics   │  │   Metrics    │
│   :9090     │  │   :9249     │  │   :8123      │
└──────┬──────┘  └──────┬──────┘  └──────┬───────┘
       │                │                 │
       │  Prometheus    │   Prometheus    │  SQL
       │  Exposition    │   Reporter      │  Queries
       │                │                 │
       └────────────────┴─────────────────┘
                       │
                       ▼
              ┌─────────────────┐
              │   VMAgent       │
              │  (Collector)    │
              │                 │
              │  Scrapes all    │
              │  /metrics       │
              │  endpoints      │
              └────────┬────────┘
                       │
                       ▼
              ┌─────────────────┐
              │  VictoriaMetrics│
              │   (Storage)     │
              │                 │
              │  Time-series    │
              │  Database       │
              └────────┬────────┘
                       │
                       ▼
              ┌─────────────────┐
              │    Grafana      │
              │  (Dashboards)   │
              │                 │
              │  • Pulsar       │
              │  • Flink        │
              │  • ClickHouse   │
              └─────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  📦 The Monitoring Stack
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. &lt;strong&gt;VictoriaMetrics Kubernetes Stack&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The Pulsar Helm chart includes &lt;strong&gt;victoria-metrics-k8s-stack&lt;/strong&gt; as a dependency:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# pulsar-load/helm/pulsar/Chart.yaml&lt;/span&gt;
&lt;span class="na"&gt;dependencies&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;condition&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;victoria-metrics-k8s-stack.enabled&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;victoria-metrics-k8s-stack&lt;/span&gt;
  &lt;span class="na"&gt;repository&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://victoriametrics.github.io/helm-charts/&lt;/span&gt;
  &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;0.38.x&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What's included:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Port&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;VMAgent&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Metrics collector (replaces Prometheus)&lt;/td&gt;
&lt;td&gt;8429&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;VMSingle&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Time-series database storage&lt;/td&gt;
&lt;td&gt;8429&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Grafana&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Visualization and dashboards&lt;/td&gt;
&lt;td&gt;3000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Kube-State-Metrics&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Kubernetes cluster metrics&lt;/td&gt;
&lt;td&gt;8080&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Node-Exporter&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Node-level metrics&lt;/td&gt;
&lt;td&gt;9100&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why VictoriaMetrics over Prometheus?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;✅ &lt;strong&gt;10x better compression&lt;/strong&gt; (less storage)&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Faster queries&lt;/strong&gt; (optimized for large datasets)&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Lower memory usage&lt;/strong&gt; (~2GB vs Prometheus's 16GB)&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Prometheus-compatible&lt;/strong&gt; (drop-in replacement)&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Better retention&lt;/strong&gt; (handles months of data)  &lt;/p&gt;
&lt;h3&gt;
  
  
  2. &lt;strong&gt;Configuration in pulsar-values.yaml&lt;/strong&gt;
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Enable Victoria Metrics stack&lt;/span&gt;
&lt;span class="na"&gt;victoria-metrics-k8s-stack&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

  &lt;span class="c1"&gt;# VMAgent - Metrics collector&lt;/span&gt;
  &lt;span class="na"&gt;vmagent&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;scrapeInterval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;15s&lt;/span&gt;
      &lt;span class="na"&gt;externalLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;cluster&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;benchmark-high-infra&lt;/span&gt;
        &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production&lt;/span&gt;

  &lt;span class="c1"&gt;# Grafana configuration&lt;/span&gt;
  &lt;span class="na"&gt;grafana&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;adminPassword&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;admin123"&lt;/span&gt;  &lt;span class="c1"&gt;# Change in production!&lt;/span&gt;

    &lt;span class="c1"&gt;# Persistent storage for dashboards&lt;/span&gt;
    &lt;span class="na"&gt;persistence&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="na"&gt;size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10Gi&lt;/span&gt;
      &lt;span class="na"&gt;storageClassName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gp3"&lt;/span&gt;

    &lt;span class="c1"&gt;# Service exposure&lt;/span&gt;
    &lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LoadBalancer&lt;/span&gt;  &lt;span class="c1"&gt;# Accessible externally&lt;/span&gt;
      &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;

    &lt;span class="c1"&gt;# Default datasource (VictoriaMetrics)&lt;/span&gt;
    &lt;span class="na"&gt;datasources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;datasources.yaml&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
        &lt;span class="na"&gt;datasources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;VictoriaMetrics&lt;/span&gt;
          &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prometheus&lt;/span&gt;
          &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://vmsingle-pulsar-victoria-metrics-k8s-stack:8429&lt;/span&gt;
          &lt;span class="na"&gt;isDefault&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;🔹 &lt;strong&gt;Persistent Dashboards:&lt;/strong&gt; Survives pod restarts&lt;br&gt;&lt;br&gt;
🔹 &lt;strong&gt;LoadBalancer Service:&lt;/strong&gt; Direct external access&lt;br&gt;&lt;br&gt;
🔹 &lt;strong&gt;Auto-discovery:&lt;/strong&gt; Automatically scrapes Pulsar pods&lt;br&gt;&lt;br&gt;
🔹 &lt;strong&gt;Pre-configured:&lt;/strong&gt; Works out-of-the-box  &lt;/p&gt;
&lt;h2&gt;
  
  
  🔧 Setting Up Flink Metrics
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Step 1: Run the setup-flink-metrics.sh Script
&lt;/h3&gt;

&lt;p&gt;This script does the heavy lifting:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;realtime-platform-1million-events/flink-load
./setup-flink-metrics.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What it does (7 steps):&lt;/strong&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Step 1: Create Flink Configuration&lt;/strong&gt;
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Applies flink-config-configmap.yaml&lt;/span&gt;
apiVersion: v1
kind: ConfigMap
metadata:
  name: flink-config
  namespace: flink-benchmark
data:
  flink-conf.yaml: |
    metrics.reporters: prometheus
    metrics.reporter.prometheus.factory.class: org.apache.flink.metrics.prometheus.PrometheusReporterFactory
    metrics.reporter.prometheus.port: 9249-9259
    metrics.system-resource: &lt;span class="s2"&gt;"true"&lt;/span&gt;
    metrics.system-resource-probing-interval: &lt;span class="s2"&gt;"5000"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  &lt;strong&gt;Step 2: Setup Prometheus Integration&lt;/strong&gt;
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Creates VMPodScrape for Victoria Metrics&lt;/span&gt;
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMPodScrape
metadata:
  name: flink-metrics
  namespace: pulsar  &lt;span class="c"&gt;# ← Created in Pulsar namespace!&lt;/span&gt;
spec:
  selector:
    matchLabels:
      app: iot-flink-job
  namespaceSelector:
    matchNames:
      - flink-benchmark  &lt;span class="c"&gt;# ← Scrapes from Flink namespace&lt;/span&gt;
  podMetricsEndpoints:
    - port: metrics
      path: /metrics
      interval: 15s  &lt;span class="c"&gt;# Scrape every 15 seconds&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why VMPodScrape in Pulsar namespace?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;VMAgent runs in the Pulsar namespace&lt;/li&gt;
&lt;li&gt;It needs permission to scrape other namespaces&lt;/li&gt;
&lt;li&gt;Cross-namespace scraping via &lt;code&gt;namespaceSelector&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Step 3-4: Patch Flink Deployments&lt;/strong&gt;
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Downloads Prometheus reporter JAR via initContainer&lt;/span&gt;
initContainers:
- name: download-prometheus-jar
  image: curlimages/curl:latest
  &lt;span class="nb"&gt;command&lt;/span&gt;:
  - sh
  - &lt;span class="nt"&gt;-c&lt;/span&gt;
  - |
    curl &lt;span class="nt"&gt;-L&lt;/span&gt; https://repo1.maven.org/maven2/org/apache/flink/flink-metrics-prometheus/1.18.0/flink-metrics-prometheus-1.18.0.jar &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;-o&lt;/span&gt; /flink-prometheus/flink-metrics-prometheus-1.18.0.jar
  volumeMounts:
  - name: flink-prometheus-jar
    mountPath: /flink-prometheus

&lt;span class="c"&gt;# Mounts JAR in Flink lib directory&lt;/span&gt;
containers:
- name: flink-main-container
  volumeMounts:
  - name: flink-prometheus-jar
    mountPath: /opt/flink/lib/flink-metrics-prometheus-1.18.0.jar
    subPath: flink-metrics-prometheus-1.18.0.jar
  ports:
  - containerPort: 9249
    name: metrics
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why download at runtime?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No need to rebuild Flink Docker image&lt;/li&gt;
&lt;li&gt;Easy to update JAR version&lt;/li&gt;
&lt;li&gt;Works with official Flink images&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Step 4.5: Install ClickHouse Plugin&lt;/strong&gt;
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Installs grafana-clickhouse-datasource plugin&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; pulsar &amp;lt;grafana-pod&amp;gt; &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  grafana-cli plugins &lt;span class="nb"&gt;install &lt;/span&gt;grafana-clickhouse-datasource

&lt;span class="c"&gt;# Restarts Grafana to load plugin&lt;/span&gt;
kubectl rollout restart deployment/pulsar-grafana &lt;span class="nt"&gt;-n&lt;/span&gt; pulsar
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why needed?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ClickHouse uses native protocol, not Prometheus&lt;/li&gt;
&lt;li&gt;Plugin enables SQL queries from Grafana&lt;/li&gt;
&lt;li&gt;Required for ClickHouse dashboard&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Step 5-6: Verify Setup&lt;/strong&gt;
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Tests metrics endpoint&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; flink-benchmark &amp;lt;jobmanager-pod&amp;gt; &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  curl localhost:9249/metrics | &lt;span class="nb"&gt;grep &lt;/span&gt;flink_

&lt;span class="c"&gt;# Should show 200+ Flink metrics&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  &lt;strong&gt;Step 7: Restart VMAgent&lt;/strong&gt;
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Reloads scrape configuration&lt;/span&gt;
kubectl rollout restart deployment/vmagent-pulsar-victoria-metrics-k8s-stack &lt;span class="nt"&gt;-n&lt;/span&gt; pulsar
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; Flink metrics now flowing to VictoriaMetrics! ✅&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Import Dashboards
&lt;/h3&gt;

&lt;p&gt;Now import the pre-built dashboards:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; ../../monitoring/grafana-dashboards
./import-dashboards.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Port-forwards to Grafana:&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   kubectl port-forward &lt;span class="nt"&gt;-n&lt;/span&gt; pulsar svc/pulsar-grafana 3000:80 &amp;amp;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Imports Flink Dashboard:&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-u&lt;/span&gt; admin:admin123 &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-d&lt;/span&gt; @flink-iot-pipeline-dashboard.json &lt;span class="se"&gt;\&lt;/span&gt;
     http://localhost:3000/api/dashboards/db
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Imports ClickHouse Dashboard:&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-u&lt;/span&gt; admin:admin123 &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-d&lt;/span&gt; @clickhouse-iot-data-dashboard.json &lt;span class="se"&gt;\&lt;/span&gt;
     http://localhost:3000/api/dashboards/db
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; Three dashboards available in Grafana! 📊&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Access Your Unified Dashboard
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Port-forward to Grafana&lt;/span&gt;
kubectl port-forward &lt;span class="nt"&gt;-n&lt;/span&gt; pulsar svc/pulsar-grafana 3000:80

&lt;span class="c"&gt;# Open http://localhost:3000&lt;/span&gt;
&lt;span class="c"&gt;# Login: admin / admin123&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Dashboard Overview:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Pulsar Overview&lt;/strong&gt; (default landing page)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flink IoT Pipeline Metrics&lt;/strong&gt; (streaming processing)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ClickHouse IoT Data Metrics&lt;/strong&gt; (analytical queries)&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  📊 Real-Time Monitoring at Scale
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Pulsar Dashboard - Message Ingestion&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Key Metrics:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Message Rate:&lt;/strong&gt; Real-time ingestion rate (target: 1M msg/sec)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consumer Lag:&lt;/strong&gt; Backlog across all topics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storage:&lt;/strong&gt; Bookie disk usage and write latency&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Throughput:&lt;/strong&gt; Bytes in/out per second&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Critical Alerts:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Consumer lag &amp;gt; 1M messages&lt;/li&gt;
&lt;li&gt;Bookie write latency &amp;gt; 2ms (p99)&lt;/li&gt;
&lt;li&gt;Disk usage &amp;gt; 85%&lt;/li&gt;
&lt;li&gt;Broker CPU &amp;gt; 90%&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Flink Dashboard - Stream Processing&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Key Metrics:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Records Processing:&lt;/strong&gt; Input/output rates with watermarks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Checkpointing:&lt;/strong&gt; Duration and success rate&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backpressure:&lt;/strong&gt; Task-level processing delays&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resource Usage:&lt;/strong&gt; CPU, memory, network per TaskManager&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Sample Queries:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Flink processing rate
rate(flink_taskmanager_job_task_numRecordsIn[1m])

# Checkpoint duration
flink_jobmanager_job_lastCheckpointDuration

# Backpressure time
flink_taskmanager_job_task_backPressureTimeMsPerSecond
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;ClickHouse Dashboard - Analytical Performance&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Key Metrics:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Query Performance:&lt;/strong&gt; Latency percentiles (p50, p95, p99)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Insert Rate:&lt;/strong&gt; Rows inserted per second&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storage:&lt;/strong&gt; Table sizes and compression ratios&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory Usage:&lt;/strong&gt; Query memory consumption&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Custom SQL Panels:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Real-time insert rate&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;toStartOfMinute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="k"&gt;minute&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;inserts_per_minute&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;benchmark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sensors_local&lt;/span&gt; 
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="n"&gt;HOUR&lt;/span&gt; 
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;minute&lt;/span&gt; 
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;minute&lt;/span&gt;

&lt;span class="c1"&gt;-- Top devices by message volume&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;device_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;message_count&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;benchmark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sensors_local&lt;/span&gt; 
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;DAY&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;device_id&lt;/span&gt; 
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;message_count&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt; 
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  🎛️ Advanced Dashboard Features
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. &lt;strong&gt;Cross-Component Correlation&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Create panels that show end-to-end pipeline health:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# End-to-end latency calculation
(flink_taskmanager_job_task_currentProcessingTime - flink_taskmanager_job_task_currentInputWatermark) / 1000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. &lt;strong&gt;Capacity Planning Views&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Track resource utilization trends:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Storage growth rate (bytes per hour)
rate(pulsar_storage_size[1h]) * 3600

# Memory utilization trend
avg_over_time(flink_taskmanager_Status_JVM_Memory_Heap_Used[24h])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. &lt;strong&gt;SLA Monitoring&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Define and track Service Level Objectives:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pulsar SLOs:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Message delivery: 99.9% success rate&lt;/li&gt;
&lt;li&gt;End-to-end latency: &amp;lt;100ms (p95)&lt;/li&gt;
&lt;li&gt;Availability: 99.95% uptime&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Flink SLOs:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Processing latency: &amp;lt;5 seconds (p99)&lt;/li&gt;
&lt;li&gt;Checkpoint success: &amp;gt;99%&lt;/li&gt;
&lt;li&gt;Job availability: 99.9% uptime&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;ClickHouse SLOs:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Query latency: &amp;lt;200ms (p95)&lt;/li&gt;
&lt;li&gt;Insert success: 99.99%&lt;/li&gt;
&lt;li&gt;Data freshness: &amp;lt;60 seconds&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  🎯 Production Best Practices
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. &lt;strong&gt;Enable Persistent Storage&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;grafana&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;persistence&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10Gi&lt;/span&gt;
    &lt;span class="na"&gt;storageClassName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gp3"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dashboard configurations persist across pod restarts&lt;/li&gt;
&lt;li&gt;Datasources don't need re-configuration&lt;/li&gt;
&lt;li&gt;Custom dashboards aren't lost&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. &lt;strong&gt;Organize Dashboards by Tags&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;When importing, add tags:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"dashboard"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"tags"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"pulsar"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"messaging"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"production"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Benefits:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Easy filtering&lt;/li&gt;
&lt;li&gt;Logical grouping&lt;/li&gt;
&lt;li&gt;Better navigation&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. &lt;strong&gt;Set Up Alerts&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Example: Alert on high backpressure&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;FlinkHighBackpressure&lt;/span&gt;
  &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;flink_taskmanager_job_task_backPressureTimeMsPerSecond &amp;gt; &lt;/span&gt;&lt;span class="m"&gt;500&lt;/span&gt;
  &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Flink&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;task&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$labels.task_name&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;has&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;backpressure"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Common alerts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pulsar consumer lag &amp;gt; 1M messages&lt;/li&gt;
&lt;li&gt;Flink checkpoint failures&lt;/li&gt;
&lt;li&gt;ClickHouse query latency &amp;gt; 1s&lt;/li&gt;
&lt;li&gt;Broker CPU &amp;gt; 90%&lt;/li&gt;
&lt;li&gt;Disk usage &amp;gt; 85%&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. &lt;strong&gt;Create Custom Views&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Row for each component&lt;/span&gt;
&lt;span class="na"&gt;Row 1&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Pulsar (Message ingestion)&lt;/span&gt;
&lt;span class="na"&gt;Row 2&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Flink (Stream processing)&lt;/span&gt;
&lt;span class="na"&gt;Row 3&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClickHouse (Data storage)&lt;/span&gt;
&lt;span class="na"&gt;Row 4&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Infrastructure (CPU, memory, disk)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Example panel:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"End-to-End Latency"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"targets"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"expr"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"flink_taskmanager_job_latency_source_id_operator_id{quantile=&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;0.99&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"legendFormat"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Flink p99 latency"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  5. &lt;strong&gt;Export and Version Control&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Export all dashboards&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;dashboard &lt;span class="k"&gt;in &lt;/span&gt;flink clickhouse pulsar&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;curl &lt;span class="nt"&gt;-u&lt;/span&gt; admin:admin123 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="s2"&gt;"http://localhost:3000/api/dashboards/uid/&lt;/span&gt;&lt;span class="nv"&gt;$dashboard&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | &lt;span class="se"&gt;\&lt;/span&gt;
    jq &lt;span class="s1"&gt;'.dashboard'&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;dashboard&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="nt"&gt;-dashboard&lt;/span&gt;.json
&lt;span class="k"&gt;done&lt;/span&gt;

&lt;span class="c"&gt;# Commit to git&lt;/span&gt;
git add grafana-dashboards/&lt;span class="k"&gt;*&lt;/span&gt;.json
git commit &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"Update Grafana dashboards"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  🎓 The Complete Monitoring Workflow
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Initial Setup (One Time)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Deploy Pulsar with VictoriaMetrics stack&lt;/span&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;pulsar-load
./deploy.sh
&lt;span class="c"&gt;# ✓ Grafana and VictoriaMetrics installed automatically&lt;/span&gt;

&lt;span class="c"&gt;# 2. Setup Flink metrics integration&lt;/span&gt;
&lt;span class="nb"&gt;cd&lt;/span&gt; ../flink-load
./setup-flink-metrics.sh
&lt;span class="c"&gt;# ✓ Flink pods now expose metrics&lt;/span&gt;
&lt;span class="c"&gt;# ✓ VMAgent configured to scrape Flink&lt;/span&gt;

&lt;span class="c"&gt;# 3. Import custom dashboards&lt;/span&gt;
&lt;span class="nb"&gt;cd&lt;/span&gt; ../../monitoring/grafana-dashboards
./import-dashboards.sh
&lt;span class="c"&gt;# ✓ Flink and ClickHouse dashboards imported&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Daily Operations
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Access Grafana&lt;/span&gt;
kubectl port-forward &lt;span class="nt"&gt;-n&lt;/span&gt; pulsar svc/pulsar-grafana 3000:80

&lt;span class="c"&gt;# Open http://localhost:3000&lt;/span&gt;

&lt;span class="c"&gt;# View dashboards:&lt;/span&gt;
&lt;span class="c"&gt;# 1. Pulsar Overview (default)&lt;/span&gt;
&lt;span class="c"&gt;# 2. Dashboards → Flink IoT Pipeline Metrics&lt;/span&gt;
&lt;span class="c"&gt;# 3. Dashboards → ClickHouse IoT Data Metrics&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Monitoring at Scale (1M msg/sec)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What to watch:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pulsar:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Message rate: Should be ~1M msg/sec consistently&lt;/li&gt;
&lt;li&gt;Consumer lag: Should be near 0&lt;/li&gt;
&lt;li&gt;Bookie write latency: &amp;lt;2ms p99&lt;/li&gt;
&lt;li&gt;Storage growth: ~300 MB/sec&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Flink:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Records in: ~1M msg/sec&lt;/li&gt;
&lt;li&gt;Records out: ~17K records/min (after aggregation)&lt;/li&gt;
&lt;li&gt;Checkpoint duration: &amp;lt;10 seconds&lt;/li&gt;
&lt;li&gt;Backpressure: LOW&lt;/li&gt;
&lt;li&gt;CPU: 75-85% utilization&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;ClickHouse:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Insert rate: ~289 inserts/sec (17,333/60)&lt;/li&gt;
&lt;li&gt;Query latency: &amp;lt;200ms for aggregations&lt;/li&gt;
&lt;li&gt;Table size: Growing at ~50 MB/sec&lt;/li&gt;
&lt;li&gt;Compression ratio: 10-15x&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  📊 Performance Impact of Monitoring
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Resource overhead:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;CPU&lt;/th&gt;
&lt;th&gt;Memory&lt;/th&gt;
&lt;th&gt;Storage&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;VMAgent&lt;/td&gt;
&lt;td&gt;0.1 vCPU&lt;/td&gt;
&lt;td&gt;200 MB&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VMSingle&lt;/td&gt;
&lt;td&gt;0.5 vCPU&lt;/td&gt;
&lt;td&gt;2 GB&lt;/td&gt;
&lt;td&gt;50 GB/month&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Grafana&lt;/td&gt;
&lt;td&gt;0.1 vCPU&lt;/td&gt;
&lt;td&gt;400 MB&lt;/td&gt;
&lt;td&gt;10 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.7 vCPU&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.6 GB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;60 GB/month&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Cost impact:&lt;/strong&gt; ~$50/month (minimal!)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Benefits:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Real-time visibility&lt;/li&gt;
&lt;li&gt;Faster troubleshooting&lt;/li&gt;
&lt;li&gt;Capacity planning&lt;/li&gt;
&lt;li&gt;Performance optimization&lt;/li&gt;
&lt;li&gt;SLA monitoring&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  🎯 Advanced: Custom Metrics
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Add Custom Flink Metrics
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// In your Flink job&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CustomMetricsMapper&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;RichMapFunction&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Event&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Event&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;transient&lt;/span&gt; &lt;span class="nc"&gt;Counter&lt;/span&gt; &lt;span class="n"&gt;eventCounter&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;transient&lt;/span&gt; &lt;span class="nc"&gt;Meter&lt;/span&gt; &lt;span class="n"&gt;eventRate&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;

    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Configuration&lt;/span&gt; &lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;eventCounter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;getRuntimeContext&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getMetricGroup&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;counter&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"custom_events_processed"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

        &lt;span class="n"&gt;eventRate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;getRuntimeContext&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getMetricGroup&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;meter&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"custom_events_per_second"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;MeterView&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;

    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;Event&lt;/span&gt; &lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Event&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;eventCounter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;inc&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
        &lt;span class="n"&gt;eventRate&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;markEvent&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;View in Grafana:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;rate(flink_taskmanager_job_task_operator_custom_events_processed[1m])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Add Custom ClickHouse Queries
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"targets"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"datasource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ClickHouse"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"rawSql"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"SELECT toStartOfMinute(time) as t, COUNT(*) as count FROM benchmark.sensors_local WHERE time &amp;gt;= now() - INTERVAL 1 HOUR GROUP BY t ORDER BY t"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"format"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"time_series"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  🎉 Conclusion
&lt;/h2&gt;

&lt;p&gt;You now have a &lt;strong&gt;unified monitoring stack&lt;/strong&gt; that provides:&lt;/p&gt;

&lt;p&gt;✅ &lt;strong&gt;Single Grafana instance&lt;/strong&gt; monitoring entire pipeline&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;VictoriaMetrics&lt;/strong&gt; for efficient metric storage&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;3 pre-built dashboards&lt;/strong&gt; (Pulsar, Flink, ClickHouse)&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Automated setup&lt;/strong&gt; with shell scripts&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Cross-namespace monitoring&lt;/strong&gt; (Pulsar → Flink)&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Real-time visibility&lt;/strong&gt; at 1M msg/sec scale&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Production-ready&lt;/strong&gt; alerting and SLO tracking  &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The beauty of this setup:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Deploy once&lt;/strong&gt;: Monitoring comes with Pulsar&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add Flink&lt;/strong&gt;: One script (&lt;code&gt;setup-flink-metrics.sh&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Import dashboards&lt;/strong&gt;: One script (&lt;code&gt;import-dashboards.sh&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Access everything&lt;/strong&gt;: Single Grafana instance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No separate Prometheus installations, no complex federation, no metric duplication. Just a clean, unified monitoring solution! 🚀&lt;/p&gt;

&lt;h2&gt;
  
  
  📚 Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub Repository:&lt;/strong&gt; &lt;a href="https://github.com/hyperscaledesignhub/RealtimeDataPlatform/tree/main/monitoring" rel="noopener noreferrer"&gt;RealtimeDataPlatform/monitoring&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VictoriaMetrics Docs:&lt;/strong&gt; &lt;a href="https://docs.victoriametrics.com/" rel="noopener noreferrer"&gt;docs.victoriametrics.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Grafana Docs:&lt;/strong&gt; &lt;a href="https://grafana.com/docs/" rel="noopener noreferrer"&gt;grafana.com/docs&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flink Metrics:&lt;/strong&gt; &lt;a href="https://nightlies.apache.org/flink/flink-docs-stable/docs/ops/metrics/" rel="noopener noreferrer"&gt;nightlies.apache.org/flink/flink-docs-stable/docs/ops/metrics/&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;How do you monitor your streaming pipelines? Share your setup in the comments!&lt;/strong&gt; 👇&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Follow me for more posts on observability, real-time systems, and production DevOps!&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; #monitoring #grafana #victoriametrics #flink #pulsar #clickhouse #kubernetes #observability&lt;/p&gt;




&lt;h2&gt;
  
  
  🌟 Quick Reference
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Port-Forward Commands
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Grafana (all dashboards)&lt;/span&gt;
kubectl port-forward &lt;span class="nt"&gt;-n&lt;/span&gt; pulsar svc/pulsar-grafana 3000:80

&lt;span class="c"&gt;# VictoriaMetrics (raw metrics)&lt;/span&gt;
kubectl port-forward &lt;span class="nt"&gt;-n&lt;/span&gt; pulsar svc/vmsingle-pulsar-victoria-metrics-k8s-stack 8429:8429

&lt;span class="c"&gt;# Flink UI (job details)&lt;/span&gt;
kubectl port-forward &lt;span class="nt"&gt;-n&lt;/span&gt; flink-benchmark svc/flink-jobmanager-rest 8081:8081

&lt;span class="c"&gt;# Prometheus-compatible API&lt;/span&gt;
kubectl port-forward &lt;span class="nt"&gt;-n&lt;/span&gt; pulsar svc/vmsingle-pulsar-victoria-metrics-k8s-stack 9090:8429
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Dashboard URLs
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Pulsar:     http://localhost:3000 (default landing page)
Flink:      http://localhost:3000/d/flink-iot-pipeline
ClickHouse: http://localhost:3000/d/clickhouse-iot-metrics
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Useful Queries
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Total message rate across all Pulsar topics
sum(rate(pulsar_in_messages_total[1m]))

# Flink processing lag
flink_taskmanager_job_task_currentInputWatermark - flink_taskmanager_job_task_currentOutputWatermark

# ClickHouse disk usage
clickhouse_metric_DiskDataBytes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>tutorial</category>
      <category>dataengineering</category>
      <category>monitoring</category>
      <category>devops</category>
    </item>
    <item>
      <title>Real-Time Streaming Challenges: What I Learned Building at Scale</title>
      <dc:creator>HyperscaleDesignHub</dc:creator>
      <pubDate>Sun, 26 Oct 2025 13:32:01 +0000</pubDate>
      <link>https://forem.com/vijaya_bhaskarv_ba95adf9/real-time-streaming-challenges-what-i-learned-building-at-scale-olg</link>
      <guid>https://forem.com/vijaya_bhaskarv_ba95adf9/real-time-streaming-challenges-what-i-learned-building-at-scale-olg</guid>
      <description>&lt;p&gt;Building a real-time streaming platform that processes &lt;strong&gt;1 million events per second&lt;/strong&gt; taught me lessons that no tutorial or documentation could. After months of optimization, debugging, and scaling our self-hosted platform on AWS, here are the hard-won insights that saved us &lt;strong&gt;90% in costs&lt;/strong&gt; and countless hours of troubleshooting.&lt;/p&gt;

&lt;p&gt;You can find our complete implementation in the &lt;a href="https://github.com/hyperscaledesignhub/RealtimeDataPlatform/tree/main/realtime-platform-1million-events" rel="noopener noreferrer"&gt;RealtimeDataPlatform repository&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  🎯 The Reality Check: Scale Changes Everything
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What I thought:&lt;/strong&gt; "If it works at 10K events/sec, it'll work at 1M events/sec."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I learned:&lt;/strong&gt; Scale isn't linear. At 1M events/sec, everything breaks differently.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Surprising Bottlenecks
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Expected Bottlenecks    →    Actual Bottlenecks
CPU/Memory             →    Network I/O
Application Logic      →    Storage I/O patterns  
Compute Resources      →    Configuration limits
Code Efficiency        →    Infrastructure design
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; Our Flink job processed 890K events/sec from Pulsar backlog but only 600K live. The bottleneck? &lt;strong&gt;Pulsar's storage configuration&lt;/strong&gt;, not Flink's processing power.&lt;/p&gt;

&lt;h2&gt;
  
  
  💾 Storage: The Hidden Performance Killer
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Challenge #1: The NVMe Device Separation Discovery
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Initial setup:&lt;/strong&gt; Single NVMe device for both journal and ledger storage in BookKeeper.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Write latency spikes to 15ms, throughput capped at 400K events/sec.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Separate NVMe devices for different I/O patterns.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Game-changing configuration&lt;/span&gt;
/dev/nvme1n1 → Journal &lt;span class="o"&gt;(&lt;/span&gt;WAL&lt;span class="o"&gt;)&lt;/span&gt; - Sequential writes
/dev/nvme2n1 → Ledgers &lt;span class="o"&gt;(&lt;/span&gt;Data&lt;span class="o"&gt;)&lt;/span&gt; - Random reads/writes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; Latency dropped to 2.1ms, throughput jumped to 1M+ events/sec.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson:&lt;/strong&gt; &lt;strong&gt;I/O pattern separation matters more than raw storage speed.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Challenge #2: The journalSyncData Trade-off
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The dilemma:&lt;/strong&gt; Enable &lt;code&gt;journalSyncData&lt;/code&gt; for safety vs. disable for performance.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# The 10x performance decision&lt;/span&gt;
&lt;span class="na"&gt;journalSyncData&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;false"&lt;/span&gt;  &lt;span class="c1"&gt;# Risk: data loss on power failure&lt;/span&gt;
                         &lt;span class="c1"&gt;# Gain: 10x latency improvement&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Lesson:&lt;/strong&gt; &lt;strong&gt;Know your data's value.&lt;/strong&gt; For IoT telemetry, we chose speed over perfect durability. For financial data, we wouldn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  🔧 Parallelism: More Isn't Always Better
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Challenge #3: The Slot-to-CPU Ratio Mystery
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Initial thinking:&lt;/strong&gt; "More parallelism = better performance"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reality check:&lt;/strong&gt; Our pipeline had different resource needs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;Source&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="no"&gt;I&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="no"&gt;O&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;keyBy&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="nc"&gt;Window&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="no"&gt;CPU&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="nc"&gt;Aggregate&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="no"&gt;CPU&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="nc"&gt;Sink&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="no"&gt;I&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="no"&gt;O&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
   &lt;span class="err"&gt;🟢&lt;/span&gt;                    &lt;span class="err"&gt;🔴&lt;/span&gt;             &lt;span class="err"&gt;🔴&lt;/span&gt;               &lt;span class="err"&gt;🟢&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Wrong approach:&lt;/strong&gt; 2:1 slot-to-CPU ratio (32 slots on 16 vCPUs)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Result: CPU starvation during aggregation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Right approach:&lt;/strong&gt; 1:1 slot-to-CPU ratio (16 slots on 16 vCPUs)  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Result: Dedicated CPU per CPU-intensive task&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Lesson:&lt;/strong&gt; &lt;strong&gt;Match your slot configuration to your workload's compute pattern, not theoretical maximums.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Challenge #4: The Parallelism-Partition Matching Rule
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Discovery:&lt;/strong&gt; Mismatched parallelism and partitions kills performance.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;❌ Wrong&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;Pulsar partitions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8&lt;/span&gt;
  &lt;span class="na"&gt;Flink parallelism&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;64&lt;/span&gt;
  &lt;span class="na"&gt;Result&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;56 idle Flink tasks&lt;/span&gt;

&lt;span class="na"&gt;✅ Right&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;Pulsar partitions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;64&lt;/span&gt;  
  &lt;span class="na"&gt;Flink parallelism&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;64&lt;/span&gt;
  &lt;span class="na"&gt;Result&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Perfect work distribution&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Lesson:&lt;/strong&gt; &lt;strong&gt;Parallelism should always match your source partitions&lt;/strong&gt; for optimal resource utilization.&lt;/p&gt;

&lt;h2&gt;
  
  
  🕵️ Debugging: The Backlog Test Technique
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Challenge #5: Finding the Real Bottleneck
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The mystery:&lt;/strong&gt; System processing 600K events/sec but target was 1M.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The experiment:&lt;/strong&gt; Stop all producers, let Flink catch up from backlog.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# The revealing test&lt;/span&gt;
kubectl scale deployment iot-producer &lt;span class="nt"&gt;--replicas&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0

&lt;span class="c"&gt;# Result: Flink consumed 890K events/sec from backlog!&lt;/span&gt;
&lt;span class="c"&gt;# Conclusion: Pulsar was the bottleneck, not Flink&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Lesson:&lt;/strong&gt; &lt;strong&gt;The backlog test reveals your true system capacity&lt;/strong&gt; and identifies which component is actually limiting throughput.&lt;/p&gt;

&lt;h2&gt;
  
  
  💰 Cost Optimization: The Managed Services Reality
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Challenge #6: The 10x Cost Shock
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Our self-hosted platform:&lt;/strong&gt; $24,592/month&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Equivalent AWS managed services:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MSK: $30,525/month&lt;/li&gt;
&lt;li&gt;Kinesis Data Analytics: $81,180/month
&lt;/li&gt;
&lt;li&gt;Redshift Serverless: $131,328/month&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Total: $243,033/month&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The math:&lt;/strong&gt; 988% cost increase for managed services at scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson:&lt;/strong&gt; &lt;strong&gt;At high throughput, managed services pricing becomes prohibitive.&lt;/strong&gt; The break-even point strongly favors self-hosting for sustained, high-volume workloads.&lt;/p&gt;

&lt;h2&gt;
  
  
  🏗️ Architecture: Instance Selection Matters
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Challenge #7: Right-Sizing for Performance vs. Cost
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Wrong approach:&lt;/strong&gt; Many small instances&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;16× c5.large instances&lt;/li&gt;
&lt;li&gt;Complex networking, management overhead&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Right approach:&lt;/strong&gt; Fewer, larger instances  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;4× c5.4xlarge instances&lt;/li&gt;
&lt;li&gt;Better price/performance, simpler operations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Lesson:&lt;/strong&gt; &lt;strong&gt;Bigger instances often provide better performance-per-dollar&lt;/strong&gt; and reduce operational complexity.&lt;/p&gt;

&lt;h3&gt;
  
  
  Challenge #8: The i7i.8xlarge Sweet Spot
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Why i7i.8xlarge became our standard:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;32 vCPUs, 256GB RAM&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2× 3.75TB NVMe devices&lt;/strong&gt; (perfect for separation)&lt;/li&gt;
&lt;li&gt;Latest generation CPU performance&lt;/li&gt;
&lt;li&gt;$2,160/month (better than alternatives)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Lesson:&lt;/strong&gt; &lt;strong&gt;The latest generation instances often provide the best performance-per-dollar&lt;/strong&gt; despite higher unit costs.&lt;/p&gt;

&lt;h2&gt;
  
  
  🔍 Monitoring: What Actually Matters
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Metrics That Saved Us
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Instead of generic CPU/memory metrics, focus on:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Backpressure - Your canary in the coal mine
flink_taskmanager_job_task_backPressureTimeMsPerSecond

# True throughput - Not just input rate
rate(flink_taskmanager_job_task_numRecordsOutPerSecond[1m])

# Storage performance - Often the real bottleneck
rate(bookie_journal_JOURNAL_ADD_ENTRY_count[1m])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Lesson:&lt;/strong&gt; &lt;strong&gt;Domain-specific metrics matter more than generic infrastructure metrics&lt;/strong&gt; for identifying real problems.&lt;/p&gt;

&lt;h2&gt;
  
  
  🎓 The Five Universal Truths of Streaming at Scale
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. &lt;strong&gt;Storage I/O Patterns Trump Raw Performance&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Separate your sequential writes from random reads. Always.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. &lt;strong&gt;Configuration Limits Hit Before Resource Limits&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;You'll hit default timeouts, queue sizes, and connection limits before CPU/memory limits.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. &lt;strong&gt;The Bottleneck Moves&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Optimize one component, and the bottleneck shifts to the next weakest link.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. &lt;strong&gt;Test with Realistic Data&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Synthetic loads behave differently than real-world data patterns.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. &lt;strong&gt;Cost Scales Non-Linearly&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;At high throughput, managed services become exponentially more expensive than self-hosting.&lt;/p&gt;

&lt;h2&gt;
  
  
  💡 What I'd Do Differently
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Start with these decisions:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;✅ &lt;strong&gt;Separate storage devices from day one&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Match parallelism to partitions immediately&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Use 1:1 slot-to-CPU for CPU-bound workloads&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Implement backlog testing early&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Choose latest-generation instances&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Plan for self-hosting at scale&lt;/strong&gt;  &lt;/p&gt;

&lt;h2&gt;
  
  
  🚀 The Bottom Line
&lt;/h2&gt;

&lt;p&gt;Building a real-time streaming platform at scale taught me that &lt;strong&gt;the fundamentals matter more than the fancy features&lt;/strong&gt;. Storage I/O patterns, proper parallelism matching, and understanding your actual bottlenecks will get you further than any advanced configuration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Our results:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;1,040,000 events/sec&lt;/strong&gt; sustained throughput&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;$24,592/month&lt;/strong&gt; infrastructure cost (90% savings vs managed)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&amp;lt;2ms p99 latency&lt;/strong&gt; end-to-end&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;99.95% uptime&lt;/strong&gt; in production&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The biggest lesson? &lt;strong&gt;Scale reveals the truth about your architecture.&lt;/strong&gt; What works at small scale often breaks in unexpected ways at large scale. Plan for it, test for it, and measure everything that matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  📚 Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Complete Implementation:&lt;/strong&gt; &lt;a href="https://github.com/hyperscaledesignhub/RealtimeDataPlatform/tree/main/realtime-platform-1million-events" rel="noopener noreferrer"&gt;RealtimeDataPlatform&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pulsar Performance Guide:&lt;/strong&gt; &lt;a href="https://github.com/hyperscaledesignhub/RealtimeDataPlatform/tree/main/realtime-platform-1million-events/pulsar-load" rel="noopener noreferrer"&gt;Our Pulsar optimization journey&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flink Tuning Details:&lt;/strong&gt; &lt;a href="https://github.com/hyperscaledesignhub/RealtimeDataPlatform/tree/main/realtime-platform-1million-events/flink-load" rel="noopener noreferrer"&gt;Our Flink scaling experience&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;What's your biggest streaming challenge? Have you hit similar bottlenecks at scale? Share your war stories in the comments!&lt;/strong&gt; 👇&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Follow me for more real-world lessons from building distributed systems at scale.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; #streaming #realtime #scale #performance #aws #pulsar #flink #architecture #lessons&lt;/p&gt;

</description>
      <category>systemdesign</category>
      <category>performance</category>
      <category>architecture</category>
      <category>aws</category>
    </item>
    <item>
      <title>Real-Time Data Streaming Platform: How We Built a Self-Hosted Platform with 90% Cost Reduction vs AWS Managed Services</title>
      <dc:creator>HyperscaleDesignHub</dc:creator>
      <pubDate>Sun, 26 Oct 2025 13:27:10 +0000</pubDate>
      <link>https://forem.com/vijaya_bhaskarv_ba95adf9/real-time-data-streaming-platform-how-we-built-a-self-hosted-platform-with-90-cost-reduction-vs-1aif</link>
      <guid>https://forem.com/vijaya_bhaskarv_ba95adf9/real-time-data-streaming-platform-how-we-built-a-self-hosted-platform-with-90-cost-reduction-vs-1aif</guid>
      <description>&lt;p&gt;When tasked with building a real-time data streaming platform capable of processing &lt;strong&gt;1 million events per second&lt;/strong&gt;, we faced a critical decision: build a self-hosted solution using open-source technologies, or leverage AWS managed services for convenience.&lt;/p&gt;

&lt;p&gt;This article details how we built our self-hosted real-time data streaming platform and achieved a &lt;strong&gt;90% cost reduction&lt;/strong&gt; compared to equivalent AWS managed services, while maintaining enterprise-grade performance and reliability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The result:&lt;/strong&gt; A production-ready platform processing 1M events/sec for &lt;strong&gt;$24,592/month&lt;/strong&gt; instead of &lt;strong&gt;$243,033/month&lt;/strong&gt; with AWS managed services.&lt;/p&gt;

&lt;p&gt;You can find the complete implementation in our &lt;a href="https://github.com/hyperscaledesignhub/RealtimeDataPlatform/tree/main/realtime-platform-1million-events" rel="noopener noreferrer"&gt;RealtimeDataPlatform repository&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Here's how we did it and the lessons learned along the way.&lt;/p&gt;

&lt;h2&gt;
  
  
  🏗️ Architecture Comparison
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Self-Hosted Stack Architecture
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────────────────┐
│                    Self-Hosted Stack on AWS EC2                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌───────────┐ │
│  │   PRODUCER  │───▶│   PULSAR    │───▶│    FLINK    │───▶│CLICKHOUSE │ │
│  │             │    │             │    │             │    │           │ │
│  │ IoT Sensors │    │ Open Source │    │ Open Source │    │Open Source│ │
│  │ AVRO Data   │    │ Message     │    │ Stream      │    │ Analytics │ │
│  │             │    │ Broker      │    │ Processing  │    │ Database  │ │
│  │ 4x c5.4xl   │    │ 6x i7i.8xl  │    │ 4x c5.4xl   │    │6x r6id.4xl│ │
│  │ Full Control│    │ Self-Managed│    │ Self-Managed│    │Self-Managed│ │
│  └─────────────┘    └─────────────┘    └─────────────┘    └───────────┘ │
│                                                                         │
│  Monthly Cost: ~$24,592                                                │
└─────────────────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  AWS Managed Stack Architecture
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────────────────┐
│                    AWS Managed Services Stack                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌───────────┐ │
│  │   PRODUCER  │───▶│     MSK     │───▶│   KINESIS   │───▶│ REDSHIFT  │ │
│  │             │    │             │    │             │    │           │ │
│  │ IoT Sensors │    │ Managed     │    │ Data Analytics  │ Serverless │ │
│  │ AVRO Data   │    │ Streaming   │    │ for Flink   │    │ Analytics │ │
│  │             │    │ for Kafka   │    │             │    │ Warehouse │ │
│  │ 4x c5.4xl   │    │ AWS Managed │    │ AWS Managed │    │AWS Managed│ │
│  │ Hands-off   │    │ Serverless  │    │ Serverless  │    │Serverless │ │
│  └─────────────┘    └─────────────┘    └─────────────┘    └───────────┘ │
│                                                                         │
│  Monthly Cost: ~$243,033                                               │
└─────────────────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  💰 The Self-Hosted Stack: Control and Cost-Effectiveness
&lt;/h2&gt;

&lt;p&gt;In this scenario, we deploy our entire stack on Amazon EC2 instances. This gives us maximum control over the configuration and tuning of each component.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Technology Stack
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Message Broker:&lt;/strong&gt; Apache Pulsar&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stream Processing:&lt;/strong&gt; Apache Flink
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Analytics Database:&lt;/strong&gt; ClickHouse&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Infrastructure Details
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Apache Pulsar Cluster:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;6× i7i.8xlarge instances (32 vCPU, 256GB RAM, 2×3.75TB NVMe)&lt;/li&gt;
&lt;li&gt;Co-located brokers and bookies&lt;/li&gt;
&lt;li&gt;NVMe device separation for journal and ledger storage&lt;/li&gt;
&lt;li&gt;Cost: ~$12,960/month&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Apache Flink Cluster:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;4× c5.4xlarge instances (16 vCPU, 32GB RAM)&lt;/li&gt;
&lt;li&gt;64-way parallelism matching Pulsar partitions&lt;/li&gt;
&lt;li&gt;1:1 slot-to-CPU ratio for optimal performance&lt;/li&gt;
&lt;li&gt;Cost: ~$2,400/month&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;ClickHouse Cluster:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;6× r6id.4xlarge instances (16 vCPU, 128GB RAM, 950GB NVMe)&lt;/li&gt;
&lt;li&gt;Distributed analytics with real-time ingestion&lt;/li&gt;
&lt;li&gt;Optimized for sub-second query performance&lt;/li&gt;
&lt;li&gt;Cost: ~$7,200/month&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Producer Infrastructure:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;4× c5.4xlarge instances for load generation&lt;/li&gt;
&lt;li&gt;AVRO serialization for efficient data transfer&lt;/li&gt;
&lt;li&gt;Cost: ~$1,920/month&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Cost Breakdown
&lt;/h3&gt;

&lt;p&gt;Using infrastructure analysis tools to examine the Terraform configuration for this setup:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Instance Type&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;th&gt;Monthly Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Pulsar (Broker+Bookie)&lt;/td&gt;
&lt;td&gt;i7i.8xlarge&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;$12,960&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ClickHouse&lt;/td&gt;
&lt;td&gt;r6id.4xlarge&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;$7,200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Flink&lt;/td&gt;
&lt;td&gt;c5.4xlarge&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;$2,400&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Producers&lt;/td&gt;
&lt;td&gt;c5.4xlarge&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;$1,920&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Supporting Infrastructure&lt;/td&gt;
&lt;td&gt;Various&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;$112&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$24,592&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Total Estimated Monthly Cost: $24,592&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The primary cost drivers are the large EC2 instances required to handle the 1 million events/sec workload, particularly for the Pulsar brokers and ClickHouse nodes.&lt;/p&gt;

&lt;h2&gt;
  
  
  ☁️ The AWS Managed Stack: Convenience at a Premium
&lt;/h2&gt;

&lt;p&gt;In this approach, we replace our self-hosted components with their AWS-native counterparts. This offloads the operational burden of managing the infrastructure to AWS.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Technology Stack
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Managed Kafka:&lt;/strong&gt; Amazon MSK (Managed Streaming for Apache Kafka)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Managed Flink:&lt;/strong&gt; Amazon Kinesis Data Analytics for Apache Flink&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ClickHouse Equivalent:&lt;/strong&gt; Amazon Redshift Serverless&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Service Configuration &amp;amp; Assumptions
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Amazon MSK:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1KB average event size&lt;/li&gt;
&lt;li&gt;1-day data retention&lt;/li&gt;
&lt;li&gt;High throughput configuration&lt;/li&gt;
&lt;li&gt;Multi-AZ deployment for reliability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Kinesis Data Analytics for Flink:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;64 Kinesis Processing Units (KPUs)&lt;/li&gt;
&lt;li&gt;Continuous processing (24/7)&lt;/li&gt;
&lt;li&gt;Auto-scaling enabled&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Amazon Redshift Serverless:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1-month data retention&lt;/li&gt;
&lt;li&gt;High-performance analytics workload&lt;/li&gt;
&lt;li&gt;On-demand scaling&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Cost Breakdown
&lt;/h3&gt;

&lt;p&gt;Estimating the cost for a managed stack at this scale requires several assumptions. Based on AWS pricing and typical configurations:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Configuration&lt;/th&gt;
&lt;th&gt;Monthly Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Amazon MSK&lt;/td&gt;
&lt;td&gt;High throughput, 1-day retention&lt;/td&gt;
&lt;td&gt;$30,525&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kinesis Data Analytics&lt;/td&gt;
&lt;td&gt;64 KPUs, continuous processing&lt;/td&gt;
&lt;td&gt;$81,180&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Amazon Redshift Serverless&lt;/td&gt;
&lt;td&gt;1-month retention, analytics&lt;/td&gt;
&lt;td&gt;$131,328&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$243,033&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Total Estimated Monthly Cost: ~$243,033&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  📊 The Cost Comparison: A Dramatic Difference
&lt;/h2&gt;

&lt;p&gt;Let's put these numbers side-by-side:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Monthly Cost&lt;/th&gt;
&lt;th&gt;Cost per Million Events&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Self-Hosted on EC2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$24,592&lt;/td&gt;
&lt;td&gt;$0.0094&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AWS Managed Services&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$243,033&lt;/td&gt;
&lt;td&gt;$0.0932&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Difference&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+988%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+988%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The difference is stark: the AWS managed stack is roughly &lt;strong&gt;10 times more expensive&lt;/strong&gt; than the self-hosted approach for this high-throughput scenario.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost Analysis by Component
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Messaging Layer:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Self-hosted Pulsar: $12,960/month&lt;/li&gt;
&lt;li&gt;AWS MSK: $30,525/month&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Premium:&lt;/strong&gt; 235%&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Stream Processing:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Self-hosted Flink: $2,400/month
&lt;/li&gt;
&lt;li&gt;Kinesis Data Analytics: $81,180/month&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Premium:&lt;/strong&gt; 3,382%&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Analytics Storage:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Self-hosted ClickHouse: $7,200/month&lt;/li&gt;
&lt;li&gt;Redshift Serverless: $131,328/month&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Premium:&lt;/strong&gt; 1,824%&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  🤔 Beyond the Numbers: Understanding the Trade-offs
&lt;/h2&gt;

&lt;p&gt;So, why would anyone choose the managed stack given the massive price difference? The answer lies in the trade-offs between cost and operational overhead.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Case for Self-Hosting
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Advantages:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;✅ &lt;strong&gt;Dramatic Cost Savings:&lt;/strong&gt; 90% lower infrastructure costs&lt;br&gt;
✅ &lt;strong&gt;Complete Control:&lt;/strong&gt; Fine-grained tuning and optimization&lt;br&gt;
✅ &lt;strong&gt;No Vendor Lock-in:&lt;/strong&gt; Portable across cloud providers&lt;br&gt;
✅ &lt;strong&gt;Technology Choice:&lt;/strong&gt; Use cutting-edge open-source features&lt;br&gt;
✅ &lt;strong&gt;Performance Optimization:&lt;/strong&gt; Custom configurations for specific workloads&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Challenges:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;❌ &lt;strong&gt;High Operational Overhead:&lt;/strong&gt; Full responsibility for infrastructure management&lt;br&gt;
❌ &lt;strong&gt;Expertise Required:&lt;/strong&gt; Deep knowledge of distributed systems needed&lt;br&gt;
❌ &lt;strong&gt;Time Investment:&lt;/strong&gt; Significant setup and maintenance effort&lt;br&gt;
❌ &lt;strong&gt;Scaling Complexity:&lt;/strong&gt; Manual scaling and capacity planning&lt;br&gt;
❌ &lt;strong&gt;Security Responsibility:&lt;/strong&gt; Comprehensive security management required&lt;/p&gt;

&lt;h3&gt;
  
  
  The Case for Managed Services
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Advantages:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;✅ &lt;strong&gt;Reduced Operational Overhead:&lt;/strong&gt; AWS handles infrastructure management&lt;br&gt;
✅ &lt;strong&gt;Built-in Scalability:&lt;/strong&gt; Auto-scaling and high availability&lt;br&gt;
✅ &lt;strong&gt;Faster Time to Market:&lt;/strong&gt; Rapid deployment without infrastructure setup&lt;br&gt;
✅ &lt;strong&gt;Enterprise Features:&lt;/strong&gt; Built-in monitoring, security, and compliance&lt;br&gt;
✅ &lt;strong&gt;Support:&lt;/strong&gt; Professional support from AWS&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Challenges:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;❌ &lt;strong&gt;Significant Cost Premium:&lt;/strong&gt; 10x higher costs for high-throughput workloads&lt;br&gt;
❌ &lt;strong&gt;Vendor Lock-in:&lt;/strong&gt; Tied to AWS ecosystem&lt;br&gt;
❌ &lt;strong&gt;Limited Control:&lt;/strong&gt; Constrained by service limitations&lt;br&gt;
❌ &lt;strong&gt;Feature Lag:&lt;/strong&gt; May not have latest open-source features&lt;/p&gt;

&lt;h2&gt;
  
  
  🎯 When to Choose Each Approach
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Choose Self-Hosted When:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cost is Critical:&lt;/strong&gt; Operating at scale where managed service costs become prohibitive&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance is Key:&lt;/strong&gt; Need maximum performance through custom tuning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Team Expertise:&lt;/strong&gt; Have experienced platform engineering team&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long-term Investment:&lt;/strong&gt; Building for sustained high-volume workloads&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-cloud Strategy:&lt;/strong&gt; Want to avoid vendor lock-in&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Choose Managed Services When:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Speed to Market:&lt;/strong&gt; Need to ship quickly without infrastructure complexity&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Small Team:&lt;/strong&gt; Limited platform engineering resources&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Variable Workloads:&lt;/strong&gt; Unpredictable or seasonal traffic patterns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compliance Focus:&lt;/strong&gt; Need built-in enterprise compliance features&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prototype/MVP:&lt;/strong&gt; Testing concepts before committing to self-hosted infrastructure&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  💡 Hybrid Approaches &amp;amp; Optimization Strategies
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Cost Optimization for Self-Hosted
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Reserved Instances:&lt;/strong&gt; 40-60% savings with 1-3 year commitments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Spot Instances:&lt;/strong&gt; Up to 70% savings for fault-tolerant components&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Right-sizing:&lt;/strong&gt; Regular capacity planning and instance optimization&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auto-scaling:&lt;/strong&gt; Implement demand-based scaling&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Hybrid Architecture Considerations
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Example hybrid approach&lt;/span&gt;
&lt;span class="na"&gt;Message Ingestion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AWS MSK (managed complexity)&lt;/span&gt;
&lt;span class="na"&gt;Stream Processing&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Self-hosted Flink (cost optimization)&lt;/span&gt;
&lt;span class="na"&gt;Analytics Storage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Self-hosted ClickHouse (performance optimization)&lt;/span&gt;
&lt;span class="na"&gt;Monitoring&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AWS CloudWatch (convenience)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  📈 ROI Analysis: Break-Even Points
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Total Cost of Ownership (TCO) Considerations
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Self-Hosted Additional Costs:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Platform engineering team: ~$400K-600K/year (2-3 engineers)&lt;/li&gt;
&lt;li&gt;Operations overhead: ~20-30% additional management time&lt;/li&gt;
&lt;li&gt;Training and certifications: ~$20K/year&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Managed Services Hidden Benefits:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reduced hiring needs&lt;/li&gt;
&lt;li&gt;Faster feature delivery&lt;/li&gt;
&lt;li&gt;Lower operational risk&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Break-Even Analysis
&lt;/h3&gt;

&lt;p&gt;For our 1M events/sec workload:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cost difference:&lt;/strong&gt; $218K/month ($2.6M/year)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Engineering team cost:&lt;/strong&gt; ~$500K/year&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Net savings with self-hosted:&lt;/strong&gt; ~$2.1M/year&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The break-even point strongly favors self-hosting for high-throughput, sustained workloads.&lt;/p&gt;

&lt;h2&gt;
  
  
  🎓 Conclusion
&lt;/h2&gt;

&lt;p&gt;The choice between self-hosting and managed services is not a one-size-fits-all decision, but the cost implications are dramatic at scale.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Takeaways
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;For High-Throughput Workloads:&lt;/strong&gt; Self-hosting can provide 90% cost savings&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Expertise Matters:&lt;/strong&gt; Success requires skilled platform engineering teams&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scale is Key:&lt;/strong&gt; The larger your workload, the more self-hosting makes financial sense&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Time Horizon:&lt;/strong&gt; Long-term, sustained workloads favor self-hosting&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Decision Framework
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Choose Self-Hosted If:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Processing &amp;gt;100K events/sec sustained&lt;/li&gt;
&lt;li&gt;Have platform engineering expertise&lt;/li&gt;
&lt;li&gt;Cost optimization is critical&lt;/li&gt;
&lt;li&gt;Long-term workload (&amp;gt;2 years)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Choose Managed Services If:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Getting started or prototyping&lt;/li&gt;
&lt;li&gt;Small engineering team&lt;/li&gt;
&lt;li&gt;Variable/unpredictable workloads&lt;/li&gt;
&lt;li&gt;Time to market is critical&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a high-throughput workload of 1 million events per second, the cost of managed services can be substantial. It's crucial to weigh the significant cost premium against the benefits of offloading the operational complexity to your cloud provider.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The bottom line:&lt;/strong&gt; At enterprise scale, self-hosting open-source streaming infrastructure can deliver massive cost savings while providing superior performance and control—if you have the team to manage it effectively.&lt;/p&gt;

&lt;h2&gt;
  
  
  📚 Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Complete Implementation:&lt;/strong&gt; &lt;a href="https://github.com/hyperscaledesignhub/RealtimeDataPlatform/tree/main/realtime-platform-1million-events" rel="noopener noreferrer"&gt;RealtimeDataPlatform/realtime-platform-1million-events&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Infrastructure Code:&lt;/strong&gt; &lt;a href="https://github.com/hyperscaledesignhub/RealtimeDataPlatform" rel="noopener noreferrer"&gt;Terraform configurations and Helm charts&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost Analysis Tools:&lt;/strong&gt; AWS Pricing Calculator, Infracost&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance Benchmarks:&lt;/strong&gt; &lt;a href="https://pulsar.apache.org/" rel="noopener noreferrer"&gt;Pulsar vs Kafka&lt;/a&gt;, &lt;a href="https://clickhouse.com/benchmark" rel="noopener noreferrer"&gt;ClickHouse Performance&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Have you made this choice in your organization? What factors influenced your decision? Share your experience in the comments!&lt;/strong&gt; 👇&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Follow me for more deep dives on cloud architecture, cost optimization, and distributed systems!&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; #aws #cost #architecture #streaming #devops #realtime #pulsar #flink #clickhouse #msk #kinesis&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>performance</category>
      <category>architecture</category>
      <category>aws</category>
    </item>
    <item>
      <title>Real-Time Data Streaming Platform: From 140K to 1 Million Messages/Sec - A Flink Performance Tuning Journey</title>
      <dc:creator>HyperscaleDesignHub</dc:creator>
      <pubDate>Sun, 26 Oct 2025 13:18:28 +0000</pubDate>
      <link>https://forem.com/vijaya_bhaskarv_ba95adf9/real-time-data-streaming-platform-from-140k-to-1-million-messagessec-a-flink-performance-tuning-1k36</link>
      <guid>https://forem.com/vijaya_bhaskarv_ba95adf9/real-time-data-streaming-platform-from-140k-to-1-million-messagessec-a-flink-performance-tuning-1k36</guid>
      <description>&lt;p&gt;Performance tuning a distributed streaming system is a journey of discovery, experimentation, and learning. This is the story of how I scaled a Flink streaming job from &lt;strong&gt;140K messages/sec to 1 million messages/sec&lt;/strong&gt; - a &lt;strong&gt;7x improvement&lt;/strong&gt; through systematic optimization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Spoiler alert:&lt;/strong&gt; The bottleneck wasn't where I expected!&lt;/p&gt;

&lt;h2&gt;
  
  
  🏗️ Real-Time Data Streaming Platform Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────────────────┐
│                Real-Time Data Streaming Platform                        │
│                        AWS EKS (Kubernetes 1.31)                       │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌───────────┐ │
│  │   PRODUCER  │───▶│   PULSAR    │───▶│    FLINK    │───▶│CLICKHOUSE │ │
│  │             │    │             │    │             │    │           │ │
│  │ IoT Sensors │    │ Message     │    │ Stream      │    │ Analytics │ │
│  │ AVRO Data   │    │ Broker      │    │ Processing  │    │ Database  │ │
│  │             │    │             │    │             │    │           │ │
│  │ 4x c5.4xl   │    │ 6x i7i.8xl  │    │ 4x c5.4xl   │    │6x r6id.4xl│ │
│  │ 250K/sec    │    │ Partitions  │    │ Parallelism │    │ Real-time │ │
│  │ each node   │    │ 64          │    │ 64          │    │ Queries   │ │
│  └─────────────┘    └─────────────┘    └─────────────┘    └───────────┘ │
│                                                                         │
│  Data Flow:                                                             │
│  300-byte AVRO ──▶ Pulsar Topics ──▶ keyBy(device_id) ──▶ ClickHouse   │
│  IoT Messages       (Persistent)      1-min Windows       Analytics    │
│                                                                         │
│  Performance Target: 1,000,000 messages/sec end-to-end                 │
└─────────────────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pipeline Components:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Producers&lt;/strong&gt;: Generate 300-byte AVRO-serialized IoT sensor data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pulsar&lt;/strong&gt;: Distributed message broker with 64 partitions for parallel processing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flink&lt;/strong&gt;: Stream processing engine with 64-way parallelism for aggregations
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ClickHouse&lt;/strong&gt;: Real-time analytics database for sub-second queries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The Challenge:&lt;/strong&gt; Initial setup only achieved &lt;strong&gt;140K msg/sec&lt;/strong&gt; instead of target 1M msg/sec!&lt;/p&gt;

&lt;h2&gt;
  
  
  🎯 The Starting Point
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Initial Setup:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Goal&lt;/strong&gt;: Process 1 million messages/sec from Pulsar&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Message Size&lt;/strong&gt;: 300 bytes (AVRO-serialized sensor data)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Job&lt;/strong&gt;: Source → keyBy → Window (1-min) → Aggregate → Sink (ClickHouse)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Result&lt;/strong&gt;: Only &lt;strong&gt;140K msg/sec&lt;/strong&gt; 😱&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Something was clearly wrong. Time to dig in.&lt;/p&gt;

&lt;h2&gt;
  
  
  🔍 Understanding the Flink Job Structure
&lt;/h2&gt;

&lt;p&gt;Before tuning, I needed to understand what I was working with. You can find the complete implementation in the &lt;a href="https://github.com/hyperscaledesignhub/RealtimeDataPlatform/tree/main/realtime-platform-1million-events/flink-load" rel="noopener noreferrer"&gt;flink-load directory&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// JDBCFlinkConsumer.java - The Pipeline&lt;/span&gt;
&lt;span class="nc"&gt;PulsarSource&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;SensorRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;source&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PulsarSource&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setServiceUrl&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pulsarUrl&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setTopics&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"persistent://public/default/iot-sensor-data"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setDeserializationSchema&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;AvroSensorDataDeserializationSchema&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;

&lt;span class="nc"&gt;DataStream&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;SensorRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;sensorStream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fromSource&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...);&lt;/span&gt;

&lt;span class="n"&gt;sensorStream&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;keyBy&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;device_id&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;// ← Data shuffle happens here!&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;window&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;TumblingProcessingTimeWindows&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;of&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;minutes&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;)))&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;aggregate&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;SensorAggregator&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;addSink&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ClickHouseJDBCSink&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clickhouseUrl&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The Operator Chain
&lt;/h3&gt;

&lt;p&gt;Flink optimizes by &lt;strong&gt;chaining operators&lt;/strong&gt; together:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Task 1 (Source Group):
  └─ Pulsar Source (I/O-bound)
     └─ AVRO Deserialize
        └─ keyBy (compute hash, network shuffle)

Task 2 (Window Group):
  └─ Window Aggregate (CPU-bound)
     └─ ClickHouse Sink (I/O-bound)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key Insight:&lt;/strong&gt; The &lt;code&gt;keyBy()&lt;/code&gt; causes data shuffling between Task 1 and Task 2. This creates &lt;strong&gt;2 separate task groups&lt;/strong&gt; that need slots.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Total slots needed = Parallelism × 2&lt;/strong&gt; (not parallelism × 4 operators, due to chaining!)&lt;/p&gt;

&lt;h2&gt;
  
  
  📊 Phase 1: Initial Configuration (140K msg/sec)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Configuration
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# FlinkDeployment configuration&lt;/span&gt;
&lt;span class="na"&gt;parallelism&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8&lt;/span&gt;
&lt;span class="na"&gt;pulsar_partitions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8&lt;/span&gt;

&lt;span class="c1"&gt;# Resource allocation&lt;/span&gt;
&lt;span class="na"&gt;taskmanager&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;4&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;      &lt;span class="c1"&gt;# 2 vCPUs per TaskManager&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;4Gi&lt;/span&gt;

&lt;span class="c1"&gt;# Task slot mapping&lt;/span&gt;
&lt;span class="na"&gt;taskmanager.numberOfTaskSlots&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;  &lt;span class="c1"&gt;# 2 slots per vCPU (2:1 ratio)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Calculation:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;4 TaskManagers × 2 slots = &lt;strong&gt;8 total slots&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Parallelism = 8&lt;/li&gt;
&lt;li&gt;Each slot gets: 2 vCPUs / 2 slots = &lt;strong&gt;1 vCPU per slot&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Instance Type:&lt;/strong&gt; c5.2xlarge (8 vCPU, 16 GB RAM)&lt;/p&gt;

&lt;h3&gt;
  
  
  Results
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Metrics after 5 minutes&lt;/span&gt;
Records In:  140,000 msg/sec
Records Out: 2,300 aggregated records/min
CPU Usage:   85-95% &lt;span class="o"&gt;(&lt;/span&gt;maxed out!&lt;span class="o"&gt;)&lt;/span&gt;
Backpressure: HIGH on &lt;span class="nb"&gt;source &lt;/span&gt;operators
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Problem Identified:&lt;/strong&gt; Not enough parallelism to match Pulsar's 8 partitions efficiently.&lt;/p&gt;

&lt;h2&gt;
  
  
  🚀 Phase 2: Scale Parallelism (480K msg/sec)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Hypothesis
&lt;/h3&gt;

&lt;p&gt;If 8 parallel instances handle 140K msg/sec, then:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Per-instance rate: 140K / 8 = 17,500 msg/sec&lt;/li&gt;
&lt;li&gt;For 1M msg/sec: 1,000,000 / 17,500 ≈ &lt;strong&gt;57 instances needed&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's try &lt;strong&gt;64&lt;/strong&gt; for a clean power-of-2.&lt;/p&gt;

&lt;h3&gt;
  
  
  Configuration Changes
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Increase Pulsar partitions to 64&lt;/span&gt;
&lt;span class="na"&gt;pulsar_partitions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;64&lt;/span&gt;

&lt;span class="c1"&gt;# Increase Flink parallelism to match&lt;/span&gt;
&lt;span class="na"&gt;parallelism&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;64&lt;/span&gt;

&lt;span class="c1"&gt;# Task slot mapping (2:1 ratio maintained)&lt;/span&gt;
&lt;span class="na"&gt;taskmanager.numberOfTaskSlots&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;  &lt;span class="c1"&gt;# Still 2 slots per vCPU&lt;/span&gt;

&lt;span class="c1"&gt;# Calculate required TaskManagers:&lt;/span&gt;
&lt;span class="c1"&gt;# 64 slots needed / 2 slots per TM = 32 TaskManagers&lt;/span&gt;
&lt;span class="c1"&gt;# OR with 8 vCPU machines: 64 slots / 16 slots per machine = 4 machines&lt;/span&gt;
&lt;span class="na"&gt;taskmanager&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;16&lt;/span&gt;  &lt;span class="c1"&gt;# Using c5.2xlarge (8 vCPU, 16 slots per machine)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Wait... 16 TaskManagers on c5.2xlarge?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Let me recalculate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;c5.2xlarge: 8 vCPUs&lt;/li&gt;
&lt;li&gt;Task slots: 2 per vCPU = 16 slots per machine&lt;/li&gt;
&lt;li&gt;Need 64 slots total&lt;/li&gt;
&lt;li&gt;Machines needed: 64 / 16 = &lt;strong&gt;4 machines&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Corrected configuration&lt;/span&gt;
&lt;span class="na"&gt;taskmanager&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;4&lt;/span&gt;  &lt;span class="c1"&gt;# 4 × c5.2xlarge machines&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8&lt;/span&gt;      &lt;span class="c1"&gt;# Full 8 vCPUs&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;16Gi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Results - Phase 2
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Metrics after deployment&lt;/span&gt;
Records In:  480,000 msg/sec  &lt;span class="o"&gt;(&lt;/span&gt;3.4x improvement!&lt;span class="o"&gt;)&lt;/span&gt;
Records Out: 8,000 aggregated records/min
CPU Usage:   65-75% per TaskManager
Backpressure: MEDIUM
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Progress:&lt;/strong&gt; 140K → 480K msg/sec ✅&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But still far from 1M!&lt;/strong&gt; What's the bottleneck now?&lt;/p&gt;

&lt;h2&gt;
  
  
  🔧 Phase 3: CPU Resource Tuning (600K msg/sec)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Investigation
&lt;/h3&gt;

&lt;p&gt;Looking at the Flink deployment YAML in the &lt;a href="https://github.com/hyperscaledesignhub/RealtimeDataPlatform/tree/main/realtime-platform-1million-events/flink-load" rel="noopener noreferrer"&gt;repository&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# flink-job-deployment.yaml - JobManager pod spec&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;job&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;jarURI&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;local:///opt/flink/usrlib/flink-consumer-1.0.0.jar&lt;/span&gt;
    &lt;span class="na"&gt;parallelism&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;64&lt;/span&gt;
    &lt;span class="na"&gt;upgradeMode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;stateless&lt;/span&gt;
  &lt;span class="na"&gt;flinkConfiguration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;taskmanager.numberOfTaskSlots&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2"&lt;/span&gt;
    &lt;span class="c1"&gt;# CPU limit in deployment&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Found it!&lt;/strong&gt; The TaskManager pod resource definition:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1"&lt;/span&gt;       &lt;span class="c1"&gt;# Only 1 CPU requested! 🚨&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4Gi"&lt;/span&gt;
  &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2"&lt;/span&gt;       &lt;span class="c1"&gt;# And max 2 CPUs&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8Gi"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Each TaskManager was throttled to 2 CPUs max, but we have &lt;strong&gt;8 vCPU machines&lt;/strong&gt;!&lt;/p&gt;

&lt;h3&gt;
  
  
  The Fix
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Updated TaskManager resources&lt;/span&gt;
&lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5"&lt;/span&gt;       &lt;span class="c1"&gt;# Request 5 CPUs (was 1)&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8Gi"&lt;/span&gt;
  &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8"&lt;/span&gt;       &lt;span class="c1"&gt;# Allow up to 8 CPUs (was 2)&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;16Gi"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Results - Phase 3
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# After CPU increase&lt;/span&gt;
Records In:  600,000 msg/sec  &lt;span class="o"&gt;(&lt;/span&gt;1.25x improvement!&lt;span class="o"&gt;)&lt;/span&gt;
CPU Usage:   75-85% per TaskManager &lt;span class="o"&gt;(&lt;/span&gt;better utilization&lt;span class="o"&gt;)&lt;/span&gt;
Backpressure: LOW → MEDIUM
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Progress:&lt;/strong&gt; 480K → 600K msg/sec ✅&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Still not 1M.&lt;/strong&gt; Time for the real detective work!&lt;/p&gt;

&lt;h2&gt;
  
  
  🔎 Phase 4: The Backlog Experiment (890K msg/sec)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Eureka Moment
&lt;/h3&gt;

&lt;p&gt;I noticed a &lt;strong&gt;huge backlog&lt;/strong&gt; forming in Pulsar (100M+ messages). So I tried an experiment:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stop all producers and let Flink catch up.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Stop producers&lt;/span&gt;
kubectl scale deployment iot-producer &lt;span class="nt"&gt;-n&lt;/span&gt; iot-pipeline &lt;span class="nt"&gt;--replicas&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0

&lt;span class="c"&gt;# Watch Flink metrics&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; flink-benchmark &amp;lt;jobmanager&amp;gt; &lt;span class="nt"&gt;--&lt;/span&gt; curl localhost:8081/jobs/&amp;lt;job-id&amp;gt;/metrics
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The Shocking Result
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Flink consuming from backlog (no new messages)&lt;/span&gt;
Records In:  890,000 msg/sec  😲

&lt;span class="c"&gt;# CPU and memory usage&lt;/span&gt;
CPU: 85-90% &lt;span class="o"&gt;(&lt;/span&gt;near max&lt;span class="o"&gt;)&lt;/span&gt;
Memory: Stable
Network: ~270 MB/sec ingress
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key Insight:&lt;/strong&gt; Flink could process &lt;strong&gt;890K msg/sec&lt;/strong&gt; when reading from Pulsar backlog, but only &lt;strong&gt;600K msg/sec&lt;/strong&gt; with live producers!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conclusion:&lt;/strong&gt; &lt;strong&gt;Pulsar was the bottleneck!&lt;/strong&gt; 🎯&lt;/p&gt;

&lt;h2&gt;
  
  
  🏎️ Phase 5: Upgrade Pulsar Infrastructure (1M msg/sec)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Pulsar Problem
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Previous Setup:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Instance: i3en.6xlarge (24 vCPU, 192GB RAM, 2x 7.5TB NVMe)&lt;/li&gt;
&lt;li&gt;Bookies: 4 nodes&lt;/li&gt;
&lt;li&gt;Brokers: 4 nodes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Issues:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Older generation (i3en)&lt;/li&gt;
&lt;li&gt;Only 4 bookies for 1M msg/sec&lt;/li&gt;
&lt;li&gt;Journal and ledger on same NVMe device&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Pulsar Upgrade
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Updated Terraform configuration&lt;/span&gt;
&lt;span class="nx"&gt;pulsar_broker_bookie_config&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;instance_types&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"i7i.8xlarge"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# Upgraded from i3en.6xlarge&lt;/span&gt;
  &lt;span class="nx"&gt;desired_size&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;                   &lt;span class="c1"&gt;# Increased from 4&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;i7i.8xlarge Advantages:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Newer generation (better CPU IPC)&lt;/li&gt;
&lt;li&gt;32 vCPUs (vs 24 on i3en)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2× NVMe devices&lt;/strong&gt; (3.75TB each)&lt;/li&gt;
&lt;li&gt;Lower per-device latency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;NVMe Device Separation (Critical!):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Device mapping on each i7i.8xlarge bookie&lt;/span&gt;
/dev/nvme1n1 → /mnt/bookkeeper/journal  &lt;span class="c"&gt;# Journal (WAL)&lt;/span&gt;
/dev/nvme2n1 → /mnt/bookkeeper/ledgers  &lt;span class="c"&gt;# Ledgers (Data)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This separation eliminates I/O contention between:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Journal&lt;/strong&gt;: Sequential writes (low latency critical)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ledgers&lt;/strong&gt;: Random reads/writes (capacity critical)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Results - Phase 5
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# After Pulsar upgrade&lt;/span&gt;
Records In:  1,040,000 msg/sec  🎉
CPU &lt;span class="o"&gt;(&lt;/span&gt;Flink&lt;span class="o"&gt;)&lt;/span&gt;: 80-85% per TaskManager
CPU &lt;span class="o"&gt;(&lt;/span&gt;Pulsar&lt;span class="o"&gt;)&lt;/span&gt;: 70-75% per Broker/Bookie
Backpressure: NONE to LOW
End-to-end latency: &amp;lt;2 seconds
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;SUCCESS:&lt;/strong&gt; 600K → 1,040K msg/sec ✅&lt;/p&gt;

&lt;h2&gt;
  
  
  🎯 Phase 6: Final Flink Optimization (1.04M msg/sec sustained)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Last Mile
&lt;/h3&gt;

&lt;p&gt;Even with Pulsar fixed, I wanted to optimize Flink further. The &lt;strong&gt;2:1 slot-to-CPU ratio&lt;/strong&gt; was still suboptimal for our CPU-heavy aggregation workload.&lt;/p&gt;

&lt;h3&gt;
  
  
  Configuration Change
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Final Flink configuration&lt;/span&gt;
&lt;span class="na"&gt;taskmanager&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;4&lt;/span&gt;  &lt;span class="c1"&gt;# 4 × c5.4xlarge machines&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;16&lt;/span&gt;     &lt;span class="c1"&gt;# Full 16 vCPUs per TM&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;32Gi&lt;/span&gt;

&lt;span class="c1"&gt;# Slot configuration&lt;/span&gt;
&lt;span class="na"&gt;taskmanager.numberOfTaskSlots&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;16&lt;/span&gt;  &lt;span class="c1"&gt;# 1:1 ratio (was 2:1)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;New Calculation:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;4 TaskManagers × 16 slots = &lt;strong&gt;64 total slots&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Each slot gets: 16 vCPUs / 16 slots = &lt;strong&gt;1 vCPU per slot&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Instance Upgrade:&lt;/strong&gt; c5.2xlarge → c5.4xlarge (16 vCPU, 32 GB RAM)&lt;/p&gt;

&lt;h3&gt;
  
  
  Why 1:1 Ratio Works Better
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Our Pipeline Analysis:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;Source&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="no"&gt;I&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="no"&gt;O&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;keyBy&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="nc"&gt;Window&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="no"&gt;CPU&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="nc"&gt;Aggregate&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="no"&gt;CPU&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="nc"&gt;Sink&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="no"&gt;I&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="no"&gt;O&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
   &lt;span class="err"&gt;🟢&lt;/span&gt;                    &lt;span class="err"&gt;🔴&lt;/span&gt;             &lt;span class="err"&gt;🔴&lt;/span&gt;               &lt;span class="err"&gt;🟢&lt;/span&gt;

&lt;span class="no"&gt;CPU&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;intensive&lt;/span&gt; &lt;span class="nl"&gt;operators:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Window&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nc"&gt;Aggregate&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="no"&gt;I&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="no"&gt;O&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;bound&lt;/span&gt; &lt;span class="nl"&gt;operators:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Source&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nc"&gt;Sink&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Decision:&lt;/strong&gt; Since aggregation is CPU-heavy, 1:1 gives each task dedicated CPU.&lt;/p&gt;

&lt;h3&gt;
  
  
  Final Results
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Sustained performance metrics&lt;/span&gt;
Records In:  1,040,000 msg/sec
Records Out: 17,333 aggregated records/min
CPU Usage:   75-80% per TaskManager &lt;span class="o"&gt;(&lt;/span&gt;optimal&lt;span class="o"&gt;)&lt;/span&gt;
Memory Usage: 60-70% &lt;span class="o"&gt;(&lt;/span&gt;plenty of headroom&lt;span class="o"&gt;)&lt;/span&gt;
Backpressure: NONE
GC Pressure: LOW
Checkpoint Duration: 5-8 seconds
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Final Achievement:&lt;/strong&gt; &lt;strong&gt;1,040,000 messages/sec sustained&lt;/strong&gt; 🏆&lt;/p&gt;

&lt;h2&gt;
  
  
  🧪 The Backlog Test - A Critical Technique
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Why This Test Matters
&lt;/h3&gt;

&lt;p&gt;The &lt;strong&gt;backlog consumption test&lt;/strong&gt; reveals your true system capacity:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# The test process&lt;/span&gt;
1. Run producers at max speed &lt;span class="o"&gt;(&lt;/span&gt;build backlog&lt;span class="o"&gt;)&lt;/span&gt;
2. Stop producers completely
3. Measure Flink consumption from backlog
4. Compare: Backlog rate vs Live rate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What it tells you:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Shows true Flink capacity&lt;/li&gt;
&lt;li&gt;Reveals whether Pulsar or Flink is the bottleneck&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;In our case:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Live: 600K msg/sec&lt;/li&gt;
&lt;li&gt;Backlog: 890K msg/sec&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Conclusion:&lt;/strong&gt; Pulsar was limiting, not Flink!&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  🔧 Final Architecture
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Flink Configuration
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# FlinkDeployment - Final configuration&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;flinkVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1_18&lt;/span&gt;

  &lt;span class="na"&gt;flinkConfiguration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;taskmanager.numberOfTaskSlots&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;16"&lt;/span&gt;
    &lt;span class="na"&gt;parallelism.default&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;64"&lt;/span&gt;
    &lt;span class="na"&gt;state.backend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rocksdb&lt;/span&gt;
    &lt;span class="na"&gt;state.checkpoints.dir&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;s3://benchmark-high-infra-state/checkpoints&lt;/span&gt;
    &lt;span class="na"&gt;execution.checkpointing.interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;60000&lt;/span&gt;
    &lt;span class="na"&gt;execution.checkpointing.mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;EXACTLY_ONCE&lt;/span&gt;

  &lt;span class="na"&gt;jobManager&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;resource&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8Gi"&lt;/span&gt;
      &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;4&lt;/span&gt;
    &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;

  &lt;span class="na"&gt;taskManager&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;resource&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;32Gi"&lt;/span&gt;
      &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;16&lt;/span&gt;  &lt;span class="c1"&gt;# Full 16 vCPUs&lt;/span&gt;
    &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;4&lt;/span&gt;  &lt;span class="c1"&gt;# 4 × c5.4xlarge machines&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Resource Summary:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;4 TaskManagers on c5.4xlarge (16 vCPU, 32 GB each)&lt;/li&gt;
&lt;li&gt;64 task slots total (16 per TM)&lt;/li&gt;
&lt;li&gt;1:1 slot-to-vCPU ratio&lt;/li&gt;
&lt;li&gt;64 parallelism (matches Pulsar partitions)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Pulsar Configuration
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Pulsar values - Final configuration&lt;/span&gt;
&lt;span class="na"&gt;bookkeeper&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replicaCount&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;6&lt;/span&gt;  &lt;span class="c1"&gt;# Increased from 4&lt;/span&gt;

  &lt;span class="c1"&gt;# NVMe device configuration&lt;/span&gt;
  &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;journal&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;200Gi&lt;/span&gt;
      &lt;span class="na"&gt;storageClassName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;local-nvme&lt;/span&gt;  &lt;span class="c1"&gt;# /dev/nvme1n1&lt;/span&gt;
    &lt;span class="na"&gt;ledgers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1000Gi&lt;/span&gt;
      &lt;span class="na"&gt;storageClassName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;local-nvme&lt;/span&gt;  &lt;span class="c1"&gt;# /dev/nvme2n1&lt;/span&gt;

  &lt;span class="na"&gt;configData&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;journalMaxSizeMB&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2048"&lt;/span&gt;
    &lt;span class="na"&gt;journalSyncData&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;false"&lt;/span&gt;
    &lt;span class="na"&gt;journalAdaptiveGroupWrites&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;
    &lt;span class="na"&gt;ledgerStorageClass&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;org.apache.bookkeeper.bookie.storage.ldb.DbLedgerStorage"&lt;/span&gt;

&lt;span class="na"&gt;broker&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replicaCount&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;6&lt;/span&gt;
  &lt;span class="na"&gt;nodeSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;node-type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;broker-bookie&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Instance Type:&lt;/strong&gt; i7i.8xlarge (32 vCPU, 256GB RAM, 2× 3.75TB NVMe)&lt;/p&gt;

&lt;h2&gt;
  
  
  📊 Performance Comparison Table
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Phase&lt;/th&gt;
&lt;th&gt;Config&lt;/th&gt;
&lt;th&gt;Throughput&lt;/th&gt;
&lt;th&gt;Bottleneck&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Phase 1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;8 parallel, c5.2xlarge, 2:1&lt;/td&gt;
&lt;td&gt;140K msg/sec&lt;/td&gt;
&lt;td&gt;Low parallelism&lt;/td&gt;
&lt;td&gt;Increase to 64&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Phase 2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;64 parallel, c5.2xlarge, 2:1&lt;/td&gt;
&lt;td&gt;480K msg/sec&lt;/td&gt;
&lt;td&gt;CPU throttling&lt;/td&gt;
&lt;td&gt;Increase CPU limit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Phase 3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;64 parallel, c5.2xlarge, 2:1, 8 CPU&lt;/td&gt;
&lt;td&gt;600K msg/sec&lt;/td&gt;
&lt;td&gt;Pulsar capacity&lt;/td&gt;
&lt;td&gt;Upgrade Pulsar&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Backlog&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Same config, no producers&lt;/td&gt;
&lt;td&gt;890K msg/sec&lt;/td&gt;
&lt;td&gt;Flink needs more CPU&lt;/td&gt;
&lt;td&gt;Upgrade to c5.4xlarge&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Phase 4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;64 parallel, c5.4xlarge, 1:1&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1,040K msg/sec&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;None!&lt;/td&gt;
&lt;td&gt;✅ Success&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  💡 Best Practices for Flink at Scale
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. &lt;strong&gt;Match Parallelism to Source Partitions&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Pulsar Partitions = Flink Parallelism
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This ensures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Optimal work distribution&lt;/li&gt;
&lt;li&gt;No partition skew&lt;/li&gt;
&lt;li&gt;Maximum throughput&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. &lt;strong&gt;Use the Right Slot-to-CPU Ratio&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Analyze your job operators&lt;/span&gt;
&lt;span class="nc"&gt;Source&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="no"&gt;I&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="no"&gt;O&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;  &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;keyBy&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="nc"&gt;Window&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="no"&gt;CPU&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;  &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="nc"&gt;Aggregate&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="no"&gt;CPU&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;  &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="nc"&gt;Sink&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="no"&gt;I&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="no"&gt;O&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
   &lt;span class="err"&gt;🟢&lt;/span&gt;                      &lt;span class="err"&gt;🔴&lt;/span&gt;              &lt;span class="err"&gt;🔴&lt;/span&gt;               &lt;span class="err"&gt;🟢&lt;/span&gt;

&lt;span class="c1"&gt;// Count CPU-bound operators&lt;/span&gt;
&lt;span class="no"&gt;CPU&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nl"&gt;bound:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="n"&gt;operators&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;window&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;aggregate&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="no"&gt;I&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="no"&gt;O&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nl"&gt;bound:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="n"&gt;operators&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sink&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;// Decision: 1:1 ratio (due to CPU-heavy aggregate)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. &lt;strong&gt;Right-Size Your TaskManager Instances&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Options for 64 slots with 1:1 ratio:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Instance&lt;/th&gt;
&lt;th&gt;vCPU&lt;/th&gt;
&lt;th&gt;Machines&lt;/th&gt;
&lt;th&gt;Cost/mo&lt;/th&gt;
&lt;th&gt;Network&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;c5.2xlarge&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;$2,200&lt;/td&gt;
&lt;td&gt;Up to 10 Gbps&lt;/td&gt;
&lt;td&gt;Budget&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;c5.4xlarge&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;$2,400&lt;/td&gt;
&lt;td&gt;Up to 10 Gbps&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Balanced&lt;/strong&gt; ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;c5.9xlarge&lt;/td&gt;
&lt;td&gt;36&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;$2,700&lt;/td&gt;
&lt;td&gt;10 Gbps&lt;/td&gt;
&lt;td&gt;High memory&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;c5.12xlarge&lt;/td&gt;
&lt;td&gt;48&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;$3,600&lt;/td&gt;
&lt;td&gt;12 Gbps&lt;/td&gt;
&lt;td&gt;Max performance&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Our choice:&lt;/strong&gt; c5.4xlarge - Best balance of cost and manageability&lt;/p&gt;

&lt;h3&gt;
  
  
  4. &lt;strong&gt;Monitor These Metrics&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Critical Flink metrics to watch

# 1. Backpressure (should be LOW)
flink_taskmanager_job_task_backPressureTimeMsPerSecond

# 2. Records per second
rate(flink_taskmanager_job_task_numRecordsInPerSecond[1m])

# 3. Checkpoint duration (should be &amp;lt; 10% of interval)
flink_jobmanager_job_lastCheckpointDuration

# 4. CPU usage (should be 70-85%)
container_cpu_usage_seconds_total{pod=~"flink-taskmanager.*"}

# 5. Memory usage
flink_taskmanager_Status_JVM_Memory_Heap_Used
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  5. &lt;strong&gt;Test with Backlog&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Always do the backlog test:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Build backlog (run producers at full speed)&lt;/span&gt;
&lt;span class="c"&gt;# 2. Stop producers&lt;/span&gt;
&lt;span class="c"&gt;# 3. Measure Flink consumption rate&lt;/span&gt;
&lt;span class="c"&gt;# 4. This is your TRUE Flink capacity&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If backlog consumption &amp;gt; live consumption:&lt;br&gt;
→ &lt;strong&gt;Upstream system (Pulsar) is the bottleneck&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If backlog consumption ≈ live consumption:&lt;br&gt;
→ &lt;strong&gt;Flink is the bottleneck&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  🎯 Key Takeaways
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What Worked
&lt;/h3&gt;

&lt;p&gt;✅ &lt;strong&gt;Parallelism = Partitions&lt;/strong&gt; (64 = 64)&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;1:1 slot-to-CPU ratio&lt;/strong&gt; for CPU-bound workloads&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Bigger instances&lt;/strong&gt; (c5.4xlarge) over many small ones&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Upgraded Pulsar&lt;/strong&gt; (i3en → i7i, 4 → 6 nodes)&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;NVMe device separation&lt;/strong&gt; (journal vs ledgers)&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Disabled journalSyncData&lt;/strong&gt; in BookKeeper&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Backlog testing&lt;/strong&gt; to identify bottlenecks  &lt;/p&gt;

&lt;h3&gt;
  
  
  What Didn't Work
&lt;/h3&gt;

&lt;p&gt;❌ &lt;strong&gt;2:1 slot-to-CPU ratio&lt;/strong&gt; (insufficient CPU per task)&lt;br&gt;&lt;br&gt;
❌ &lt;strong&gt;Low CPU limits&lt;/strong&gt; in pod specs (throttling)&lt;br&gt;&lt;br&gt;
❌ &lt;strong&gt;Too few Pulsar bookies&lt;/strong&gt; (4 → needed 6)&lt;br&gt;&lt;br&gt;
❌ &lt;strong&gt;Single NVMe device&lt;/strong&gt; for journal+ledger (I/O contention)&lt;br&gt;&lt;br&gt;
❌ &lt;strong&gt;Older instance types&lt;/strong&gt; (i3en vs i7i)  &lt;/p&gt;

&lt;h2&gt;
  
  
  💰 Final Cost
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Instance&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;th&gt;Monthly Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Flink JM&lt;/td&gt;
&lt;td&gt;c5.4xlarge&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;$480&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Flink TM&lt;/td&gt;
&lt;td&gt;c5.4xlarge&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;$1,920&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Flink Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$2,400&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pulsar (Broker+Bookie)&lt;/td&gt;
&lt;td&gt;i7i.8xlarge&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;$12,960&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Infrastructure Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$15,360/month&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Cost per message:&lt;/strong&gt; $0.0000154 per 1M messages&lt;/p&gt;

&lt;h2&gt;
  
  
  🚀 Scaling Beyond 1M
&lt;/h2&gt;

&lt;p&gt;Want to go higher? Here's the roadmap:&lt;/p&gt;

&lt;h3&gt;
  
  
  2M msg/sec:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Pulsar: 8 bookies (i7i.8xlarge)&lt;/li&gt;
&lt;li&gt;Flink: 8 TaskManagers (c5.4xlarge), parallelism 128&lt;/li&gt;
&lt;li&gt;Partitions: 128&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost:&lt;/strong&gt; ~$23K/month&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  5M msg/sec:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Pulsar: 15 bookies (i7i.8xlarge)&lt;/li&gt;
&lt;li&gt;Flink: 16 TaskManagers (c5.4xlarge), parallelism 256&lt;/li&gt;
&lt;li&gt;Partitions: 256&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost:&lt;/strong&gt; ~$40K/month&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  10M msg/sec:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Pulsar: 30 bookies (i7i.16xlarge)&lt;/li&gt;
&lt;li&gt;Flink: 32 TaskManagers (c5.9xlarge), parallelism 512&lt;/li&gt;
&lt;li&gt;Network: Upgrade to 100 Gbps instances&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost:&lt;/strong&gt; ~$80K/month&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  🎓 Conclusion
&lt;/h2&gt;

&lt;p&gt;Going from &lt;strong&gt;140K to 1 million messages/sec&lt;/strong&gt; required:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Understanding the architecture&lt;/strong&gt; (operators, tasks, slots)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Systematic testing&lt;/strong&gt; (change one thing at a time)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bottleneck identification&lt;/strong&gt; (backlog test was key!)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Right-sizing resources&lt;/strong&gt; (not just throwing more hardware)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Infrastructure matching&lt;/strong&gt; (Pulsar + Flink capacity aligned)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The biggest lesson?&lt;/strong&gt; &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The bottleneck is rarely where you think it is. Measure, test, iterate.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In our case, we assumed Flink was the problem. Turned out:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Phase 1-3: Flink configuration issues&lt;/li&gt;
&lt;li&gt;Phase 4: &lt;strong&gt;Pulsar was limiting Flink!&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Phase 5-6: Back to Flink (needed more CPU per slot)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Both systems needed optimization to achieve 1M msg/sec.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The journey taught me:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Start with fundamentals&lt;/strong&gt;: Parallelism, partitions, resource allocation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use systematic testing&lt;/strong&gt;: Change one variable at a time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Leverage diagnostic tools&lt;/strong&gt;: Backlog testing, metrics monitoring&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Think holistically&lt;/strong&gt;: Tune the entire pipeline, not just one component&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  📚 Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Flink Load Repository&lt;/strong&gt;: &lt;a href="https://github.com/hyperscaledesignhub/RealtimeDataPlatform/tree/main/realtime-platform-1million-events/flink-load" rel="noopener noreferrer"&gt;flink-load&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Complete Implementation&lt;/strong&gt;: &lt;a href="https://github.com/hyperscaledesignhub/RealtimeDataPlatform" rel="noopener noreferrer"&gt;RealtimeDataPlatform&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apache Flink Performance Tuning&lt;/strong&gt;: &lt;a href="https://flink.apache.org/performance" rel="noopener noreferrer"&gt;flink.apache.org/performance&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flink Resource Configuration&lt;/strong&gt;: &lt;a href="https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/deployment/memory/mem_setup/" rel="noopener noreferrer"&gt;Flink Memory Setup Guide&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Have you tuned Flink for high throughput? What challenges did you face? Share your optimization stories in the comments!&lt;/strong&gt; 👇&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Follow me for more deep dives on stream processing, performance optimization, and distributed systems architecture!&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Next in the series&lt;/strong&gt;: "ClickHouse Performance: Ingesting 1M Events/Sec with Sub-Second Queries"&lt;/p&gt;




&lt;h2&gt;
  
  
  📋 Quick Reference
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Final Configuration Checklist
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;✅ Pulsar partitions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;64&lt;/span&gt;
&lt;span class="na"&gt;✅ Flink parallelism&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;64&lt;/span&gt;
&lt;span class="na"&gt;✅ TaskManager slots&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;16 (per TM)&lt;/span&gt;
&lt;span class="na"&gt;✅ Slot-to-CPU ratio&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1:1&lt;/span&gt;
&lt;span class="na"&gt;✅ TaskManager count&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;4&lt;/span&gt;
&lt;span class="na"&gt;✅ Instance type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;c5.4xlarge&lt;/span&gt;
&lt;span class="na"&gt;✅ Total vCPUs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;64&lt;/span&gt;
&lt;span class="na"&gt;✅ Total slots&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;64&lt;/span&gt;
&lt;span class="na"&gt;✅ Pulsar bookies&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;6&lt;/span&gt;
&lt;span class="na"&gt;✅ Pulsar instance&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;i7i.8xlarge&lt;/span&gt;
&lt;span class="na"&gt;✅ NVMe separation&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Yes&lt;/span&gt;
&lt;span class="na"&gt;✅ journalSyncData&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Performance Validation Commands
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check Flink throughput&lt;/span&gt;
kubectl logs &lt;span class="nt"&gt;-n&lt;/span&gt; flink-benchmark deployment/iot-flink-job | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s2"&gt;"records"&lt;/span&gt;

&lt;span class="c"&gt;# Check parallelism&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; flink-benchmark &amp;lt;jm-pod&amp;gt; &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  curl localhost:8081/jobs/&amp;lt;job-id&amp;gt; | jq &lt;span class="s1"&gt;'.vertices[].parallelism'&lt;/span&gt;

&lt;span class="c"&gt;# Check CPU usage&lt;/span&gt;
kubectl top pods &lt;span class="nt"&gt;-n&lt;/span&gt; flink-benchmark

&lt;span class="c"&gt;# Check backpressure&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; flink-benchmark &amp;lt;jm-pod&amp;gt; &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  curl localhost:8081/jobs/&amp;lt;job-id&amp;gt;/vertices/&amp;lt;vertex-id&amp;gt;/backpressure

&lt;span class="c"&gt;# Run the backlog test&lt;/span&gt;
kubectl scale deployment iot-producer &lt;span class="nt"&gt;-n&lt;/span&gt; iot-pipeline &lt;span class="nt"&gt;--replicas&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0
&lt;span class="c"&gt;# Watch throughput increase as Flink catches up&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; #flink #performance #streaming #aws #optimization #parallelism #tuning #apacheflink #realtimedata&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>dataengineering</category>
      <category>performance</category>
      <category>aws</category>
    </item>
    <item>
      <title>How I Achieved 1 Million Messages/Sec with Apache Pulsar on AWS EKS - A Deep Dive into NVMe, BookKeeper, and Performance Tuning</title>
      <dc:creator>HyperscaleDesignHub</dc:creator>
      <pubDate>Sun, 26 Oct 2025 12:44:48 +0000</pubDate>
      <link>https://forem.com/vijaya_bhaskarv_ba95adf9/how-i-achieved-1-million-messagessec-with-apache-pulsar-on-aws-eks-a-deep-dive-into-nvme-33b8</link>
      <guid>https://forem.com/vijaya_bhaskarv_ba95adf9/how-i-achieved-1-million-messagessec-with-apache-pulsar-on-aws-eks-a-deep-dive-into-nvme-33b8</guid>
      <description>&lt;p&gt;Processing &lt;strong&gt;1 million messages per second&lt;/strong&gt; isn't just about throwing more hardware at the problem. It requires deep understanding of storage I/O, careful configuration tuning, and smart architectural decisions.&lt;/p&gt;

&lt;p&gt;In this article, I'll share the exact configurations and optimizations that enabled Apache Pulsar to reliably handle &lt;strong&gt;1,000,000 messages/sec with 300-byte payloads&lt;/strong&gt; on AWS EKS.&lt;/p&gt;

&lt;h2&gt;
  
  
  🎯 The Challenge
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Requirements:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Throughput&lt;/strong&gt;: 1 million messages/second sustained&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Message Size&lt;/strong&gt;: ~300 bytes (AVRO-serialized sensor data)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Total Bandwidth&lt;/strong&gt;: ~2.4 Gbps (300 MB/sec)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency&lt;/strong&gt;: &amp;lt; 10ms p99&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Durability&lt;/strong&gt;: No message loss, replicated storage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost&lt;/strong&gt;: Optimized for AWS infrastructure&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  🏗️ Architecture Overview
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────────────────────────────┐
│                     Apache Pulsar on AWS EKS                            │
│                 benchmark-high-infra (k8s 1.31)                         │
├──────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌─────────────────┐   ┌─────────────┐   ┌─────────────┐   ┌──────────┐ │
│  │   PRODUCERS     │──▶│ ZooKeeper   │   │   Pulsar    │──▶│ PROXIES  │ │
│  │                 │   │             │   │  Brokers    │   │          │ │
│  │ 4 nodes         │   │ 3 nodes     │   │ 6 nodes     │   │ 2 nodes  │ │
│  │ c5.4xlarge      │   │ t3.medium   │   │ i7i.8xlarge │   │c5.2xlarge│ │
│  │                 │   │             │   │             │   │          │ │
│  │ Java/AVRO       │   │ Metadata    │   │ Message     │   │ Load     │ │
│  │ 250K evt/sec    │   │ Management  │   │ Routing     │   │ Balance  │ │
│  │ per node        │   │             │   │             │   │          │ │
│  └─────────────────┘   └─────────────┘   └─────────────┘   └──────────┘ │
│                                                 │                        │
│                                                 ▼                        │
│                                         ┌─────────────┐                  │
│                                         │ BookKeeper  │                  │
│                                         │  Bookies    │                  │
│                                         │             │                  │
│                                         │ 6 nodes     │                  │
│                                         │ i7i.8xlarge │                  │
│                                         │             │                  │
│                                         │ NVMe Storage│                  │
│                                         │ Separation: │                  │
│                                         │ Device 0:   │                  │
│                                         │ Journal WAL │                  │
│                                         │ Device 1:   │                  │
│                                         │ Ledger Data │                  │
│                                         └─────────────┘                  │
└──────────────────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  💾 The Storage Strategy - Why NVMe Matters
&lt;/h2&gt;

&lt;h3&gt;
  
  
  EC2 Instance Selection: i7i.8xlarge
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Why i7i.8xlarge?&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Instance: i7i.8xlarge
- 32 vCPUs
- 256 GiB RAM
- 2x 3,750 GB NVMe SSDs (7.5TB total)
- Network: 25 Gbps
- Cost: ~$2,160/month per instance
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key Benefits:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Ultra-low latency&lt;/strong&gt;: NVMe SSDs provide &amp;lt;100µs latency vs EBS's ~1-3ms&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High IOPS&lt;/strong&gt;: 3.75M IOPS vs EBS's 64K IOPS (gp3)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sustained throughput&lt;/strong&gt;: 30 GB/s vs EBS's 1 GB/s&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No network overhead&lt;/strong&gt;: Local storage doesn't compete with network bandwidth&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  NVMe Device Separation - The Game Changer
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Critical Design Decision:&lt;/strong&gt; Use &lt;strong&gt;2 separate NVMe devices&lt;/strong&gt; per node&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Device configuration on each i7i.8xlarge node&lt;/span&gt;
/dev/nvme1n1 → Journal &lt;span class="o"&gt;(&lt;/span&gt;Write-Ahead Log&lt;span class="o"&gt;)&lt;/span&gt; - 3,750 GB
/dev/nvme2n1 → Ledgers &lt;span class="o"&gt;(&lt;/span&gt;Message Storage&lt;span class="o"&gt;)&lt;/span&gt; - 3,750 GB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why separate devices?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Journal (Write-Ahead Log):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sequential writes only&lt;/li&gt;
&lt;li&gt;Low latency critical (blocks producer ACKs)&lt;/li&gt;
&lt;li&gt;Large capacity available (3.75TB per device)&lt;/li&gt;
&lt;li&gt;High write frequency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Ledgers (Message Storage):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Random reads/writes&lt;/li&gt;
&lt;li&gt;Large capacity needed (3.75TB per device)&lt;/li&gt;
&lt;li&gt;Background compaction operations&lt;/li&gt;
&lt;li&gt;Read-heavy for consumers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Performance Impact:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Without separation&lt;/strong&gt;: Journal writes compete with ledger I/O → increased latency&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;With separation&lt;/strong&gt;: Independent I/O queues → consistent &amp;lt;1ms write latency&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  🚀 Producer Infrastructure - High-Volume Event Generation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Producer Instance Selection: c5.4xlarge
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Why c5.4xlarge for producers?&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Instance: c5.4xlarge
- 16 vCPUs (high single-thread performance)
- 32 GiB RAM
- Up to 10 Gbps network performance
- EBS Optimized: Up to 4,750 Mbps
- Cost: ~$480/month per instance
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Producer Architecture Details:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Node Configuration:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Count&lt;/strong&gt;: 4 producer nodes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Instance Type&lt;/strong&gt;: c5.4xlarge&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Target Throughput&lt;/strong&gt;: 250,000 messages/sec per node&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Total Capacity&lt;/strong&gt;: 1,000,000 messages/sec across all nodes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Producer Implementation:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Language&lt;/strong&gt;: Java with high-performance Pulsar client&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Serialization&lt;/strong&gt;: AVRO for efficient message encoding&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Message Size&lt;/strong&gt;: ~300 bytes (AVRO-serialized sensor data)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Batching&lt;/strong&gt;: Optimized batch sizes for throughput&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Connection Pooling&lt;/strong&gt;: Multiple connections per producer&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Producer Performance Characteristics
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Per-Node Metrics:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Each c5.4xlarge producer node generates:&lt;/span&gt;
- Messages/sec: 250,000
- Data rate: 75 MB/sec &lt;span class="o"&gt;(&lt;/span&gt;300 bytes × 250K&lt;span class="o"&gt;)&lt;/span&gt;
- CPU utilization: 70-80%
- Memory usage: 8-12 GB
- Network utilization: ~600 Mbps
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pulsar Proxy Configuration:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Instance Type&lt;/strong&gt;: c5.2xlarge (8 vCPUs, 16 GiB RAM)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Why c5.2xlarge?&lt;/strong&gt; Higher network performance and CPU for connection handling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Role&lt;/strong&gt;: Load balancing and connection management for 1M+ connections&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Network Performance&lt;/strong&gt;: Up to 10 Gbps (critical for high-throughput scenarios)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost&lt;/strong&gt;: ~$240/month per instance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can find the complete producer implementation in the &lt;a href="https://github.com/hyperscaledesignhub/RealtimeDataPlatform/tree/main/realtime-platform-1million-events/producer-load" rel="noopener noreferrer"&gt;producer-load directory&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  🔧 BookKeeper Configuration - The Secret Sauce
&lt;/h2&gt;

&lt;p&gt;The complete configuration can be found in the &lt;a href="https://github.com/hyperscaledesignhub/RealtimeDataPlatform/tree/main/realtime-platform-1million-events/pulsar-load" rel="noopener noreferrer"&gt;pulsar-load directory&lt;/a&gt; and the specific &lt;a href="https://github.com/hyperscaledesignhub/RealtimeDataPlatform/blob/main/realtime-platform-1million-events/pulsar-load/helm/pulsar/values.yaml" rel="noopener noreferrer"&gt;values.yaml file&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Journal Configuration
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# pulsar-values.yaml - BookKeeper section&lt;/span&gt;
&lt;span class="na"&gt;bookkeeper&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;configData&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Journal write buffer - CRITICAL for throughput&lt;/span&gt;
    &lt;span class="na"&gt;journalMaxSizeMB&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2048"&lt;/span&gt;  &lt;span class="c1"&gt;# 2GB buffer (default: 512MB)&lt;/span&gt;
    &lt;span class="na"&gt;journalMaxBackups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5"&lt;/span&gt;

    &lt;span class="c1"&gt;# Disable fsync for each write (NVMe is reliable enough)&lt;/span&gt;
    &lt;span class="na"&gt;journalSyncData&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;false"&lt;/span&gt;  &lt;span class="c1"&gt;# Default: true&lt;/span&gt;

    &lt;span class="c1"&gt;# Enable adaptive group writes (batch small writes)&lt;/span&gt;
    &lt;span class="na"&gt;journalAdaptiveGroupWrites&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;  &lt;span class="c1"&gt;# Default: false&lt;/span&gt;

    &lt;span class="c1"&gt;# Flush immediately when queue is empty&lt;/span&gt;
    &lt;span class="na"&gt;journalFlushWhenQueueEmpty&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;  &lt;span class="c1"&gt;# Default: false&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Explanation:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;journalMaxSizeMB: "2048"&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Increased from default 512MB to 2GB&lt;/li&gt;
&lt;li&gt;Allows buffering more writes before flush&lt;/li&gt;
&lt;li&gt;Reduces fsync frequency&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Impact&lt;/strong&gt;: 3-4x improvement in write throughput&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;journalSyncData: "false"&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Disables fsync() after each write&lt;/li&gt;
&lt;li&gt;Relies on NVMe's own write cache and power loss protection&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Risk&lt;/strong&gt;: Potential data loss on sudden power failure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mitigation&lt;/strong&gt;: NVMe drives have capacitor-backed cache&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Impact&lt;/strong&gt;: 10x reduction in write latency (10ms → 1ms)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;journalAdaptiveGroupWrites: "true"&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Groups multiple small writes into batches&lt;/li&gt;
&lt;li&gt;Reduces system call overhead&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Impact&lt;/strong&gt;: Improves throughput by 20-30% under high load&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;journalFlushWhenQueueEmpty: "true"&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Immediately flushes when no pending writes&lt;/li&gt;
&lt;li&gt;Reduces latency for sporadic writes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Impact&lt;/strong&gt;: Better p99 latency during variable load&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Ledger Storage Configuration
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Entry log settings&lt;/span&gt;
&lt;span class="na"&gt;entryLogSizeLimit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2147483648"&lt;/span&gt;  &lt;span class="c1"&gt;# 2GB per entry log file&lt;/span&gt;
&lt;span class="na"&gt;entryLogFilePreAllocationEnabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;  &lt;span class="c1"&gt;# Pre-allocate files&lt;/span&gt;

&lt;span class="c1"&gt;# Flush interval&lt;/span&gt;
&lt;span class="na"&gt;flushInterval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;60000"&lt;/span&gt;  &lt;span class="c1"&gt;# 60 seconds (default: 60000)&lt;/span&gt;

&lt;span class="c1"&gt;# Use RocksDB for better performance&lt;/span&gt;
&lt;span class="na"&gt;ledgerStorageClass&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;org.apache.bookkeeper.bookie.storage.ldb.DbLedgerStorage"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Explanation:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;entryLogSizeLimit: "2147483648"&lt;/code&gt; (2GB)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Larger entry log files reduce file rotation overhead&lt;/li&gt;
&lt;li&gt;Better sequential write patterns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Impact&lt;/strong&gt;: 15% improvement in write throughput&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;entryLogFilePreAllocationEnabled: "true"&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pre-allocates disk space for entry log files&lt;/li&gt;
&lt;li&gt;Eliminates file system overhead during writes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Impact&lt;/strong&gt;: More predictable latency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;ledgerStorageClass: "DbLedgerStorage"&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Uses RocksDB instead of InterleavedLedgerStorage&lt;/li&gt;
&lt;li&gt;Better for high-throughput workloads&lt;/li&gt;
&lt;li&gt;Faster index lookups&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Impact&lt;/strong&gt;: 40% improvement in random read performance&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Garbage Collection Tuning
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# GC settings optimized for high throughput&lt;/span&gt;
&lt;span class="na"&gt;minorCompactionInterval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3600"&lt;/span&gt;     &lt;span class="c1"&gt;# 1 hour (default: 2 hours)&lt;/span&gt;
&lt;span class="na"&gt;majorCompactionInterval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;86400"&lt;/span&gt;    &lt;span class="c1"&gt;# 24 hours (default: 24 hours)&lt;/span&gt;
&lt;span class="na"&gt;isForceGCAllowWhenNoSpace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;   &lt;span class="c1"&gt;# Force GC when disk full&lt;/span&gt;
&lt;span class="na"&gt;gcWaitTime&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;900000"&lt;/span&gt;                &lt;span class="c1"&gt;# 15 minutes (default: 15 minutes)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this matters:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;High throughput generates ledgers quickly&lt;/li&gt;
&lt;li&gt;Compaction reclaims space from deleted messages&lt;/li&gt;
&lt;li&gt;Balance: Too frequent → CPU overhead, Too rare → Disk full&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Cache and Memory Configuration
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Broker configuration&lt;/span&gt;
&lt;span class="na"&gt;broker&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;configData&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Managed ledger cache (hot data in memory)&lt;/span&gt;
    &lt;span class="na"&gt;managedLedgerCacheSizeMB&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;512"&lt;/span&gt;  &lt;span class="c1"&gt;# 512MB per broker&lt;/span&gt;

    &lt;span class="c1"&gt;# Replication settings&lt;/span&gt;
    &lt;span class="na"&gt;managedLedgerDefaultEnsembleSize&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3"&lt;/span&gt;  &lt;span class="c1"&gt;# 3 bookies per ledger&lt;/span&gt;
    &lt;span class="na"&gt;managedLedgerDefaultWriteQuorum&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2"&lt;/span&gt;   &lt;span class="c1"&gt;# Write to 2 bookies&lt;/span&gt;
    &lt;span class="na"&gt;managedLedgerDefaultAckQuorum&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2"&lt;/span&gt;     &lt;span class="c1"&gt;# Wait for 2 ACKs&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Trade-offs:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;managedLedgerCacheSizeMB: "512"&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Caches recently written messages&lt;/li&gt;
&lt;li&gt;Speeds up tailing reads (consumers close to tail)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Impact&lt;/strong&gt;: 50% reduction in read latency for hot data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Quorum Configuration (3/2/2):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ensemble: 3 bookies hold each ledger segment&lt;/li&gt;
&lt;li&gt;Write Quorum: Write to 2 bookies simultaneously&lt;/li&gt;
&lt;li&gt;Ack Quorum: Wait for 2 ACKs before acknowledging producer&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trade-off&lt;/strong&gt;: Balance between durability and latency

&lt;ul&gt;
&lt;li&gt;3/3/3: More durable, higher latency&lt;/li&gt;
&lt;li&gt;3/2/2: Balanced (our choice)&lt;/li&gt;
&lt;li&gt;3/2/1: Fast but less durable&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  🚀 Broker Configuration for High Throughput
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;broker&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replicaCount&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;6&lt;/span&gt;  &lt;span class="c1"&gt;# 6 broker instances&lt;/span&gt;

  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2Gi"&lt;/span&gt;
      &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1"&lt;/span&gt;
    &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4Gi"&lt;/span&gt;
      &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2"&lt;/span&gt;

  &lt;span class="na"&gt;configData&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Increase connection limits&lt;/span&gt;
    &lt;span class="na"&gt;maxConcurrentLookupRequest&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;50000"&lt;/span&gt;      &lt;span class="c1"&gt;# Default: 5000&lt;/span&gt;
    &lt;span class="na"&gt;maxConcurrentTopicLoadRequest&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;50000"&lt;/span&gt;   &lt;span class="c1"&gt;# Default: 5000&lt;/span&gt;

    &lt;span class="c1"&gt;# Batch settings for better throughput&lt;/span&gt;
    &lt;span class="na"&gt;maxMessagesBatchingEnabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;
    &lt;span class="na"&gt;maxNumMessagesInBatch&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1000"&lt;/span&gt;
    &lt;span class="na"&gt;maxBatchingDelayInMillis&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;10"&lt;/span&gt;

    &lt;span class="c1"&gt;# Producer settings&lt;/span&gt;
    &lt;span class="na"&gt;maxProducersPerTopic&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;10000"&lt;/span&gt;
    &lt;span class="na"&gt;maxConsumersPerTopic&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;10000"&lt;/span&gt;
    &lt;span class="na"&gt;maxConsumersPerSubscription&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;10000"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key optimizations:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Connection Limits:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Increased from 5K to 50K concurrent requests&lt;/li&gt;
&lt;li&gt;Handles high producer/consumer concurrency&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Impact&lt;/strong&gt;: Eliminates connection throttling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Batching Configuration:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Groups messages for efficient network utilization&lt;/li&gt;
&lt;li&gt;10ms delay balances latency vs throughput&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Impact&lt;/strong&gt;: 30% improvement in network efficiency&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  📊 Performance Monitoring
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Essential Metrics to Track
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Throughput Metrics:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Messages per second&lt;/span&gt;
kubectl logs &lt;span class="nt"&gt;-n&lt;/span&gt; pulsar pulsar-broker-0 | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s2"&gt;"msg/s"&lt;/span&gt;

&lt;span class="c"&gt;# Bytes per second&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; pulsar pulsar-broker-0 &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  bin/pulsar-admin topics stats persistent://public/default/iot-sensor-data
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Latency Metrics:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# BookKeeper journal write latency&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; pulsar pulsar-broker-0 &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  bin/bookkeeper shell listledgers | &lt;span class="nb"&gt;head&lt;/span&gt; &lt;span class="nt"&gt;-10&lt;/span&gt;

&lt;span class="c"&gt;# End-to-end producer latency (check application logs)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Storage Metrics:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# NVMe utilization&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; pulsar pulsar-broker-0 &lt;span class="nt"&gt;--&lt;/span&gt; iostat &lt;span class="nt"&gt;-x&lt;/span&gt; 1 5

&lt;span class="c"&gt;# Disk space usage&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; pulsar pulsar-broker-0 &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="nb"&gt;df&lt;/span&gt; &lt;span class="nt"&gt;-h&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Critical Performance Indicators
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Target Metrics:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Message Rate&lt;/strong&gt;: 1,000,000+ msg/sec&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Journal Write Latency&lt;/strong&gt;: &amp;lt; 2ms p99&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CPU Utilization&lt;/strong&gt;: 70-80% (brokers)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory Utilization&lt;/strong&gt;: 60-70% (bookies)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Network Utilization&lt;/strong&gt;: &amp;lt; 80% of 25 Gbps&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Grafana Dashboards
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Critical Panels:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Message Rate In/Out&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   rate(pulsar_in_messages_total[1m])
   rate(pulsar_out_messages_total[1m])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;BookKeeper Write Latency&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   histogram_quantile(0.99, 
     rate(bookie_journal_JOURNAL_ADD_ENTRY_bucket[5m])
   )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Storage Fill Rate&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   rate(bookie_ledgers_size_bytes[1h])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  💡 Lessons Learned &amp;amp; Best Practices
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. &lt;strong&gt;NVMe Device Separation is Critical&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Before optimization (single device):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Write latency p99: ~15ms&lt;/li&gt;
&lt;li&gt;Throughput: ~400K msg/sec&lt;/li&gt;
&lt;li&gt;Frequent latency spikes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;After optimization (separate devices):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Write latency p99: ~2.1ms&lt;/li&gt;
&lt;li&gt;Throughput: 1M+ msg/sec&lt;/li&gt;
&lt;li&gt;Stable performance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key Takeaway:&lt;/strong&gt; Journal and ledger I/O patterns conflict. Separate them.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. &lt;strong&gt;Disable journalSyncData (Carefully)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Trade-off Analysis:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Benefits:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;10x reduction in write latency&lt;/li&gt;
&lt;li&gt;2-3x increase in throughput&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Risks:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data loss on sudden power failure (rare with NVMe)&lt;/li&gt;
&lt;li&gt;Not suitable for financial transactions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;When to use:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;IoT/telemetry data (lossy acceptable)&lt;/li&gt;
&lt;li&gt;High-volume logs&lt;/li&gt;
&lt;li&gt;Event streaming&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;When NOT to use:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Financial transactions&lt;/li&gt;
&lt;li&gt;Critical business data&lt;/li&gt;
&lt;li&gt;Compliance-regulated workloads&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. &lt;strong&gt;Right-Size Your Instances&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Instance comparison for Pulsar:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Instance&lt;/th&gt;
&lt;th&gt;vCPU&lt;/th&gt;
&lt;th&gt;RAM&lt;/th&gt;
&lt;th&gt;NVMe&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;i3en.6xlarge&lt;/td&gt;
&lt;td&gt;24&lt;/td&gt;
&lt;td&gt;192GB&lt;/td&gt;
&lt;td&gt;2x 7.5TB&lt;/td&gt;
&lt;td&gt;High capacity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;i7i.8xlarge&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;256GB&lt;/td&gt;
&lt;td&gt;2x 3.75TB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Balanced&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Our choice: i7i.8xlarge&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Latest generation (better CPU performance)&lt;/li&gt;
&lt;li&gt;2 NVMe devices (perfect for journal/ledger separation)&lt;/li&gt;
&lt;li&gt;Optimal balance of CPU, memory, and storage for Pulsar workloads&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. &lt;strong&gt;Broker and Bookie Co-location&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reduced network hops (broker→bookie is local)&lt;/li&gt;
&lt;li&gt;Lower latency&lt;/li&gt;
&lt;li&gt;Cost savings (fewer instances)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Resource contention (CPU, memory)&lt;/li&gt;
&lt;li&gt;Harder to scale independently&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Our approach:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Co-locate for high throughput workloads&lt;/li&gt;
&lt;li&gt;Separate for latency-sensitive applications&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Result:&lt;/strong&gt; Works well for 1M msg/sec with proper resource allocation&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  5. &lt;strong&gt;Monitor Journal Cache Time&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Critical metric: How long messages stay in journal cache&lt;/span&gt;
&lt;span class="na"&gt;gcWaitTime&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;900000"&lt;/span&gt;  &lt;span class="c1"&gt;# 15 minutes&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Journal cache holds recently written messages&lt;/li&gt;
&lt;li&gt;Longer cache time = Better read performance for tailing consumers&lt;/li&gt;
&lt;li&gt;Too long = More data loss risk on failure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sweet spot:&lt;/strong&gt; 10-15 minutes for streaming workloads&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  🚧 Common Pitfalls &amp;amp; How to Avoid Them
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. &lt;strong&gt;EBS Instead of NVMe&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Symptom:&lt;/strong&gt; Throughput caps at ~100K msg/sec, high latency&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cause:&lt;/strong&gt; EBS gp3 maxes at:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;16K IOPS (baseline)&lt;/li&gt;
&lt;li&gt;64K IOPS (provisioned)&lt;/li&gt;
&lt;li&gt;1 GB/s throughput&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Use NVMe-backed instances (i7i, i4i, i3en)&lt;/p&gt;

&lt;h3&gt;
  
  
  2. &lt;strong&gt;Single NVMe Device for Journal + Ledger&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Symptom:&lt;/strong&gt; Latency spikes, inconsistent throughput&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cause:&lt;/strong&gt; Journal sequential writes blocked by ledger random I/O&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Use separate devices or at least separate partitions&lt;/p&gt;

&lt;h3&gt;
  
  
  3. &lt;strong&gt;journalSyncData Enabled&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Symptom:&lt;/strong&gt; Write latency &amp;gt;10ms, throughput &amp;lt;200K msg/sec&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cause:&lt;/strong&gt; fsync() after every write (10ms overhead)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Disable if data loss tolerance acceptable&lt;/p&gt;

&lt;h3&gt;
  
  
  4. &lt;strong&gt;Insufficient Broker Count&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Symptom:&lt;/strong&gt; High CPU on brokers, throttling, connection refused&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cause:&lt;/strong&gt; Too few brokers for traffic volume&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Rule of thumb: 1 broker per 150-200K msg/sec&lt;/li&gt;
&lt;li&gt;For 1M msg/sec: Minimum 5-6 brokers&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  5. &lt;strong&gt;Not Using Separate Node Groups&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Symptom:&lt;/strong&gt; Performance degradation when other workloads deploy&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cause:&lt;/strong&gt; Resource contention with non-Pulsar pods&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Dedicated node groups with taints&lt;/p&gt;

&lt;h2&gt;
  
  
  📈 Scaling Beyond 1M msg/sec
&lt;/h2&gt;

&lt;h3&gt;
  
  
  To 2M msg/sec:
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Increase broker and bookie count&lt;/span&gt;
&lt;span class="na"&gt;broker&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replicaCount&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8&lt;/span&gt;  &lt;span class="c1"&gt;# Up from 6&lt;/span&gt;

&lt;span class="na"&gt;bookkeeper&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replicaCount&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8&lt;/span&gt;  &lt;span class="c1"&gt;# Up from 6&lt;/span&gt;

&lt;span class="c1"&gt;# Scale producer infrastructure&lt;/span&gt;
&lt;span class="na"&gt;producers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replicaCount&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8&lt;/span&gt;  &lt;span class="c1"&gt;# Up from 4 (250K each = 2M total)&lt;/span&gt;
  &lt;span class="na"&gt;instanceType&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;c5.4xlarge&lt;/span&gt;  &lt;span class="c1"&gt;# Keep same instance type&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Cost impact:&lt;/strong&gt; ~+$6,240/month (2 more i7i.8xlarge @ $2,160 each + 4 more c5.4xlarge @ $480 each)&lt;/p&gt;

&lt;h3&gt;
  
  
  To 5M msg/sec:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Horizontal scaling:&lt;/strong&gt; 15-20 brokers, 15-20 bookies&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Producer scaling:&lt;/strong&gt; 20 c5.4xlarge nodes (250K each)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Network:&lt;/strong&gt; Upgrade to 50-100 Gbps instances&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storage:&lt;/strong&gt; Consider i4i.16xlarge (4x NVMe, 64 vCPU)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost:&lt;/strong&gt; ~$45,000-50,000/month&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  💰 Cost Analysis
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Monthly Cost Breakdown (1M msg/sec)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Instance&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;th&gt;Cost/mo&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Producer Nodes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;c5.4xlarge&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;$1,920&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Pulsar Brokers&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;i7i.8xlarge&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;$12,960&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;BookKeeper Bookies&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;i7i.8xlarge&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;(Co-located)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ZooKeeper&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;t3.medium&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;$90&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Pulsar Proxy&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;c5.2xlarge&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;$480&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$15,450&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Cost per message:&lt;/strong&gt; $0.0000155 per 1M messages&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Infrastructure Breakdown:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Producer Infrastructure&lt;/strong&gt;: $1,920/month (12.4% of total)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pulsar Core Infrastructure&lt;/strong&gt;: $13,530/month (87.6% of total)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Total Infrastructure&lt;/strong&gt;: $15,450/month&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; i7i.8xlarge cost = $12,960 ÷ 6 instances = $2,160/month per instance&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost Optimization Options
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Savings Plans (26% off):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;3-year commitment&lt;/li&gt;
&lt;li&gt;Reduces to ~$11,433/month&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Spot Instances (60% off):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;$6,180/month&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Risk:&lt;/strong&gt; Potential interruptions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mitigation:&lt;/strong&gt; Use for non-critical environments&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Reserved Instances (40% off):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1-year commitment&lt;/li&gt;
&lt;li&gt;Reduces to ~$9,270/month&lt;/li&gt;
&lt;li&gt;Balance between savings and flexibility&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  🎓 Conclusion
&lt;/h2&gt;

&lt;p&gt;Achieving 1 million messages/sec with Pulsar requires:&lt;/p&gt;

&lt;p&gt;✅ &lt;strong&gt;NVMe storage&lt;/strong&gt; with separated journal and ledger devices&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Careful BookKeeper tuning&lt;/strong&gt; (journalSyncData, buffer sizes)&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Right-sized instances&lt;/strong&gt; (i7i.8xlarge sweet spot)&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Horizontal scaling&lt;/strong&gt; (6+ brokers, 6+ bookies)&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Dedicated infrastructure&lt;/strong&gt; (node groups with taints)&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Monitoring&lt;/strong&gt; (latency, IOPS, CPU, throughput)  &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Metrics Achieved:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Throughput:&lt;/strong&gt; 1,040,000 msg/sec sustained&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency:&lt;/strong&gt; 0.8ms p50, 2.1ms p99&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Infrastructure Cost:&lt;/strong&gt; $15,450/month ($11,433 with savings plans)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reliability:&lt;/strong&gt; 99.95% uptime&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The secret sauce combinations:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;NVMe device separation&lt;/strong&gt; for journal vs ledgers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;journalSyncData: false&lt;/strong&gt; for 10x latency improvement&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;i7i.8xlarge instances&lt;/strong&gt; for optimal price/performance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Broker-bookie co-location&lt;/strong&gt; for reduced network hops&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Proper resource allocation&lt;/strong&gt; and monitoring&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  📚 Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pulsar Load Repository&lt;/strong&gt;: &lt;a href="https://github.com/hyperscaledesignhub/RealtimeDataPlatform/tree/main/realtime-platform-1million-events/pulsar-load" rel="noopener noreferrer"&gt;pulsar-load&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Configuration Values&lt;/strong&gt;: &lt;a href="https://github.com/hyperscaledesignhub/RealtimeDataPlatform/blob/main/realtime-platform-1million-events/pulsar-load/helm/pulsar/values.yaml" rel="noopener noreferrer"&gt;values.yaml&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Main Repository&lt;/strong&gt;: &lt;a href="https://github.com/hyperscaledesignhub/RealtimeDataPlatform" rel="noopener noreferrer"&gt;RealtimeDataPlatform&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apache Pulsar Documentation&lt;/strong&gt;: &lt;a href="https://pulsar.apache.org/" rel="noopener noreferrer"&gt;pulsar.apache.org&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BookKeeper Documentation&lt;/strong&gt;: &lt;a href="https://bookkeeper.apache.org/" rel="noopener noreferrer"&gt;bookkeeper.apache.org&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS i7i Instances&lt;/strong&gt;: &lt;a href="https://aws.amazon.com/ec2/instance-types/i7i/" rel="noopener noreferrer"&gt;aws.amazon.com/ec2/instance-types/i7i&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Have you implemented high-throughput messaging systems? What challenges did you face with storage I/O optimization? Share your experiences in the comments!&lt;/strong&gt; 👇&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Building real-time data platforms? Follow me for deep dives on performance optimization, distributed systems, and cloud infrastructure!&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Next in the series&lt;/strong&gt;: "ClickHouse Performance Tuning for 1M Events/Sec Ingestion"&lt;/p&gt;




&lt;h2&gt;
  
  
  🌟 Key Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Storage architecture matters more than CPU&lt;/strong&gt; for messaging systems&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NVMe device separation&lt;/strong&gt; is critical for predictable latency&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;journalSyncData: false&lt;/strong&gt; gives 10x performance boost (with trade-offs)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Right instance type&lt;/strong&gt; beats "more instances" for cost efficiency&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitoring is essential&lt;/strong&gt; - know your bottlenecks before scaling&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; #pulsar #apache #aws #performance #messaging #nvme #bookkeeper #streaming #eks #realtimedata&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>performance</category>
      <category>architecture</category>
      <category>aws</category>
    </item>
    <item>
      <title>AWS EKS Enterprise Deployment: Real-Time Data Streaming Platform - 1 Million Events/Sec</title>
      <dc:creator>HyperscaleDesignHub</dc:creator>
      <pubDate>Sun, 26 Oct 2025 10:50:13 +0000</pubDate>
      <link>https://forem.com/vijaya_bhaskarv_ba95adf9/aws-eks-enterprise-deployment-real-time-data-streaming-platform-1-million-eventssec-p2g</link>
      <guid>https://forem.com/vijaya_bhaskarv_ba95adf9/aws-eks-enterprise-deployment-real-time-data-streaming-platform-1-million-eventssec-p2g</guid>
      <description>&lt;p&gt;When your business processes &lt;strong&gt;millions of events per second&lt;/strong&gt; - think major e-commerce platforms during Black Friday, global payment processors, or IoT fleets with millions of devices - you need infrastructure that doesn't just scale, but performs flawlessly under extreme load.&lt;/p&gt;

&lt;p&gt;In this guide, I'll show you how to deploy an &lt;strong&gt;enterprise-grade event streaming platform on AWS EKS&lt;/strong&gt; that handles &lt;strong&gt;1 million events per second&lt;/strong&gt; using high-performance compute instances, NVMe storage, and battle-tested architectural patterns.&lt;/p&gt;

&lt;h2&gt;
  
  
  🎯 What We're Building
&lt;/h2&gt;

&lt;p&gt;An enterprise-scale streaming platform that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;⚡ Processes &lt;strong&gt;1,000,000+ events per second&lt;/strong&gt; in real-time&lt;/li&gt;
&lt;li&gt;🚀 Uses &lt;strong&gt;high-performance instances&lt;/strong&gt; (c5.4xlarge, i7i.8xlarge, r6id.4xlarge)&lt;/li&gt;
&lt;li&gt;💾 Leverages &lt;strong&gt;NVMe SSD storage&lt;/strong&gt; for ultra-low latency&lt;/li&gt;
&lt;li&gt;☁️ Runs on &lt;strong&gt;AWS EKS&lt;/strong&gt; with production-grade HA&lt;/li&gt;
&lt;li&gt;🌍 Supports &lt;strong&gt;multi-domain&lt;/strong&gt;: E-commerce, Finance, IoT, Gaming at scale&lt;/li&gt;
&lt;li&gt;⏱️ Delivers &lt;strong&gt;sub-second latency&lt;/strong&gt; end-to-end&lt;/li&gt;
&lt;li&gt;📊 Includes &lt;strong&gt;enterprise monitoring&lt;/strong&gt; with Grafana&lt;/li&gt;
&lt;li&gt;🔄 Provides &lt;strong&gt;exactly-once processing&lt;/strong&gt; guarantees&lt;/li&gt;
&lt;li&gt;💰 &lt;strong&gt;AWS infrastructure cost: ~$24,592/month&lt;/strong&gt; (with reserved instances)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  💰 Enterprise Infrastructure Investment
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;AWS Infrastructure Cost: ~$24,592/month&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This enterprise-grade investment includes high-performance compute instances (c5.4xlarge, i7i.8xlarge, r6id.4xlarge), NVMe SSD storage, multi-AZ deployment compatible, existing terraform provides only single AZ (we did this to save data transfer cost. You can change terraform to support Multi-AZ and We have verified already), enterprise monitoring, and all supporting AWS services required for processing 1 million events per second with production-grade reliability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why enterprise instances?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;i7i.8xlarge&lt;/strong&gt;: NVMe SSD for Pulsar (ultra-low latency message storage)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;r6id.4xlarge&lt;/strong&gt;: NVMe SSD for ClickHouse (blazing-fast analytics)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;c5.4xlarge&lt;/strong&gt;: High-performance compute for Flink processing &amp;amp; event generation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enterprise HA&lt;/strong&gt;: Multi-AZ deployment compatible, replication, auto-scaling&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  🏗️ Architecture Overview
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────────────────────┐
│                  AWS EKS Cluster (us-west-2)                     │
│              benchmark-high-infra (k8s 1.31)                     │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌─────────────────┐   ┌──────────────────┐   ┌──────────────┐ │
│  │   PRODUCER      │──▶│     PULSAR       │──▶│    FLINK     │ │
│  │  c5.4xlarge     │   │  i7i.8xlarge     │   │ c5.4xlarge   │ │
│  │                 │   │                  │   │              │ │
│  │ 4 nodes         │   │ ZK + 6 Brokers   │   │ JM + 6 TMs   │ │
│  │ Java/AVRO       │   │ NVMe Storage     │   │ 1M evt/sec   │ │
│  │ 250K evt/sec    │   │ 3.6TB NVMe       │   │ Checkpoints  │ │
│  │ 100K devices    │   │ Ultra-low lat    │   │ Aggregation  │ │
│  └─────────────────┘   └──────────────────┘   └──────┬───────┘ │
│                                                        │         │
│                         ┌──────────────────────────────┘         │
│                         ▼                                        │
│                  ┌──────────────────┐                           │
│                  │   CLICKHOUSE     │                           │
│                  │  r6id.4xlarge    │                           │
│                  │                  │                           │
│                  │  6 Data Nodes    │                           │
│                  │  1 Query Node    │                           │
│                  │  NVMe + EBS      │                           │
│                  │  10K+ queries/s  │                           │
│                  └──────────────────┘                           │
│                                                                  │
│  Supporting: VPC, Single-AZ (Multi-AZ Compatible), S3, ECR, IAM, Auto-scaling         │
└──────────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Tech Stack:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Kubernetes&lt;/strong&gt;: AWS EKS 1.31 (Multi-AZ Compatible, HA)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Message Broker&lt;/strong&gt;: Apache Pulsar 3.1 (NVMe-backed)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stream Processing&lt;/strong&gt;: Apache Flink 1.18 (Exactly-once)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Analytics DB&lt;/strong&gt;: ClickHouse 24.x (NVMe + EBS)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storage&lt;/strong&gt;: NVMe SSD (45TB) + EBS gp3&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Infrastructure&lt;/strong&gt;: Terraform&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitoring&lt;/strong&gt;: Grafana + Prometheus + VictoriaMetrics&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  📋 Prerequisites
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install required tools&lt;/span&gt;
brew &lt;span class="nb"&gt;install &lt;/span&gt;awscli terraform kubectl helm

&lt;span class="c"&gt;# Configure AWS with admin-level access&lt;/span&gt;
aws configure
&lt;span class="c"&gt;# Enter credentials for production account&lt;/span&gt;

&lt;span class="c"&gt;# Verify versions&lt;/span&gt;
terraform &lt;span class="nt"&gt;--version&lt;/span&gt;  &lt;span class="c"&gt;# &amp;gt;= 1.6.0&lt;/span&gt;
kubectl version      &lt;span class="c"&gt;# &amp;gt;= 1.28.0&lt;/span&gt;
helm version         &lt;span class="c"&gt;# &amp;gt;= 3.12.0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;AWS Requirements:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Admin access to AWS account&lt;/li&gt;
&lt;li&gt;Budget: ~$25,000-33,000/month&lt;/li&gt;
&lt;li&gt;Region: us-west-2 (or your preferred region)&lt;/li&gt;
&lt;li&gt;Service limits increased for:

&lt;ul&gt;
&lt;li&gt;EKS clusters&lt;/li&gt;
&lt;li&gt;EC2 instances (especially i7i.8xlarge, r6id.4xlarge)&lt;/li&gt;
&lt;li&gt;EBS volumes&lt;/li&gt;
&lt;li&gt;Elastic IPs&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  🚀 Step-by-Step Deployment
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Clone Repository &amp;amp; Review Configuration
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/hyperscaledesignhub/RealtimeDataPlatform.git
&lt;span class="nb"&gt;cd &lt;/span&gt;RealtimeDataPlatform/realtime-platform-1million-events

&lt;span class="c"&gt;# Review configuration&lt;/span&gt;
&lt;span class="nb"&gt;cat &lt;/span&gt;terraform.tfvars
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Repository structure:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;realtime-platform-1million-events/
├── terraform/                # Enterprise AWS infrastructure
├── producer-load/            # High-volume event generation
├── pulsar-load/              # Apache Pulsar (NVMe-backed)
├── flink-load/               # Apache Flink enterprise processing
├── clickhouse-load/          # ClickHouse analytics cluster
└── monitoring/               # Enterprise monitoring stack
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key Configuration:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# terraform.tfvars&lt;/span&gt;
&lt;span class="nx"&gt;cluster_name&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"benchmark-high-infra"&lt;/span&gt;
&lt;span class="nx"&gt;aws_region&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"us-west-2"&lt;/span&gt;
&lt;span class="nx"&gt;environment&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"production"&lt;/span&gt;

&lt;span class="c1"&gt;# High-performance node groups&lt;/span&gt;
&lt;span class="nx"&gt;producer_desired_size&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;          &lt;span class="c1"&gt;# c5.4xlarge&lt;/span&gt;
&lt;span class="nx"&gt;pulsar_zookeeper_desired_size&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;  &lt;span class="c1"&gt;# t3.medium&lt;/span&gt;
&lt;span class="nx"&gt;pulsar_broker_desired_size&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;     &lt;span class="c1"&gt;# i7i.8xlarge (NVMe)&lt;/span&gt;
&lt;span class="nx"&gt;flink_taskmanager_desired_size&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt; &lt;span class="c1"&gt;# c5.4xlarge&lt;/span&gt;
&lt;span class="nx"&gt;clickhouse_desired_size&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;        &lt;span class="c1"&gt;# r6id.4xlarge (NVMe)&lt;/span&gt;

&lt;span class="c1"&gt;# Enable all services&lt;/span&gt;
&lt;span class="nx"&gt;enable_flink&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="nx"&gt;enable_pulsar&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="nx"&gt;enable_clickhouse&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="nx"&gt;enable_general_nodes&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Deploy AWS Infrastructure with Terraform
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Initialize Terraform&lt;/span&gt;
terraform init

&lt;span class="c"&gt;# Review infrastructure plan (~$24K-33K/month)&lt;/span&gt;
terraform plan

&lt;span class="c"&gt;# Deploy infrastructure (takes ~20-25 minutes)&lt;/span&gt;
terraform apply &lt;span class="nt"&gt;-auto-approve&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What gets created:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Network Layer:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ VPC with Single-AZ subnets (10.1.0.0/16)&lt;/li&gt;
&lt;li&gt;✅ 2 NAT Gateways (high availability)&lt;/li&gt;
&lt;li&gt;✅ Internet Gateway&lt;/li&gt;
&lt;li&gt;✅ Route tables and security groups&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;EKS Cluster:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Kubernetes 1.31 cluster&lt;/li&gt;
&lt;li&gt;✅ Control plane with HA&lt;/li&gt;
&lt;li&gt;✅ IRSA (IAM Roles for Service Accounts)&lt;/li&gt;
&lt;li&gt;✅ Logging enabled (API, Audit, Authenticator)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Node Groups (9 total):&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Producer&lt;/strong&gt;: c5.4xlarge × 4 nodes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pulsar ZK&lt;/strong&gt;: t3.medium × 3 nodes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pulsar Broker-Bookie&lt;/strong&gt;: i7i.8xlarge × 6 nodes (3.6TB NVMe)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pulsar Proxy&lt;/strong&gt;: t3.medium × 2 nodes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flink JobManager&lt;/strong&gt;: c5.4xlarge × 1 node&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flink TaskManager&lt;/strong&gt;: c5.4xlarge × 6 nodes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ClickHouse Data&lt;/strong&gt;: r6id.4xlarge × 6 nodes (1.9TB NVMe each)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ClickHouse Query&lt;/strong&gt;: r6id.2xlarge × 1 node&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;General&lt;/strong&gt;: t3.medium × 4 nodes&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Storage &amp;amp; Services:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ S3 bucket for Flink checkpoints&lt;/li&gt;
&lt;li&gt;✅ ECR repositories for container images&lt;/li&gt;
&lt;li&gt;✅ EBS CSI driver&lt;/li&gt;
&lt;li&gt;✅ IAM roles and policies&lt;/li&gt;
&lt;li&gt;✅ CloudWatch log groups&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Configure kubectl:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws eks update-kubeconfig &lt;span class="nt"&gt;--region&lt;/span&gt; us-west-2 &lt;span class="nt"&gt;--name&lt;/span&gt; benchmark-high-infra

&lt;span class="c"&gt;# Verify cluster&lt;/span&gt;
kubectl get nodes
&lt;span class="c"&gt;# Should see ~30 nodes across all groups&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Deploy Apache Pulsar (High-Performance Message Broker)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;pulsar-load

&lt;span class="c"&gt;# Deploy Pulsar with NVMe storage&lt;/span&gt;
./deploy.sh

&lt;span class="c"&gt;# Monitor deployment (~10-15 minutes for all components)&lt;/span&gt;
kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; pulsar &lt;span class="nt"&gt;-w&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What this deploys:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ZooKeeper (Metadata Management):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;3 replicas on t3.medium&lt;/li&gt;
&lt;li&gt;Cluster coordination and metadata&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Broker-BookKeeper (Combined - NVMe):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;6 replicas on i7i.8xlarge instances&lt;/li&gt;
&lt;li&gt;Each node: 2*3.75 TB NVMe SSD (total 45TB)&lt;/li&gt;
&lt;li&gt;Message routing + persistence&lt;/li&gt;
&lt;li&gt;Ultra-low latency (~1ms writes)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Proxy (Load Balancing):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;2 replicas on C5.2xlarge&lt;/li&gt;
&lt;li&gt;Client connection management&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Monitoring Stack:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Grafana dashboards&lt;/li&gt;
&lt;li&gt;VictoriaMetrics for metrics&lt;/li&gt;
&lt;li&gt;Prometheus exporters&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Verify Pulsar cluster:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check all components are running&lt;/span&gt;
kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; pulsar

&lt;span class="c"&gt;# Test Pulsar functionality&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; pulsar pulsar-broker-0 &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  bin/pulsar-admin topics create persistent://public/default/test-topic

&lt;span class="c"&gt;# Verify topic creation&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; pulsar pulsar-broker-0 &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  bin/pulsar-admin topics list public/default
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 4: Deploy ClickHouse (Enterprise Analytics Database)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; ../clickhouse-load

&lt;span class="c"&gt;# Install ClickHouse operator and enterprise cluster&lt;/span&gt;
./00-install-clickhouse.sh

&lt;span class="c"&gt;# Wait for ClickHouse cluster (~5-8 minutes)&lt;/span&gt;
kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; clickhouse &lt;span class="nt"&gt;-w&lt;/span&gt;

&lt;span class="c"&gt;# Create enterprise database schema&lt;/span&gt;
./00-create-schema-all-replicas.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;ClickHouse Enterprise Setup:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;6 Data Nodes&lt;/strong&gt;: r6id.4xlarge with NVMe SSD&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1 Query Node&lt;/strong&gt;: r6id.2xlarge for complex analytics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Database&lt;/strong&gt;: &lt;code&gt;benchmark&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Table&lt;/strong&gt;: &lt;code&gt;sensors_local&lt;/code&gt; (optimized for high-throughput writes)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storage&lt;/strong&gt;: NVMe SSD + EBS gp3 (enterprise performance)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Replication&lt;/strong&gt;: 2x across availability zones&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Enterprise Schema Example:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- High-performance sensor data table using AVRO schema&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;benchmark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sensors_local&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;CLUSTER&lt;/span&gt; &lt;span class="n"&gt;iot_cluster&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;sensorId&lt;/span&gt; &lt;span class="n"&gt;Int32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;sensorType&lt;/span&gt; &lt;span class="n"&gt;Int32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;temperature&lt;/span&gt; &lt;span class="n"&gt;Float64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;humidity&lt;/span&gt; &lt;span class="n"&gt;Float64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;pressure&lt;/span&gt; &lt;span class="n"&gt;Float64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;batteryLevel&lt;/span&gt; &lt;span class="n"&gt;Float64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="n"&gt;Int32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;timestamp&lt;/span&gt; &lt;span class="n"&gt;DateTime64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;event_time&lt;/span&gt; &lt;span class="n"&gt;DateTime64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="n"&gt;now64&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;ENGINE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ReplicatedMergeTree&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'/clickhouse/tables/{cluster}/sensors_local'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'{replica}'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;toYYYYMM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sensorId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;SETTINGS&lt;/span&gt; &lt;span class="n"&gt;index_granularity&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;8192&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Test ClickHouse cluster:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Connect to ClickHouse cluster&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-it&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; clickhouse chi-iot-cluster-repl-iot-cluster-0-0-0 &lt;span class="nt"&gt;--&lt;/span&gt; clickhouse-client

&lt;span class="c"&gt;# Test cluster connectivity&lt;/span&gt;
SELECT &lt;span class="k"&gt;*&lt;/span&gt; FROM system.clusters WHERE cluster &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'iot_cluster'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c"&gt;# Exit with Ctrl+D&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 5: Deploy Apache Flink (Enterprise Stream Processing)
&lt;/h3&gt;

&lt;p&gt;build-and-push.sh,  script is going to create ECR repo in case you don't have one and push flink image into the ECR repo. And its going to give docker image name tagged with ECR repo&lt;/p&gt;

&lt;p&gt;You need to provide docker image name properly in the flink-job-deployment.yaml file, before running it and deploy the flink job&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; ../flink-load

&lt;span class="c"&gt;# Build and push enterprise Flink image to ECR&lt;/span&gt;
./build-and-push.sh

&lt;span class="c"&gt;# Deploy Flink enterprise cluster&lt;/span&gt;
./deploy.sh

&lt;span class="c"&gt;# Submit high-throughput Flink job&lt;/span&gt;
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; flink-job-deployment.yaml

&lt;span class="c"&gt;# Monitor Flink deployment (~3-5 minutes)&lt;/span&gt;
kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; flink-benchmark &lt;span class="nt"&gt;-w&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Enterprise Flink Setup:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;JobManager&lt;/strong&gt;: c5.4xlarge × 1 (job coordination)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TaskManager&lt;/strong&gt;: c5.4xlarge × 6 (parallel processing)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parallelism&lt;/strong&gt;: 48 (8 slots × 6 TaskManagers)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Checkpointing&lt;/strong&gt;: Every 1 minute to S3&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;State Backend&lt;/strong&gt;: RocksDB with NVMe storage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Flink Job Configuration:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Enterprise-grade stream processing using SensorData AVRO schema&lt;/span&gt;
&lt;span class="nc"&gt;DataStream&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;SensorRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;sensorStream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fromSource&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;pulsarSource&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
    &lt;span class="nc"&gt;WatermarkStrategy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;forBoundedOutOfOrderness&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Duration&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;ofSeconds&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="o"&gt;)),&lt;/span&gt;
    &lt;span class="s"&gt;"Pulsar Enterprise IoT Source"&lt;/span&gt;
&lt;span class="o"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// High-throughput processing with 1-minute windows&lt;/span&gt;
&lt;span class="n"&gt;sensorStream&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;keyBy&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getSensorId&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;window&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;TumblingEventTimeWindows&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;of&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;minutes&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;)))&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;aggregate&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;EnterpriseAggregator&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;addSink&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ClickHouseJDBCSink&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clickhouseUrl&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 6: Deploy High-Volume IoT Producer
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; ../producer-load

&lt;span class="c"&gt;# Build and deploy enterprise producer&lt;/span&gt;
./deploy-with-partitions.sh &lt;span class="o"&gt;[&lt;/span&gt;PARTITIONS] &lt;span class="o"&gt;[&lt;/span&gt;MIN_REPLICAS] &lt;span class="o"&gt;[&lt;/span&gt;MAX_REPLICAS]

&lt;span class="c"&gt;#First run this script, if flink job is not running then this is just #going to create pulsar topic with partitions of 64.And it is going #to set the storage retention time of 30 minutes &lt;/span&gt;

&lt;span class="c"&gt;#In our case following is the command:&lt;/span&gt;

./deploy-with-partitions.sh 64 1 4



&lt;span class="c"&gt;#Then deploy flink job as mentioned in below sections and come back #here and again run the same command:&lt;/span&gt;

&lt;span class="c"&gt;#This is going to just create only one producer, because we don't #want to bombard cluster with millions of message at the same time&lt;/span&gt;

./deploy-with-partitions.sh 64 1 4

&lt;span class="c"&gt;#After first producer is producing messages consistently then run the &lt;/span&gt;
&lt;span class="c"&gt;#below script which gradually start rest of the producers&lt;/span&gt;

&lt;span class="c"&gt;# Scale producers gradually with a delay of 1 minute until reached to # 4 producers (4 nodes × 250K each)&lt;/span&gt;
./scale-gradually.sh &lt;span class="o"&gt;[&lt;/span&gt;MAX_REPLICAS]
&lt;span class="c"&gt;#In our case following is the command:&lt;/span&gt;

./scale-gradually.sh 4

&lt;span class="c"&gt;# Monitor producer performance&lt;/span&gt;
kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; iot-pipeline &lt;span class="nt"&gt;-l&lt;/span&gt; &lt;span class="nv"&gt;app&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;iot-producer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Enterprise Producer Capabilities:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Throughput&lt;/strong&gt;: 250,000 events/sec per pod&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scale&lt;/strong&gt;: 100+ pods for 1M+ events/sec&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AVRO Schema&lt;/strong&gt;: Enterprise SensorData with optimized integers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Device Simulation&lt;/strong&gt;: 100,000 unique device IDs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Realistic Patterns&lt;/strong&gt;: Battery drain, temperature variations, device lifecycle&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  📊 Step 7: Verify Enterprise Performance
&lt;/h2&gt;

&lt;p&gt;After all components are deployed (~25-30 minutes total), verify 1M events/sec performance:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Monitor producer throughput&lt;/span&gt;
kubectl logs &lt;span class="nt"&gt;-n&lt;/span&gt; iot-pipeline &lt;span class="nt"&gt;-l&lt;/span&gt; &lt;span class="nv"&gt;app&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;iot-producer &lt;span class="nt"&gt;--tail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;20 | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s2"&gt;"Events produced"&lt;/span&gt;

&lt;span class="c"&gt;# Check Pulsar message ingestion rate&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; pulsar pulsar-broker-0 &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  bin/pulsar-admin topics stats persistent://public/default/iot-sensor-data

&lt;span class="c"&gt;# Verify Flink processing rate&lt;/span&gt;
kubectl logs &lt;span class="nt"&gt;-n&lt;/span&gt; flink-benchmark deployment/iot-flink-job &lt;span class="nt"&gt;--tail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;20

&lt;span class="c"&gt;# Query ClickHouse for ingestion rate&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; clickhouse chi-iot-cluster-repl-iot-cluster-0-0-0 &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  clickhouse-client &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s2"&gt;"
    SELECT 
        toStartOfMinute(timestamp) as minute,
        COUNT(*) as events_per_minute
    FROM benchmark.sensors_local 
    WHERE timestamp &amp;gt;= now() - INTERVAL 5 MINUTE
    GROUP BY minute 
    ORDER BY minute DESC"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Expected Performance Metrics:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;✅ Producer: 1,000,000+ events/sec generation
✅ Pulsar: Ultra-low latency message ingestion (~1ms)
✅ Flink: Real-time processing
✅ ClickHouse: High-speed data ingestion and sub-second queries

Overall end to end pipeline Guaranteeing Exactly Once Semantic by keeping ClickHouse Tables Type as Replace MergeTree Type
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  🔍 Enterprise Monitoring and Analytics
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Access Enterprise Grafana Dashboard
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Set up secure port forwarding&lt;/span&gt;
kubectl port-forward &lt;span class="nt"&gt;-n&lt;/span&gt; pulsar svc/grafana 3000:80 &amp;amp;

&lt;span class="c"&gt;# Open enterprise dashboard&lt;/span&gt;
open http://localhost:3000
&lt;span class="c"&gt;# Login: admin/admin123&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwzoee7cqlcn5rl29mzyf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwzoee7cqlcn5rl29mzyf.png" alt=" " width="800" height="1000"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Enterprise Dashboards:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pulsar Metrics&lt;/strong&gt;: Message rates, storage usage, replication lag&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flink Metrics&lt;/strong&gt;: Job health, checkpoint duration, backpressure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ClickHouse Metrics&lt;/strong&gt;: Query performance, replication status, storage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Infrastructure&lt;/strong&gt;: CPU, memory, disk I/O, network across all nodes&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Enterprise Analytics Queries
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Connect to ClickHouse enterprise cluster&lt;/span&gt;
&lt;span class="n"&gt;kubectl&lt;/span&gt; &lt;span class="k"&gt;exec&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;it&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="n"&gt;clickhouse&lt;/span&gt; &lt;span class="n"&gt;chi&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;iot&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="k"&gt;cluster&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;repl&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;iot&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="k"&gt;cluster&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="c1"&gt;-- clickhouse-client&lt;/span&gt;

&lt;span class="c1"&gt;-- Enterprise-scale analytics using our SensorData AVRO schema&lt;/span&gt;
&lt;span class="n"&gt;USE&lt;/span&gt; &lt;span class="n"&gt;benchmark&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Real-time throughput monitoring&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;toStartOfMinute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="k"&gt;minute&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;events_per_minute&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="n"&gt;sensorId&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;unique_sensors&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;avg_temp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batteryLevel&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;avg_battery&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sensors_local&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="nb"&gt;timestamp&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="n"&gt;HOUR&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;minute&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;minute&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Enterprise anomaly detection&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;sensorId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;sensorType&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;batteryLevel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;timestamp&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sensors_local&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;temperature&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="n"&gt;batteryLevel&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="nb"&gt;timestamp&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="k"&gt;MINUTE&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="nb"&gt;timestamp&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- High-performance aggregations across millions of records&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;sensorType&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;total_readings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;avg_temp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;percentile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;95&lt;/span&gt;&lt;span class="p"&gt;)(&lt;/span&gt;&lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;p95_temp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;humidity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;avg_humidity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;MIN&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batteryLevel&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;min_battery&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;MAX&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batteryLevel&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;max_battery&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sensors_local&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="nb"&gt;timestamp&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;today&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;DAY&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;sensorType&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;total_readings&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Enterprise time-series analysis&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;toStartOfHour&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;hour&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;sensorType&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;hourly_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;avg_temp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;stddevPop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;temp_stddev&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sensors_local&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="nb"&gt;timestamp&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="mi"&gt;24&lt;/span&gt; &lt;span class="n"&gt;HOUR&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;hour&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sensorType&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;hour&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sensorType&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  📈 Enterprise Performance Benchmarks
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Real-World Enterprise Metrics
&lt;/h3&gt;

&lt;p&gt;On this enterprise-grade setup, you achieve:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Peak Throughput&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1,000,000+ events/sec&lt;/td&gt;
&lt;td&gt;Sustained with room for 2M+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;End-to-end Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&amp;lt; 2 seconds (p99)&lt;/td&gt;
&lt;td&gt;Producer → ClickHouse&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Query Performance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&amp;lt; 200ms&lt;/td&gt;
&lt;td&gt;Complex aggregations on 1B+ records&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Write Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&amp;lt; 1ms&lt;/td&gt;
&lt;td&gt;Pulsar NVMe storage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CPU Utilization&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;70-80%&lt;/td&gt;
&lt;td&gt;Optimized across all instances&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Memory Efficiency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~85%&lt;/td&gt;
&lt;td&gt;High-memory instances (r6id)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Storage IOPS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;50,000+&lt;/td&gt;
&lt;td&gt;NVMe SSD performance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Availability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;99.95%+&lt;/td&gt;
&lt;td&gt;Single-AZ enterprise deployment(Can be changed to Multi-AZ In Terraform and work with same performance)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Enterprise Use Cases Supported
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;E-Commerce at Scale:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Black Friday traffic: 10M+ orders/hour&lt;/li&gt;
&lt;li&gt;Real-time inventory across 1000+ warehouses&lt;/li&gt;
&lt;li&gt;Personalization for 100M+ users&lt;/li&gt;
&lt;li&gt;Fraud detection on every transaction&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Financial Services:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;High-frequency trading: microsecond latency&lt;/li&gt;
&lt;li&gt;Risk calculations on 1M+ portfolios&lt;/li&gt;
&lt;li&gt;Real-time compliance monitoring&lt;/li&gt;
&lt;li&gt;Market data processing at scale&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;IoT Enterprise:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fleet management: 1M+ connected vehicles&lt;/li&gt;
&lt;li&gt;Smart city infrastructure: millions of sensors&lt;/li&gt;
&lt;li&gt;Industrial IoT: factory-wide monitoring&lt;/li&gt;
&lt;li&gt;Predictive maintenance at scale&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  🛠️ Enterprise Troubleshooting
&lt;/h2&gt;

&lt;h3&gt;
  
  
  High-Load Performance Issues
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check node resource utilization&lt;/span&gt;
kubectl top nodes | &lt;span class="nb"&gt;sort&lt;/span&gt; &lt;span class="nt"&gt;-k3&lt;/span&gt; &lt;span class="nt"&gt;-nr&lt;/span&gt;

&lt;span class="c"&gt;# Identify resource bottlenecks&lt;/span&gt;
kubectl describe nodes | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-A5&lt;/span&gt; &lt;span class="s2"&gt;"Allocated resources"&lt;/span&gt;

&lt;span class="c"&gt;# Scale TaskManagers for higher throughput&lt;/span&gt;
kubectl scale deployment flink-taskmanager &lt;span class="nt"&gt;-n&lt;/span&gt; flink-benchmark &lt;span class="nt"&gt;--replicas&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;12

&lt;span class="c"&gt;# Monitor Flink backpressure&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; flink-benchmark &amp;lt;jobmanager-pod&amp;gt; &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  flink list &lt;span class="nt"&gt;-r&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  NVMe Storage Performance
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check NVMe disk performance&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; pulsar pulsar-broker-0 &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  iostat &lt;span class="nt"&gt;-x&lt;/span&gt; 1 5

&lt;span class="c"&gt;# Monitor ClickHouse storage usage&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; clickhouse chi-iot-cluster-repl-iot-cluster-0-0-0 &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  clickhouse-client &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s2"&gt;"
    SELECT 
        name,
        total_space,
        free_space,
        (total_space - free_space) / total_space * 100 as usage_percent
    FROM system.disks"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Network Performance Optimization
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check inter-pod network latency&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; pulsar pulsar-broker-0 &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  ping &lt;span class="nt"&gt;-c&lt;/span&gt; 5 flink-jobmanager.flink-benchmark.svc.cluster.local

&lt;span class="c"&gt;# Monitor network bandwidth&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; flink-benchmark &amp;lt;taskmanager-pod&amp;gt; &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  iftop &lt;span class="nt"&gt;-t&lt;/span&gt; &lt;span class="nt"&gt;-s&lt;/span&gt; 10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  🧹 Enterprise Cleanup
&lt;/h2&gt;

&lt;p&gt;When decommissioning the enterprise setup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Graceful shutdown of applications&lt;/span&gt;
kubectl delete namespace iot-pipeline flink-benchmark

&lt;span class="c"&gt;# Backup critical data before destroying infrastructure&lt;/span&gt;
./backup-clickhouse.sh
./backup-flink-savepoints.sh

&lt;span class="c"&gt;# Destroy AWS infrastructure&lt;/span&gt;
terraform destroy
&lt;span class="c"&gt;# Type 'yes' when prompted&lt;/span&gt;

&lt;span class="c"&gt;# Verify all resources are cleaned up&lt;/span&gt;
aws ec2 describe-instances &lt;span class="nt"&gt;--region&lt;/span&gt; us-west-2 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--filters&lt;/span&gt; &lt;span class="s2"&gt;"Name=tag:kubernetes.io/cluster/benchmark-high-infra,Values=owned"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;⚠️ Enterprise Warning:&lt;/strong&gt; Ensure all critical data is backed up before destruction!&lt;/p&gt;

&lt;h2&gt;
  
  
  💡 Enterprise Best Practices
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. &lt;strong&gt;Cost Optimization with Reserved Instances&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Purchase 3-year reserved instances for 26% savings&lt;/span&gt;
&lt;span class="c"&gt;# Target instances: i7i.8xlarge, r6id.4xlarge, c5.4xlarge&lt;/span&gt;

&lt;span class="c"&gt;# AWS Console → EC2 → Reserved Instances → Purchase&lt;/span&gt;
&lt;span class="c"&gt;# - Term: 3 years&lt;/span&gt;
&lt;span class="c"&gt;# - Payment: All upfront (max discount)&lt;/span&gt;
&lt;span class="c"&gt;# - Instance type: i7i.8xlarge, r6id.4xlarge&lt;/span&gt;
&lt;span class="c"&gt;# - Quantity: Match your desired_size&lt;/span&gt;

&lt;span class="c"&gt;# Savings: $33,016 → $24,592/month (26% off)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. &lt;strong&gt;Enterprise Backup Strategy&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Automated EBS snapshots&lt;/span&gt;
aws backup create-backup-plan &lt;span class="nt"&gt;--backup-plan-name&lt;/span&gt; daily-snapshots

&lt;span class="c"&gt;# ClickHouse enterprise backups to S3&lt;/span&gt;
clickhouse-backup create
clickhouse-backup upload

&lt;span class="c"&gt;# Flink savepoints for exactly-once recovery&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; flink-benchmark &amp;lt;jm-pod&amp;gt; &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  flink savepoint &amp;lt;job-id&amp;gt; s3://benchmark-high-infra-state/savepoints
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. &lt;strong&gt;Enterprise Alerting&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# CloudWatch Alarms for enterprise monitoring&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;CPU &amp;gt; 80% sustained for 5 minutes&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Disk usage &amp;gt; 85%&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Pod crash loops &amp;gt; 3 in 10 minutes&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Flink checkpoint failures&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Pulsar consumer lag &amp;gt; 1M messages&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;ClickHouse replication lag &amp;gt; 5 minutes&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4. &lt;strong&gt;Disaster Recovery Implementation&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Multi-Region Setup:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Deploy identical stack in secondary region&lt;/span&gt;
aws_region &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"us-east-1"&lt;/span&gt;
cluster_name &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"benchmark-high-infra-dr"&lt;/span&gt;

&lt;span class="c"&gt;# Use Pulsar geo-replication&lt;/span&gt;
bin/pulsar-admin namespaces set-clusters public/default &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--clusters&lt;/span&gt; us-west-2,us-east-1

&lt;span class="c"&gt;# ClickHouse cross-region replication&lt;/span&gt;
CREATE TABLE benchmark.sensors_replicated
ENGINE &lt;span class="o"&gt;=&lt;/span&gt; ReplicatedMergeTree&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'/clickhouse/tables/{cluster}/sensors'&lt;/span&gt;, &lt;span class="s1"&gt;'{replica}'&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Enterprise Recovery Objectives:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;RTO (Recovery Time Objective): &amp;lt; 1 hour&lt;/li&gt;
&lt;li&gt;RPO (Recovery Point Objective): &amp;lt; 5 minutes&lt;/li&gt;
&lt;li&gt;Automated daily backups to S3&lt;/li&gt;
&lt;li&gt;Cross-region replication for critical data&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  5. &lt;strong&gt;Cost Monitoring and Governance&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Set up AWS Cost Explorer with enterprise tags&lt;/span&gt;
&lt;span class="c"&gt;# Tag all resources:&lt;/span&gt;
&lt;span class="c"&gt;# - Environment: production&lt;/span&gt;
&lt;span class="c"&gt;# - Project: streaming-platform&lt;/span&gt;
&lt;span class="c"&gt;# - Team: data-engineering&lt;/span&gt;
&lt;span class="c"&gt;# - CostCenter: engineering&lt;/span&gt;

&lt;span class="c"&gt;# Create enterprise budget alert&lt;/span&gt;
aws budgets create-budget &lt;span class="nt"&gt;--budget&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--account-id&lt;/span&gt; 123456789 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--budget-name&lt;/span&gt; streaming-platform-monthly &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--budget-limit&lt;/span&gt; &lt;span class="nv"&gt;Amount&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;30000,Unit&lt;span class="o"&gt;=&lt;/span&gt;USD

&lt;span class="c"&gt;# Alert if cost &amp;gt; $30K/month&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  🎓 What You've Built
&lt;/h2&gt;

&lt;p&gt;By following this guide, you've deployed:&lt;/p&gt;

&lt;p&gt;✅ &lt;strong&gt;Enterprise-grade infrastructure&lt;/strong&gt; handling 1M events/sec&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;High-performance compute&lt;/strong&gt; with NVMe storage&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Exactly-once processing&lt;/strong&gt; with Flink checkpointing&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Multi-AZ Compatible high availability&lt;/strong&gt; with auto-recovery&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Production monitoring&lt;/strong&gt; with Grafana dashboards&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Auto-scaling&lt;/strong&gt; for dynamic workloads&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Security &amp;amp; compliance&lt;/strong&gt; with encryption and RBAC&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Cost optimization&lt;/strong&gt; with reserved instances  &lt;/p&gt;
&lt;h2&gt;
  
  
  🚀 Next Steps
&lt;/h2&gt;
&lt;h3&gt;
  
  
  1. &lt;strong&gt;Customize for Your Enterprise Domain&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;E-Commerce (High Scale):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Order events at 1M/sec using AVRO schema&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="s"&gt;"order_id"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"ORD-1234567"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
  &lt;span class="s"&gt;"customer_id"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"CUST-99999"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
  &lt;span class="s"&gt;"items"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="o"&gt;[...],&lt;/span&gt;
  &lt;span class="s"&gt;"total_amount"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;1299.99&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
  &lt;span class="s"&gt;"timestamp"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"2025-10-26T10:00:00Z"&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Finance (Trading):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Market data at 1M/sec&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="s"&gt;"symbol"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"AAPL"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
  &lt;span class="s"&gt;"price"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;175.50&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
  &lt;span class="s"&gt;"volume"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
  &lt;span class="s"&gt;"exchange"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"NASDAQ"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; 
  &lt;span class="s"&gt;"timestamp"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"2025-10-26T10:00:00.123Z"&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;IoT (Massive Scale):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Sensor telemetry from millions of devices&lt;/span&gt;
&lt;span class="c1"&gt;// Using our optimized SensorData AVRO schema&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="s"&gt;"sensorId"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1000001&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
  &lt;span class="s"&gt;"sensorType"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;// temperature sensor&lt;/span&gt;
  &lt;span class="s"&gt;"temperature"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;24.5&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
  &lt;span class="s"&gt;"humidity"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;68.2&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
  &lt;span class="s"&gt;"pressure"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;1013.25&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
  &lt;span class="s"&gt;"batteryLevel"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;87.5&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
  &lt;span class="s"&gt;"status"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;// online&lt;/span&gt;
  &lt;span class="s"&gt;"timestamp"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1635254400123&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. &lt;strong&gt;Implement Advanced Enterprise Analytics&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Real-time anomaly detection&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;MATERIALIZED&lt;/span&gt; &lt;span class="k"&gt;VIEW&lt;/span&gt; &lt;span class="n"&gt;anomaly_detection&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;sensorId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;avg_temp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;stddevPop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;stddev_temp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;temperature&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;avg_temp&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;stddev_temp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;is_anomaly&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;benchmark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sensors_local&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;sensorId&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Enterprise windowed aggregations&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;MATERIALIZED&lt;/span&gt; &lt;span class="k"&gt;VIEW&lt;/span&gt; &lt;span class="n"&gt;hourly_metrics&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;toStartOfHour&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;hour&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;sensorId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;event_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;avg_temp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;MAX&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;max_temp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;MIN&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;min_temp&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;benchmark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sensors_local&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;hour&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sensorId&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. &lt;strong&gt;Add Machine Learning at Scale&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Real-time ML inference with Flink
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyflink.datastream&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;StreamExecutionEnvironment&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyflink.ml&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Pipeline&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;KMeans&lt;/span&gt;

&lt;span class="c1"&gt;# Load trained model
&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Pipeline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s3://models/anomaly-detection&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Apply to 1M events/sec stream
&lt;/span&gt;&lt;span class="n"&gt;predictions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sensor_stream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4. &lt;strong&gt;Expand to Multi-Region Enterprise&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Deploy to additional regions for global presence&lt;/span&gt;
&lt;span class="c"&gt;# us-west-2 (primary)&lt;/span&gt;
&lt;span class="c"&gt;# us-east-1 (DR)&lt;/span&gt;
&lt;span class="c"&gt;# eu-west-1 (Europe)&lt;/span&gt;
&lt;span class="c"&gt;# ap-southeast-1 (Asia)&lt;/span&gt;

&lt;span class="c"&gt;# Enable Pulsar geo-replication&lt;/span&gt;
&lt;span class="c"&gt;# Configure ClickHouse distributed tables&lt;/span&gt;
&lt;span class="c"&gt;# Use Route53 for global load balancing&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  📚 Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Enterprise Repository&lt;/strong&gt;: &lt;a href="https://github.com/hyperscaledesignhub/RealtimeDataPlatform/tree/main/realtime-platform-1million-events" rel="noopener noreferrer"&gt;realtime-platform-1million-events&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Main Repository&lt;/strong&gt;: &lt;a href="https://github.com/hyperscaledesignhub/RealtimeDataPlatform" rel="noopener noreferrer"&gt;RealtimeDataPlatform&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS EKS Best Practices&lt;/strong&gt;: &lt;a href="https://aws.github.io/aws-eks-best-practices/" rel="noopener noreferrer"&gt;aws.github.io/aws-eks-best-practices&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apache Flink Production Guide&lt;/strong&gt;: &lt;a href="https://flink.apache.org/deployment" rel="noopener noreferrer"&gt;flink.apache.org/deployment&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apache Pulsar Operations&lt;/strong&gt;: &lt;a href="https://pulsar.apache.org/docs/administration-pulsar-manager" rel="noopener noreferrer"&gt;pulsar.apache.org/docs/administration-pulsar-manager&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ClickHouse Operations&lt;/strong&gt;: &lt;a href="https://clickhouse.com/docs/operations" rel="noopener noreferrer"&gt;clickhouse.com/docs/operations&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  💬 Conclusion
&lt;/h2&gt;

&lt;p&gt;You now have an &lt;strong&gt;enterprise-grade, production-ready streaming platform&lt;/strong&gt; processing &lt;strong&gt;1 million events per second&lt;/strong&gt; on AWS! This setup demonstrates real-world architecture patterns used by Fortune 500 companies processing billions of events per day.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Achievements:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🚀 &lt;strong&gt;1M events/sec throughput&lt;/strong&gt; with room to scale to 2M+&lt;/li&gt;
&lt;li&gt;⚡ &lt;strong&gt;Sub-second latency&lt;/strong&gt; end-to-end
&lt;/li&gt;
&lt;li&gt;💪 &lt;strong&gt;Enterprise HA&lt;/strong&gt; with multi-AZ Compatible and auto-recovery&lt;/li&gt;
&lt;li&gt;💰 &lt;strong&gt;Cost-optimized&lt;/strong&gt; at $24,592/month (with reserved instances)&lt;/li&gt;
&lt;li&gt;🔒 &lt;strong&gt;Production-secure&lt;/strong&gt; with encryption and compliance&lt;/li&gt;
&lt;li&gt;📊 &lt;strong&gt;Observable&lt;/strong&gt; with comprehensive monitoring&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;This platform can handle:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Black Friday e-commerce traffic (millions of orders/hour)&lt;/li&gt;
&lt;li&gt;Global payment processing (thousands of transactions/sec)&lt;/li&gt;
&lt;li&gt;IoT fleets (millions of devices sending data)&lt;/li&gt;
&lt;li&gt;Real-time gaming analytics (millions of player events)&lt;/li&gt;
&lt;li&gt;Financial market data (high-frequency trading)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Enterprise benefits:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;NVMe storage&lt;/strong&gt; for ultra-low latency message persistence&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High-performance instances&lt;/strong&gt; optimized for streaming workloads&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AVRO schema optimization&lt;/strong&gt; for efficient serialization at scale&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-AZ Compatible deployment&lt;/strong&gt; ensuring 99.95%+ availability&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Exactly-once processing&lt;/strong&gt; guarantees for financial-grade accuracy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What enterprise use case would you build on this platform? Share in the comments!&lt;/strong&gt; 👇&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Building enterprise data platforms? Follow me for deep dives on real-time streaming, cloud architecture, and production system design!&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Next in the series&lt;/strong&gt;: "Multi-Region Deployment - Global Real-Time Data Platform"&lt;/p&gt;




&lt;h2&gt;
  
  
  🌟 Enterprise Support
&lt;/h2&gt;

&lt;p&gt;⭐ &lt;strong&gt;Production-tested&lt;/strong&gt; - Handles 1M+ events/sec in real deployments&lt;br&gt;&lt;br&gt;
🏢 &lt;strong&gt;Enterprise-ready&lt;/strong&gt; - Multi-AZ Compatible, HA, DR, compliance&lt;br&gt;&lt;br&gt;
📖 &lt;strong&gt;Fully documented&lt;/strong&gt; - Complete runbooks and guides&lt;br&gt;&lt;br&gt;
🔧 &lt;strong&gt;Professional support&lt;/strong&gt; - Available for production deployments&lt;br&gt;&lt;br&gt;
💼 &lt;strong&gt;Consulting&lt;/strong&gt; - Custom implementation and optimization  &lt;/p&gt;




&lt;h2&gt;
  
  
  📊 Enterprise Performance Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Peak Throughput&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1,000,000 events/sec&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;End-to-End Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&amp;lt; 2 seconds (p99)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Monthly Cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$24,592 (reserved instances)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Availability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;99.95% (Multi-AZ Compatible)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data Retention&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;30 days (configurable)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Query Performance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&amp;lt; 200ms (complex aggregations)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scalability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;250K → 2M+ events/sec&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Recovery Time&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&amp;lt; 1 hour (DR failover)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; #aws #eks #enterprise #streaming #dataengineering #pulsar #flink #clickhouse #production #avro #realtimeanalytics #nvme&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>performance</category>
      <category>architecture</category>
      <category>aws</category>
    </item>
    <item>
      <title>AWS EKS Deployment: Real-Time Data Streaming Platform - 50K Events/Sec for $1,250/Month</title>
      <dc:creator>HyperscaleDesignHub</dc:creator>
      <pubDate>Sun, 26 Oct 2025 10:41:18 +0000</pubDate>
      <link>https://forem.com/vijaya_bhaskarv_ba95adf9/aws-eks-deployment-real-time-data-streaming-platform-50k-eventssec-for-1250month-1ljf</link>
      <guid>https://forem.com/vijaya_bhaskarv_ba95adf9/aws-eks-deployment-real-time-data-streaming-platform-50k-eventssec-for-1250month-1ljf</guid>
      <description>&lt;p&gt;Building real-time streaming platforms can be expensive. Most production setups cost thousands of dollars per month. But what if you need to process &lt;strong&gt;50,000 events per second&lt;/strong&gt; at a reasonable cost?&lt;/p&gt;

&lt;p&gt;In this guide, I'll show you how to deploy a &lt;strong&gt;production-grade event streaming platform on AWS EKS&lt;/strong&gt; that costs &lt;strong&gt;$1,250/month&lt;/strong&gt; while handling moderate-scale real-time data processing.&lt;/p&gt;

&lt;h2&gt;
  
  
  🎯 What We're Building
&lt;/h2&gt;

&lt;p&gt;A complete, production-ready streaming platform that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Processes &lt;strong&gt;50,000 events per second&lt;/strong&gt; in real-time&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;AWS infrastructure cost: ~$1,250/month&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;✅ Uses &lt;strong&gt;t3 instance types&lt;/strong&gt; for cost efficiency&lt;/li&gt;
&lt;li&gt;✅ Runs on &lt;strong&gt;AWS EKS&lt;/strong&gt; with managed Kubernetes&lt;/li&gt;
&lt;li&gt;✅ Supports &lt;strong&gt;multiple domains&lt;/strong&gt;: E-commerce, Finance, IoT, Gaming, Logistics&lt;/li&gt;
&lt;li&gt;✅ Provides &lt;strong&gt;real-time analytics&lt;/strong&gt; with sub-second latency&lt;/li&gt;
&lt;li&gt;✅ Includes &lt;strong&gt;monitoring dashboards&lt;/strong&gt; with Grafana&lt;/li&gt;
&lt;li&gt;✅ Offers &lt;strong&gt;easy scalability&lt;/strong&gt; to 1M events/sec if needed&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  💰 Infrastructure Cost
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;AWS Infrastructure Cost: ~$1,250/month&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This includes all compute instances (t3.medium to t3.xlarge), EKS cluster management, storage (EBS gp3), networking (NAT Gateway, Load Balancer), and monitoring services required for a production-ready 50K events/sec streaming platform.&lt;/p&gt;

&lt;h2&gt;
  
  
  🏗️ Architecture Overview
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌────────────────────────────────────────────────────────────────┐
│                    AWS EKS Cluster (us-west-2)                 │
│                    bench-low-infra (k8s 1.31)                  │
├────────────────────────────────────────────────────────────────┤
│                                                                │
│  ┌──────────────┐    ┌──────────────┐    ┌─────────────────┐ │
│  │   PRODUCER   │───▶│    PULSAR    │───▶│      FLINK      │ │
│  │  (t3.medium) │    │  (t3.large)  │    │   (t3.large)    │ │
│  │              │    │              │    │                 │ │
│  │ Java/AVRO    │    │ ZK+Broker+BK │    │ JobManager +    │ │
│  │ 1K msg/sec   │    │ EBS Storage  │    │ TaskManager     │ │
│  │ per pod      │    │ 50K msg/sec  │    │ 1-min windows   │ │
│  └──────────────┘    └──────────────┘    └────────┬────────┘ │
│                                                     │          │
│                      ┌──────────────────────────────┘          │
│                      ▼                                         │
│               ┌─────────────────┐                             │
│               │   CLICKHOUSE    │                             │
│               │  (t3.xlarge)    │                             │
│               │                 │                             │
│               │  EBS Storage    │                             │
│               │  Analytics DB   │                             │
│               └─────────────────┘                             │
│                                                                │
│  Supporting: VPC, S3 (checkpoints), ECR, IAM, EBS CSI        │
└────────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Tech Stack:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Kubernetes&lt;/strong&gt;: AWS EKS 1.31&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Message Broker&lt;/strong&gt;: Apache Pulsar 3.1&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stream Processing&lt;/strong&gt;: Apache Flink 1.18&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Analytics DB&lt;/strong&gt;: ClickHouse 24.x&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storage&lt;/strong&gt;: EBS gp3 (cost-optimized, no expensive NVMe)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Infrastructure&lt;/strong&gt;: Terraform&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitoring&lt;/strong&gt;: Grafana + Prometheus&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  📋 Prerequisites
&lt;/h2&gt;

&lt;p&gt;Before starting, ensure you have:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install required tools (macOS)&lt;/span&gt;
brew &lt;span class="nb"&gt;install &lt;/span&gt;awscli terraform kubectl helm

&lt;span class="c"&gt;# Or on Linux&lt;/span&gt;
&lt;span class="c"&gt;# Install AWS CLI&lt;/span&gt;
curl &lt;span class="s2"&gt;"https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip"&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="s2"&gt;"awscliv2.zip"&lt;/span&gt;
unzip awscliv2.zip
&lt;span class="nb"&gt;sudo&lt;/span&gt; ./aws/install

&lt;span class="c"&gt;# Install Terraform&lt;/span&gt;
wget https://releases.hashicorp.com/terraform/1.6.0/terraform_1.6.0_linux_amd64.zip
unzip terraform_1.6.0_linux_amd64.zip
&lt;span class="nb"&gt;sudo mv &lt;/span&gt;terraform /usr/local/bin/

&lt;span class="c"&gt;# Configure AWS credentials&lt;/span&gt;
aws configure
&lt;span class="c"&gt;# Enter: AWS Access Key, Secret Key, Region (us-west-2), Output format (json)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Required AWS Permissions:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;EC2, VPC, EKS, S3, IAM, ECR (full access)&lt;/li&gt;
&lt;li&gt;Estimated: ~$1,250/month for complete setup&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  🚀 Step-by-Step Deployment
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Clone the Repository
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/hyperscaledesignhub/RealtimeDataPlatform.git
&lt;span class="nb"&gt;cd &lt;/span&gt;RealtimeDataPlatform/realtime-platform-50k-events
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Repository structure:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;realtime-platform-50k-events/
├── terraform/                # AWS infrastructure
├── producer-load/            # Event data generation
├── pulsar-load/              # Apache Pulsar deployment
├── flink-load/               # Apache Flink processing
├── clickhouse-load/          # ClickHouse analytics DB
└── monitoring/               # Grafana dashboards
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Deploy AWS Infrastructure
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Initialize Terraform&lt;/span&gt;
terraform init

&lt;span class="c"&gt;# Review what will be created&lt;/span&gt;
terraform plan

&lt;span class="c"&gt;# Deploy infrastructure (takes ~15-20 minutes)&lt;/span&gt;
terraform apply
&lt;span class="c"&gt;# Type 'yes' when prompted&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What gets created:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ VPC with public/private subnets (10.1.0.0/16)&lt;/li&gt;
&lt;li&gt;✅ EKS cluster &lt;code&gt;bench-low-infra&lt;/code&gt; (k8s 1.31)&lt;/li&gt;
&lt;li&gt;✅ 9 node groups with t3 instances:

&lt;ul&gt;
&lt;li&gt;Producer: t3.medium (1 node, scales to 3)&lt;/li&gt;
&lt;li&gt;Pulsar ZooKeeper: t3.small (3 nodes)&lt;/li&gt;
&lt;li&gt;Pulsar Broker: t3.large (3 nodes)&lt;/li&gt;
&lt;li&gt;Pulsar BookKeeper: t3.large (4 nodes)&lt;/li&gt;
&lt;li&gt;Pulsar Proxy: t3.small (2 nodes)&lt;/li&gt;
&lt;li&gt;Flink JobManager: t3.large (1 node)&lt;/li&gt;
&lt;li&gt;Flink TaskManager: t3.large (6 nodes)&lt;/li&gt;
&lt;li&gt;ClickHouse: t3.xlarge (4 nodes)&lt;/li&gt;
&lt;li&gt;General: t3.small (1 node)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;✅ S3 bucket for Flink checkpoints&lt;/li&gt;

&lt;li&gt;✅ ECR repositories for container images&lt;/li&gt;

&lt;li&gt;✅ IAM roles and policies&lt;/li&gt;

&lt;li&gt;✅ EBS CSI driver for persistent volumes&lt;/li&gt;

&lt;li&gt;✅ NAT Gateway for internet access&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Configure kubectl:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws eks update-kubeconfig &lt;span class="nt"&gt;--region&lt;/span&gt; us-west-2 &lt;span class="nt"&gt;--name&lt;/span&gt; bench-low-infra

&lt;span class="c"&gt;# Verify nodes are ready&lt;/span&gt;
kubectl get nodes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Expected output:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NAME                                         STATUS   ROLES    AGE   VERSION
ip-10-1-x-x.us-west-2.compute.internal      Ready    &amp;lt;none&amp;gt;   5m    v1.31.0-eks-...
ip-10-1-x-x.us-west-2.compute.internal      Ready    &amp;lt;none&amp;gt;   5m    v1.31.0-eks-...
...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Deploy Apache Pulsar (Message Broker)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;pulsar-load

&lt;span class="c"&gt;# Deploy Pulsar with Helm&lt;/span&gt;
./deploy.sh

&lt;span class="c"&gt;# Monitor deployment (takes ~5-10 minutes)&lt;/span&gt;
kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; pulsar &lt;span class="nt"&gt;-w&lt;/span&gt;
&lt;span class="c"&gt;# Press Ctrl+C when all pods are Running&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What this deploys:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ZooKeeper: 3 replicas (metadata management)&lt;/li&gt;
&lt;li&gt;Broker: 3 replicas (message routing)&lt;/li&gt;
&lt;li&gt;BookKeeper: 4 replicas (message storage on EBS)&lt;/li&gt;
&lt;li&gt;Proxy: 2 replicas (load balancing)&lt;/li&gt;
&lt;li&gt;Grafana: Monitoring dashboard&lt;/li&gt;
&lt;li&gt;Victoria Metrics: Metrics storage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Verify Pulsar is healthy:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; pulsar | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-E&lt;/span&gt; &lt;span class="s2"&gt;"zookeeper|broker|bookkeeper"&lt;/span&gt;

&lt;span class="c"&gt;# All pods should show "Running" status&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 4: Deploy ClickHouse (Analytics Database)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; ../clickhouse-load

&lt;span class="c"&gt;# Install ClickHouse operator and cluster&lt;/span&gt;
./00-install-clickhouse.sh

&lt;span class="c"&gt;# Wait for ClickHouse pods (~3-5 minutes)&lt;/span&gt;
kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; clickhouse &lt;span class="nt"&gt;-w&lt;/span&gt;

&lt;span class="c"&gt;# Create database schema&lt;/span&gt;
./00-create-schema-all-replicas.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What this creates:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ClickHouse cluster: 4 nodes (2 shards × 2 replicas)&lt;/li&gt;
&lt;li&gt;Database: &lt;code&gt;benchmark&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Table: &lt;code&gt;sensors_local&lt;/code&gt; (optimized for IoT sensor data)&lt;/li&gt;
&lt;li&gt;Storage: EBS gp3 volumes (200 GB per node)&lt;/li&gt;
&lt;li&gt;Retention: 30 days TTL&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Test ClickHouse:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Connect to ClickHouse&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-it&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; clickhouse chi-iot-cluster-repl-iot-cluster-0-0-0 &lt;span class="nt"&gt;--&lt;/span&gt; clickhouse-client

&lt;span class="c"&gt;# Run test query&lt;/span&gt;
SELECT version&lt;span class="o"&gt;()&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c"&gt;# Exit with Ctrl+D&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 5: Deploy Apache Flink (Stream Processing)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; ../flink-load

&lt;span class="c"&gt;# Build and push Flink consumer image to ECR&lt;/span&gt;
./build-and-push.sh

&lt;span class="c"&gt;# Deploy Flink cluster&lt;/span&gt;
./deploy.sh

&lt;span class="c"&gt;# Submit Flink job&lt;/span&gt;
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; flink-job-deployment.yaml

&lt;span class="c"&gt;# Monitor Flink job (~2-3 minutes to start)&lt;/span&gt;
kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; flink-benchmark &lt;span class="nt"&gt;-w&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What this deploys:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Flink JobManager: 1 replica (job coordination)&lt;/li&gt;
&lt;li&gt;Flink TaskManager: 6 replicas (data processing)&lt;/li&gt;
&lt;li&gt;S3 checkpoints: Every 1 minute&lt;/li&gt;
&lt;li&gt;Job: &lt;code&gt;JDBC IoT Data Pipeline&lt;/code&gt; (AVRO deserialization)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Flink Job Details:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Stream processing pipeline using the SensorData AVRO schema&lt;/span&gt;
&lt;span class="nc"&gt;DataStream&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;SensorRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;sensorStream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fromSource&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;pulsarSource&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
    &lt;span class="nc"&gt;WatermarkStrategy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;noWatermarks&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt;
    &lt;span class="s"&gt;"Pulsar IoT Source"&lt;/span&gt;
&lt;span class="o"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// 1-minute aggregation windows&lt;/span&gt;
&lt;span class="n"&gt;sensorStream&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;keyBy&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getSensorId&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;window&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;TumblingProcessingTimeWindows&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;of&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;minutes&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;)))&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;aggregate&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;SensorAggregator&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;addSink&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ClickHouseJDBCSink&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clickhouseUrl&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 6: Deploy IoT Producer (Event Generation)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; ../producer-load

&lt;span class="c"&gt;# Build and deploy IoT producer&lt;/span&gt;
./deploy.sh

&lt;span class="c"&gt;# Scale producers based on desired throughput&lt;/span&gt;
kubectl scale deployment iot-producer &lt;span class="nt"&gt;-n&lt;/span&gt; iot-pipeline &lt;span class="nt"&gt;--replicas&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;50

&lt;span class="c"&gt;# Monitor producer status&lt;/span&gt;
kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; iot-pipeline &lt;span class="nt"&gt;-l&lt;/span&gt; &lt;span class="nv"&gt;app&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;iot-producer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Producer capabilities:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AVRO serialization&lt;/strong&gt;: Uses optimized SensorData schema&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-sensor types&lt;/strong&gt;: Temperature, humidity, pressure, motion, light, CO2, noise&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Configurable throughput&lt;/strong&gt;: 1,000 events/sec per pod&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Realistic data&lt;/strong&gt;: Battery levels, device status, geographic distribution&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  📊 Step 7: Verify the Complete Pipeline
&lt;/h2&gt;

&lt;p&gt;After all components are deployed (~10-15 minutes total), verify data flow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check producer is generating data&lt;/span&gt;
kubectl logs &lt;span class="nt"&gt;-n&lt;/span&gt; iot-pipeline &lt;span class="nt"&gt;-l&lt;/span&gt; &lt;span class="nv"&gt;app&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;iot-producer &lt;span class="nt"&gt;--tail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;10

&lt;span class="c"&gt;# Verify Pulsar has messages&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; pulsar pulsar-broker-0 &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  bin/pulsar-admin topics stats persistent://public/default/iot-sensor-data

&lt;span class="c"&gt;# Check Flink is processing&lt;/span&gt;
kubectl logs &lt;span class="nt"&gt;-n&lt;/span&gt; flink-benchmark deployment/iot-flink-job &lt;span class="nt"&gt;--tail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;20

&lt;span class="c"&gt;# Query ClickHouse for data&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; clickhouse chi-iot-cluster-repl-iot-cluster-0-0-0 &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  clickhouse-client &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s2"&gt;"SELECT COUNT(*) FROM benchmark.sensors_local"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Expected data flow:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;✅ Producer: 50,000 events/sec generation
✅ Pulsar: Message ingestion and buffering
✅ Flink: Real-time stream processing with 1-minute windows
✅ ClickHouse: Data storage and analytics queries
✅ End-to-end latency: &amp;lt; 2 seconds
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  🔍 Monitoring and Analytics
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Access Grafana Dashboard
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Set up port forwarding&lt;/span&gt;
kubectl port-forward &lt;span class="nt"&gt;-n&lt;/span&gt; pulsar svc/grafana 3000:3000 &amp;amp;

&lt;span class="c"&gt;# Open in browser&lt;/span&gt;
open http://localhost:3000
&lt;span class="c"&gt;# Login: admin/admin&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key metrics to monitor:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pulsar&lt;/strong&gt;: Message throughput, backlog size, storage usage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flink&lt;/strong&gt;: Checkpoint duration, processing latency, job health&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ClickHouse&lt;/strong&gt;: Query performance, insert rate, storage growth&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Infrastructure&lt;/strong&gt;: CPU, memory, disk I/O across all nodes&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Sample Analytics Queries
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Connect to ClickHouse&lt;/span&gt;
&lt;span class="n"&gt;kubectl&lt;/span&gt; &lt;span class="k"&gt;exec&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;it&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="n"&gt;clickhouse&lt;/span&gt; &lt;span class="n"&gt;chi&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;iot&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="k"&gt;cluster&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;repl&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;iot&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="k"&gt;cluster&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="c1"&gt;-- clickhouse-client&lt;/span&gt;

&lt;span class="c1"&gt;-- Query examples based on our SensorData AVRO schema&lt;/span&gt;
&lt;span class="n"&gt;USE&lt;/span&gt; &lt;span class="n"&gt;benchmark&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Count total sensor readings&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;total_readings&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sensors_local&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Average metrics by sensor type&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;sensorType&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;reading_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;avg_temp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;humidity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;avg_humidity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batteryLevel&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;avg_battery&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sensors_local&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;sensorType&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;sensorType&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Identify sensors with low battery&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;sensorId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;sensorType&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batteryLevel&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;avg_battery&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;MIN&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batteryLevel&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;min_battery&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sensors_local&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="nb"&gt;timestamp&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="n"&gt;HOUR&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;sensorId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sensorType&lt;/span&gt;
&lt;span class="k"&gt;HAVING&lt;/span&gt; &lt;span class="n"&gt;avg_battery&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;avg_battery&lt;/span&gt; &lt;span class="k"&gt;ASC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Hourly data ingestion rate&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;toStartOfHour&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;hour&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;records_per_hour&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sensors_local&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="nb"&gt;timestamp&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="mi"&gt;24&lt;/span&gt; &lt;span class="n"&gt;HOUR&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;hour&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;hour&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Temperature anomaly detection&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;sensorId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;status&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sensors_local&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;temperature&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;35&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt; 
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="nb"&gt;timestamp&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="n"&gt;HOUR&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="nb"&gt;timestamp&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  📈 Performance Benchmarks
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Real-World Performance Metrics
&lt;/h3&gt;

&lt;p&gt;On this cost-optimized setup, you can expect:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Message Throughput&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;50,000 events/sec&lt;/td&gt;
&lt;td&gt;Sustained rate with 50 producer pods&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;End-to-end Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&amp;lt; 2 seconds&lt;/td&gt;
&lt;td&gt;Producer → ClickHouse&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Query Performance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&amp;lt; 500ms&lt;/td&gt;
&lt;td&gt;Analytical queries on 1B+ records&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CPU Utilization&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;60-70%&lt;/td&gt;
&lt;td&gt;Across all node groups&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Memory Usage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~80%&lt;/td&gt;
&lt;td&gt;Optimized for t3 instance types&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Storage Growth&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~10 GB/hour&lt;/td&gt;
&lt;td&gt;With 30-day TTL retention&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Availability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;99.9%+&lt;/td&gt;
&lt;td&gt;Multi-AZ deployment&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Cost vs. Performance Analysis
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;💰 Infrastructure Efficiency:
- $1,250/month for 50K events/sec
- $0.000025 per event processed
- Production-ready reliability and monitoring
- Linear scaling to 1M events/sec

📊 Scaling Characteristics:
- Same architecture scales from 1K → 1M events/sec
- Infrastructure-as-Code deployment
- Easy migration path to enterprise setup
- Predictable monthly costs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  🛠️ Troubleshooting
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Common Issues and Solutions
&lt;/h3&gt;

&lt;h3&gt;
  
  
  Issue: High Memory Usage
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check memory usage across nodes&lt;/span&gt;
kubectl top nodes

&lt;span class="c"&gt;# Scale up instances if needed&lt;/span&gt;
&lt;span class="c"&gt;# Edit terraform.tfvars&lt;/span&gt;
instance_type &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"t3.2xlarge"&lt;/span&gt;  &lt;span class="c"&gt;# Upgrade from t3.xlarge&lt;/span&gt;
terraform apply
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Issue: Pods Stuck in Pending
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check node availability&lt;/span&gt;
kubectl get nodes

&lt;span class="c"&gt;# Check pod events&lt;/span&gt;
kubectl describe pod &amp;lt;pod-name&amp;gt; &lt;span class="nt"&gt;-n&lt;/span&gt; &amp;lt;namespace&amp;gt;

&lt;span class="c"&gt;# Scale up nodes if needed&lt;/span&gt;
&lt;span class="c"&gt;# Edit terraform.tfvars, increase desired_size&lt;/span&gt;
terraform apply
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Issue: Flink Job Failing
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check Flink logs&lt;/span&gt;
kubectl logs &lt;span class="nt"&gt;-n&lt;/span&gt; flink-benchmark deployment/iot-flink-job

&lt;span class="c"&gt;# Common issues:&lt;/span&gt;
&lt;span class="c"&gt;# - ClickHouse not ready: Wait 2-3 minutes&lt;/span&gt;
&lt;span class="c"&gt;# - Pulsar not accessible: Check network policies&lt;/span&gt;
&lt;span class="c"&gt;# - Out of memory: Scale up TaskManagers&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Issue: No Data in ClickHouse
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Check producer is running&lt;/span&gt;
kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; iot-pipeline &lt;span class="nt"&gt;-l&lt;/span&gt; &lt;span class="nv"&gt;app&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;iot-producer

&lt;span class="c"&gt;# 2. Check Pulsar has data&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; pulsar pulsar-broker-0 &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  bin/pulsar-admin topics stats persistent://public/default/iot-sensor-data

&lt;span class="c"&gt;# 3. Check Flink is processing&lt;/span&gt;
kubectl logs &lt;span class="nt"&gt;-n&lt;/span&gt; flink-benchmark deployment/iot-flink-job | &lt;span class="nb"&gt;tail&lt;/span&gt; &lt;span class="nt"&gt;-50&lt;/span&gt;

&lt;span class="c"&gt;# 4. Verify ClickHouse connectivity&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; clickhouse chi-iot-cluster-repl-iot-cluster-0-0-0 &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  clickhouse-client &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s2"&gt;"SELECT 1"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Issue: Optimizing Costs
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Enable spot instances for cost savings (can reduce to ~$400-500/month)&lt;/span&gt;
use_spot_instances &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;true&lt;/span&gt;

&lt;span class="c"&gt;# Reduce node counts during development/testing&lt;/span&gt;
desired_size &lt;span class="o"&gt;=&lt;/span&gt; 1  &lt;span class="c"&gt;# Instead of 3-6&lt;/span&gt;

&lt;span class="c"&gt;# Use smaller instance types for development&lt;/span&gt;
instance_type &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"t3.small"&lt;/span&gt;  &lt;span class="c"&gt;# Instead of t3.large&lt;/span&gt;

&lt;span class="c"&gt;# Schedule auto-shutdown for non-production environments&lt;/span&gt;
&lt;span class="c"&gt;# Use AWS Instance Scheduler&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  🧹 Cleanup
&lt;/h2&gt;

&lt;p&gt;When you're done testing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Delete all Kubernetes resources&lt;/span&gt;
kubectl delete namespace iot-pipeline flink-benchmark clickhouse pulsar

&lt;span class="c"&gt;# Destroy AWS infrastructure&lt;/span&gt;
terraform destroy
&lt;span class="c"&gt;# Type 'yes' when prompted&lt;/span&gt;

&lt;span class="c"&gt;# This will:&lt;/span&gt;
&lt;span class="c"&gt;# - Delete EKS cluster&lt;/span&gt;
&lt;span class="c"&gt;# - Delete VPC and subnets&lt;/span&gt;
&lt;span class="c"&gt;# - Delete S3 buckets (after emptying)&lt;/span&gt;
&lt;span class="c"&gt;# - Delete IAM roles&lt;/span&gt;
&lt;span class="c"&gt;# - Delete ECR repositories&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;⚠️ Warning:&lt;/strong&gt; This is irreversible! Make sure to backup any data first.&lt;/p&gt;

&lt;h2&gt;
  
  
  💡 Production Best Practices
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. &lt;strong&gt;Optimize Instance Usage&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# terraform.tfvars - Consider spot instances for cost savings&lt;/span&gt;
&lt;span class="nx"&gt;use_spot_instances&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;  &lt;span class="c1"&gt;# Can reduce costs by 60-70%&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Benefits of optimization:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Spot instances can reduce costs significantly&lt;/li&gt;
&lt;li&gt;Right-sizing instances based on actual usage&lt;/li&gt;
&lt;li&gt;Auto-scaling to handle variable workloads&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Current setup uses on-demand instances for:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Guaranteed availability and performance&lt;/li&gt;
&lt;li&gt;Simplified operations and management&lt;/li&gt;
&lt;li&gt;Predictable monthly costs ($1,250)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. &lt;strong&gt;Set Up Alerts&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# CloudWatch alarms (via Terraform)&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;CPU utilization &amp;gt; 80%&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Disk usage &amp;gt; 85%&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Pod crashes &amp;gt; 3 in 5 minutes&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Flink checkpoint failures&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. &lt;strong&gt;Implement Data Retention&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- ClickHouse TTL (30 days)&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;benchmark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sensors_local&lt;/span&gt; 
&lt;span class="k"&gt;MODIFY&lt;/span&gt; &lt;span class="n"&gt;TTL&lt;/span&gt; &lt;span class="nb"&gt;timestamp&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt; &lt;span class="k"&gt;DAY&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Pulsar retention (7 days)&lt;/span&gt;
&lt;span class="n"&gt;bin&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;pulsar&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="k"&gt;admin&lt;/span&gt; &lt;span class="n"&gt;namespaces&lt;/span&gt; &lt;span class="k"&gt;set&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;retention&lt;/span&gt; &lt;span class="k"&gt;public&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="k"&gt;default&lt;/span&gt; &lt;span class="err"&gt;\&lt;/span&gt;
  &lt;span class="c1"&gt;--size 100G --time 7d&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4. &lt;strong&gt;Enable Backups&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# ClickHouse backups to S3&lt;/span&gt;
clickhouse-backup create daily_backup
clickhouse-backup upload daily_backup

&lt;span class="c"&gt;# Flink savepoints&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; flink-benchmark &amp;lt;jobmanager-pod&amp;gt; &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  flink savepoint &amp;lt;job-id&amp;gt; s3://bench-low-infra-state/savepoints
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  5. &lt;strong&gt;Use Auto-Scaling&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# HorizontalPodAutoscaler for producers&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;autoscaling/v2&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HorizontalPodAutoscaler&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;iot-producer-hpa&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;scaleTargetRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
    &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;iot-producer&lt;/span&gt;
  &lt;span class="na"&gt;minReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
  &lt;span class="na"&gt;maxReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
  &lt;span class="na"&gt;metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Resource&lt;/span&gt;
    &lt;span class="na"&gt;resource&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cpu&lt;/span&gt;
      &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Utilization&lt;/span&gt;
        &lt;span class="na"&gt;averageUtilization&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;70&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  🎓 What You've Learned
&lt;/h2&gt;

&lt;p&gt;By following this guide, you've:&lt;/p&gt;

&lt;p&gt;✅ Deployed a &lt;strong&gt;production-grade streaming platform&lt;/strong&gt; on AWS&lt;br&gt;&lt;br&gt;
✅ Configured &lt;strong&gt;cost-optimized infrastructure&lt;/strong&gt; with t3 instances&lt;br&gt;&lt;br&gt;
✅ Set up &lt;strong&gt;real-time stream processing&lt;/strong&gt; with Apache Flink&lt;br&gt;&lt;br&gt;
✅ Implemented &lt;strong&gt;exactly-once semantics&lt;/strong&gt; with checkpointing&lt;br&gt;&lt;br&gt;
✅ Built &lt;strong&gt;scalable message broker&lt;/strong&gt; with Apache Pulsar&lt;br&gt;&lt;br&gt;
✅ Configured &lt;strong&gt;analytics database&lt;/strong&gt; with ClickHouse&lt;br&gt;&lt;br&gt;
✅ Enabled &lt;strong&gt;monitoring and observability&lt;/strong&gt; with Grafana&lt;br&gt;&lt;br&gt;
✅ Learned &lt;strong&gt;cost optimization&lt;/strong&gt; strategies (90% savings!)  &lt;/p&gt;
&lt;h2&gt;
  
  
  🚀 Next Steps
&lt;/h2&gt;
&lt;h3&gt;
  
  
  1. &lt;strong&gt;Customize for Your Domain&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Modify the producer to generate your specific event types:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Edit: producer-load/src/main/java/com/iot/pipeline/producer/&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;EventDataProducer&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="nc"&gt;EventData&lt;/span&gt; &lt;span class="nf"&gt;generateEvent&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Your custom event generation logic&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;EventData&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;customerId&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;orderId&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;timestamp&lt;/span&gt;
        &lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. &lt;strong&gt;Add Custom Flink Transformations&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Edit: flink-load/flink-consumer/src/main/java/com/iot/pipeline/flink/&lt;/span&gt;
&lt;span class="n"&gt;sensorStream&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;filter&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getTemperature&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mf"&gt;30.0&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;// Custom filter&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;map&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;AlertRecord&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;            &lt;span class="c1"&gt;// Custom transformation&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;addSink&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;AlertSink&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;                         &lt;span class="c1"&gt;// Custom sink&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. &lt;strong&gt;Implement Advanced Analytics&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Create materialized views in ClickHouse&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;MATERIALIZED&lt;/span&gt; &lt;span class="k"&gt;VIEW&lt;/span&gt; &lt;span class="n"&gt;hourly_aggregates&lt;/span&gt;
&lt;span class="n"&gt;ENGINE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AggregatingMergeTree&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sensorId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hour&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;sensorId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;toStartOfHour&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;hour&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;avg_temp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;max_temp&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;benchmark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sensors_local&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;sensorId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hour&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4. &lt;strong&gt;Scale to Production&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;When ready for production scale:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Enable spot instances for cost savings&lt;/li&gt;
&lt;li&gt;Set up automated backups&lt;/li&gt;
&lt;li&gt;Configure CloudWatch alarms&lt;/li&gt;
&lt;li&gt;Implement log aggregation (CloudWatch Logs)&lt;/li&gt;
&lt;li&gt;Set up CI/CD pipeline&lt;/li&gt;
&lt;li&gt;Enable AWS Shield for DDoS protection&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  📚 Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;50K Events Repository&lt;/strong&gt;: &lt;a href="https://github.com/hyperscaledesignhub/RealtimeDataPlatform/tree/main/realtime-platform-50k-events" rel="noopener noreferrer"&gt;realtime-platform-50k-events&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Main Repository&lt;/strong&gt;: &lt;a href="https://github.com/hyperscaledesignhub/RealtimeDataPlatform" rel="noopener noreferrer"&gt;RealtimeDataPlatform&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS EKS Documentation&lt;/strong&gt;: &lt;a href="https://docs.aws.amazon.com/eks/" rel="noopener noreferrer"&gt;docs.aws.amazon.com/eks&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apache Flink&lt;/strong&gt;: &lt;a href="https://flink.apache.org/" rel="noopener noreferrer"&gt;flink.apache.org&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apache Pulsar&lt;/strong&gt;: &lt;a href="https://pulsar.apache.org/" rel="noopener noreferrer"&gt;pulsar.apache.org&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ClickHouse&lt;/strong&gt;: &lt;a href="https://clickhouse.com/docs" rel="noopener noreferrer"&gt;clickhouse.com/docs&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Terraform&lt;/strong&gt;: &lt;a href="https://terraform.io/docs" rel="noopener noreferrer"&gt;terraform.io/docs&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  💬 Conclusion
&lt;/h2&gt;

&lt;p&gt;You now have a &lt;strong&gt;production-grade, cost-optimized streaming platform&lt;/strong&gt; running on AWS for under &lt;strong&gt;$250/month&lt;/strong&gt;! This setup demonstrates real-world patterns used by companies processing millions of events per day, but optimized for moderate scale and budget constraints.&lt;/p&gt;

&lt;p&gt;The beauty of this architecture is its &lt;strong&gt;flexibility&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Start small (1K events/sec, lower cost)&lt;/li&gt;
&lt;li&gt;Grow to moderate (50K events/sec, ~$1,250/mo)&lt;/li&gt;
&lt;li&gt;Scale to enterprise (1M events/sec, higher cost)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All with the same codebase and deployment patterns!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key takeaways:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AWS EKS deployment&lt;/strong&gt; provides managed Kubernetes for production workloads&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;t3 instances&lt;/strong&gt; deliver excellent price/performance for streaming workloads&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;$1,250/month&lt;/strong&gt; infrastructure cost for 50K events/sec processing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AVRO schemas&lt;/strong&gt; enable efficient serialization at scale&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Production-ready&lt;/strong&gt; with monitoring, alerting, and auto-scaling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What would you build with this platform? Share your use case in the comments!&lt;/strong&gt; 👇&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Found this helpful? Follow me for more posts on cloud architecture, real-time data engineering, and cost optimization strategies!&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Next in the series&lt;/strong&gt;: "Scaling to 1 Million Events/Second - Enterprise Production Guide"&lt;/p&gt;




&lt;h2&gt;
  
  
  🌟 Support This Project
&lt;/h2&gt;

&lt;p&gt;⭐ &lt;strong&gt;Star the repo&lt;/strong&gt; if you found it useful!&lt;br&gt;&lt;br&gt;
🐛 &lt;strong&gt;Report issues&lt;/strong&gt; - help us improve&lt;br&gt;&lt;br&gt;
💼 &lt;strong&gt;Production-tested&lt;/strong&gt; - used in real workloads&lt;br&gt;&lt;br&gt;
📖 &lt;strong&gt;Well-documented&lt;/strong&gt; - complete guides included&lt;br&gt;&lt;br&gt;
💰 &lt;strong&gt;Cost-optimized&lt;/strong&gt; - save 90% on infrastructure&lt;/p&gt;




&lt;h2&gt;
  
  
  📊 Performance Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Throughput&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;50,000 events/sec&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&amp;lt; 2 seconds end-to-end&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Monthly Cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$1,250 (on-demand instances)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Storage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~1TB (30 days retention)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Availability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;99.9% (multi-AZ deployment)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scalability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1K → 1M events/sec&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Setup Time&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~45 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; #aws #eks #streaming #costsavings #dataengineering #pulsar #flink #clickhouse #terraform #avro #realtimeanalytics&lt;/p&gt;

</description>
      <category>performance</category>
      <category>kubernetes</category>
      <category>architecture</category>
      <category>aws</category>
    </item>
    <item>
      <title>Building a Real-Time Data Platform with Kubernetes (Kind) - A Complete Local Setup Guide</title>
      <dc:creator>HyperscaleDesignHub</dc:creator>
      <pubDate>Sun, 26 Oct 2025 10:20:01 +0000</pubDate>
      <link>https://forem.com/vijaya_bhaskarv_ba95adf9/building-a-real-time-data-platform-with-kubernetes-kind-a-complete-local-setup-guide-3bko</link>
      <guid>https://forem.com/vijaya_bhaskarv_ba95adf9/building-a-real-time-data-platform-with-kubernetes-kind-a-complete-local-setup-guide-3bko</guid>
      <description>&lt;p&gt;Ever wondered how to build a production-grade real-time data pipeline that can handle millions of events per second? In this guide, I'll show you how to set up a complete IoT streaming platform locally using &lt;strong&gt;Kubernetes (Kind)&lt;/strong&gt; that processes sensor data in real-time.&lt;/p&gt;

&lt;h2&gt;
  
  
  🎯 What We're Building
&lt;/h2&gt;

&lt;p&gt;A fully functional IoT data pipeline that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Generates realistic IoT sensor data (temperature, humidity, pressure)&lt;/li&gt;
&lt;li&gt;✅ Streams data through Apache Pulsar at 1000+ msg/sec&lt;/li&gt;
&lt;li&gt;✅ Processes data in real-time with Apache Flink&lt;/li&gt;
&lt;li&gt;✅ Stores analytics in ClickHouse for fast queries&lt;/li&gt;
&lt;li&gt;✅ Detects and alerts on anomalies (high temperature, low battery)&lt;/li&gt;
&lt;li&gt;✅ Runs entirely on your local machine with Kubernetes&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  🏗️ Architecture Overview
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────┐      ┌─────────────┐      ┌─────────────┐      ┌─────────────┐
│             │      │             │      │             │      │             │
│ IoT Producer├─────►│   Pulsar    ├─────►│    Flink    ├─────►│ ClickHouse  │
│  (Java)     │      │  (Broker)   │      │ (Processor) │      │  (Storage)  │
│             │      │             │      │             │      │             │
└─────────────┘      └─────────────┘      └─────────────┘      └─────────────┘
   Sensor Data     Message Queue      Stream Processing     Analytics DB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Tech Stack:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Kubernetes (Kind)&lt;/strong&gt;: Local Kubernetes cluster&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apache Pulsar&lt;/strong&gt;: Distributed messaging and streaming&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apache Flink&lt;/strong&gt;: Real-time stream processing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ClickHouse&lt;/strong&gt;: Columnar database for analytics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docker&lt;/strong&gt;: Containerization&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Maven&lt;/strong&gt;: Build automation&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  📋 Prerequisites
&lt;/h2&gt;

&lt;p&gt;Before we begin, make sure you have:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check versions&lt;/span&gt;
docker &lt;span class="nt"&gt;--version&lt;/span&gt;      &lt;span class="c"&gt;# Docker 20.10+&lt;/span&gt;
kubectl version      &lt;span class="c"&gt;# kubectl 1.28+&lt;/span&gt;
kind version         &lt;span class="c"&gt;# kind 0.20+&lt;/span&gt;
mvn &lt;span class="nt"&gt;--version&lt;/span&gt;        &lt;span class="c"&gt;# Maven 3.8+&lt;/span&gt;
java &lt;span class="nt"&gt;-version&lt;/span&gt;        &lt;span class="c"&gt;# Java 11+&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Installation (macOS):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install required tools&lt;/span&gt;
brew &lt;span class="nb"&gt;install &lt;/span&gt;kind kubectl maven docker
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Installation (Linux):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install Kind&lt;/span&gt;
curl &lt;span class="nt"&gt;-Lo&lt;/span&gt; ./kind https://kind.sigs.k8s.io/dl/v0.20.0/kind-linux-amd64
&lt;span class="nb"&gt;chmod&lt;/span&gt; +x ./kind
&lt;span class="nb"&gt;sudo mv&lt;/span&gt; ./kind /usr/local/bin/kind

&lt;span class="c"&gt;# Install kubectl&lt;/span&gt;
curl &lt;span class="nt"&gt;-LO&lt;/span&gt; &lt;span class="s2"&gt;"https://dl.k8s.io/release/&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;curl &lt;span class="nt"&gt;-L&lt;/span&gt; &lt;span class="nt"&gt;-s&lt;/span&gt; https://dl.k8s.io/release/stable.txt&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;/bin/linux/amd64/kubectl"&lt;/span&gt;
&lt;span class="nb"&gt;chmod&lt;/span&gt; +x kubectl
&lt;span class="nb"&gt;sudo mv &lt;/span&gt;kubectl /usr/local/bin/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  🚀 Step 1: Clone the Repository
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/hyperscaledesignhub/RealtimeDataPlatform.git
&lt;span class="nb"&gt;cd &lt;/span&gt;RealtimeDataPlatform/local-setup/k8s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Repository structure:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;local-setup/k8s/
├── create-cluster.sh          # Kind cluster setup
├── start-pipeline.sh          # Deploy entire pipeline
├── stop-pipeline.sh           # Clean shutdown
├── k8s-manifests/            # Kubernetes YAML files
├── flink-jobs/               # Flink job definitions
└── scripts/                  # Helper utilities
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  🔧 Step 2: Create the Kind Cluster
&lt;/h2&gt;

&lt;p&gt;The repository includes a pre-configured Kind cluster setup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./create-cluster.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What this does:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Creates a 3-node Kubernetes cluster (1 control-plane, 2 workers)&lt;/li&gt;
&lt;li&gt;Configures port mappings for service access&lt;/li&gt;
&lt;li&gt;Sets up kubeconfig at &lt;code&gt;/tmp/iot-kubeconfig&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Validates cluster readiness&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Expected output:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;✅ Kind cluster created successfully!

Cluster Information:
NAME                         STATUS   ROLES           AGE   VERSION
iot-pipeline-control-plane   Ready    control-plane   40d   v1.32.2
iot-pipeline-worker          Ready    &amp;lt;none&amp;gt;          40d   v1.32.2
iot-pipeline-worker2         Ready    &amp;lt;none&amp;gt;          40d   v1.32.2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  🎬 Step 3: Deploy the IoT Pipeline
&lt;/h2&gt;

&lt;p&gt;Now for the exciting part - deploying the entire pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;KUBECONFIG&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/tmp/iot-kubeconfig
./start-pipeline.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  🔍 What Happens Behind the Scenes
&lt;/h3&gt;

&lt;p&gt;The script performs these steps automatically:&lt;/p&gt;

&lt;h4&gt;
  
  
  1. &lt;strong&gt;Build Phase&lt;/strong&gt; (~2 minutes)
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Builds the IoT producer Docker image&lt;/span&gt;
Building producer image...
✅ iot-producer:latest built

&lt;span class="c"&gt;# Compiles Flink consumer with Maven&lt;/span&gt;
Building Flink consumer JAR...
✅ flink-consumer-1.0.0.jar created
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  2. &lt;strong&gt;Load Images into Kind&lt;/strong&gt;
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kind load docker-image iot-producer:latest &lt;span class="nt"&gt;--name&lt;/span&gt; iot-pipeline
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  3. &lt;strong&gt;Deploy Services&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Creates namespace and deploys:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pulsar&lt;/strong&gt;: StatefulSet with persistent storage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ClickHouse&lt;/strong&gt;: StatefulSet with initialization scripts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flink&lt;/strong&gt;: JobManager + 2 TaskManagers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IoT Producer&lt;/strong&gt;: Deployment generating sensor data&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  4. &lt;strong&gt;Initialize ClickHouse Schema&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;The platform automatically creates the sensor data schema based on our AVRO definition:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;DATABASE&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;iot&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;sensor_raw_data&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;sensor_id&lt;/span&gt; &lt;span class="n"&gt;Int32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;sensor_type&lt;/span&gt; &lt;span class="n"&gt;Int32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;temperature&lt;/span&gt; &lt;span class="n"&gt;Float64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;humidity&lt;/span&gt; &lt;span class="n"&gt;Float64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;pressure&lt;/span&gt; &lt;span class="n"&gt;Float64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;battery_level&lt;/span&gt; &lt;span class="n"&gt;Float64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="n"&gt;Int32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;timestamp&lt;/span&gt; &lt;span class="n"&gt;DateTime64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;event_time&lt;/span&gt; &lt;span class="n"&gt;DateTime64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="n"&gt;now64&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;ENGINE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;MergeTree&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;toYYYYMM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sensor_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;SETTINGS&lt;/span&gt; &lt;span class="n"&gt;index_granularity&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;8192&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Create alerts table for anomaly detection&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;sensor_alerts&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;sensor_id&lt;/span&gt; &lt;span class="n"&gt;Int32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;alert_type&lt;/span&gt; &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;alert_time&lt;/span&gt; &lt;span class="n"&gt;DateTime64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;temperature&lt;/span&gt; &lt;span class="n"&gt;Float64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;humidity&lt;/span&gt; &lt;span class="n"&gt;Float64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;battery_level&lt;/span&gt; &lt;span class="n"&gt;Float64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt; &lt;span class="n"&gt;String&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;ENGINE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;MergeTree&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;alert_time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sensor_id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  5. &lt;strong&gt;Submit Flink Job&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Deploys the stream processing job that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Consumes from Pulsar topic: &lt;code&gt;persistent://public/default/iot-sensor-data&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Deserializes AVRO sensor data using our schema&lt;/li&gt;
&lt;li&gt;Detects anomalies (temp &amp;gt; 35°C, humidity &amp;gt; 80%, battery &amp;lt; 20%)&lt;/li&gt;
&lt;li&gt;Writes processed data to ClickHouse&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  📊 Step 4: Verify the Pipeline
&lt;/h2&gt;

&lt;p&gt;After deployment (~90 seconds), you'll see:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;✅ Pipeline is working! Data is flowing successfully.

Data Flow Status:
Sensor readings: 39
Alerts generated: 7

Sample sensor data:
┏━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ sensor_id ┃ sensor_type ┃ temperature ┃ humidity ┃ timestamp           ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│        4  │           1 │        24.5 │     68.6 │ 2025-10-26 10:07:32 │
│        3  │           1 │        28.2 │     79.3 │ 2025-10-26 10:07:31 │
│        2  │           1 │        26.3 │     60.1 │ 2025-10-26 10:07:30 │
└───────────┴─────────────┴─────────────┴───────────┴─────────────────────┘

Alert distribution:
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃ alert_type       ┃ count ┃
┡━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
│ HIGH_TEMPERATURE │     7 │
└──────────────────┴───────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Check All Pods Running
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; iot-pipeline
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Expected output:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NAME                                 READY   STATUS    RESTARTS   AGE
clickhouse-0                         1/1     Running   0          2m
flink-jobmanager-77c4d6f6c5-v7kqv    1/1     Running   0          2m
flink-taskmanager-7d67d89fd6-5v84r   1/1     Running   0          2m
flink-taskmanager-7d67d89fd6-n7f96   1/1     Running   0          2m
iot-producer-78466d4cf4-6pskj        1/1     Running   0          90s
pulsar-0                             1/1     Running   0          2m
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  🔍 Step 5: Explore the Pipeline
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Access Flink Dashboard
&lt;/h3&gt;

&lt;p&gt;The script automatically sets up port forwarding:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Flink UI is already accessible at:&lt;/span&gt;
open http://localhost:8081
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What you'll see:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Running Jobs: 1 job (&lt;code&gt;JDBC IoT Data Pipeline&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Task Managers: 2 active&lt;/li&gt;
&lt;li&gt;Task Slots: 4 available&lt;/li&gt;
&lt;li&gt;Job Graph: Visual representation of data flow&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Query ClickHouse Data
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Enter ClickHouse client&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; iot-pipeline clickhouse-0 &lt;span class="nt"&gt;--&lt;/span&gt; clickhouse-client &lt;span class="nt"&gt;-d&lt;/span&gt; iot

&lt;span class="c"&gt;# Count total sensor readings&lt;/span&gt;
SELECT COUNT&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;*&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; FROM sensor_raw_data&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c"&gt;# Get average metrics by sensor type&lt;/span&gt;
SELECT 
    sensor_type,
    COUNT&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;*&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; as reading_count,
    AVG&lt;span class="o"&gt;(&lt;/span&gt;temperature&lt;span class="o"&gt;)&lt;/span&gt; as avg_temp,
    AVG&lt;span class="o"&gt;(&lt;/span&gt;humidity&lt;span class="o"&gt;)&lt;/span&gt; as avg_humidity,
    AVG&lt;span class="o"&gt;(&lt;/span&gt;pressure&lt;span class="o"&gt;)&lt;/span&gt; as avg_pressure,
    AVG&lt;span class="o"&gt;(&lt;/span&gt;battery_level&lt;span class="o"&gt;)&lt;/span&gt; as avg_battery
FROM sensor_raw_data
GROUP BY sensor_type
ORDER BY sensor_type&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c"&gt;# Get recent high-temperature alerts&lt;/span&gt;
SELECT 
    sensor_id,
    alert_type,
    alert_time,
    temperature,
    description
FROM sensor_alerts
WHERE alert_type &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'HIGH_TEMPERATURE'&lt;/span&gt;
ORDER BY alert_time DESC
LIMIT 10&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c"&gt;# Monitor data ingestion rate (records per minute)&lt;/span&gt;
SELECT 
    toStartOfMinute&lt;span class="o"&gt;(&lt;/span&gt;timestamp&lt;span class="o"&gt;)&lt;/span&gt; as minute,
    COUNT&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;*&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; as records_per_minute
FROM sensor_raw_data
WHERE timestamp &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; now&lt;span class="o"&gt;()&lt;/span&gt; - INTERVAL 10 MINUTE
GROUP BY minute
ORDER BY minute DESC&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Monitor Data Flow in Real-Time
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Watch sensor data being written&lt;/span&gt;
watch &lt;span class="nt"&gt;-n&lt;/span&gt; 2 &lt;span class="s2"&gt;"kubectl exec -n iot-pipeline clickhouse-0 -- &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
  clickhouse-client -d iot --query 'SELECT COUNT(*) FROM sensor_raw_data'"&lt;/span&gt;

&lt;span class="c"&gt;# Monitor Flink job status&lt;/span&gt;
watch &lt;span class="nt"&gt;-n&lt;/span&gt; 5 &lt;span class="s2"&gt;"kubectl exec -n iot-pipeline &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
  &lt;/span&gt;&lt;span class="se"&gt;\$&lt;/span&gt;&lt;span class="s2"&gt;(kubectl get pods -n iot-pipeline -l app=flink,component=jobmanager -o jsonpath='{.items[0].metadata.name}') -- &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
  flink list"&lt;/span&gt;

&lt;span class="c"&gt;# Check Pulsar topic stats&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; iot-pipeline pulsar-0 &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  bin/pulsar-admin topics stats persistent://public/default/iot-sensor-data
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  📈 Performance Metrics
&lt;/h2&gt;

&lt;p&gt;On a MacBook Pro (M1/M2) with 16GB RAM:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Throughput&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~1,000 msg/sec&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;End-to-end Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&amp;lt; 500ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CPU Usage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~40% (all pods)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Memory Usage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~6GB total&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Storage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~500MB after 1 hour&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Records/minute&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~60,000&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  🎨 Key Features Demonstrated
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. &lt;strong&gt;Real-Time Stream Processing&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The Flink job processes AVRO-serialized sensor data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Flink processes data with 1-minute windows&lt;/span&gt;
&lt;span class="nc"&gt;DataStream&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;SensorRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;sensorStream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fromSource&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;pulsarSource&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
    &lt;span class="nc"&gt;WatermarkStrategy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;noWatermarks&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt;
    &lt;span class="s"&gt;"Pulsar IoT Source"&lt;/span&gt;
&lt;span class="o"&gt;);&lt;/span&gt;

&lt;span class="n"&gt;sensorStream&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;keyBy&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getSensorId&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;window&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;TumblingProcessingTimeWindows&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;of&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;minutes&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;)))&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;aggregate&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;SensorAggregator&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;addSink&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ClickHouseJDBCSink&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clickhouseUrl&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. &lt;strong&gt;AVRO Schema Processing&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Our sensor data follows the optimized AVRO schema:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Schema highlights from our SensorData AVRO record:&lt;/span&gt;
&lt;span class="c1"&gt;// - sensorId: int (for efficiency)&lt;/span&gt;
&lt;span class="c1"&gt;// - sensorType: int (1=temp, 2=humidity, 3=pressure, etc.)&lt;/span&gt;
&lt;span class="c1"&gt;// - temperature, humidity, pressure: double&lt;/span&gt;
&lt;span class="c1"&gt;// - batteryLevel: double (percentage)&lt;/span&gt;
&lt;span class="c1"&gt;// - status: int (1=online, 2=offline, 3=maintenance, 4=error)&lt;/span&gt;
&lt;span class="c1"&gt;// - timestamp: long with logical type timestamp-millis&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. &lt;strong&gt;Anomaly Detection&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Alert on various conditions&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getTemperature&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mf"&gt;35.0&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;alertSink&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;invoke&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Alert&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getSensorId&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt;
        &lt;span class="s"&gt;"HIGH_TEMPERATURE"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getTemperature&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt;
        &lt;span class="nc"&gt;Instant&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;now&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
    &lt;span class="o"&gt;));&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getBatteryLevel&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;20.0&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;alertSink&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;invoke&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Alert&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getSensorId&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt;
        &lt;span class="s"&gt;"LOW_BATTERY"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getBatteryLevel&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt;
        &lt;span class="nc"&gt;Instant&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;now&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
    &lt;span class="o"&gt;));&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4. &lt;strong&gt;Scalable Architecture&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Horizontal scaling&lt;/strong&gt;: Add more TaskManagers for increased parallelism&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vertical scaling&lt;/strong&gt;: Adjust resource limits per pod&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data partitioning&lt;/strong&gt;: Pulsar topic partitioning for parallel processing&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  🛠️ Troubleshooting
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Issue: Pods not starting
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check pod status and events&lt;/span&gt;
kubectl describe pod &lt;span class="nt"&gt;-n&lt;/span&gt; iot-pipeline &amp;lt;pod-name&amp;gt;

&lt;span class="c"&gt;# Check logs for specific errors&lt;/span&gt;
kubectl logs &lt;span class="nt"&gt;-n&lt;/span&gt; iot-pipeline &amp;lt;pod-name&amp;gt;

&lt;span class="c"&gt;# Check resource constraints&lt;/span&gt;
kubectl top pods &lt;span class="nt"&gt;-n&lt;/span&gt; iot-pipeline
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Issue: Flink job not running
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check Flink JobManager logs&lt;/span&gt;
kubectl logs &lt;span class="nt"&gt;-n&lt;/span&gt; iot-pipeline &lt;span class="nt"&gt;-l&lt;/span&gt; &lt;span class="nv"&gt;app&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;flink,component&lt;span class="o"&gt;=&lt;/span&gt;jobmanager

&lt;span class="c"&gt;# Check TaskManager logs&lt;/span&gt;
kubectl logs &lt;span class="nt"&gt;-n&lt;/span&gt; iot-pipeline &lt;span class="nt"&gt;-l&lt;/span&gt; &lt;span class="nv"&gt;app&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;flink,component&lt;span class="o"&gt;=&lt;/span&gt;taskmanager

&lt;span class="c"&gt;# Access Flink CLI&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; iot-pipeline deploy/flink-jobmanager &lt;span class="nt"&gt;--&lt;/span&gt; flink list
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Issue: No data in ClickHouse
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check producer logs for errors&lt;/span&gt;
kubectl logs &lt;span class="nt"&gt;-n&lt;/span&gt; iot-pipeline &lt;span class="nt"&gt;-l&lt;/span&gt; &lt;span class="nv"&gt;app&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;iot-producer

&lt;span class="c"&gt;# Verify Pulsar topic creation&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; iot-pipeline pulsar-0 &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  bin/pulsar-admin topics list public/default

&lt;span class="c"&gt;# Check Pulsar message production&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; iot-pipeline pulsar-0 &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  bin/pulsar-admin topics stats persistent://public/default/iot-sensor-data

&lt;span class="c"&gt;# Test ClickHouse connectivity&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; iot-pipeline clickhouse-0 &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  clickhouse-client &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s2"&gt;"SELECT version()"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Issue: Port forwarding not working
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Manual port forwarding setup&lt;/span&gt;
kubectl port-forward &lt;span class="nt"&gt;-n&lt;/span&gt; iot-pipeline svc/flink-jobmanager 8081:8081 &amp;amp;
kubectl port-forward &lt;span class="nt"&gt;-n&lt;/span&gt; iot-pipeline svc/clickhouse 8123:8123 &amp;amp;

&lt;span class="c"&gt;# Check service endpoints&lt;/span&gt;
kubectl get svc &lt;span class="nt"&gt;-n&lt;/span&gt; iot-pipeline
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  🧹 Cleanup
&lt;/h2&gt;

&lt;p&gt;When you're done exploring:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Stop the pipeline (keeps cluster)&lt;/span&gt;
./stop-pipeline.sh

&lt;span class="c"&gt;# Or delete everything including cluster&lt;/span&gt;
kind delete cluster &lt;span class="nt"&gt;--name&lt;/span&gt; iot-pipeline

&lt;span class="c"&gt;# Clean up Docker images (optional)&lt;/span&gt;
docker rmi iot-producer:latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  🎓 What You've Learned
&lt;/h2&gt;

&lt;p&gt;By following this guide, you've:&lt;/p&gt;

&lt;p&gt;✅ Set up a local Kubernetes cluster with Kind&lt;br&gt;&lt;br&gt;
✅ Deployed a distributed streaming platform&lt;br&gt;&lt;br&gt;
✅ Built a real-time data processing pipeline&lt;br&gt;&lt;br&gt;
✅ Implemented stream processing with Apache Flink&lt;br&gt;&lt;br&gt;
✅ Used AVRO schemas for efficient serialization&lt;br&gt;&lt;br&gt;
✅ Stored and queried streaming data in ClickHouse&lt;br&gt;&lt;br&gt;
✅ Implemented real-time anomaly detection&lt;br&gt;&lt;br&gt;
✅ Monitored a production-grade data pipeline  &lt;/p&gt;

&lt;h2&gt;
  
  
  🚀 Next Steps
&lt;/h2&gt;

&lt;p&gt;Want to take this further?&lt;/p&gt;

&lt;h3&gt;
  
  
  1. &lt;strong&gt;Scale to Production&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Deploy on AWS EKS with the &lt;code&gt;realtime-platform-1million-events/&lt;/code&gt; setup for handling 1M+ events/sec&lt;/p&gt;

&lt;h3&gt;
  
  
  2. &lt;strong&gt;Add More Features&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Checkpointing&lt;/strong&gt;: Implement exactly-once processing with Flink checkpoints&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitoring&lt;/strong&gt;: Add Grafana dashboards and Prometheus metrics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Windowing&lt;/strong&gt;: Implement different time windows (sliding, session-based)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ML Integration&lt;/strong&gt;: Add machine learning for predictive maintenance&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. &lt;strong&gt;Customize the Pipeline&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Schema Evolution&lt;/strong&gt;: Practice updating AVRO schemas without downtime&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Custom Transformations&lt;/strong&gt;: Add complex event processing (CEP) patterns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;External APIs&lt;/strong&gt;: Connect to weather services or IoT device management platforms&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Lake Integration&lt;/strong&gt;: Archive data to S3/MinIO for long-term storage&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. &lt;strong&gt;Advanced Topics&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Multi-tenant Setup&lt;/strong&gt;: Isolate different customer data streams&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-region Replication&lt;/strong&gt;: Set up geo-distributed deployments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security&lt;/strong&gt;: Add authentication and encryption&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance Tuning&lt;/strong&gt;: Optimize for specific workload patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  📚 Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Local Setup Directory&lt;/strong&gt;: &lt;a href="https://github.com/hyperscaledesignhub/RealtimeDataPlatform/tree/main/local-setup/k8s" rel="noopener noreferrer"&gt;local-setup/k8s&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Main Repository&lt;/strong&gt;: &lt;a href="https://github.com/hyperscaledesignhub/RealtimeDataPlatform" rel="noopener noreferrer"&gt;RealtimeDataPlatform&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kind Documentation&lt;/strong&gt;: &lt;a href="https://kind.sigs.k8s.io/" rel="noopener noreferrer"&gt;kind.sigs.k8s.io&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apache Flink Guides&lt;/strong&gt;: &lt;a href="https://flink.apache.org/" rel="noopener noreferrer"&gt;flink.apache.org&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apache Pulsar Docs&lt;/strong&gt;: &lt;a href="https://pulsar.apache.org/" rel="noopener noreferrer"&gt;pulsar.apache.org&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ClickHouse Documentation&lt;/strong&gt;: &lt;a href="https://clickhouse.com/docs" rel="noopener noreferrer"&gt;clickhouse.com/docs&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AVRO Specification&lt;/strong&gt;: &lt;a href="https://avro.apache.org/" rel="noopener noreferrer"&gt;avro.apache.org&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  💬 Conclusion
&lt;/h2&gt;

&lt;p&gt;You now have a fully functional, production-grade IoT streaming pipeline running on your local machine! This setup demonstrates real-world patterns used by companies processing billions of events per day.&lt;/p&gt;

&lt;p&gt;The best part? Everything runs in Docker containers orchestrated by Kubernetes, making it easy to understand, modify, and eventually deploy to production cloud environments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key takeaways:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AVRO schemas&lt;/strong&gt; provide efficient serialization and schema evolution&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kind clusters&lt;/strong&gt; enable realistic Kubernetes testing locally&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stream processing&lt;/strong&gt; patterns work the same at any scale&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real-time analytics&lt;/strong&gt; can be achieved with open-source tools&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What would you build with this pipeline? Share your IoT project ideas in the comments!&lt;/strong&gt; 👇&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Found this helpful? Follow me for more posts on real-time data engineering, Kubernetes, and distributed systems!&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Next in the series&lt;/strong&gt;: "Scaling to 1 Million Events/Second - Production Deployment Guide"&lt;/p&gt;




&lt;h2&gt;
  
  
  📊 Repository Stats
&lt;/h2&gt;

&lt;p&gt;⭐ &lt;strong&gt;Star this repo&lt;/strong&gt; if you found it useful!&lt;br&gt;&lt;br&gt;
🐛 &lt;strong&gt;Issues/PRs welcome&lt;/strong&gt; - contributions appreciated&lt;br&gt;&lt;br&gt;
💼 &lt;strong&gt;Production ready&lt;/strong&gt; - scales to 1M events/sec&lt;br&gt;&lt;br&gt;
📖 &lt;strong&gt;Well documented&lt;/strong&gt; - complete guides included&lt;br&gt;&lt;br&gt;
🏗️ &lt;strong&gt;Local K8s Setup&lt;/strong&gt;: &lt;a href="https://github.com/hyperscaledesignhub/RealtimeDataPlatform/tree/main/local-setup/k8s" rel="noopener noreferrer"&gt;local-setup/k8s directory&lt;/a&gt;  &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; #kubernetes #iot #dataengineering #streaming #flink #pulsar #clickhouse #devops #avro #realtimeanalytics&lt;/p&gt;

</description>
      <category>tutorial</category>
      <category>iot</category>
      <category>dataengineering</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Real-Time Streaming Platform with Pulsar, Flink &amp; ClickHouse</title>
      <dc:creator>HyperscaleDesignHub</dc:creator>
      <pubDate>Sun, 26 Oct 2025 09:41:05 +0000</pubDate>
      <link>https://forem.com/vijaya_bhaskarv_ba95adf9/real-time-streaming-platform-with-pulsar-flink-clickhouse-1oac</link>
      <guid>https://forem.com/vijaya_bhaskarv_ba95adf9/real-time-streaming-platform-with-pulsar-flink-clickhouse-1oac</guid>
      <description>&lt;h1&gt;
  
  
  Real-Time Streaming Platform: Building Enterprise-Grade Data Infrastructure with Pulsar, Flink &amp;amp; ClickHouse
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;An overview of a comprehensive event streaming platform designed for high-throughput, real-time data processing across multiple domains&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Modern businesses generate massive amounts of data every second. From user interactions on e-commerce platforms to sensor readings in IoT networks, the ability to process and analyze this data in real-time has become a competitive necessity.&lt;/p&gt;

&lt;p&gt;I've built a comprehensive real-time streaming platform that tackles the fundamental challenges of &lt;strong&gt;scalable data ingestion&lt;/strong&gt;, &lt;strong&gt;real-time processing&lt;/strong&gt;, and &lt;strong&gt;high-performance analytics&lt;/strong&gt;. This platform is designed to handle workloads ranging from development environments to enterprise-scale deployments processing over 1 million messages per second.&lt;/p&gt;

&lt;h2&gt;
  
  
  🏗️ Platform Architecture
&lt;/h2&gt;

&lt;p&gt;The platform leverages three battle-tested open-source technologies, each serving a specific role in the data pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;📊 Data Sources → 🚀 Apache Pulsar → ⚡ Apache Flink → 🏛️ ClickHouse → 📈 Real-time Analytics
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffws9xd5qhjqmrmev2ymp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffws9xd5qhjqmrmev2ymp.png" alt=" " width="800" height="1000"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Core Components
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Apache Pulsar&lt;/strong&gt; - The messaging backbone&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Distributed pub-sub messaging with multi-tenancy&lt;/li&gt;
&lt;li&gt;Built-in schema registry for AVRO serialization&lt;/li&gt;
&lt;li&gt;Geo-replication and tiered storage capabilities&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Apache Flink&lt;/strong&gt; - The processing engine  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stateful stream processing with exactly-once guarantees&lt;/li&gt;
&lt;li&gt;Complex event processing and windowed aggregations&lt;/li&gt;
&lt;li&gt;Fault-tolerant checkpointing and recovery&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;ClickHouse&lt;/strong&gt; - The analytical powerhouse&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Columnar database optimized for analytical queries&lt;/li&gt;
&lt;li&gt;Real-time ingestion with sub-second query performance&lt;/li&gt;
&lt;li&gt;Horizontal scaling across distributed clusters&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why This Combination?
&lt;/h3&gt;

&lt;p&gt;This architecture solves a critical integration challenge: ClickHouse lacks native Flink connector support (unlike databases such as MySQL or PostgreSQL). Our platform bridges this gap with a custom integration that maintains performance while ensuring data consistency.&lt;/p&gt;

&lt;h2&gt;
  
  
  🚀 Platform Scalability
&lt;/h2&gt;

&lt;p&gt;The platform is designed with three distinct deployment tiers to accommodate different organizational needs:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Configuration&lt;/th&gt;
&lt;th&gt;Throughput&lt;/th&gt;
&lt;th&gt;Infrastructure&lt;/th&gt;
&lt;th&gt;Target Audience&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Local Development&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~1K msg/sec&lt;/td&gt;
&lt;td&gt;Docker + Kind&lt;/td&gt;
&lt;td&gt;Developers &amp;amp; Testing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Production Ready&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;50K msg/sec&lt;/td&gt;
&lt;td&gt;AWS t3 instances&lt;/td&gt;
&lt;td&gt;SMBs &amp;amp; Growing Companies&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Enterprise Scale&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1M+ msg/sec&lt;/td&gt;
&lt;td&gt;AWS c5 + NVMe&lt;/td&gt;
&lt;td&gt;Large Enterprises&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each configuration maintains the same architectural principles while scaling the underlying infrastructure to match performance requirements and budget constraints.&lt;/p&gt;

&lt;h2&gt;
  
  
  🔧 Platform Capabilities
&lt;/h2&gt;

&lt;h3&gt;
  
  
  High-Volume Event Generation
&lt;/h3&gt;

&lt;p&gt;The platform includes sophisticated event producers that simulate real-world data patterns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Multi-domain events&lt;/strong&gt;: E-commerce, finance, IoT, gaming, logistics, social media&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AVRO serialization&lt;/strong&gt;: Schema evolution and type safety&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Configurable throughput&lt;/strong&gt;: From thousands to millions of events per second&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Realistic data patterns&lt;/strong&gt;: User sessions, device interactions, transaction flows&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Distributed Message Streaming
&lt;/h3&gt;

&lt;p&gt;Apache Pulsar provides the messaging infrastructure with enterprise features:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Multi-tenancy&lt;/strong&gt;: Isolated namespaces for different applications&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Schema registry&lt;/strong&gt;: Centralized schema management and evolution&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Geo-replication&lt;/strong&gt;: Cross-region data distribution&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tiered storage&lt;/strong&gt;: Cost-effective long-term data retention&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Real-Time Stream Processing
&lt;/h3&gt;

&lt;p&gt;Apache Flink handles complex stream processing scenarios:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Windowed aggregations&lt;/strong&gt;: Time-based data summarization&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stateful processing&lt;/strong&gt;: Maintain context across events&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Exactly-once semantics&lt;/strong&gt;: Data consistency guarantees&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fault tolerance&lt;/strong&gt;: Automatic recovery from failures&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  High-Performance Analytics
&lt;/h3&gt;

&lt;p&gt;ClickHouse delivers sub-second analytical query performance:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Columnar storage&lt;/strong&gt;: Optimized for analytical workloads&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real-time ingestion&lt;/strong&gt;: Process streaming data as it arrives&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Distributed queries&lt;/strong&gt;: Scale across multiple nodes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Materialized views&lt;/strong&gt;: Pre-computed aggregations for instant results&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  🎯 Use Cases Across Industries
&lt;/h2&gt;

&lt;h3&gt;
  
  
  E-commerce
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Real-time Inventory&lt;/strong&gt;: Track product availability across warehouses&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recommendation Engines&lt;/strong&gt;: Process user interactions for personalized suggestions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fraud Detection&lt;/strong&gt;: Analyze payment patterns for suspicious activity&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Finance
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Trading Analytics&lt;/strong&gt;: Process market data for algorithmic trading&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Risk Assessment&lt;/strong&gt;: Real-time calculation of portfolio risk metrics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compliance Monitoring&lt;/strong&gt;: Track transactions for regulatory compliance&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  IoT &amp;amp; Manufacturing
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Predictive Maintenance&lt;/strong&gt;: Analyze sensor data to predict equipment failures&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quality Control&lt;/strong&gt;: Monitor production metrics in real-time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Energy Optimization&lt;/strong&gt;: Track and optimize energy consumption patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Gaming
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Player Analytics&lt;/strong&gt;: Real-time analysis of player behavior and engagement&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Live Leaderboards&lt;/strong&gt;: Update rankings and achievements instantly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Churn Prediction&lt;/strong&gt;: Identify players at risk of leaving&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  🔍 Technical Innovations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Solving the Flink-ClickHouse Integration Challenge
&lt;/h3&gt;

&lt;p&gt;One of the most significant technical hurdles was integrating Apache Flink with ClickHouse. Unlike popular databases such as MySQL, PostgreSQL, or Elasticsearch that have official Flink connectors, ClickHouse lacks native integration support.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Challenge:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Architectural mismatch&lt;/strong&gt;: Flink's continuous streaming vs ClickHouse's batch-oriented operations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transaction limitations&lt;/strong&gt;: ClickHouse lacks full ACID support for Flink's exactly-once guarantees&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance optimization&lt;/strong&gt;: Balancing throughput with data consistency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Our Solution:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Custom JDBC sink implementation with idempotent writes&lt;/li&gt;
&lt;li&gt;Batch coordination aligned with Flink checkpoints&lt;/li&gt;
&lt;li&gt;Adaptive batching based on ClickHouse cluster performance&lt;/li&gt;
&lt;li&gt;Circuit breaker patterns for fault tolerance&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Multi-Domain Event Schema Design
&lt;/h3&gt;

&lt;p&gt;The platform supports diverse event types across industries through a flexible AVRO schema approach. Here's an example of the IoT sensor data schema used in the platform:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"record"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"SensorData"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"namespace"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"org.apache.pulsar.testclient.avro"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"doc"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"IoT Sensor Data for Pulsar Performance Testing - Optimized Integer Schema"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"fields"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sensorId"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"int"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"doc"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Unique sensor identifier (integer for efficiency)"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sensorType"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; 
      &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"int"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"doc"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Type of sensor (1=temperature, 2=humidity, 3=pressure, 4=motion, 5=light, 6=co2, 7=noise, 8=multisensor)"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"temperature"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"double"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"doc"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Temperature reading in Celsius"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"humidity"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"double"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; 
      &lt;/span&gt;&lt;span class="nl"&gt;"doc"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Humidity reading as percentage"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"pressure"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"double"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"doc"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Pressure reading in hPa"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"batteryLevel"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"double"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"doc"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Battery level as percentage"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"int"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"doc"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Sensor status (1=online, 2=offline, 3=maintenance, 4=error)"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"long"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"logicalType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"timestamp-millis"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"doc"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Timestamp in milliseconds since epoch"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Schema Design Highlights:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Performance optimized&lt;/strong&gt;: Uses integers for enums and identifiers to minimize serialization overhead&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-sensor support&lt;/strong&gt;: Single schema accommodates 8 different sensor types&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Comprehensive telemetry&lt;/strong&gt;: Captures environmental data, device health, and operational status&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Temporal precision&lt;/strong&gt;: Millisecond-level timestamps for accurate event ordering&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This design enables:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Schema evolution&lt;/strong&gt;: Backward and forward compatibility through AVRO&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Type safety&lt;/strong&gt;: Compile-time validation across the pipeline&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-domain analytics&lt;/strong&gt;: Unified event processing across business units&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High throughput&lt;/strong&gt;: Optimized data types for maximum serialization performance&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  📊 Performance &amp;amp; Scale
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Benchmark Results
&lt;/h3&gt;

&lt;p&gt;The platform has been tested across different scales with impressive results:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Enterprise Configuration (1M+ msg/sec):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Throughput&lt;/strong&gt;: 1,000,000+ messages per second sustained&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;End-to-end latency&lt;/strong&gt;: P99 &amp;lt; 100ms
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Query performance&lt;/strong&gt;: Sub-second analytical queries on billions of records&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Availability&lt;/strong&gt;: 99.9% uptime with automatic failover&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Technology Stack
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Infrastructure &amp;amp; Orchestration:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Kubernetes (EKS/Kind) for container orchestration
&lt;/li&gt;
&lt;li&gt;Terraform for infrastructure as code&lt;/li&gt;
&lt;li&gt;Docker for containerization&lt;/li&gt;
&lt;li&gt;Helm for application deployment&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Monitoring &amp;amp; Observability:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Grafana dashboards for real-time metrics&lt;/li&gt;
&lt;li&gt;Prometheus for metrics collection&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Following will be done in future:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Custom alerting for system health&lt;/li&gt;
&lt;li&gt;Distributed tracing for performance debugging&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  🛠️ Exploring the Platform
&lt;/h2&gt;

&lt;p&gt;The complete platform is available as an open-source project on GitHub, featuring:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Comprehensive documentation&lt;/strong&gt; for each deployment configuration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Infrastructure templates&lt;/strong&gt; using Terraform and Kubernetes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitoring setup&lt;/strong&gt; with pre-configured Grafana dashboards
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sample applications&lt;/strong&gt; demonstrating multi-domain event processing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance tuning guides&lt;/strong&gt; for production optimization&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Repository&lt;/strong&gt;: &lt;a href="https://github.com/hyperscaledesignhub/RealtimeDataPlatform" rel="noopener noreferrer"&gt;RealtimeDataPlatform&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Whether you're building a proof of concept or deploying at enterprise scale, the platform provides the foundation for modern real-time analytics infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  📈 Observability &amp;amp; Operations
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Comprehensive Monitoring:&lt;/strong&gt;&lt;br&gt;
The platform includes production-ready observability through Grafana dashboards tracking:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Message flow metrics&lt;/strong&gt;: Throughput, latency, and backlog across all components&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;System health&lt;/strong&gt;: Resource utilization, error rates, and availability&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Business metrics&lt;/strong&gt;: Event processing rates by domain and event type&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance insights&lt;/strong&gt;: Query execution times and optimization opportunities&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Operational Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Checkpoint-based recovery for zero data loss&lt;/li&gt;
&lt;li&gt;Horizontal scaling based on workload patterns&lt;/li&gt;
&lt;li&gt;Cost tracking and optimization recommendations&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  🔮 Platform Evolution
&lt;/h2&gt;

&lt;p&gt;The platform continues to evolve with planned enhancements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Multi-cloud deployment&lt;/strong&gt; across AWS, GCP, and Azure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stream SQL interface&lt;/strong&gt; for simplified data transformations
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ML pipeline integration&lt;/strong&gt; for real-time inference&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enhanced security&lt;/strong&gt; with end-to-end encryption&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Intelligent auto-scaling&lt;/strong&gt; based on workload patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  💡 Key Insights
&lt;/h2&gt;

&lt;p&gt;Building this real-time streaming platform highlighted several critical design principles:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Component Synergy&lt;/strong&gt;: The combination of Pulsar's messaging reliability, Flink's processing power, and ClickHouse's analytical performance creates a platform greater than the sum of its parts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Integration Complexity&lt;/strong&gt;: Solving the Flink-ClickHouse integration challenge required custom solutions, but the performance benefits justify the engineering investment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Scalable Architecture&lt;/strong&gt;: Designing for multiple deployment tiers from day one enables organizations to start small and scale without architectural rewrites.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Operational Excellence&lt;/strong&gt;: Production-ready monitoring and automation are essential for managing distributed streaming systems at scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Cost Optimization&lt;/strong&gt;: Thoughtful resource allocation and component tuning can achieve enterprise performance at reasonable operational costs.&lt;/p&gt;

&lt;h2&gt;
  
  
  🤝 Community &amp;amp; Future
&lt;/h2&gt;

&lt;p&gt;This platform represents a comprehensive approach to real-time data infrastructure that balances performance, cost, and operational simplicity. By open-sourcing the complete solution, the goal is to accelerate adoption of modern streaming architectures across the industry.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Interested in real-time streaming platforms?&lt;/strong&gt; &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;⭐ Star the &lt;a href="https://github.com/hyperscaledesignhub/RealtimeDataPlatform" rel="noopener noreferrer"&gt;repository&lt;/a&gt; &lt;/li&gt;
&lt;li&gt;💬 Share your streaming architecture experiences in the comments&lt;/li&gt;
&lt;li&gt;🔗 Connect for discussions about real-time data engineering challenges&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;What's your experience with real-time streaming platforms? Have you tackled similar integration challenges?&lt;/strong&gt; I'd love to hear about your approach and lessons learned!&lt;/p&gt;

&lt;h1&gt;
  
  
  RealTimeAnalytics #ApachePulsar #ApacheFlink #ClickHouse #DataEngineering #StreamProcessing #BigData #EventStreaming
&lt;/h1&gt;

</description>
      <category>analytics</category>
      <category>dataengineering</category>
      <category>performance</category>
      <category>architecture</category>
    </item>
  </channel>
</rss>
