Forem: HyperscaleDesignHub

Who Needs Real-Time Streaming? Use Cases & Architecture Across Industries

HyperscaleDesignHub — Sun, 26 Oct 2025 14:01:10 +0000

In today's fast-paced digital world, the question isn't "Do I need real-time data streaming?" but rather "How fast do I need it?" From detecting fraudulent transactions in milliseconds to optimizing supply chains in real-time, streaming data has become the backbone of modern digital experiences.

But here's the thing: not everyone needs to process 1 million messages per second. Your startup's user analytics might work perfectly fine with 1,000 events per second, while a major bank's fraud detection system requires enterprise-grade throughput.

Let me show you when you need real-time streaming, what scale you actually need, and how to architect it properly across different industries.

🎯 The Real-Time Spectrum: When Every Second Counts

Before diving into use cases, let's understand the different flavors of "real-time":

Latency	Use Case	Example
<1ms	High-frequency trading	Stock market microsecond arbitrage
<100ms	Gaming & Interactive	Real-time leaderboards, live chat
<1 second	Fraud detection	Credit card transaction blocking
<10 seconds	Monitoring & Alerts	Infrastructure failure detection
<1 minute	Analytics & Dashboards	Real-time business metrics

The key insight: Match your architecture complexity to your actual latency requirements. Don't over-engineer!

🏗️ Architecture Patterns by Scale

Based on the RealtimeDataPlatform implementations, here are three proven architectures:

Local Development (~1K msg/sec)

Cost: FREE | Use Case: Development & Testing

┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Producer  │───▶│   Pulsar    │───▶│    Flink    │
│  (Docker)   │    │  (Docker)   │    │  (Docker)   │
└─────────────┘    └─────────────┘    └──────┬──────┘
                                              │
┌─────────────┐    ┌─────────────┐           │
│   Grafana   │◀───│ ClickHouse  │◀──────────┘
│ (Dashboards)│    │ (Storage)   │
└─────────────┘    └─────────────┘

Perfect for:

Proof of concepts
Algorithm development
Learning streaming concepts
Small team experiments

Small-Medium Business (50K msg/sec)

Cost: $200-250/month | Use Case: Growing Companies

┌─────────────────────────────────────────────────────────┐
│                AWS EKS (t3.medium nodes)                │
│                                                         │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐     │
│  │  Producer   │  │   Pulsar    │  │    Flink    │     │
│  │ (10K IDs)   │─▶│ (3 brokers) │─▶│ (2 workers) │     │
│  └─────────────┘  └─────────────┘  └──────┬──────┘     │
│                                           │              │
│  ┌─────────────┐  ┌─────────────┐        │              │
│  │ Monitoring  │  │ ClickHouse  │◀───────┘              │
│  │ (Grafana)   │  │(2 replicas) │                       │
│  └─────────────┘  └─────────────┘                       │
└─────────────────────────────────────────────────────────┘

Perfect for:

SaaS platforms
Regional e-commerce
IoT startups
Gaming companies

Enterprise Scale (1M msg/sec)

Cost: $25,000/month | Use Case: Large Organizations

┌─────────────────────────────────────────────────────────────┐
│           AWS EKS (c5.2xlarge + NVMe storage)              │
│                                                             │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌────────┐ │
│ │  Producer   │ │   Pulsar    │ │    Flink    │ │ Click  │ │
│ │(100K IDs)   │▶│(6 brokers)  │▶│(6 workers)  │▶│ House  │ │
│ │Multi-AZ     │ │Multi-AZ     │ │Multi-AZ     │ │Multi-AZ│ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └────────┘ │
│                                                             │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │          VictoriaMetrics + Grafana Stack                │ │
│ │    (Unified monitoring across all components)           │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘

Perfect for:

Global financial institutions
Major e-commerce platforms
Telecommunications providers
Enterprise IoT deployments

🏢 Industry Use Cases: When Real-Time Makes Business Sense

🛒 E-Commerce: Every Click Counts

Why Real-Time Matters:

Cart abandonment: React within seconds to offer discounts
Inventory management: Prevent overselling during flash sales
Fraud prevention: Block suspicious transactions instantly
Personalization: Update recommendations as users browse

Real-World Example:

📱 User adds iPhone to cart
    ↓ (50ms)
🔍 Inventory check: 2 units left
    ↓ (100ms)
💰 Price optimization: Apply 5% discount for cart abandonment risk
    ↓ (200ms)
🎯 Recommendation update: Show compatible accessories
    ↓ (500ms)
📊 Analytics: Update real-time sales dashboard

Architecture Need: 50K MPS Setup for most e-commerce, 1M MPS for Amazon-scale.

Key Metrics to Stream:

Page views and click events
Cart modifications
Payment transactions
Inventory levels
User session data

💰 Finance: Milliseconds = Millions

Why Real-Time Matters:

High-frequency trading: Execute trades in microseconds
Fraud detection: Block transactions before completion
Risk management: Adjust portfolios based on market movements
Compliance: Real-time reporting for regulatory requirements

Real-World Example:

💳 Credit card swipe: $5,000 transaction
    ↓ (10ms)
🤖 ML Model: Unusual amount + new location = 85% fraud probability
    ↓ (50ms)
🚫 Transaction blocked + SMS sent to customer
    ↓ (100ms)
📊 Risk dashboard updated: +1 blocked fraud attempt

Architecture Need: 1M MPS Setup for major financial institutions.

Key Metrics to Stream:

Transaction data
Market price feeds
Risk calculations
Customer behavior patterns
Regulatory compliance events

🎮 Gaming: Real-Time Engagement

Why Real-Time Matters:

Leaderboards: Update rankings instantly
Matchmaking: Pair players with similar skill levels
In-game events: Dynamic content based on player actions
Anti-cheat: Detect suspicious behavior patterns

Real-World Example:

🎯 Player achieves high score: 1,245,830 points
    ↓ (10ms)
🏆 Leaderboard update: #3 globally
    ↓ (50ms)
🎊 Achievement unlocked: "Top 10 Global"
    ↓ (100ms)
👥 Notify friends: "Alex just reached #3!"
    ↓ (200ms)
💰 Offer premium upgrade: "Celebrate with special skin!"

Architecture Need: 50K MPS for indie games, 1M MPS for AAA multiplayer games.

Key Metrics to Stream:

Player actions and scores
In-game purchases
Session duration and engagement
Performance metrics
Social interactions

🏭 IoT & Manufacturing: Predictive Intelligence

Why Real-Time Matters:

Predictive maintenance: Fix equipment before it breaks
Quality control: Detect defects in real-time
Energy optimization: Adjust consumption based on demand
Safety monitoring: Immediate alerts for dangerous conditions

Real-World Example:

🌡️ Temperature sensor: 85°C (normal: 70°C)
    ↓ (1 second)
⚠️ Anomaly detection: Temperature rising trend
    ↓ (2 seconds)
🔧 Maintenance alert: Schedule inspection within 4 hours
    ↓ (5 seconds)
📊 Dashboard update: Equipment health status = Warning

Architecture Need: 50K MPS for smart buildings, 1M MPS for industrial IoT.

Key Metrics to Stream:

Sensor readings (temperature, pressure, vibration)
Equipment status and performance
Environmental conditions
Energy consumption
Safety alerts

📱 Social Media: Viral Content Detection

Why Real-Time Matters:

Trending topics: Identify viral content early
Content moderation: Remove harmful content instantly
Engagement optimization: Boost high-performing posts
Influencer identification: Spot rising content creators

Real-World Example:

📸 User posts photo with #NewProduct
    ↓ (100ms)
🔥 Engagement spike: 1000 likes in 2 minutes
    ↓ (1 second)
📈 Trending algorithm: Boost to wider audience
    ↓ (5 seconds)
💰 Ad targeting: Show related product ads

Architecture Need: 1M MPS for major platforms, 50K MPS for niche communities.

🚛 Logistics: Supply Chain Optimization

Why Real-Time Matters:

Route optimization: Adjust for traffic and weather
Inventory tracking: Real-time stock levels across warehouses
Delivery predictions: Accurate ETAs for customers
Exception handling: Immediate response to delays

Real-World Example:

📦 Package scanned at distribution center
    ↓ (500ms)
🗺️ Route optimization: Traffic jam detected, reroute
    ↓ (2 seconds)
📱 Customer notification: "Delivery delayed by 30 minutes"
    ↓ (5 seconds)
📊 Analytics: Update delivery performance metrics

Architecture Need: 50K MPS for regional logistics, 1M MPS for global shipping companies.

🎪 When Real-Time Becomes Essential

Not every use case needs real-time processing. Here's when it becomes critical:

✅ Perfect for Real-Time Streaming

User Experience Depends on Speed
- Gaming leaderboards
- Live chat applications
- Real-time collaboration tools
Financial Impact of Delays
- Trading platforms
- Fraud detection
- Dynamic pricing
Safety-Critical Systems
- Medical monitoring
- Industrial safety
- Autonomous vehicles
Competitive Advantage through Speed
- Personalized recommendations
- Real-time offers
- Instant customer support

❌ Better with Batch Processing

Historical Analysis
- Monthly sales reports
- Annual compliance reporting
- Data warehouse ETL
Complex Computations
- Machine learning model training
- Financial reconciliation
- Scientific simulations
Cost-Sensitive Operations
- Backup processing
- Archive operations
- Non-urgent analytics

🛠️ Technology Stack Breakdown

Here's what powers the RealtimeDataPlatform across different scales:

Core Components

Message Broker: Apache Pulsar
  - Why: Better than Kafka for geo-replication
  - Scalability: Handles multi-tenant workloads
  - Features: Built-in schema registry, tiered storage

Stream Processing: Apache Flink  
  - Why: True low-latency processing
  - Scalability: Horizontal scaling with checkpointing
  - Features: Event-time processing, stateful operations

Storage: ClickHouse
  - Why: Optimized for analytical queries
  - Scalability: Columnar storage with compression
  - Features: Real-time ingestion, SQL interface

Monitoring: Grafana + VictoriaMetrics
  - Why: Unified observability across all components
  - Scalability: Better compression than Prometheus
  - Features: Custom dashboards, alerting

Scaling Strategies

From 1K to 50K messages/sec:

Horizontal Pod Scaling: Add more Flink TaskManagers
Storage Optimization: Partition ClickHouse tables by time
Network Optimization: Use node affinity for co-location

From 50K to 1M messages/sec:

Infrastructure Upgrade: c5.2xlarge instances with NVMe
Multi-AZ Deployment: Distribute across availability zones
Advanced Monitoring: Dedicated monitoring namespace

📊 ROI Calculator: Is Real-Time Worth It?

Cost Analysis Template

Real-Time Implementation Cost:
- Infrastructure: $200-25,000/month (based on scale)
- Development: 2-6 months
- Maintenance: 20% of development cost annually

Business Value Calculation:
- Revenue increase from faster responses
- Cost savings from early problem detection  
- Competitive advantage quantification
- Customer satisfaction improvement

Break-even typically: 6-18 months

Decision Framework

Ask yourself:

How much does a 1-hour delay cost your business?
- If >$1000: Consider real-time
- If >$10000: Real-time is essential
What's your user expectation?
- Gaming: <100ms expected
- E-commerce: <1s acceptable
- Analytics: <1 minute usually fine
How complex is your processing?
- Simple aggregations: Real-time feasible
- ML training: Stick to batch processing
- Fraud detection: Real-time critical

🚀 Getting Started: Your Real-Time Journey

Phase 1: Proof of Concept (Week 1-2)

# Start with local development setup
git clone https://github.com/hyperscaledesignhub/RealtimeDataPlatform
cd local-setup
./scripts/start-pipeline.sh

# Experiment with your data patterns
# Measure actual throughput requirements
# Validate business value

Phase 2: Production Pilot (Month 1-2)

# Deploy 50K MPS setup for initial load
cd realtime-platform-50k-events
# Follow deployment guide
# Monitor performance and costs
# Gather user feedback

Phase 3: Scale as Needed (Month 3+)

# Upgrade to 1M MPS if requirements justify it
cd realtime-platform-1million-events
# Enterprise-grade monitoring and alerting
# Multi-region deployment for global reach

🎯 Industry-Specific Quick Start Guides

E-Commerce Startup

Start with: Local setup for development
Scale to: 50K MPS when you hit 10K daily users
Focus on: Cart abandonment, inventory tracking
Key metrics: Conversion rate, page load time

Financial Services

Start with: 50K MPS for fraud detection
Scale to: 1M MPS for trading platforms
Focus on: Transaction monitoring, risk analysis
Key metrics: Fraud detection rate, latency

IoT Company

Start with: 50K MPS for device monitoring
Scale to: 1M MPS for industrial deployments
Focus on: Predictive maintenance, anomaly detection
Key metrics: Uptime, maintenance cost savings

Gaming Studio

Start with: Local setup for single-player games
Scale to: 1M MPS for massively multiplayer games
Focus on: Real-time leaderboards, matchmaking
Key metrics: Player engagement, session duration

🏁 Conclusion: The Real-Time Imperative

Real-time streaming isn't just a technology choice—it's a business strategy. The companies winning today are those who can act on data as it happens, not hours or days later.

Key takeaways:

✅ Match your architecture to your actual needs—don't over-engineer

✅ Start small and scale progressively based on proven business value

✅ Focus on the use cases that directly impact revenue or user experience

✅ Invest in monitoring and observability from day one

✅ Consider the total cost of ownership, not just infrastructure costs

The bottom line: If waiting for data costs you more than processing it in real-time, you need streaming architecture. If your users expect instant responses, you need real-time processing. If your competitors are faster, you need to catch up.

The question isn't whether you'll adopt real-time streaming—it's when and at what scale.

📚 Resources & Next Steps

Complete Platform: RealtimeDataPlatform GitHub
50K Setup Guide: Small-Medium Business Architecture
1M Setup Guide: Enterprise Architecture
Local Development: Quick Start Guide

What's your real-time streaming use case? Share your requirements and challenges in the comments! 👇

Follow me for more posts on streaming architecture, scalability patterns, and production DevOps!

Tags: #realtime #streaming #architecture #iot #ecommerce #finance #gaming #devops #microservices

🎮 Interactive Use Case Matcher

Answer these questions to find your ideal architecture:

What's your expected peak throughput?
- <1K msg/sec → Local development setup
- 1K-50K msg/sec → 50K MPS architecture
- >50K msg/sec → 1M MPS enterprise setup
What's your latency requirement?
- <100ms → Gaming/trading focused setup
- <1 second → E-commerce/fraud detection
- <10 seconds → Analytics/monitoring
What's your budget?
- Free → Local development
- $200-500/month → 50K MPS
- $25,000+/month → Enterprise 1M MPS
What's your industry?
- E-commerce → Focus on cart/inventory streams
- Finance → Emphasize fraud detection
- IoT → Sensor data and predictive maintenance
- Gaming → Real-time leaderboards and events
- Social Media → Content engagement tracking

Got your answers? Check the corresponding setup guide and start building! 🚀

Realtime Data Streaming Platform: Building a Unified Monitoring Stack

HyperscaleDesignHub — Sun, 26 Oct 2025 13:52:57 +0000

When you're running a real-time streaming platform processing 1 million messages per second, you can't afford to be blind. You need comprehensive monitoring across all components - Pulsar, Flink, and ClickHouse - in a single unified view.

In this guide, I'll show you how to build a production-grade monitoring stack that provides real-time visibility into your entire streaming pipeline using VictoriaMetrics and Grafana.

🎯 What We're Building

A unified monitoring solution that:

📊 Single Grafana instance for all components
⚡ VictoriaMetrics as the metrics backend (Prometheus-compatible)
📈 Real-time dashboards for Pulsar, Flink, and ClickHouse
🔌 Automated setup with scripts and Helm charts
🎨 Pre-built dashboards ready to import
🚀 Scalable to handle 1M+ metrics/sec

🏗️ Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│               Unified Monitoring Architecture                   │
└─────────────────────────────────────────────────────────────────┘

┌─────────────┐  ┌─────────────┐  ┌──────────────┐
│   Pulsar    │  │    Flink    │  │  ClickHouse  │
│   Metrics   │  │   Metrics   │  │   Metrics    │
│   :9090     │  │   :9249     │  │   :8123      │
└──────┬──────┘  └──────┬──────┘  └──────┬───────┘
       │                │                 │
       │  Prometheus    │   Prometheus    │  SQL
       │  Exposition    │   Reporter      │  Queries
       │                │                 │
       └────────────────┴─────────────────┘
                       │
                       ▼
              ┌─────────────────┐
              │   VMAgent       │
              │  (Collector)    │
              │                 │
              │  Scrapes all    │
              │  /metrics       │
              │  endpoints      │
              └────────┬────────┘
                       │
                       ▼
              ┌─────────────────┐
              │  VictoriaMetrics│
              │   (Storage)     │
              │                 │
              │  Time-series    │
              │  Database       │
              └────────┬────────┘
                       │
                       ▼
              ┌─────────────────┐
              │    Grafana      │
              │  (Dashboards)   │
              │                 │
              │  • Pulsar       │
              │  • Flink        │
              │  • ClickHouse   │
              └─────────────────┘

📦 The Monitoring Stack

1. VictoriaMetrics Kubernetes Stack

The Pulsar Helm chart includes victoria-metrics-k8s-stack as a dependency:

# pulsar-load/helm/pulsar/Chart.yaml
dependencies:
- condition: victoria-metrics-k8s-stack.enabled
  name: victoria-metrics-k8s-stack
  repository: https://victoriametrics.github.io/helm-charts/
  version: 0.38.x

What's included:

Component	Purpose	Port
VMAgent	Metrics collector (replaces Prometheus)	8429
VMSingle	Time-series database storage	8429
Grafana	Visualization and dashboards	3000
Kube-State-Metrics	Kubernetes cluster metrics	8080
Node-Exporter	Node-level metrics	9100

Why VictoriaMetrics over Prometheus?

✅ 10x better compression (less storage)

✅ Faster queries (optimized for large datasets)

✅ Lower memory usage (~2GB vs Prometheus's 16GB)

✅ Prometheus-compatible (drop-in replacement)

✅ Better retention (handles months of data)

2. Configuration in pulsar-values.yaml

# Enable Victoria Metrics stack
victoria-metrics-k8s-stack:
  enabled: true

  # VMAgent - Metrics collector
  vmagent:
    enabled: true
    spec:
      scrapeInterval: 15s
      externalLabels:
        cluster: benchmark-high-infra
        environment: production

  # Grafana configuration
  grafana:
    enabled: true
    adminPassword: "admin123"  # Change in production!

    # Persistent storage for dashboards
    persistence:
      enabled: true
      size: 10Gi
      storageClassName: "gp3"

    # Service exposure
    service:
      type: LoadBalancer  # Accessible externally
      port: 80

    # Default datasource (VictoriaMetrics)
    datasources:
      datasources.yaml:
        apiVersion: 1
        datasources:
        - name: VictoriaMetrics
          type: prometheus
          url: http://vmsingle-pulsar-victoria-metrics-k8s-stack:8429
          isDefault: true

Key Features:

🔹 Persistent Dashboards: Survives pod restarts

🔹 LoadBalancer Service: Direct external access

🔹 Auto-discovery: Automatically scrapes Pulsar pods

🔹 Pre-configured: Works out-of-the-box

🔧 Setting Up Flink Metrics

Step 1: Run the setup-flink-metrics.sh Script

This script does the heavy lifting:

cd realtime-platform-1million-events/flink-load
./setup-flink-metrics.sh

What it does (7 steps):

Step 1: Create Flink Configuration

# Applies flink-config-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: flink-config
  namespace: flink-benchmark
data:
  flink-conf.yaml: |
    metrics.reporters: prometheus
    metrics.reporter.prometheus.factory.class: org.apache.flink.metrics.prometheus.PrometheusReporterFactory
    metrics.reporter.prometheus.port: 9249-9259
    metrics.system-resource: "true"
    metrics.system-resource-probing-interval: "5000"

Step 2: Setup Prometheus Integration

# Creates VMPodScrape for Victoria Metrics
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMPodScrape
metadata:
  name: flink-metrics
  namespace: pulsar  # ← Created in Pulsar namespace!
spec:
  selector:
    matchLabels:
      app: iot-flink-job
  namespaceSelector:
    matchNames:
      - flink-benchmark  # ← Scrapes from Flink namespace
  podMetricsEndpoints:
    - port: metrics
      path: /metrics
      interval: 15s  # Scrape every 15 seconds

Why VMPodScrape in Pulsar namespace?

VMAgent runs in the Pulsar namespace
It needs permission to scrape other namespaces
Cross-namespace scraping via namespaceSelector

Step 3-4: Patch Flink Deployments

# Downloads Prometheus reporter JAR via initContainer
initContainers:
- name: download-prometheus-jar
  image: curlimages/curl:latest
  command:
  - sh
  - -c
  - |
    curl -L https://repo1.maven.org/maven2/org/apache/flink/flink-metrics-prometheus/1.18.0/flink-metrics-prometheus-1.18.0.jar \
      -o /flink-prometheus/flink-metrics-prometheus-1.18.0.jar
  volumeMounts:
  - name: flink-prometheus-jar
    mountPath: /flink-prometheus

# Mounts JAR in Flink lib directory
containers:
- name: flink-main-container
  volumeMounts:
  - name: flink-prometheus-jar
    mountPath: /opt/flink/lib/flink-metrics-prometheus-1.18.0.jar
    subPath: flink-metrics-prometheus-1.18.0.jar
  ports:
  - containerPort: 9249
    name: metrics

Why download at runtime?

No need to rebuild Flink Docker image
Easy to update JAR version
Works with official Flink images

Step 4.5: Install ClickHouse Plugin

# Installs grafana-clickhouse-datasource plugin
kubectl exec -n pulsar <grafana-pod> -- \
  grafana-cli plugins install grafana-clickhouse-datasource

# Restarts Grafana to load plugin
kubectl rollout restart deployment/pulsar-grafana -n pulsar

Why needed?

ClickHouse uses native protocol, not Prometheus
Plugin enables SQL queries from Grafana
Required for ClickHouse dashboard

Step 5-6: Verify Setup

# Tests metrics endpoint
kubectl exec -n flink-benchmark <jobmanager-pod> -- \
  curl localhost:9249/metrics | grep flink_

# Should show 200+ Flink metrics

Step 7: Restart VMAgent

# Reloads scrape configuration
kubectl rollout restart deployment/vmagent-pulsar-victoria-metrics-k8s-stack -n pulsar

Result: Flink metrics now flowing to VictoriaMetrics! ✅

Step 2: Import Dashboards

Now import the pre-built dashboards:

cd ../../monitoring/grafana-dashboards
./import-dashboards.sh

What it does:

Port-forwards to Grafana:

   kubectl port-forward -n pulsar svc/pulsar-grafana 3000:80 &

Imports Flink Dashboard:

   curl -X POST \
     -H "Content-Type: application/json" \
     -u admin:admin123 \
     -d @flink-iot-pipeline-dashboard.json \
     http://localhost:3000/api/dashboards/db

Imports ClickHouse Dashboard:

   curl -X POST \
     -H "Content-Type: application/json" \
     -u admin:admin123 \
     -d @clickhouse-iot-data-dashboard.json \
     http://localhost:3000/api/dashboards/db

Result: Three dashboards available in Grafana! 📊

Step 3: Access Your Unified Dashboard

# Port-forward to Grafana
kubectl port-forward -n pulsar svc/pulsar-grafana 3000:80

# Open http://localhost:3000
# Login: admin / admin123

Dashboard Overview:

Pulsar Overview (default landing page)
Flink IoT Pipeline Metrics (streaming processing)
ClickHouse IoT Data Metrics (analytical queries)

📊 Real-Time Monitoring at Scale

Pulsar Dashboard - Message Ingestion

Key Metrics:

Message Rate: Real-time ingestion rate (target: 1M msg/sec)
Consumer Lag: Backlog across all topics
Storage: Bookie disk usage and write latency
Throughput: Bytes in/out per second

Critical Alerts:

Consumer lag > 1M messages
Bookie write latency > 2ms (p99)
Disk usage > 85%
Broker CPU > 90%

Flink Dashboard - Stream Processing

Key Metrics:

Records Processing: Input/output rates with watermarks
Checkpointing: Duration and success rate
Backpressure: Task-level processing delays
Resource Usage: CPU, memory, network per TaskManager

Sample Queries:

# Flink processing rate
rate(flink_taskmanager_job_task_numRecordsIn[1m])

# Checkpoint duration
flink_jobmanager_job_lastCheckpointDuration

# Backpressure time
flink_taskmanager_job_task_backPressureTimeMsPerSecond

ClickHouse Dashboard - Analytical Performance

Key Metrics:

Query Performance: Latency percentiles (p50, p95, p99)
Insert Rate: Rows inserted per second
Storage: Table sizes and compression ratios
Memory Usage: Query memory consumption

Custom SQL Panels:

-- Real-time insert rate
SELECT 
    toStartOfMinute(time) as minute,
    COUNT(*) as inserts_per_minute
FROM benchmark.sensors_local 
WHERE time >= now() - INTERVAL 1 HOUR 
GROUP BY minute 
ORDER BY minute

-- Top devices by message volume
SELECT 
    device_id,
    COUNT(*) as message_count
FROM benchmark.sensors_local 
WHERE time >= now() - INTERVAL 1 DAY
GROUP BY device_id 
ORDER BY message_count DESC 
LIMIT 10

🎛️ Advanced Dashboard Features

1. Cross-Component Correlation

Create panels that show end-to-end pipeline health:

# End-to-end latency calculation
(flink_taskmanager_job_task_currentProcessingTime - flink_taskmanager_job_task_currentInputWatermark) / 1000

2. Capacity Planning Views

Track resource utilization trends:

# Storage growth rate (bytes per hour)
rate(pulsar_storage_size[1h]) * 3600

# Memory utilization trend
avg_over_time(flink_taskmanager_Status_JVM_Memory_Heap_Used[24h])

3. SLA Monitoring

Define and track Service Level Objectives:

Pulsar SLOs:

Message delivery: 99.9% success rate
End-to-end latency: <100ms (p95)
Availability: 99.95% uptime

Flink SLOs:

Processing latency: <5 seconds (p99)
Checkpoint success: >99%
Job availability: 99.9% uptime

ClickHouse SLOs:

Query latency: <200ms (p95)
Insert success: 99.99%
Data freshness: <60 seconds

🎯 Production Best Practices

1. Enable Persistent Storage

grafana:
  persistence:
    enabled: true
    size: 10Gi
    storageClassName: "gp3"

Why?

Dashboard configurations persist across pod restarts
Datasources don't need re-configuration
Custom dashboards aren't lost

2. Organize Dashboards by Tags

When importing, add tags:

{
  "dashboard": {
    "tags": ["pulsar", "messaging", "production"],
    ...
  }
}

Benefits:

Easy filtering
Logical grouping
Better navigation

3. Set Up Alerts

# Example: Alert on high backpressure
- alert: FlinkHighBackpressure
  expr: flink_taskmanager_job_task_backPressureTimeMsPerSecond > 500
  for: 5m
  annotations:
    summary: "Flink task {{ $labels.task_name }} has high backpressure"

Common alerts:

Pulsar consumer lag > 1M messages
Flink checkpoint failures
ClickHouse query latency > 1s
Broker CPU > 90%
Disk usage > 85%

4. Create Custom Views

# Row for each component
Row 1: Pulsar (Message ingestion)
Row 2: Flink (Stream processing)
Row 3: ClickHouse (Data storage)
Row 4: Infrastructure (CPU, memory, disk)

Example panel:

{
  "title": "End-to-End Latency",
  "targets": [
    {
      "expr": "flink_taskmanager_job_latency_source_id_operator_id{quantile=\"0.99\"}",
      "legendFormat": "Flink p99 latency"
    }
  ]
}

5. Export and Version Control

# Export all dashboards
for dashboard in flink clickhouse pulsar; do
  curl -u admin:admin123 \
    "http://localhost:3000/api/dashboards/uid/$dashboard" | \
    jq '.dashboard' > ${dashboard}-dashboard.json
done

# Commit to git
git add grafana-dashboards/*.json
git commit -m "Update Grafana dashboards"

🎓 The Complete Monitoring Workflow

Initial Setup (One Time)

# 1. Deploy Pulsar with VictoriaMetrics stack
cd pulsar-load
./deploy.sh
# ✓ Grafana and VictoriaMetrics installed automatically

# 2. Setup Flink metrics integration
cd ../flink-load
./setup-flink-metrics.sh
# ✓ Flink pods now expose metrics
# ✓ VMAgent configured to scrape Flink

# 3. Import custom dashboards
cd ../../monitoring/grafana-dashboards
./import-dashboards.sh
# ✓ Flink and ClickHouse dashboards imported

Daily Operations

# Access Grafana
kubectl port-forward -n pulsar svc/pulsar-grafana 3000:80

# Open http://localhost:3000

# View dashboards:
# 1. Pulsar Overview (default)
# 2. Dashboards → Flink IoT Pipeline Metrics
# 3. Dashboards → ClickHouse IoT Data Metrics

Monitoring at Scale (1M msg/sec)

What to watch:

Pulsar:

Message rate: Should be ~1M msg/sec consistently
Consumer lag: Should be near 0
Bookie write latency: <2ms p99
Storage growth: ~300 MB/sec

Flink:

Records in: ~1M msg/sec
Records out: ~17K records/min (after aggregation)
Checkpoint duration: <10 seconds
Backpressure: LOW
CPU: 75-85% utilization

ClickHouse:

Insert rate: ~289 inserts/sec (17,333/60)
Query latency: <200ms for aggregations
Table size: Growing at ~50 MB/sec
Compression ratio: 10-15x

📊 Performance Impact of Monitoring

Resource overhead:

Component	CPU	Memory	Storage
VMAgent	0.1 vCPU	200 MB	-
VMSingle	0.5 vCPU	2 GB	50 GB/month
Grafana	0.1 vCPU	400 MB	10 GB
Total	0.7 vCPU	2.6 GB	60 GB/month

Cost impact: ~$50/month (minimal!)

Benefits:

Real-time visibility
Faster troubleshooting
Capacity planning
Performance optimization
SLA monitoring

🎯 Advanced: Custom Metrics

Add Custom Flink Metrics

// In your Flink job
public class CustomMetricsMapper extends RichMapFunction<Event, Event> {
    private transient Counter eventCounter;
    private transient Meter eventRate;

    @Override
    public void open(Configuration parameters) {
        eventCounter = getRuntimeContext()
            .getMetricGroup()
            .counter("custom_events_processed");

        eventRate = getRuntimeContext()
            .getMetricGroup()
            .meter("custom_events_per_second", new MeterView(60));
    }

    @Override
    public Event map(Event event) {
        eventCounter.inc();
        eventRate.markEvent();
        return event;
    }
}

View in Grafana:

rate(flink_taskmanager_job_task_operator_custom_events_processed[1m])

Add Custom ClickHouse Queries

{
  "targets": [
    {
      "datasource": "ClickHouse",
      "rawSql": "SELECT toStartOfMinute(time) as t, COUNT(*) as count FROM benchmark.sensors_local WHERE time >= now() - INTERVAL 1 HOUR GROUP BY t ORDER BY t",
      "format": "time_series"
    }
  ]
}

🎉 Conclusion

You now have a unified monitoring stack that provides:

✅ Single Grafana instance monitoring entire pipeline

✅ VictoriaMetrics for efficient metric storage

✅ 3 pre-built dashboards (Pulsar, Flink, ClickHouse)

✅ Automated setup with shell scripts

✅ Cross-namespace monitoring (Pulsar → Flink)

✅ Real-time visibility at 1M msg/sec scale

✅ Production-ready alerting and SLO tracking

The beauty of this setup:

Deploy once: Monitoring comes with Pulsar
Add Flink: One script (setup-flink-metrics.sh)
Import dashboards: One script (import-dashboards.sh)
Access everything: Single Grafana instance

No separate Prometheus installations, no complex federation, no metric duplication. Just a clean, unified monitoring solution! 🚀

📚 Resources

GitHub Repository: RealtimeDataPlatform/monitoring
VictoriaMetrics Docs: docs.victoriametrics.com
Grafana Docs: grafana.com/docs
Flink Metrics: nightlies.apache.org/flink/flink-docs-stable/docs/ops/metrics/

How do you monitor your streaming pipelines? Share your setup in the comments! 👇

Follow me for more posts on observability, real-time systems, and production DevOps!

Tags: #monitoring #grafana #victoriametrics #flink #pulsar #clickhouse #kubernetes #observability

🌟 Quick Reference

Port-Forward Commands

# Grafana (all dashboards)
kubectl port-forward -n pulsar svc/pulsar-grafana 3000:80

# VictoriaMetrics (raw metrics)
kubectl port-forward -n pulsar svc/vmsingle-pulsar-victoria-metrics-k8s-stack 8429:8429

# Flink UI (job details)
kubectl port-forward -n flink-benchmark svc/flink-jobmanager-rest 8081:8081

# Prometheus-compatible API
kubectl port-forward -n pulsar svc/vmsingle-pulsar-victoria-metrics-k8s-stack 9090:8429

Dashboard URLs

Pulsar:     http://localhost:3000 (default landing page)
Flink:      http://localhost:3000/d/flink-iot-pipeline
ClickHouse: http://localhost:3000/d/clickhouse-iot-metrics

Useful Queries

# Total message rate across all Pulsar topics
sum(rate(pulsar_in_messages_total[1m]))

# Flink processing lag
flink_taskmanager_job_task_currentInputWatermark - flink_taskmanager_job_task_currentOutputWatermark

# ClickHouse disk usage
clickhouse_metric_DiskDataBytes

Real-Time Streaming Challenges: What I Learned Building at Scale

HyperscaleDesignHub — Sun, 26 Oct 2025 13:32:01 +0000

Building a real-time streaming platform that processes 1 million events per second taught me lessons that no tutorial or documentation could. After months of optimization, debugging, and scaling our self-hosted platform on AWS, here are the hard-won insights that saved us 90% in costs and countless hours of troubleshooting.

You can find our complete implementation in the RealtimeDataPlatform repository.

🎯 The Reality Check: Scale Changes Everything

What I thought: "If it works at 10K events/sec, it'll work at 1M events/sec."

What I learned: Scale isn't linear. At 1M events/sec, everything breaks differently.

The Surprising Bottlenecks

Expected Bottlenecks    →    Actual Bottlenecks
CPU/Memory             →    Network I/O
Application Logic      →    Storage I/O patterns  
Compute Resources      →    Configuration limits
Code Efficiency        →    Infrastructure design

Example: Our Flink job processed 890K events/sec from Pulsar backlog but only 600K live. The bottleneck? Pulsar's storage configuration, not Flink's processing power.

💾 Storage: The Hidden Performance Killer

Challenge #1: The NVMe Device Separation Discovery

Initial setup: Single NVMe device for both journal and ledger storage in BookKeeper.

Problem: Write latency spikes to 15ms, throughput capped at 400K events/sec.

Solution: Separate NVMe devices for different I/O patterns.

# Game-changing configuration
/dev/nvme1n1 → Journal (WAL) - Sequential writes
/dev/nvme2n1 → Ledgers (Data) - Random reads/writes

Result: Latency dropped to 2.1ms, throughput jumped to 1M+ events/sec.

Lesson: I/O pattern separation matters more than raw storage speed.

Challenge #2: The journalSyncData Trade-off

The dilemma: Enable journalSyncData for safety vs. disable for performance.

# The 10x performance decision
journalSyncData: "false"  # Risk: data loss on power failure
                         # Gain: 10x latency improvement

Lesson: Know your data's value. For IoT telemetry, we chose speed over perfect durability. For financial data, we wouldn't.

🔧 Parallelism: More Isn't Always Better

Challenge #3: The Slot-to-CPU Ratio Mystery

Initial thinking: "More parallelism = better performance"

Reality check: Our pipeline had different resource needs:

Source (I/O) → keyBy → Window (CPU) → Aggregate (CPU) → Sink (I/O)
   🟢                    🔴             🔴               🟢

Wrong approach: 2:1 slot-to-CPU ratio (32 slots on 16 vCPUs)

Result: CPU starvation during aggregation

Right approach: 1:1 slot-to-CPU ratio (16 slots on 16 vCPUs)

Result: Dedicated CPU per CPU-intensive task

Lesson: Match your slot configuration to your workload's compute pattern, not theoretical maximums.

Challenge #4: The Parallelism-Partition Matching Rule

Discovery: Mismatched parallelism and partitions kills performance.

❌ Wrong:
  Pulsar partitions: 8
  Flink parallelism: 64
  Result: 56 idle Flink tasks

✅ Right:
  Pulsar partitions: 64  
  Flink parallelism: 64
  Result: Perfect work distribution

Lesson: Parallelism should always match your source partitions for optimal resource utilization.

🕵️ Debugging: The Backlog Test Technique

Challenge #5: Finding the Real Bottleneck

The mystery: System processing 600K events/sec but target was 1M.

The experiment: Stop all producers, let Flink catch up from backlog.

# The revealing test
kubectl scale deployment iot-producer --replicas=0

# Result: Flink consumed 890K events/sec from backlog!
# Conclusion: Pulsar was the bottleneck, not Flink

Lesson: The backlog test reveals your true system capacity and identifies which component is actually limiting throughput.

💰 Cost Optimization: The Managed Services Reality

Challenge #6: The 10x Cost Shock

Our self-hosted platform: $24,592/month

Equivalent AWS managed services:

MSK: $30,525/month
Kinesis Data Analytics: $81,180/month
Redshift Serverless: $131,328/month
Total: $243,033/month

The math: 988% cost increase for managed services at scale.

Lesson: At high throughput, managed services pricing becomes prohibitive. The break-even point strongly favors self-hosting for sustained, high-volume workloads.

🏗️ Architecture: Instance Selection Matters

Challenge #7: Right-Sizing for Performance vs. Cost

Wrong approach: Many small instances

16× c5.large instances
Complex networking, management overhead

Right approach: Fewer, larger instances

4× c5.4xlarge instances
Better price/performance, simpler operations

Lesson: Bigger instances often provide better performance-per-dollar and reduce operational complexity.

Challenge #8: The i7i.8xlarge Sweet Spot

Why i7i.8xlarge became our standard:

32 vCPUs, 256GB RAM
2× 3.75TB NVMe devices (perfect for separation)
Latest generation CPU performance
$2,160/month (better than alternatives)

Lesson: The latest generation instances often provide the best performance-per-dollar despite higher unit costs.

🔍 Monitoring: What Actually Matters

The Metrics That Saved Us

Instead of generic CPU/memory metrics, focus on:

# Backpressure - Your canary in the coal mine
flink_taskmanager_job_task_backPressureTimeMsPerSecond

# True throughput - Not just input rate
rate(flink_taskmanager_job_task_numRecordsOutPerSecond[1m])

# Storage performance - Often the real bottleneck
rate(bookie_journal_JOURNAL_ADD_ENTRY_count[1m])

Lesson: Domain-specific metrics matter more than generic infrastructure metrics for identifying real problems.

🎓 The Five Universal Truths of Streaming at Scale

1. Storage I/O Patterns Trump Raw Performance

Separate your sequential writes from random reads. Always.

2. Configuration Limits Hit Before Resource Limits

You'll hit default timeouts, queue sizes, and connection limits before CPU/memory limits.

3. The Bottleneck Moves

Optimize one component, and the bottleneck shifts to the next weakest link.

4. Test with Realistic Data

Synthetic loads behave differently than real-world data patterns.

5. Cost Scales Non-Linearly

At high throughput, managed services become exponentially more expensive than self-hosting.

💡 What I'd Do Differently

Start with these decisions:

✅ Separate storage devices from day one

✅ Match parallelism to partitions immediately

✅ Use 1:1 slot-to-CPU for CPU-bound workloads

✅ Implement backlog testing early

✅ Choose latest-generation instances

✅ Plan for self-hosting at scale

🚀 The Bottom Line

Building a real-time streaming platform at scale taught me that the fundamentals matter more than the fancy features. Storage I/O patterns, proper parallelism matching, and understanding your actual bottlenecks will get you further than any advanced configuration.

Our results:

1,040,000 events/sec sustained throughput
$24,592/month infrastructure cost (90% savings vs managed)
<2ms p99 latency end-to-end
99.95% uptime in production

The biggest lesson? Scale reveals the truth about your architecture. What works at small scale often breaks in unexpected ways at large scale. Plan for it, test for it, and measure everything that matters.

📚 Resources

Complete Implementation: RealtimeDataPlatform
Pulsar Performance Guide: Our Pulsar optimization journey
Flink Tuning Details: Our Flink scaling experience

What's your biggest streaming challenge? Have you hit similar bottlenecks at scale? Share your war stories in the comments! 👇

Follow me for more real-world lessons from building distributed systems at scale.

Tags: #streaming #realtime #scale #performance #aws #pulsar #flink #architecture #lessons

Real-Time Data Streaming Platform: How We Built a Self-Hosted Platform with 90% Cost Reduction vs AWS Managed Services

HyperscaleDesignHub — Sun, 26 Oct 2025 13:27:10 +0000

When tasked with building a real-time data streaming platform capable of processing 1 million events per second, we faced a critical decision: build a self-hosted solution using open-source technologies, or leverage AWS managed services for convenience.

This article details how we built our self-hosted real-time data streaming platform and achieved a 90% cost reduction compared to equivalent AWS managed services, while maintaining enterprise-grade performance and reliability.

The result: A production-ready platform processing 1M events/sec for $24,592/month instead of $243,033/month with AWS managed services.

You can find the complete implementation in our RealtimeDataPlatform repository.

Here's how we did it and the lessons learned along the way.

🏗️ Architecture Comparison

Self-Hosted Stack Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                    Self-Hosted Stack on AWS EC2                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌───────────┐ │
│  │   PRODUCER  │───▶│   PULSAR    │───▶│    FLINK    │───▶│CLICKHOUSE │ │
│  │             │    │             │    │             │    │           │ │
│  │ IoT Sensors │    │ Open Source │    │ Open Source │    │Open Source│ │
│  │ AVRO Data   │    │ Message     │    │ Stream      │    │ Analytics │ │
│  │             │    │ Broker      │    │ Processing  │    │ Database  │ │
│  │ 4x c5.4xl   │    │ 6x i7i.8xl  │    │ 4x c5.4xl   │    │6x r6id.4xl│ │
│  │ Full Control│    │ Self-Managed│    │ Self-Managed│    │Self-Managed│ │
│  └─────────────┘    └─────────────┘    └─────────────┘    └───────────┘ │
│                                                                         │
│  Monthly Cost: ~$24,592                                                │
└─────────────────────────────────────────────────────────────────────────┘

AWS Managed Stack Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                    AWS Managed Services Stack                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌───────────┐ │
│  │   PRODUCER  │───▶│     MSK     │───▶│   KINESIS   │───▶│ REDSHIFT  │ │
│  │             │    │             │    │             │    │           │ │
│  │ IoT Sensors │    │ Managed     │    │ Data Analytics  │ Serverless │ │
│  │ AVRO Data   │    │ Streaming   │    │ for Flink   │    │ Analytics │ │
│  │             │    │ for Kafka   │    │             │    │ Warehouse │ │
│  │ 4x c5.4xl   │    │ AWS Managed │    │ AWS Managed │    │AWS Managed│ │
│  │ Hands-off   │    │ Serverless  │    │ Serverless  │    │Serverless │ │
│  └─────────────┘    └─────────────┘    └─────────────┘    └───────────┘ │
│                                                                         │
│  Monthly Cost: ~$243,033                                               │
└─────────────────────────────────────────────────────────────────────────┘

💰 The Self-Hosted Stack: Control and Cost-Effectiveness

In this scenario, we deploy our entire stack on Amazon EC2 instances. This gives us maximum control over the configuration and tuning of each component.

The Technology Stack

Message Broker: Apache Pulsar
Stream Processing: Apache Flink
Analytics Database: ClickHouse

Infrastructure Details

Apache Pulsar Cluster:

6× i7i.8xlarge instances (32 vCPU, 256GB RAM, 2×3.75TB NVMe)
Co-located brokers and bookies
NVMe device separation for journal and ledger storage
Cost: ~$12,960/month

Apache Flink Cluster:

4× c5.4xlarge instances (16 vCPU, 32GB RAM)
64-way parallelism matching Pulsar partitions
1:1 slot-to-CPU ratio for optimal performance
Cost: ~$2,400/month

ClickHouse Cluster:

6× r6id.4xlarge instances (16 vCPU, 128GB RAM, 950GB NVMe)
Distributed analytics with real-time ingestion
Optimized for sub-second query performance
Cost: ~$7,200/month

Producer Infrastructure:

4× c5.4xlarge instances for load generation
AVRO serialization for efficient data transfer
Cost: ~$1,920/month

The Cost Breakdown

Using infrastructure analysis tools to examine the Terraform configuration for this setup:

Component	Instance Type	Count	Monthly Cost
Pulsar (Broker+Bookie)	i7i.8xlarge	6	$12,960
ClickHouse	r6id.4xlarge	6	$7,200
Flink	c5.4xlarge	4	$2,400
Producers	c5.4xlarge	4	$1,920
Supporting Infrastructure	Various	-	$112
Total			$24,592

Total Estimated Monthly Cost: $24,592

The primary cost drivers are the large EC2 instances required to handle the 1 million events/sec workload, particularly for the Pulsar brokers and ClickHouse nodes.

☁️ The AWS Managed Stack: Convenience at a Premium

In this approach, we replace our self-hosted components with their AWS-native counterparts. This offloads the operational burden of managing the infrastructure to AWS.

The Technology Stack

Managed Kafka: Amazon MSK (Managed Streaming for Apache Kafka)
Managed Flink: Amazon Kinesis Data Analytics for Apache Flink
ClickHouse Equivalent: Amazon Redshift Serverless

Service Configuration & Assumptions

Amazon MSK:

1KB average event size
1-day data retention
High throughput configuration
Multi-AZ deployment for reliability

Kinesis Data Analytics for Flink:

64 Kinesis Processing Units (KPUs)
Continuous processing (24/7)
Auto-scaling enabled

Amazon Redshift Serverless:

1-month data retention
High-performance analytics workload
On-demand scaling

The Cost Breakdown

Estimating the cost for a managed stack at this scale requires several assumptions. Based on AWS pricing and typical configurations:

Service	Configuration	Monthly Cost
Amazon MSK	High throughput, 1-day retention	$30,525
Kinesis Data Analytics	64 KPUs, continuous processing	$81,180
Amazon Redshift Serverless	1-month retention, analytics	$131,328
Total		$243,033

Total Estimated Monthly Cost: ~$243,033

📊 The Cost Comparison: A Dramatic Difference

Let's put these numbers side-by-side:

Approach	Monthly Cost	Cost per Million Events
Self-Hosted on EC2	$24,592	$0.0094
AWS Managed Services	$243,033	$0.0932
Difference	+988%	+988%

The difference is stark: the AWS managed stack is roughly 10 times more expensive than the self-hosted approach for this high-throughput scenario.

Cost Analysis by Component

Messaging Layer:

Self-hosted Pulsar: $12,960/month
AWS MSK: $30,525/month
Premium: 235%

Stream Processing:

Self-hosted Flink: $2,400/month
Kinesis Data Analytics: $81,180/month
Premium: 3,382%

Analytics Storage:

Self-hosted ClickHouse: $7,200/month
Redshift Serverless: $131,328/month
Premium: 1,824%

🤔 Beyond the Numbers: Understanding the Trade-offs

So, why would anyone choose the managed stack given the massive price difference? The answer lies in the trade-offs between cost and operational overhead.

The Case for Self-Hosting

Advantages:

✅ Dramatic Cost Savings: 90% lower infrastructure costs
✅ Complete Control: Fine-grained tuning and optimization
✅ No Vendor Lock-in: Portable across cloud providers
✅ Technology Choice: Use cutting-edge open-source features
✅ Performance Optimization: Custom configurations for specific workloads

Challenges:

❌ High Operational Overhead: Full responsibility for infrastructure management
❌ Expertise Required: Deep knowledge of distributed systems needed
❌ Time Investment: Significant setup and maintenance effort
❌ Scaling Complexity: Manual scaling and capacity planning
❌ Security Responsibility: Comprehensive security management required

The Case for Managed Services

Advantages:

✅ Reduced Operational Overhead: AWS handles infrastructure management
✅ Built-in Scalability: Auto-scaling and high availability
✅ Faster Time to Market: Rapid deployment without infrastructure setup
✅ Enterprise Features: Built-in monitoring, security, and compliance
✅ Support: Professional support from AWS

Challenges:

❌ Significant Cost Premium: 10x higher costs for high-throughput workloads
❌ Vendor Lock-in: Tied to AWS ecosystem
❌ Limited Control: Constrained by service limitations
❌ Feature Lag: May not have latest open-source features

🎯 When to Choose Each Approach

Choose Self-Hosted When:

Cost is Critical: Operating at scale where managed service costs become prohibitive
Performance is Key: Need maximum performance through custom tuning
Team Expertise: Have experienced platform engineering team
Long-term Investment: Building for sustained high-volume workloads
Multi-cloud Strategy: Want to avoid vendor lock-in

Choose Managed Services When:

Speed to Market: Need to ship quickly without infrastructure complexity
Small Team: Limited platform engineering resources
Variable Workloads: Unpredictable or seasonal traffic patterns
Compliance Focus: Need built-in enterprise compliance features
Prototype/MVP: Testing concepts before committing to self-hosted infrastructure

💡 Hybrid Approaches & Optimization Strategies

Cost Optimization for Self-Hosted

Reserved Instances: 40-60% savings with 1-3 year commitments
Spot Instances: Up to 70% savings for fault-tolerant components
Right-sizing: Regular capacity planning and instance optimization
Auto-scaling: Implement demand-based scaling

Hybrid Architecture Considerations

# Example hybrid approach
Message Ingestion: AWS MSK (managed complexity)
Stream Processing: Self-hosted Flink (cost optimization)
Analytics Storage: Self-hosted ClickHouse (performance optimization)
Monitoring: AWS CloudWatch (convenience)

📈 ROI Analysis: Break-Even Points

Total Cost of Ownership (TCO) Considerations

Self-Hosted Additional Costs:

Platform engineering team: ~$400K-600K/year (2-3 engineers)
Operations overhead: ~20-30% additional management time
Training and certifications: ~$20K/year

Managed Services Hidden Benefits:

Reduced hiring needs
Faster feature delivery
Lower operational risk

Break-Even Analysis

For our 1M events/sec workload:

Cost difference: $218K/month ($2.6M/year)
Engineering team cost: ~$500K/year
Net savings with self-hosted: ~$2.1M/year

The break-even point strongly favors self-hosting for high-throughput, sustained workloads.

🎓 Conclusion

The choice between self-hosting and managed services is not a one-size-fits-all decision, but the cost implications are dramatic at scale.

Key Takeaways

For High-Throughput Workloads: Self-hosting can provide 90% cost savings
Expertise Matters: Success requires skilled platform engineering teams
Scale is Key: The larger your workload, the more self-hosting makes financial sense
Time Horizon: Long-term, sustained workloads favor self-hosting

Decision Framework

Choose Self-Hosted If:

Processing >100K events/sec sustained
Have platform engineering expertise
Cost optimization is critical
Long-term workload (>2 years)

Choose Managed Services If:

Getting started or prototyping
Small engineering team
Variable/unpredictable workloads
Time to market is critical

For a high-throughput workload of 1 million events per second, the cost of managed services can be substantial. It's crucial to weigh the significant cost premium against the benefits of offloading the operational complexity to your cloud provider.

The bottom line: At enterprise scale, self-hosting open-source streaming infrastructure can deliver massive cost savings while providing superior performance and control—if you have the team to manage it effectively.

📚 Resources

Complete Implementation: RealtimeDataPlatform/realtime-platform-1million-events
Infrastructure Code: Terraform configurations and Helm charts
Cost Analysis Tools: AWS Pricing Calculator, Infracost
Performance Benchmarks: Pulsar vs Kafka, ClickHouse Performance

Have you made this choice in your organization? What factors influenced your decision? Share your experience in the comments! 👇

Follow me for more deep dives on cloud architecture, cost optimization, and distributed systems!

Tags: #aws #cost #architecture #streaming #devops #realtime #pulsar #flink #clickhouse #msk #kinesis

Real-Time Data Streaming Platform: From 140K to 1 Million Messages/Sec - A Flink Performance Tuning Journey

HyperscaleDesignHub — Sun, 26 Oct 2025 13:18:28 +0000

Performance tuning a distributed streaming system is a journey of discovery, experimentation, and learning. This is the story of how I scaled a Flink streaming job from 140K messages/sec to 1 million messages/sec - a 7x improvement through systematic optimization.

Spoiler alert: The bottleneck wasn't where I expected!

🏗️ Real-Time Data Streaming Platform Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                Real-Time Data Streaming Platform                        │
│                        AWS EKS (Kubernetes 1.31)                       │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌───────────┐ │
│  │   PRODUCER  │───▶│   PULSAR    │───▶│    FLINK    │───▶│CLICKHOUSE │ │
│  │             │    │             │    │             │    │           │ │
│  │ IoT Sensors │    │ Message     │    │ Stream      │    │ Analytics │ │
│  │ AVRO Data   │    │ Broker      │    │ Processing  │    │ Database  │ │
│  │             │    │             │    │             │    │           │ │
│  │ 4x c5.4xl   │    │ 6x i7i.8xl  │    │ 4x c5.4xl   │    │6x r6id.4xl│ │
│  │ 250K/sec    │    │ Partitions  │    │ Parallelism │    │ Real-time │ │
│  │ each node   │    │ 64          │    │ 64          │    │ Queries   │ │
│  └─────────────┘    └─────────────┘    └─────────────┘    └───────────┘ │
│                                                                         │
│  Data Flow:                                                             │
│  300-byte AVRO ──▶ Pulsar Topics ──▶ keyBy(device_id) ──▶ ClickHouse   │
│  IoT Messages       (Persistent)      1-min Windows       Analytics    │
│                                                                         │
│  Performance Target: 1,000,000 messages/sec end-to-end                 │
└─────────────────────────────────────────────────────────────────────────┘

Pipeline Components:

Producers: Generate 300-byte AVRO-serialized IoT sensor data
Pulsar: Distributed message broker with 64 partitions for parallel processing
Flink: Stream processing engine with 64-way parallelism for aggregations
ClickHouse: Real-time analytics database for sub-second queries

The Challenge: Initial setup only achieved 140K msg/sec instead of target 1M msg/sec!

🎯 The Starting Point

Initial Setup:

Goal: Process 1 million messages/sec from Pulsar
Message Size: 300 bytes (AVRO-serialized sensor data)
Job: Source → keyBy → Window (1-min) → Aggregate → Sink (ClickHouse)
Result: Only 140K msg/sec 😱

Something was clearly wrong. Time to dig in.

🔍 Understanding the Flink Job Structure

Before tuning, I needed to understand what I was working with. You can find the complete implementation in the flink-load directory:

// JDBCFlinkConsumer.java - The Pipeline
PulsarSource<SensorRecord> source = PulsarSource.builder()
    .setServiceUrl(pulsarUrl)
    .setTopics("persistent://public/default/iot-sensor-data")
    .setDeserializationSchema(new AvroSensorDataDeserializationSchema())
    .build();

DataStream<SensorRecord> sensorStream = env.fromSource(source, ...);

sensorStream
    .keyBy(record -> record.device_id)  // ← Data shuffle happens here!
    .window(TumblingProcessingTimeWindows.of(Time.minutes(1)))
    .aggregate(new SensorAggregator())
    .addSink(new ClickHouseJDBCSink(clickhouseUrl));

The Operator Chain

Flink optimizes by chaining operators together:

Task 1 (Source Group):
  └─ Pulsar Source (I/O-bound)
     └─ AVRO Deserialize
        └─ keyBy (compute hash, network shuffle)

Task 2 (Window Group):
  └─ Window Aggregate (CPU-bound)
     └─ ClickHouse Sink (I/O-bound)

Key Insight: The keyBy() causes data shuffling between Task 1 and Task 2. This creates 2 separate task groups that need slots.

Total slots needed = Parallelism × 2 (not parallelism × 4 operators, due to chaining!)

📊 Phase 1: Initial Configuration (140K msg/sec)

Configuration

# FlinkDeployment configuration
parallelism: 8
pulsar_partitions: 8

# Resource allocation
taskmanager:
  replicas: 4
  resources:
    cpu: 2      # 2 vCPUs per TaskManager
    memory: 4Gi

# Task slot mapping
taskmanager.numberOfTaskSlots: 2  # 2 slots per vCPU (2:1 ratio)

Calculation:

4 TaskManagers × 2 slots = 8 total slots
Parallelism = 8
Each slot gets: 2 vCPUs / 2 slots = 1 vCPU per slot

Instance Type: c5.2xlarge (8 vCPU, 16 GB RAM)

Results

# Metrics after 5 minutes
Records In:  140,000 msg/sec
Records Out: 2,300 aggregated records/min
CPU Usage:   85-95% (maxed out!)
Backpressure: HIGH on source operators

Problem Identified: Not enough parallelism to match Pulsar's 8 partitions efficiently.

🚀 Phase 2: Scale Parallelism (480K msg/sec)

The Hypothesis

If 8 parallel instances handle 140K msg/sec, then:

Per-instance rate: 140K / 8 = 17,500 msg/sec
For 1M msg/sec: 1,000,000 / 17,500 ≈ 57 instances needed

Let's try 64 for a clean power-of-2.

Configuration Changes

# Increase Pulsar partitions to 64
pulsar_partitions: 64

# Increase Flink parallelism to match
parallelism: 64

# Task slot mapping (2:1 ratio maintained)
taskmanager.numberOfTaskSlots: 2  # Still 2 slots per vCPU

# Calculate required TaskManagers:
# 64 slots needed / 2 slots per TM = 32 TaskManagers
# OR with 8 vCPU machines: 64 slots / 16 slots per machine = 4 machines
taskmanager:
  replicas: 16  # Using c5.2xlarge (8 vCPU, 16 slots per machine)

Wait... 16 TaskManagers on c5.2xlarge?

Let me recalculate:

c5.2xlarge: 8 vCPUs
Task slots: 2 per vCPU = 16 slots per machine
Need 64 slots total
Machines needed: 64 / 16 = 4 machines

# Corrected configuration
taskmanager:
  replicas: 4  # 4 × c5.2xlarge machines
  resources:
    cpu: 8      # Full 8 vCPUs
    memory: 16Gi

Results - Phase 2

# Metrics after deployment
Records In:  480,000 msg/sec  (3.4x improvement!)
Records Out: 8,000 aggregated records/min
CPU Usage:   65-75% per TaskManager
Backpressure: MEDIUM

Progress: 140K → 480K msg/sec ✅

But still far from 1M! What's the bottleneck now?

🔧 Phase 3: CPU Resource Tuning (600K msg/sec)

The Investigation

Looking at the Flink deployment YAML in the repository:

# flink-job-deployment.yaml - JobManager pod spec
spec:
  job:
    jarURI: local:///opt/flink/usrlib/flink-consumer-1.0.0.jar
    parallelism: 64
    upgradeMode: stateless
  flinkConfiguration:
    taskmanager.numberOfTaskSlots: "2"
    # CPU limit in deployment

Found it! The TaskManager pod resource definition:

resources:
  requests:
    cpu: "1"       # Only 1 CPU requested! 🚨
    memory: "4Gi"
  limits:
    cpu: "2"       # And max 2 CPUs
    memory: "8Gi"

Problem: Each TaskManager was throttled to 2 CPUs max, but we have 8 vCPU machines!

The Fix

# Updated TaskManager resources
resources:
  requests:
    cpu: "5"       # Request 5 CPUs (was 1)
    memory: "8Gi"
  limits:
    cpu: "8"       # Allow up to 8 CPUs (was 2)
    memory: "16Gi"

Results - Phase 3

# After CPU increase
Records In:  600,000 msg/sec  (1.25x improvement!)
CPU Usage:   75-85% per TaskManager (better utilization)
Backpressure: LOW → MEDIUM

Progress: 480K → 600K msg/sec ✅

Still not 1M. Time for the real detective work!

🔎 Phase 4: The Backlog Experiment (890K msg/sec)

The Eureka Moment

I noticed a huge backlog forming in Pulsar (100M+ messages). So I tried an experiment:

Stop all producers and let Flink catch up.

# Stop producers
kubectl scale deployment iot-producer -n iot-pipeline --replicas=0

# Watch Flink metrics
kubectl exec -n flink-benchmark <jobmanager> -- curl localhost:8081/jobs/<job-id>/metrics

The Shocking Result

# Flink consuming from backlog (no new messages)
Records In:  890,000 msg/sec  😲

# CPU and memory usage
CPU: 85-90% (near max)
Memory: Stable
Network: ~270 MB/sec ingress

Key Insight: Flink could process 890K msg/sec when reading from Pulsar backlog, but only 600K msg/sec with live producers!

Conclusion: Pulsar was the bottleneck! 🎯

🏎️ Phase 5: Upgrade Pulsar Infrastructure (1M msg/sec)

The Pulsar Problem

Previous Setup:

Instance: i3en.6xlarge (24 vCPU, 192GB RAM, 2x 7.5TB NVMe)
Bookies: 4 nodes
Brokers: 4 nodes

Issues:

Older generation (i3en)
Only 4 bookies for 1M msg/sec
Journal and ledger on same NVMe device

The Pulsar Upgrade

# Updated Terraform configuration
pulsar_broker_bookie_config = {
  instance_types = ["i7i.8xlarge"]  # Upgraded from i3en.6xlarge
  desired_size = 6                   # Increased from 4
}

i7i.8xlarge Advantages:

Newer generation (better CPU IPC)
32 vCPUs (vs 24 on i3en)
2× NVMe devices (3.75TB each)
Lower per-device latency

NVMe Device Separation (Critical!):

# Device mapping on each i7i.8xlarge bookie
/dev/nvme1n1 → /mnt/bookkeeper/journal  # Journal (WAL)
/dev/nvme2n1 → /mnt/bookkeeper/ledgers  # Ledgers (Data)

This separation eliminates I/O contention between:

Journal: Sequential writes (low latency critical)
Ledgers: Random reads/writes (capacity critical)

Results - Phase 5

# After Pulsar upgrade
Records In:  1,040,000 msg/sec  🎉
CPU (Flink): 80-85% per TaskManager
CPU (Pulsar): 70-75% per Broker/Bookie
Backpressure: NONE to LOW
End-to-end latency: <2 seconds

SUCCESS: 600K → 1,040K msg/sec ✅

🎯 Phase 6: Final Flink Optimization (1.04M msg/sec sustained)

The Last Mile

Even with Pulsar fixed, I wanted to optimize Flink further. The 2:1 slot-to-CPU ratio was still suboptimal for our CPU-heavy aggregation workload.

Configuration Change

# Final Flink configuration
taskmanager:
  replicas: 4  # 4 × c5.4xlarge machines
  resources:
    cpu: 16     # Full 16 vCPUs per TM
    memory: 32Gi

# Slot configuration
taskmanager.numberOfTaskSlots: 16  # 1:1 ratio (was 2:1)

New Calculation:

4 TaskManagers × 16 slots = 64 total slots
Each slot gets: 16 vCPUs / 16 slots = 1 vCPU per slot

Instance Upgrade: c5.2xlarge → c5.4xlarge (16 vCPU, 32 GB RAM)

Why 1:1 Ratio Works Better

Our Pipeline Analysis:

Source (I/O) → keyBy → Window (CPU) → Aggregate (CPU) → Sink (I/O)
   🟢                    🔴             🔴               🟢

CPU-intensive operators: 2/4 (Window + Aggregate)
I/O-bound operators: 2/4 (Source + Sink)

Decision: Since aggregation is CPU-heavy, 1:1 gives each task dedicated CPU.

Final Results

# Sustained performance metrics
Records In:  1,040,000 msg/sec
Records Out: 17,333 aggregated records/min
CPU Usage:   75-80% per TaskManager (optimal)
Memory Usage: 60-70% (plenty of headroom)
Backpressure: NONE
GC Pressure: LOW
Checkpoint Duration: 5-8 seconds

Final Achievement: 1,040,000 messages/sec sustained 🏆

🧪 The Backlog Test - A Critical Technique

Why This Test Matters

The backlog consumption test reveals your true system capacity:

# The test process
1. Run producers at max speed (build backlog)
2. Stop producers completely
3. Measure Flink consumption from backlog
4. Compare: Backlog rate vs Live rate

What it tells you:

Shows true Flink capacity
Reveals whether Pulsar or Flink is the bottleneck

In our case:

Live: 600K msg/sec
Backlog: 890K msg/sec
Conclusion: Pulsar was limiting, not Flink!

🔧 Final Architecture

Flink Configuration

# FlinkDeployment - Final configuration
spec:
  flinkVersion: v1_18

  flinkConfiguration:
    taskmanager.numberOfTaskSlots: "16"
    parallelism.default: "64"
    state.backend: rocksdb
    state.checkpoints.dir: s3://benchmark-high-infra-state/checkpoints
    execution.checkpointing.interval: 60000
    execution.checkpointing.mode: EXACTLY_ONCE

  jobManager:
    resource:
      memory: "8Gi"
      cpu: 4
    replicas: 1

  taskManager:
    resource:
      memory: "32Gi"
      cpu: 16  # Full 16 vCPUs
    replicas: 4  # 4 × c5.4xlarge machines

Resource Summary:

4 TaskManagers on c5.4xlarge (16 vCPU, 32 GB each)
64 task slots total (16 per TM)
1:1 slot-to-vCPU ratio
64 parallelism (matches Pulsar partitions)

Pulsar Configuration

# Pulsar values - Final configuration
bookkeeper:
  replicaCount: 6  # Increased from 4

  # NVMe device configuration
  volumes:
    journal:
      size: 200Gi
      storageClassName: local-nvme  # /dev/nvme1n1
    ledgers:
      size: 1000Gi
      storageClassName: local-nvme  # /dev/nvme2n1

  configData:
    journalMaxSizeMB: "2048"
    journalSyncData: "false"
    journalAdaptiveGroupWrites: "true"
    ledgerStorageClass: "org.apache.bookkeeper.bookie.storage.ldb.DbLedgerStorage"

broker:
  replicaCount: 6
  nodeSelector:
    node-type: broker-bookie

Instance Type: i7i.8xlarge (32 vCPU, 256GB RAM, 2× 3.75TB NVMe)

📊 Performance Comparison Table

Phase	Config	Throughput	Bottleneck	Fix
Phase 1	8 parallel, c5.2xlarge, 2:1	140K msg/sec	Low parallelism	Increase to 64
Phase 2	64 parallel, c5.2xlarge, 2:1	480K msg/sec	CPU throttling	Increase CPU limit
Phase 3	64 parallel, c5.2xlarge, 2:1, 8 CPU	600K msg/sec	Pulsar capacity	Upgrade Pulsar
Backlog	Same config, no producers	890K msg/sec	Flink needs more CPU	Upgrade to c5.4xlarge
Phase 4	64 parallel, c5.4xlarge, 1:1	1,040K msg/sec	None!	✅ Success

💡 Best Practices for Flink at Scale

1. Match Parallelism to Source Partitions

Pulsar Partitions = Flink Parallelism

This ensures:

Optimal work distribution
No partition skew
Maximum throughput

2. Use the Right Slot-to-CPU Ratio

// Analyze your job operators
Source (I/O)  → keyBy → Window (CPU)  → Aggregate (CPU)  → Sink (I/O)
   🟢                      🔴              🔴               🟢

// Count CPU-bound operators
CPU-bound: 2 operators (window, aggregate)
I/O-bound: 2 operators (source, sink)

// Decision: 1:1 ratio (due to CPU-heavy aggregate)

3. Right-Size Your TaskManager Instances

Options for 64 slots with 1:1 ratio:

Instance	vCPU	Machines	Cost/mo	Network	Best For
c5.2xlarge	8	8	$2,200	Up to 10 Gbps	Budget
c5.4xlarge	16	4	$2,400	Up to 10 Gbps	Balanced ✅
c5.9xlarge	36	2	$2,700	10 Gbps	High memory
c5.12xlarge	48	2	$3,600	12 Gbps	Max performance

Our choice: c5.4xlarge - Best balance of cost and manageability

4. Monitor These Metrics

# Critical Flink metrics to watch

# 1. Backpressure (should be LOW)
flink_taskmanager_job_task_backPressureTimeMsPerSecond

# 2. Records per second
rate(flink_taskmanager_job_task_numRecordsInPerSecond[1m])

# 3. Checkpoint duration (should be < 10% of interval)
flink_jobmanager_job_lastCheckpointDuration

# 4. CPU usage (should be 70-85%)
container_cpu_usage_seconds_total{pod=~"flink-taskmanager.*"}

# 5. Memory usage
flink_taskmanager_Status_JVM_Memory_Heap_Used

5. Test with Backlog

Always do the backlog test:

# 1. Build backlog (run producers at full speed)
# 2. Stop producers
# 3. Measure Flink consumption rate
# 4. This is your TRUE Flink capacity

If backlog consumption > live consumption:
→ Upstream system (Pulsar) is the bottleneck

If backlog consumption ≈ live consumption:
→ Flink is the bottleneck

🎯 Key Takeaways

What Worked

✅ Parallelism = Partitions (64 = 64)

✅ 1:1 slot-to-CPU ratio for CPU-bound workloads

✅ Bigger instances (c5.4xlarge) over many small ones

✅ Upgraded Pulsar (i3en → i7i, 4 → 6 nodes)

✅ NVMe device separation (journal vs ledgers)

✅ Disabled journalSyncData in BookKeeper

✅ Backlog testing to identify bottlenecks

What Didn't Work

❌ 2:1 slot-to-CPU ratio (insufficient CPU per task)

❌ Low CPU limits in pod specs (throttling)

❌ Too few Pulsar bookies (4 → needed 6)

❌ Single NVMe device for journal+ledger (I/O contention)

❌ Older instance types (i3en vs i7i)

💰 Final Cost

Component	Instance	Count	Monthly Cost
Flink JM	c5.4xlarge	1	$480
Flink TM	c5.4xlarge	4	$1,920
Flink Total			$2,400
Pulsar (Broker+Bookie)	i7i.8xlarge	6	$12,960
Infrastructure Total			~$15,360/month

Cost per message: $0.0000154 per 1M messages

🚀 Scaling Beyond 1M

Want to go higher? Here's the roadmap:

2M msg/sec:

Pulsar: 8 bookies (i7i.8xlarge)
Flink: 8 TaskManagers (c5.4xlarge), parallelism 128
Partitions: 128
Cost: ~$23K/month

5M msg/sec:

Pulsar: 15 bookies (i7i.8xlarge)
Flink: 16 TaskManagers (c5.4xlarge), parallelism 256
Partitions: 256
Cost: ~$40K/month

10M msg/sec:

Pulsar: 30 bookies (i7i.16xlarge)
Flink: 32 TaskManagers (c5.9xlarge), parallelism 512
Network: Upgrade to 100 Gbps instances
Cost: ~$80K/month

🎓 Conclusion

Going from 140K to 1 million messages/sec required:

Understanding the architecture (operators, tasks, slots)
Systematic testing (change one thing at a time)
Bottleneck identification (backlog test was key!)
Right-sizing resources (not just throwing more hardware)
Infrastructure matching (Pulsar + Flink capacity aligned)

The biggest lesson?

The bottleneck is rarely where you think it is. Measure, test, iterate.

In our case, we assumed Flink was the problem. Turned out:

Phase 1-3: Flink configuration issues
Phase 4: Pulsar was limiting Flink!
Phase 5-6: Back to Flink (needed more CPU per slot)

Both systems needed optimization to achieve 1M msg/sec.

The journey taught me:

Start with fundamentals: Parallelism, partitions, resource allocation
Use systematic testing: Change one variable at a time
Leverage diagnostic tools: Backlog testing, metrics monitoring
Think holistically: Tune the entire pipeline, not just one component

📚 Resources

Flink Load Repository: flink-load
Complete Implementation: RealtimeDataPlatform
Apache Flink Performance Tuning: flink.apache.org/performance
Flink Resource Configuration: Flink Memory Setup Guide

Have you tuned Flink for high throughput? What challenges did you face? Share your optimization stories in the comments! 👇

Follow me for more deep dives on stream processing, performance optimization, and distributed systems architecture!

Next in the series: "ClickHouse Performance: Ingesting 1M Events/Sec with Sub-Second Queries"

📋 Quick Reference

Final Configuration Checklist

✅ Pulsar partitions: 64
✅ Flink parallelism: 64
✅ TaskManager slots: 16 (per TM)
✅ Slot-to-CPU ratio: 1:1
✅ TaskManager count: 4
✅ Instance type: c5.4xlarge
✅ Total vCPUs: 64
✅ Total slots: 64
✅ Pulsar bookies: 6
✅ Pulsar instance: i7i.8xlarge
✅ NVMe separation: Yes
✅ journalSyncData: false

Performance Validation Commands

# Check Flink throughput
kubectl logs -n flink-benchmark deployment/iot-flink-job | grep "records"

# Check parallelism
kubectl exec -n flink-benchmark <jm-pod> -- \
  curl localhost:8081/jobs/<job-id> | jq '.vertices[].parallelism'

# Check CPU usage
kubectl top pods -n flink-benchmark

# Check backpressure
kubectl exec -n flink-benchmark <jm-pod> -- \
  curl localhost:8081/jobs/<job-id>/vertices/<vertex-id>/backpressure

# Run the backlog test
kubectl scale deployment iot-producer -n iot-pipeline --replicas=0
# Watch throughput increase as Flink catches up

Tags: #flink #performance #streaming #aws #optimization #parallelism #tuning #apacheflink #realtimedata

How I Achieved 1 Million Messages/Sec with Apache Pulsar on AWS EKS - A Deep Dive into NVMe, BookKeeper, and Performance Tuning

HyperscaleDesignHub — Sun, 26 Oct 2025 12:44:48 +0000

Processing 1 million messages per second isn't just about throwing more hardware at the problem. It requires deep understanding of storage I/O, careful configuration tuning, and smart architectural decisions.

In this article, I'll share the exact configurations and optimizations that enabled Apache Pulsar to reliably handle 1,000,000 messages/sec with 300-byte payloads on AWS EKS.

🎯 The Challenge

Requirements:

Throughput: 1 million messages/second sustained
Message Size: ~300 bytes (AVRO-serialized sensor data)
Total Bandwidth: ~2.4 Gbps (300 MB/sec)
Latency: < 10ms p99
Durability: No message loss, replicated storage
Cost: Optimized for AWS infrastructure

🏗️ Architecture Overview

┌──────────────────────────────────────────────────────────────────────────┐
│                     Apache Pulsar on AWS EKS                            │
│                 benchmark-high-infra (k8s 1.31)                         │
├──────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌─────────────────┐   ┌─────────────┐   ┌─────────────┐   ┌──────────┐ │
│  │   PRODUCERS     │──▶│ ZooKeeper   │   │   Pulsar    │──▶│ PROXIES  │ │
│  │                 │   │             │   │  Brokers    │   │          │ │
│  │ 4 nodes         │   │ 3 nodes     │   │ 6 nodes     │   │ 2 nodes  │ │
│  │ c5.4xlarge      │   │ t3.medium   │   │ i7i.8xlarge │   │c5.2xlarge│ │
│  │                 │   │             │   │             │   │          │ │
│  │ Java/AVRO       │   │ Metadata    │   │ Message     │   │ Load     │ │
│  │ 250K evt/sec    │   │ Management  │   │ Routing     │   │ Balance  │ │
│  │ per node        │   │             │   │             │   │          │ │
│  └─────────────────┘   └─────────────┘   └─────────────┘   └──────────┘ │
│                                                 │                        │
│                                                 ▼                        │
│                                         ┌─────────────┐                  │
│                                         │ BookKeeper  │                  │
│                                         │  Bookies    │                  │
│                                         │             │                  │
│                                         │ 6 nodes     │                  │
│                                         │ i7i.8xlarge │                  │
│                                         │             │                  │
│                                         │ NVMe Storage│                  │
│                                         │ Separation: │                  │
│                                         │ Device 0:   │                  │
│                                         │ Journal WAL │                  │
│                                         │ Device 1:   │                  │
│                                         │ Ledger Data │                  │
│                                         └─────────────┘                  │
└──────────────────────────────────────────────────────────────────────────┘

💾 The Storage Strategy - Why NVMe Matters

EC2 Instance Selection: i7i.8xlarge

Why i7i.8xlarge?

Instance: i7i.8xlarge
- 32 vCPUs
- 256 GiB RAM
- 2x 3,750 GB NVMe SSDs (7.5TB total)
- Network: 25 Gbps
- Cost: ~$2,160/month per instance

Key Benefits:

Ultra-low latency: NVMe SSDs provide <100µs latency vs EBS's ~1-3ms
High IOPS: 3.75M IOPS vs EBS's 64K IOPS (gp3)
Sustained throughput: 30 GB/s vs EBS's 1 GB/s
No network overhead: Local storage doesn't compete with network bandwidth

NVMe Device Separation - The Game Changer

Critical Design Decision: Use 2 separate NVMe devices per node

# Device configuration on each i7i.8xlarge node
/dev/nvme1n1 → Journal (Write-Ahead Log) - 3,750 GB
/dev/nvme2n1 → Ledgers (Message Storage) - 3,750 GB

Why separate devices?

Journal (Write-Ahead Log):

Sequential writes only
Low latency critical (blocks producer ACKs)
Large capacity available (3.75TB per device)
High write frequency

Ledgers (Message Storage):

Random reads/writes
Large capacity needed (3.75TB per device)
Background compaction operations
Read-heavy for consumers

Performance Impact:

Without separation: Journal writes compete with ledger I/O → increased latency
With separation: Independent I/O queues → consistent <1ms write latency

🚀 Producer Infrastructure - High-Volume Event Generation

Producer Instance Selection: c5.4xlarge

Why c5.4xlarge for producers?

Instance: c5.4xlarge
- 16 vCPUs (high single-thread performance)
- 32 GiB RAM
- Up to 10 Gbps network performance
- EBS Optimized: Up to 4,750 Mbps
- Cost: ~$480/month per instance

Producer Architecture Details:

Node Configuration:

Count: 4 producer nodes
Instance Type: c5.4xlarge
Target Throughput: 250,000 messages/sec per node
Total Capacity: 1,000,000 messages/sec across all nodes

Producer Implementation:

Language: Java with high-performance Pulsar client
Serialization: AVRO for efficient message encoding
Message Size: ~300 bytes (AVRO-serialized sensor data)
Batching: Optimized batch sizes for throughput
Connection Pooling: Multiple connections per producer

Producer Performance Characteristics

Per-Node Metrics:

# Each c5.4xlarge producer node generates:
- Messages/sec: 250,000
- Data rate: 75 MB/sec (300 bytes × 250K)
- CPU utilization: 70-80%
- Memory usage: 8-12 GB
- Network utilization: ~600 Mbps

Pulsar Proxy Configuration:

Instance Type: c5.2xlarge (8 vCPUs, 16 GiB RAM)
Why c5.2xlarge? Higher network performance and CPU for connection handling
Role: Load balancing and connection management for 1M+ connections
Network Performance: Up to 10 Gbps (critical for high-throughput scenarios)
Cost: ~$240/month per instance

You can find the complete producer implementation in the producer-load directory.

🔧 BookKeeper Configuration - The Secret Sauce

The complete configuration can be found in the pulsar-load directory and the specific values.yaml file.

1. Journal Configuration

# pulsar-values.yaml - BookKeeper section
bookkeeper:
  configData:
    # Journal write buffer - CRITICAL for throughput
    journalMaxSizeMB: "2048"  # 2GB buffer (default: 512MB)
    journalMaxBackups: "5"

    # Disable fsync for each write (NVMe is reliable enough)
    journalSyncData: "false"  # Default: true

    # Enable adaptive group writes (batch small writes)
    journalAdaptiveGroupWrites: "true"  # Default: false

    # Flush immediately when queue is empty
    journalFlushWhenQueueEmpty: "true"  # Default: false

Explanation:

journalMaxSizeMB: "2048"

Increased from default 512MB to 2GB
Allows buffering more writes before flush
Reduces fsync frequency
Impact: 3-4x improvement in write throughput

journalSyncData: "false"

Disables fsync() after each write
Relies on NVMe's own write cache and power loss protection
Risk: Potential data loss on sudden power failure
Mitigation: NVMe drives have capacitor-backed cache
Impact: 10x reduction in write latency (10ms → 1ms)

journalAdaptiveGroupWrites: "true"

Groups multiple small writes into batches
Reduces system call overhead
Impact: Improves throughput by 20-30% under high load

journalFlushWhenQueueEmpty: "true"

Immediately flushes when no pending writes
Reduces latency for sporadic writes
Impact: Better p99 latency during variable load

2. Ledger Storage Configuration

# Entry log settings
entryLogSizeLimit: "2147483648"  # 2GB per entry log file
entryLogFilePreAllocationEnabled: "true"  # Pre-allocate files

# Flush interval
flushInterval: "60000"  # 60 seconds (default: 60000)

# Use RocksDB for better performance
ledgerStorageClass: "org.apache.bookkeeper.bookie.storage.ldb.DbLedgerStorage"

Explanation:

entryLogSizeLimit: "2147483648" (2GB)

Larger entry log files reduce file rotation overhead
Better sequential write patterns
Impact: 15% improvement in write throughput

entryLogFilePreAllocationEnabled: "true"

Pre-allocates disk space for entry log files
Eliminates file system overhead during writes
Impact: More predictable latency

ledgerStorageClass: "DbLedgerStorage"

Uses RocksDB instead of InterleavedLedgerStorage
Better for high-throughput workloads
Faster index lookups
Impact: 40% improvement in random read performance

3. Garbage Collection Tuning

# GC settings optimized for high throughput
minorCompactionInterval: "3600"     # 1 hour (default: 2 hours)
majorCompactionInterval: "86400"    # 24 hours (default: 24 hours)
isForceGCAllowWhenNoSpace: "true"   # Force GC when disk full
gcWaitTime: "900000"                # 15 minutes (default: 15 minutes)

Why this matters:

High throughput generates ledgers quickly
Compaction reclaims space from deleted messages
Balance: Too frequent → CPU overhead, Too rare → Disk full

4. Cache and Memory Configuration

# Broker configuration
broker:
  configData:
    # Managed ledger cache (hot data in memory)
    managedLedgerCacheSizeMB: "512"  # 512MB per broker

    # Replication settings
    managedLedgerDefaultEnsembleSize: "3"  # 3 bookies per ledger
    managedLedgerDefaultWriteQuorum: "2"   # Write to 2 bookies
    managedLedgerDefaultAckQuorum: "2"     # Wait for 2 ACKs

Trade-offs:

managedLedgerCacheSizeMB: "512"

Caches recently written messages
Speeds up tailing reads (consumers close to tail)
Impact: 50% reduction in read latency for hot data

Quorum Configuration (3/2/2):

Ensemble: 3 bookies hold each ledger segment
Write Quorum: Write to 2 bookies simultaneously
Ack Quorum: Wait for 2 ACKs before acknowledging producer
Trade-off: Balance between durability and latency
- 3/3/3: More durable, higher latency
- 3/2/2: Balanced (our choice)
- 3/2/1: Fast but less durable

🚀 Broker Configuration for High Throughput

broker:
  replicaCount: 6  # 6 broker instances

  resources:
    requests:
      memory: "2Gi"
      cpu: "1"
    limits:
      memory: "4Gi"
      cpu: "2"

  configData:
    # Increase connection limits
    maxConcurrentLookupRequest: "50000"      # Default: 5000
    maxConcurrentTopicLoadRequest: "50000"   # Default: 5000

    # Batch settings for better throughput
    maxMessagesBatchingEnabled: "true"
    maxNumMessagesInBatch: "1000"
    maxBatchingDelayInMillis: "10"

    # Producer settings
    maxProducersPerTopic: "10000"
    maxConsumersPerTopic: "10000"
    maxConsumersPerSubscription: "10000"

Key optimizations:

Connection Limits:

Increased from 5K to 50K concurrent requests
Handles high producer/consumer concurrency
Impact: Eliminates connection throttling

Batching Configuration:

Groups messages for efficient network utilization
10ms delay balances latency vs throughput
Impact: 30% improvement in network efficiency

📊 Performance Monitoring

Essential Metrics to Track

1. Throughput Metrics:

# Messages per second
kubectl logs -n pulsar pulsar-broker-0 | grep "msg/s"

# Bytes per second
kubectl exec -n pulsar pulsar-broker-0 -- \
  bin/pulsar-admin topics stats persistent://public/default/iot-sensor-data

2. Latency Metrics:

# BookKeeper journal write latency
kubectl exec -n pulsar pulsar-broker-0 -- \
  bin/bookkeeper shell listledgers | head -10

# End-to-end producer latency (check application logs)

3. Storage Metrics:

# NVMe utilization
kubectl exec -n pulsar pulsar-broker-0 -- iostat -x 1 5

# Disk space usage
kubectl exec -n pulsar pulsar-broker-0 -- df -h

Critical Performance Indicators

Target Metrics:

Message Rate: 1,000,000+ msg/sec
Journal Write Latency: < 2ms p99
CPU Utilization: 70-80% (brokers)
Memory Utilization: 60-70% (bookies)
Network Utilization: < 80% of 25 Gbps

Grafana Dashboards

Critical Panels:

Message Rate In/Out

   rate(pulsar_in_messages_total[1m])
   rate(pulsar_out_messages_total[1m])

BookKeeper Write Latency

   histogram_quantile(0.99, 
     rate(bookie_journal_JOURNAL_ADD_ENTRY_bucket[5m])
   )

Storage Fill Rate

   rate(bookie_ledgers_size_bytes[1h])

💡 Lessons Learned & Best Practices

1. NVMe Device Separation is Critical

Before optimization (single device):

Write latency p99: ~15ms
Throughput: ~400K msg/sec
Frequent latency spikes

After optimization (separate devices):

Write latency p99: ~2.1ms
Throughput: 1M+ msg/sec
Stable performance

Key Takeaway: Journal and ledger I/O patterns conflict. Separate them.

2. Disable journalSyncData (Carefully)

Trade-off Analysis:

Benefits:

10x reduction in write latency
2-3x increase in throughput

Risks:

Data loss on sudden power failure (rare with NVMe)
Not suitable for financial transactions

When to use:

IoT/telemetry data (lossy acceptable)
High-volume logs
Event streaming

When NOT to use:

Financial transactions
Critical business data
Compliance-regulated workloads

3. Right-Size Your Instances

Instance comparison for Pulsar:

Instance	vCPU	RAM	NVMe	Best For
i3en.6xlarge	24	192GB	2x 7.5TB	High capacity
i7i.8xlarge	32	256GB	2x 3.75TB	Balanced

Our choice: i7i.8xlarge

Latest generation (better CPU performance)
2 NVMe devices (perfect for journal/ledger separation)
Optimal balance of CPU, memory, and storage for Pulsar workloads

4. Broker and Bookie Co-location

Pros:

Reduced network hops (broker→bookie is local)
Lower latency
Cost savings (fewer instances)

Cons:

Resource contention (CPU, memory)
Harder to scale independently

Our approach:

Co-locate for high throughput workloads
Separate for latency-sensitive applications
Result: Works well for 1M msg/sec with proper resource allocation

5. Monitor Journal Cache Time

# Critical metric: How long messages stay in journal cache
gcWaitTime: "900000"  # 15 minutes

Why it matters:

Journal cache holds recently written messages
Longer cache time = Better read performance for tailing consumers
Too long = More data loss risk on failure
Sweet spot: 10-15 minutes for streaming workloads

🚧 Common Pitfalls & How to Avoid Them

1. EBS Instead of NVMe

Symptom: Throughput caps at ~100K msg/sec, high latency

Cause: EBS gp3 maxes at:

16K IOPS (baseline)
64K IOPS (provisioned)
1 GB/s throughput

Solution: Use NVMe-backed instances (i7i, i4i, i3en)

2. Single NVMe Device for Journal + Ledger

Symptom: Latency spikes, inconsistent throughput

Cause: Journal sequential writes blocked by ledger random I/O

Solution: Use separate devices or at least separate partitions

3. journalSyncData Enabled

Symptom: Write latency >10ms, throughput <200K msg/sec

Cause: fsync() after every write (10ms overhead)

Solution: Disable if data loss tolerance acceptable

4. Insufficient Broker Count

Symptom: High CPU on brokers, throttling, connection refused

Cause: Too few brokers for traffic volume

Solution:

Rule of thumb: 1 broker per 150-200K msg/sec
For 1M msg/sec: Minimum 5-6 brokers

5. Not Using Separate Node Groups

Symptom: Performance degradation when other workloads deploy

Cause: Resource contention with non-Pulsar pods

Solution: Dedicated node groups with taints

📈 Scaling Beyond 1M msg/sec

To 2M msg/sec:

# Increase broker and bookie count
broker:
  replicaCount: 8  # Up from 6

bookkeeper:
  replicaCount: 8  # Up from 6

# Scale producer infrastructure
producers:
  replicaCount: 8  # Up from 4 (250K each = 2M total)
  instanceType: c5.4xlarge  # Keep same instance type

Cost impact: ~+$6,240/month (2 more i7i.8xlarge @ $2,160 each + 4 more c5.4xlarge @ $480 each)

To 5M msg/sec:

Horizontal scaling: 15-20 brokers, 15-20 bookies
Producer scaling: 20 c5.4xlarge nodes (250K each)
Network: Upgrade to 50-100 Gbps instances
Storage: Consider i4i.16xlarge (4x NVMe, 64 vCPU)
Cost: ~$45,000-50,000/month

💰 Cost Analysis

Monthly Cost Breakdown (1M msg/sec)

Component	Instance	Count	Cost/mo
Producer Nodes	c5.4xlarge	4	$1,920
Pulsar Brokers	i7i.8xlarge	6	$12,960
BookKeeper Bookies	i7i.8xlarge	6	(Co-located)
ZooKeeper	t3.medium	3	$90
Pulsar Proxy	c5.2xlarge	2	$480
Total			$15,450

Cost per message: $0.0000155 per 1M messages

Infrastructure Breakdown:

Producer Infrastructure: $1,920/month (12.4% of total)
Pulsar Core Infrastructure: $13,530/month (87.6% of total)
Total Infrastructure: $15,450/month

Note: i7i.8xlarge cost = $12,960 ÷ 6 instances = $2,160/month per instance

Cost Optimization Options

Savings Plans (26% off):
- 3-year commitment
- Reduces to ~$11,433/month
Spot Instances (60% off):
- $6,180/month
- Risk: Potential interruptions
- Mitigation: Use for non-critical environments
Reserved Instances (40% off):
- 1-year commitment
- Reduces to ~$9,270/month
- Balance between savings and flexibility

🎓 Conclusion

Achieving 1 million messages/sec with Pulsar requires:

✅ NVMe storage with separated journal and ledger devices

✅ Careful BookKeeper tuning (journalSyncData, buffer sizes)

✅ Right-sized instances (i7i.8xlarge sweet spot)

✅ Horizontal scaling (6+ brokers, 6+ bookies)

✅ Dedicated infrastructure (node groups with taints)

✅ Monitoring (latency, IOPS, CPU, throughput)

Key Metrics Achieved:

Throughput: 1,040,000 msg/sec sustained
Latency: 0.8ms p50, 2.1ms p99
Infrastructure Cost: $15,450/month ($11,433 with savings plans)
Reliability: 99.95% uptime

The secret sauce combinations:

NVMe device separation for journal vs ledgers
journalSyncData: false for 10x latency improvement
i7i.8xlarge instances for optimal price/performance
Broker-bookie co-location for reduced network hops
Proper resource allocation and monitoring

📚 Resources

Pulsar Load Repository: pulsar-load
Configuration Values: values.yaml
Main Repository: RealtimeDataPlatform
Apache Pulsar Documentation: pulsar.apache.org
BookKeeper Documentation: bookkeeper.apache.org
AWS i7i Instances: aws.amazon.com/ec2/instance-types/i7i

Have you implemented high-throughput messaging systems? What challenges did you face with storage I/O optimization? Share your experiences in the comments! 👇

Building real-time data platforms? Follow me for deep dives on performance optimization, distributed systems, and cloud infrastructure!

Next in the series: "ClickHouse Performance Tuning for 1M Events/Sec Ingestion"

🌟 Key Takeaways

Storage architecture matters more than CPU for messaging systems
NVMe device separation is critical for predictable latency
journalSyncData: false gives 10x performance boost (with trade-offs)
Right instance type beats "more instances" for cost efficiency
Monitoring is essential - know your bottlenecks before scaling

Tags: #pulsar #apache #aws #performance #messaging #nvme #bookkeeper #streaming #eks #realtimedata

AWS EKS Enterprise Deployment: Real-Time Data Streaming Platform - 1 Million Events/Sec

HyperscaleDesignHub — Sun, 26 Oct 2025 10:50:13 +0000

When your business processes millions of events per second - think major e-commerce platforms during Black Friday, global payment processors, or IoT fleets with millions of devices - you need infrastructure that doesn't just scale, but performs flawlessly under extreme load.

In this guide, I'll show you how to deploy an enterprise-grade event streaming platform on AWS EKS that handles 1 million events per second using high-performance compute instances, NVMe storage, and battle-tested architectural patterns.

🎯 What We're Building

An enterprise-scale streaming platform that:

⚡ Processes 1,000,000+ events per second in real-time
🚀 Uses high-performance instances (c5.4xlarge, i7i.8xlarge, r6id.4xlarge)
💾 Leverages NVMe SSD storage for ultra-low latency
☁️ Runs on AWS EKS with production-grade HA
🌍 Supports multi-domain: E-commerce, Finance, IoT, Gaming at scale
⏱️ Delivers sub-second latency end-to-end
📊 Includes enterprise monitoring with Grafana
🔄 Provides exactly-once processing guarantees
💰 AWS infrastructure cost: ~$24,592/month (with reserved instances)

💰 Enterprise Infrastructure Investment

AWS Infrastructure Cost: ~$24,592/month

This enterprise-grade investment includes high-performance compute instances (c5.4xlarge, i7i.8xlarge, r6id.4xlarge), NVMe SSD storage, multi-AZ deployment compatible, existing terraform provides only single AZ (we did this to save data transfer cost. You can change terraform to support Multi-AZ and We have verified already), enterprise monitoring, and all supporting AWS services required for processing 1 million events per second with production-grade reliability.

Why enterprise instances?

i7i.8xlarge: NVMe SSD for Pulsar (ultra-low latency message storage)
r6id.4xlarge: NVMe SSD for ClickHouse (blazing-fast analytics)
c5.4xlarge: High-performance compute for Flink processing & event generation
Enterprise HA: Multi-AZ deployment compatible, replication, auto-scaling

🏗️ Architecture Overview

┌──────────────────────────────────────────────────────────────────┐
│                  AWS EKS Cluster (us-west-2)                     │
│              benchmark-high-infra (k8s 1.31)                     │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌─────────────────┐   ┌──────────────────┐   ┌──────────────┐ │
│  │   PRODUCER      │──▶│     PULSAR       │──▶│    FLINK     │ │
│  │  c5.4xlarge     │   │  i7i.8xlarge     │   │ c5.4xlarge   │ │
│  │                 │   │                  │   │              │ │
│  │ 4 nodes         │   │ ZK + 6 Brokers   │   │ JM + 6 TMs   │ │
│  │ Java/AVRO       │   │ NVMe Storage     │   │ 1M evt/sec   │ │
│  │ 250K evt/sec    │   │ 3.6TB NVMe       │   │ Checkpoints  │ │
│  │ 100K devices    │   │ Ultra-low lat    │   │ Aggregation  │ │
│  └─────────────────┘   └──────────────────┘   └──────┬───────┘ │
│                                                        │         │
│                         ┌──────────────────────────────┘         │
│                         ▼                                        │
│                  ┌──────────────────┐                           │
│                  │   CLICKHOUSE     │                           │
│                  │  r6id.4xlarge    │                           │
│                  │                  │                           │
│                  │  6 Data Nodes    │                           │
│                  │  1 Query Node    │                           │
│                  │  NVMe + EBS      │                           │
│                  │  10K+ queries/s  │                           │
│                  └──────────────────┘                           │
│                                                                  │
│  Supporting: VPC, Single-AZ (Multi-AZ Compatible), S3, ECR, IAM, Auto-scaling         │
└──────────────────────────────────────────────────────────────────┘

Tech Stack:

Kubernetes: AWS EKS 1.31 (Multi-AZ Compatible, HA)
Message Broker: Apache Pulsar 3.1 (NVMe-backed)
Stream Processing: Apache Flink 1.18 (Exactly-once)
Analytics DB: ClickHouse 24.x (NVMe + EBS)
Storage: NVMe SSD (45TB) + EBS gp3
Infrastructure: Terraform
Monitoring: Grafana + Prometheus + VictoriaMetrics

📋 Prerequisites

# Install required tools
brew install awscli terraform kubectl helm

# Configure AWS with admin-level access
aws configure
# Enter credentials for production account

# Verify versions
terraform --version  # >= 1.6.0
kubectl version      # >= 1.28.0
helm version         # >= 3.12.0

AWS Requirements:

Admin access to AWS account
Budget: ~$25,000-33,000/month
Region: us-west-2 (or your preferred region)
Service limits increased for:
- EKS clusters
- EC2 instances (especially i7i.8xlarge, r6id.4xlarge)
- EBS volumes
- Elastic IPs

🚀 Step-by-Step Deployment

Step 1: Clone Repository & Review Configuration

git clone https://github.com/hyperscaledesignhub/RealtimeDataPlatform.git
cd RealtimeDataPlatform/realtime-platform-1million-events

# Review configuration
cat terraform.tfvars

Repository structure:

realtime-platform-1million-events/
├── terraform/                # Enterprise AWS infrastructure
├── producer-load/            # High-volume event generation
├── pulsar-load/              # Apache Pulsar (NVMe-backed)
├── flink-load/               # Apache Flink enterprise processing
├── clickhouse-load/          # ClickHouse analytics cluster
└── monitoring/               # Enterprise monitoring stack

Key Configuration:

# terraform.tfvars
cluster_name = "benchmark-high-infra"
aws_region = "us-west-2"
environment = "production"

# High-performance node groups
producer_desired_size = 4          # c5.4xlarge
pulsar_zookeeper_desired_size = 3  # t3.medium
pulsar_broker_desired_size = 6     # i7i.8xlarge (NVMe)
flink_taskmanager_desired_size = 6 # c5.4xlarge
clickhouse_desired_size = 6        # r6id.4xlarge (NVMe)

# Enable all services
enable_flink = true
enable_pulsar = true
enable_clickhouse = true
enable_general_nodes = true

Step 2: Deploy AWS Infrastructure with Terraform

# Initialize Terraform
terraform init

# Review infrastructure plan (~$24K-33K/month)
terraform plan

# Deploy infrastructure (takes ~20-25 minutes)
terraform apply -auto-approve

What gets created:

Network Layer:

✅ VPC with Single-AZ subnets (10.1.0.0/16)
✅ 2 NAT Gateways (high availability)
✅ Internet Gateway
✅ Route tables and security groups

EKS Cluster:

✅ Kubernetes 1.31 cluster
✅ Control plane with HA
✅ IRSA (IAM Roles for Service Accounts)
✅ Logging enabled (API, Audit, Authenticator)

Node Groups (9 total):

Producer: c5.4xlarge × 4 nodes
Pulsar ZK: t3.medium × 3 nodes
Pulsar Broker-Bookie: i7i.8xlarge × 6 nodes (3.6TB NVMe)
Pulsar Proxy: t3.medium × 2 nodes
Flink JobManager: c5.4xlarge × 1 node
Flink TaskManager: c5.4xlarge × 6 nodes
ClickHouse Data: r6id.4xlarge × 6 nodes (1.9TB NVMe each)
ClickHouse Query: r6id.2xlarge × 1 node
General: t3.medium × 4 nodes

Storage & Services:

✅ S3 bucket for Flink checkpoints
✅ ECR repositories for container images
✅ EBS CSI driver
✅ IAM roles and policies
✅ CloudWatch log groups

Configure kubectl:

aws eks update-kubeconfig --region us-west-2 --name benchmark-high-infra

# Verify cluster
kubectl get nodes
# Should see ~30 nodes across all groups

Step 3: Deploy Apache Pulsar (High-Performance Message Broker)

cd pulsar-load

# Deploy Pulsar with NVMe storage
./deploy.sh

# Monitor deployment (~10-15 minutes for all components)
kubectl get pods -n pulsar -w

What this deploys:

ZooKeeper (Metadata Management):

3 replicas on t3.medium
Cluster coordination and metadata

Broker-BookKeeper (Combined - NVMe):

6 replicas on i7i.8xlarge instances
Each node: 2*3.75 TB NVMe SSD (total 45TB)
Message routing + persistence
Ultra-low latency (~1ms writes)

Proxy (Load Balancing):

2 replicas on C5.2xlarge
Client connection management

Monitoring Stack:

Grafana dashboards
VictoriaMetrics for metrics
Prometheus exporters

Verify Pulsar cluster:

# Check all components are running
kubectl get pods -n pulsar

# Test Pulsar functionality
kubectl exec -n pulsar pulsar-broker-0 -- \
  bin/pulsar-admin topics create persistent://public/default/test-topic

# Verify topic creation
kubectl exec -n pulsar pulsar-broker-0 -- \
  bin/pulsar-admin topics list public/default

Step 4: Deploy ClickHouse (Enterprise Analytics Database)

cd ../clickhouse-load

# Install ClickHouse operator and enterprise cluster
./00-install-clickhouse.sh

# Wait for ClickHouse cluster (~5-8 minutes)
kubectl get pods -n clickhouse -w

# Create enterprise database schema
./00-create-schema-all-replicas.sh

ClickHouse Enterprise Setup:

6 Data Nodes: r6id.4xlarge with NVMe SSD
1 Query Node: r6id.2xlarge for complex analytics
Database: benchmark
Table: sensors_local (optimized for high-throughput writes)
Storage: NVMe SSD + EBS gp3 (enterprise performance)
Replication: 2x across availability zones

Enterprise Schema Example:

-- High-performance sensor data table using AVRO schema
CREATE TABLE IF NOT EXISTS benchmark.sensors_local ON CLUSTER iot_cluster (
    sensorId Int32,
    sensorType Int32,
    temperature Float64,
    humidity Float64,
    pressure Float64,
    batteryLevel Float64,
    status Int32,
    timestamp DateTime64(3),
    event_time DateTime64(3) DEFAULT now64()
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{cluster}/sensors_local', '{replica}')
PARTITION BY toYYYYMM(timestamp)
ORDER BY (sensorId, timestamp)
SETTINGS index_granularity = 8192;

Test ClickHouse cluster:

# Connect to ClickHouse cluster
kubectl exec -it -n clickhouse chi-iot-cluster-repl-iot-cluster-0-0-0 -- clickhouse-client

# Test cluster connectivity
SELECT * FROM system.clusters WHERE cluster = 'iot_cluster';

# Exit with Ctrl+D

Step 5: Deploy Apache Flink (Enterprise Stream Processing)

build-and-push.sh, script is going to create ECR repo in case you don't have one and push flink image into the ECR repo. And its going to give docker image name tagged with ECR repo

You need to provide docker image name properly in the flink-job-deployment.yaml file, before running it and deploy the flink job

cd ../flink-load

# Build and push enterprise Flink image to ECR
./build-and-push.sh

# Deploy Flink enterprise cluster
./deploy.sh

# Submit high-throughput Flink job
kubectl apply -f flink-job-deployment.yaml

# Monitor Flink deployment (~3-5 minutes)
kubectl get pods -n flink-benchmark -w

Enterprise Flink Setup:

JobManager: c5.4xlarge × 1 (job coordination)
TaskManager: c5.4xlarge × 6 (parallel processing)
Parallelism: 48 (8 slots × 6 TaskManagers)
Checkpointing: Every 1 minute to S3
State Backend: RocksDB with NVMe storage

Flink Job Configuration:

// Enterprise-grade stream processing using SensorData AVRO schema
DataStream<SensorRecord> sensorStream = env.fromSource(
    pulsarSource,
    WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(5)),
    "Pulsar Enterprise IoT Source"
);

// High-throughput processing with 1-minute windows
sensorStream
    .keyBy(record -> record.getSensorId())
    .window(TumblingEventTimeWindows.of(Time.minutes(1)))
    .aggregate(new EnterpriseAggregator())
    .addSink(new ClickHouseJDBCSink(clickhouseUrl));

Step 6: Deploy High-Volume IoT Producer

cd ../producer-load

# Build and deploy enterprise producer
./deploy-with-partitions.sh [PARTITIONS] [MIN_REPLICAS] [MAX_REPLICAS]

#First run this script, if flink job is not running then this is just #going to create pulsar topic with partitions of 64.And it is going #to set the storage retention time of 30 minutes 

#In our case following is the command:

./deploy-with-partitions.sh 64 1 4



#Then deploy flink job as mentioned in below sections and come back #here and again run the same command:

#This is going to just create only one producer, because we don't #want to bombard cluster with millions of message at the same time

./deploy-with-partitions.sh 64 1 4

#After first producer is producing messages consistently then run the 
#below script which gradually start rest of the producers

# Scale producers gradually with a delay of 1 minute until reached to # 4 producers (4 nodes × 250K each)
./scale-gradually.sh [MAX_REPLICAS]
#In our case following is the command:

./scale-gradually.sh 4

# Monitor producer performance
kubectl get pods -n iot-pipeline -l app=iot-producer

Enterprise Producer Capabilities:

Throughput: 250,000 events/sec per pod
Scale: 100+ pods for 1M+ events/sec
AVRO Schema: Enterprise SensorData with optimized integers
Device Simulation: 100,000 unique device IDs
Realistic Patterns: Battery drain, temperature variations, device lifecycle

📊 Step 7: Verify Enterprise Performance

After all components are deployed (~25-30 minutes total), verify 1M events/sec performance:

# Monitor producer throughput
kubectl logs -n iot-pipeline -l app=iot-producer --tail=20 | grep "Events produced"

# Check Pulsar message ingestion rate
kubectl exec -n pulsar pulsar-broker-0 -- \
  bin/pulsar-admin topics stats persistent://public/default/iot-sensor-data

# Verify Flink processing rate
kubectl logs -n flink-benchmark deployment/iot-flink-job --tail=20

# Query ClickHouse for ingestion rate
kubectl exec -n clickhouse chi-iot-cluster-repl-iot-cluster-0-0-0 -- \
  clickhouse-client --query "
    SELECT 
        toStartOfMinute(timestamp) as minute,
        COUNT(*) as events_per_minute
    FROM benchmark.sensors_local 
    WHERE timestamp >= now() - INTERVAL 5 MINUTE
    GROUP BY minute 
    ORDER BY minute DESC"

Expected Performance Metrics:

✅ Producer: 1,000,000+ events/sec generation
✅ Pulsar: Ultra-low latency message ingestion (~1ms)
✅ Flink: Real-time processing
✅ ClickHouse: High-speed data ingestion and sub-second queries

Overall end to end pipeline Guaranteeing Exactly Once Semantic by keeping ClickHouse Tables Type as Replace MergeTree Type

🔍 Enterprise Monitoring and Analytics

Access Enterprise Grafana Dashboard

# Set up secure port forwarding
kubectl port-forward -n pulsar svc/grafana 3000:80 &

# Open enterprise dashboard
open http://localhost:3000
# Login: admin/admin123

Enterprise Dashboards:

Pulsar Metrics: Message rates, storage usage, replication lag
Flink Metrics: Job health, checkpoint duration, backpressure
ClickHouse Metrics: Query performance, replication status, storage
Infrastructure: CPU, memory, disk I/O, network across all nodes

Enterprise Analytics Queries

-- Connect to ClickHouse enterprise cluster
kubectl exec -it -n clickhouse chi-iot-cluster-repl-iot-cluster-0-0-0 -- clickhouse-client

-- Enterprise-scale analytics using our SensorData AVRO schema
USE benchmark;

-- Real-time throughput monitoring
SELECT 
    toStartOfMinute(timestamp) as minute,
    COUNT(*) as events_per_minute,
    COUNT(DISTINCT sensorId) as unique_sensors,
    AVG(temperature) as avg_temp,
    AVG(batteryLevel) as avg_battery
FROM sensors_local
WHERE timestamp >= now() - INTERVAL 1 HOUR
GROUP BY minute
ORDER BY minute DESC
LIMIT 60;

-- Enterprise anomaly detection
SELECT 
    sensorId,
    sensorType,
    temperature,
    batteryLevel,
    status,
    timestamp
FROM sensors_local
WHERE (temperature > 40.0 OR batteryLevel < 15.0 OR status != 1)
  AND timestamp >= now() - INTERVAL 10 MINUTE
ORDER BY timestamp DESC
LIMIT 100;

-- High-performance aggregations across millions of records
SELECT 
    sensorType,
    COUNT(*) as total_readings,
    AVG(temperature) as avg_temp,
    percentile(0.95)(temperature) as p95_temp,
    AVG(humidity) as avg_humidity,
    MIN(batteryLevel) as min_battery,
    MAX(batteryLevel) as max_battery
FROM sensors_local
WHERE timestamp >= today() - INTERVAL 1 DAY
GROUP BY sensorType
ORDER BY total_readings DESC;

-- Enterprise time-series analysis
SELECT 
    toStartOfHour(timestamp) as hour,
    sensorType,
    COUNT(*) as hourly_count,
    AVG(temperature) as avg_temp,
    stddevPop(temperature) as temp_stddev
FROM sensors_local
WHERE timestamp >= now() - INTERVAL 24 HOUR
GROUP BY hour, sensorType
ORDER BY hour DESC, sensorType;

📈 Enterprise Performance Benchmarks

Real-World Enterprise Metrics

On this enterprise-grade setup, you achieve:

Metric	Value	Notes
Peak Throughput	1,000,000+ events/sec	Sustained with room for 2M+
End-to-end Latency	< 2 seconds (p99)	Producer → ClickHouse
Query Performance	< 200ms	Complex aggregations on 1B+ records
Write Latency	< 1ms	Pulsar NVMe storage
CPU Utilization	70-80%	Optimized across all instances
Memory Efficiency	~85%	High-memory instances (r6id)
Storage IOPS	50,000+	NVMe SSD performance
Availability	99.95%+	Single-AZ enterprise deployment(Can be changed to Multi-AZ In Terraform and work with same performance)

Enterprise Use Cases Supported

E-Commerce at Scale:

Black Friday traffic: 10M+ orders/hour
Real-time inventory across 1000+ warehouses
Personalization for 100M+ users
Fraud detection on every transaction

Financial Services:

High-frequency trading: microsecond latency
Risk calculations on 1M+ portfolios
Real-time compliance monitoring
Market data processing at scale

IoT Enterprise:

Fleet management: 1M+ connected vehicles
Smart city infrastructure: millions of sensors
Industrial IoT: factory-wide monitoring
Predictive maintenance at scale

🛠️ Enterprise Troubleshooting

High-Load Performance Issues

# Check node resource utilization
kubectl top nodes | sort -k3 -nr

# Identify resource bottlenecks
kubectl describe nodes | grep -A5 "Allocated resources"

# Scale TaskManagers for higher throughput
kubectl scale deployment flink-taskmanager -n flink-benchmark --replicas=12

# Monitor Flink backpressure
kubectl exec -n flink-benchmark <jobmanager-pod> -- \
  flink list -r

NVMe Storage Performance

# Check NVMe disk performance
kubectl exec -n pulsar pulsar-broker-0 -- \
  iostat -x 1 5

# Monitor ClickHouse storage usage
kubectl exec -n clickhouse chi-iot-cluster-repl-iot-cluster-0-0-0 -- \
  clickhouse-client --query "
    SELECT 
        name,
        total_space,
        free_space,
        (total_space - free_space) / total_space * 100 as usage_percent
    FROM system.disks"

Network Performance Optimization

# Check inter-pod network latency
kubectl exec -n pulsar pulsar-broker-0 -- \
  ping -c 5 flink-jobmanager.flink-benchmark.svc.cluster.local

# Monitor network bandwidth
kubectl exec -n flink-benchmark <taskmanager-pod> -- \
  iftop -t -s 10

🧹 Enterprise Cleanup

When decommissioning the enterprise setup:

# Graceful shutdown of applications
kubectl delete namespace iot-pipeline flink-benchmark

# Backup critical data before destroying infrastructure
./backup-clickhouse.sh
./backup-flink-savepoints.sh

# Destroy AWS infrastructure
terraform destroy
# Type 'yes' when prompted

# Verify all resources are cleaned up
aws ec2 describe-instances --region us-west-2 \
  --filters "Name=tag:kubernetes.io/cluster/benchmark-high-infra,Values=owned"

⚠️ Enterprise Warning: Ensure all critical data is backed up before destruction!

💡 Enterprise Best Practices

1. Cost Optimization with Reserved Instances

# Purchase 3-year reserved instances for 26% savings
# Target instances: i7i.8xlarge, r6id.4xlarge, c5.4xlarge

# AWS Console → EC2 → Reserved Instances → Purchase
# - Term: 3 years
# - Payment: All upfront (max discount)
# - Instance type: i7i.8xlarge, r6id.4xlarge
# - Quantity: Match your desired_size

# Savings: $33,016 → $24,592/month (26% off)

2. Enterprise Backup Strategy

# Automated EBS snapshots
aws backup create-backup-plan --backup-plan-name daily-snapshots

# ClickHouse enterprise backups to S3
clickhouse-backup create
clickhouse-backup upload

# Flink savepoints for exactly-once recovery
kubectl exec -n flink-benchmark <jm-pod> -- \
  flink savepoint <job-id> s3://benchmark-high-infra-state/savepoints

3. Enterprise Alerting

# CloudWatch Alarms for enterprise monitoring
- CPU > 80% sustained for 5 minutes
- Disk usage > 85%
- Pod crash loops > 3 in 10 minutes
- Flink checkpoint failures
- Pulsar consumer lag > 1M messages
- ClickHouse replication lag > 5 minutes

4. Disaster Recovery Implementation

Multi-Region Setup:

# Deploy identical stack in secondary region
aws_region = "us-east-1"
cluster_name = "benchmark-high-infra-dr"

# Use Pulsar geo-replication
bin/pulsar-admin namespaces set-clusters public/default \
  --clusters us-west-2,us-east-1

# ClickHouse cross-region replication
CREATE TABLE benchmark.sensors_replicated
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{cluster}/sensors', '{replica}')
...

Enterprise Recovery Objectives:

RTO (Recovery Time Objective): < 1 hour
RPO (Recovery Point Objective): < 5 minutes
Automated daily backups to S3
Cross-region replication for critical data

5. Cost Monitoring and Governance

# Set up AWS Cost Explorer with enterprise tags
# Tag all resources:
# - Environment: production
# - Project: streaming-platform
# - Team: data-engineering
# - CostCenter: engineering

# Create enterprise budget alert
aws budgets create-budget --budget \
  --account-id 123456789 \
  --budget-name streaming-platform-monthly \
  --budget-limit Amount=30000,Unit=USD

# Alert if cost > $30K/month

🎓 What You've Built

By following this guide, you've deployed:

✅ Enterprise-grade infrastructure handling 1M events/sec

✅ High-performance compute with NVMe storage

✅ Exactly-once processing with Flink checkpointing

✅ Multi-AZ Compatible high availability with auto-recovery

✅ Production monitoring with Grafana dashboards

✅ Auto-scaling for dynamic workloads

✅ Security & compliance with encryption and RBAC

✅ Cost optimization with reserved instances

🚀 Next Steps

1. Customize for Your Enterprise Domain

E-Commerce (High Scale):

// Order events at 1M/sec using AVRO schema
{
  "order_id": "ORD-1234567",
  "customer_id": "CUST-99999",
  "items": [...],
  "total_amount": 1299.99,
  "timestamp": "2025-10-26T10:00:00Z"
}

Finance (Trading):

// Market data at 1M/sec
{
  "symbol": "AAPL",
  "price": 175.50,
  "volume": 10000,
  "exchange": "NASDAQ", 
  "timestamp": "2025-10-26T10:00:00.123Z"
}

IoT (Massive Scale):

// Sensor telemetry from millions of devices
// Using our optimized SensorData AVRO schema
{
  "sensorId": 1000001,
  "sensorType": 1,  // temperature sensor
  "temperature": 24.5,
  "humidity": 68.2,
  "pressure": 1013.25,
  "batteryLevel": 87.5,
  "status": 1,  // online
  "timestamp": 1635254400123
}

2. Implement Advanced Enterprise Analytics

-- Real-time anomaly detection
CREATE MATERIALIZED VIEW anomaly_detection AS
SELECT 
    sensorId,
    AVG(temperature) as avg_temp,
    stddevPop(temperature) as stddev_temp,
    if(temperature > avg_temp + 3*stddev_temp, 1, 0) as is_anomaly
FROM benchmark.sensors_local
GROUP BY sensorId;

-- Enterprise windowed aggregations
CREATE MATERIALIZED VIEW hourly_metrics AS
SELECT 
    toStartOfHour(timestamp) as hour,
    sensorId,
    COUNT(*) as event_count,
    AVG(temperature) as avg_temp,
    MAX(temperature) as max_temp,
    MIN(temperature) as min_temp
FROM benchmark.sensors_local
GROUP BY hour, sensorId;

3. Add Machine Learning at Scale

# Real-time ML inference with Flink
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.ml import Pipeline, KMeans

# Load trained model
model = Pipeline.load('s3://models/anomaly-detection')

# Apply to 1M events/sec stream
predictions = sensor_stream.map(lambda x: model.predict(x))

4. Expand to Multi-Region Enterprise

# Deploy to additional regions for global presence
# us-west-2 (primary)
# us-east-1 (DR)
# eu-west-1 (Europe)
# ap-southeast-1 (Asia)

# Enable Pulsar geo-replication
# Configure ClickHouse distributed tables
# Use Route53 for global load balancing

📚 Resources

Enterprise Repository: realtime-platform-1million-events
Main Repository: RealtimeDataPlatform
AWS EKS Best Practices: aws.github.io/aws-eks-best-practices
Apache Flink Production Guide: flink.apache.org/deployment
Apache Pulsar Operations: pulsar.apache.org/docs/administration-pulsar-manager
ClickHouse Operations: clickhouse.com/docs/operations

💬 Conclusion

You now have an enterprise-grade, production-ready streaming platform processing 1 million events per second on AWS! This setup demonstrates real-world architecture patterns used by Fortune 500 companies processing billions of events per day.

Key Achievements:

🚀 1M events/sec throughput with room to scale to 2M+
⚡ Sub-second latency end-to-end
💪 Enterprise HA with multi-AZ Compatible and auto-recovery
💰 Cost-optimized at $24,592/month (with reserved instances)
🔒 Production-secure with encryption and compliance
📊 Observable with comprehensive monitoring

This platform can handle:

Black Friday e-commerce traffic (millions of orders/hour)
Global payment processing (thousands of transactions/sec)
IoT fleets (millions of devices sending data)
Real-time gaming analytics (millions of player events)
Financial market data (high-frequency trading)

Enterprise benefits:

NVMe storage for ultra-low latency message persistence
High-performance instances optimized for streaming workloads
AVRO schema optimization for efficient serialization at scale
Multi-AZ Compatible deployment ensuring 99.95%+ availability
Exactly-once processing guarantees for financial-grade accuracy

What enterprise use case would you build on this platform? Share in the comments! 👇

Building enterprise data platforms? Follow me for deep dives on real-time streaming, cloud architecture, and production system design!

Next in the series: "Multi-Region Deployment - Global Real-Time Data Platform"

🌟 Enterprise Support

⭐ Production-tested - Handles 1M+ events/sec in real deployments

🏢 Enterprise-ready - Multi-AZ Compatible, HA, DR, compliance

📖 Fully documented - Complete runbooks and guides

🔧 Professional support - Available for production deployments

💼 Consulting - Custom implementation and optimization

📊 Enterprise Performance Summary

Metric	Value
Peak Throughput	1,000,000 events/sec
End-to-End Latency	< 2 seconds (p99)
Monthly Cost	$24,592 (reserved instances)
Availability	99.95% (Multi-AZ Compatible)
Data Retention	30 days (configurable)
Query Performance	< 200ms (complex aggregations)
Scalability	250K → 2M+ events/sec
Recovery Time	< 1 hour (DR failover)

Tags: #aws #eks #enterprise #streaming #dataengineering #pulsar #flink #clickhouse #production #avro #realtimeanalytics #nvme

AWS EKS Deployment: Real-Time Data Streaming Platform - 50K Events/Sec for $1,250/Month

HyperscaleDesignHub — Sun, 26 Oct 2025 10:41:18 +0000

Building real-time streaming platforms can be expensive. Most production setups cost thousands of dollars per month. But what if you need to process 50,000 events per second at a reasonable cost?

In this guide, I'll show you how to deploy a production-grade event streaming platform on AWS EKS that costs $1,250/month while handling moderate-scale real-time data processing.

🎯 What We're Building

A complete, production-ready streaming platform that:

✅ Processes 50,000 events per second in real-time
✅ AWS infrastructure cost: ~$1,250/month
✅ Uses t3 instance types for cost efficiency
✅ Runs on AWS EKS with managed Kubernetes
✅ Supports multiple domains: E-commerce, Finance, IoT, Gaming, Logistics
✅ Provides real-time analytics with sub-second latency
✅ Includes monitoring dashboards with Grafana
✅ Offers easy scalability to 1M events/sec if needed

💰 Infrastructure Cost

AWS Infrastructure Cost: ~$1,250/month

This includes all compute instances (t3.medium to t3.xlarge), EKS cluster management, storage (EBS gp3), networking (NAT Gateway, Load Balancer), and monitoring services required for a production-ready 50K events/sec streaming platform.

🏗️ Architecture Overview

┌────────────────────────────────────────────────────────────────┐
│                    AWS EKS Cluster (us-west-2)                 │
│                    bench-low-infra (k8s 1.31)                  │
├────────────────────────────────────────────────────────────────┤
│                                                                │
│  ┌──────────────┐    ┌──────────────┐    ┌─────────────────┐ │
│  │   PRODUCER   │───▶│    PULSAR    │───▶│      FLINK      │ │
│  │  (t3.medium) │    │  (t3.large)  │    │   (t3.large)    │ │
│  │              │    │              │    │                 │ │
│  │ Java/AVRO    │    │ ZK+Broker+BK │    │ JobManager +    │ │
│  │ 1K msg/sec   │    │ EBS Storage  │    │ TaskManager     │ │
│  │ per pod      │    │ 50K msg/sec  │    │ 1-min windows   │ │
│  └──────────────┘    └──────────────┘    └────────┬────────┘ │
│                                                     │          │
│                      ┌──────────────────────────────┘          │
│                      ▼                                         │
│               ┌─────────────────┐                             │
│               │   CLICKHOUSE    │                             │
│               │  (t3.xlarge)    │                             │
│               │                 │                             │
│               │  EBS Storage    │                             │
│               │  Analytics DB   │                             │
│               └─────────────────┘                             │
│                                                                │
│  Supporting: VPC, S3 (checkpoints), ECR, IAM, EBS CSI        │
└────────────────────────────────────────────────────────────────┘

Tech Stack:

Kubernetes: AWS EKS 1.31
Message Broker: Apache Pulsar 3.1
Stream Processing: Apache Flink 1.18
Analytics DB: ClickHouse 24.x
Storage: EBS gp3 (cost-optimized, no expensive NVMe)
Infrastructure: Terraform
Monitoring: Grafana + Prometheus

📋 Prerequisites

Before starting, ensure you have:

# Install required tools (macOS)
brew install awscli terraform kubectl helm

# Or on Linux
# Install AWS CLI
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

# Install Terraform
wget https://releases.hashicorp.com/terraform/1.6.0/terraform_1.6.0_linux_amd64.zip
unzip terraform_1.6.0_linux_amd64.zip
sudo mv terraform /usr/local/bin/

# Configure AWS credentials
aws configure
# Enter: AWS Access Key, Secret Key, Region (us-west-2), Output format (json)

Required AWS Permissions:

EC2, VPC, EKS, S3, IAM, ECR (full access)
Estimated: ~$1,250/month for complete setup

🚀 Step-by-Step Deployment

Step 1: Clone the Repository

git clone https://github.com/hyperscaledesignhub/RealtimeDataPlatform.git
cd RealtimeDataPlatform/realtime-platform-50k-events

Repository structure:

realtime-platform-50k-events/
├── terraform/                # AWS infrastructure
├── producer-load/            # Event data generation
├── pulsar-load/              # Apache Pulsar deployment
├── flink-load/               # Apache Flink processing
├── clickhouse-load/          # ClickHouse analytics DB
└── monitoring/               # Grafana dashboards

Step 2: Deploy AWS Infrastructure

# Initialize Terraform
terraform init

# Review what will be created
terraform plan

# Deploy infrastructure (takes ~15-20 minutes)
terraform apply
# Type 'yes' when prompted

What gets created:

✅ VPC with public/private subnets (10.1.0.0/16)
✅ EKS cluster bench-low-infra (k8s 1.31)
✅ 9 node groups with t3 instances:
- Producer: t3.medium (1 node, scales to 3)
- Pulsar ZooKeeper: t3.small (3 nodes)
- Pulsar Broker: t3.large (3 nodes)
- Pulsar BookKeeper: t3.large (4 nodes)
- Pulsar Proxy: t3.small (2 nodes)
- Flink JobManager: t3.large (1 node)
- Flink TaskManager: t3.large (6 nodes)
- ClickHouse: t3.xlarge (4 nodes)
- General: t3.small (1 node)
✅ S3 bucket for Flink checkpoints
✅ ECR repositories for container images
✅ IAM roles and policies
✅ EBS CSI driver for persistent volumes
✅ NAT Gateway for internet access

Configure kubectl:

aws eks update-kubeconfig --region us-west-2 --name bench-low-infra

# Verify nodes are ready
kubectl get nodes

Expected output:

NAME                                         STATUS   ROLES    AGE   VERSION
ip-10-1-x-x.us-west-2.compute.internal      Ready    <none>   5m    v1.31.0-eks-...
ip-10-1-x-x.us-west-2.compute.internal      Ready    <none>   5m    v1.31.0-eks-...
...

Step 3: Deploy Apache Pulsar (Message Broker)

cd pulsar-load

# Deploy Pulsar with Helm
./deploy.sh

# Monitor deployment (takes ~5-10 minutes)
kubectl get pods -n pulsar -w
# Press Ctrl+C when all pods are Running

What this deploys:

ZooKeeper: 3 replicas (metadata management)
Broker: 3 replicas (message routing)
BookKeeper: 4 replicas (message storage on EBS)
Proxy: 2 replicas (load balancing)
Grafana: Monitoring dashboard
Victoria Metrics: Metrics storage

Verify Pulsar is healthy:

kubectl get pods -n pulsar | grep -E "zookeeper|broker|bookkeeper"

# All pods should show "Running" status

Step 4: Deploy ClickHouse (Analytics Database)

cd ../clickhouse-load

# Install ClickHouse operator and cluster
./00-install-clickhouse.sh

# Wait for ClickHouse pods (~3-5 minutes)
kubectl get pods -n clickhouse -w

# Create database schema
./00-create-schema-all-replicas.sh

What this creates:

ClickHouse cluster: 4 nodes (2 shards × 2 replicas)
Database: benchmark
Table: sensors_local (optimized for IoT sensor data)
Storage: EBS gp3 volumes (200 GB per node)
Retention: 30 days TTL

Test ClickHouse:

# Connect to ClickHouse
kubectl exec -it -n clickhouse chi-iot-cluster-repl-iot-cluster-0-0-0 -- clickhouse-client

# Run test query
SELECT version();

# Exit with Ctrl+D

Step 5: Deploy Apache Flink (Stream Processing)

cd ../flink-load

# Build and push Flink consumer image to ECR
./build-and-push.sh

# Deploy Flink cluster
./deploy.sh

# Submit Flink job
kubectl apply -f flink-job-deployment.yaml

# Monitor Flink job (~2-3 minutes to start)
kubectl get pods -n flink-benchmark -w

What this deploys:

Flink JobManager: 1 replica (job coordination)
Flink TaskManager: 6 replicas (data processing)
S3 checkpoints: Every 1 minute
Job: JDBC IoT Data Pipeline (AVRO deserialization)

Flink Job Details:

// Stream processing pipeline using the SensorData AVRO schema
DataStream<SensorRecord> sensorStream = env.fromSource(
    pulsarSource,
    WatermarkStrategy.noWatermarks(),
    "Pulsar IoT Source"
);

// 1-minute aggregation windows
sensorStream
    .keyBy(record -> record.getSensorId())
    .window(TumblingProcessingTimeWindows.of(Time.minutes(1)))
    .aggregate(new SensorAggregator())
    .addSink(new ClickHouseJDBCSink(clickhouseUrl));

Step 6: Deploy IoT Producer (Event Generation)

cd ../producer-load

# Build and deploy IoT producer
./deploy.sh

# Scale producers based on desired throughput
kubectl scale deployment iot-producer -n iot-pipeline --replicas=50

# Monitor producer status
kubectl get pods -n iot-pipeline -l app=iot-producer

Producer capabilities:

AVRO serialization: Uses optimized SensorData schema
Multi-sensor types: Temperature, humidity, pressure, motion, light, CO2, noise
Configurable throughput: 1,000 events/sec per pod
Realistic data: Battery levels, device status, geographic distribution

📊 Step 7: Verify the Complete Pipeline

After all components are deployed (~10-15 minutes total), verify data flow:

# Check producer is generating data
kubectl logs -n iot-pipeline -l app=iot-producer --tail=10

# Verify Pulsar has messages
kubectl exec -n pulsar pulsar-broker-0 -- \
  bin/pulsar-admin topics stats persistent://public/default/iot-sensor-data

# Check Flink is processing
kubectl logs -n flink-benchmark deployment/iot-flink-job --tail=20

# Query ClickHouse for data
kubectl exec -n clickhouse chi-iot-cluster-repl-iot-cluster-0-0-0 -- \
  clickhouse-client --query "SELECT COUNT(*) FROM benchmark.sensors_local"

Expected data flow:

✅ Producer: 50,000 events/sec generation
✅ Pulsar: Message ingestion and buffering
✅ Flink: Real-time stream processing with 1-minute windows
✅ ClickHouse: Data storage and analytics queries
✅ End-to-end latency: < 2 seconds

🔍 Monitoring and Analytics

Access Grafana Dashboard

# Set up port forwarding
kubectl port-forward -n pulsar svc/grafana 3000:3000 &

# Open in browser
open http://localhost:3000
# Login: admin/admin

Key metrics to monitor:

Pulsar: Message throughput, backlog size, storage usage
Flink: Checkpoint duration, processing latency, job health
ClickHouse: Query performance, insert rate, storage growth
Infrastructure: CPU, memory, disk I/O across all nodes

Sample Analytics Queries

-- Connect to ClickHouse
kubectl exec -it -n clickhouse chi-iot-cluster-repl-iot-cluster-0-0-0 -- clickhouse-client

-- Query examples based on our SensorData AVRO schema
USE benchmark;

-- Count total sensor readings
SELECT COUNT(*) as total_readings FROM sensors_local;

-- Average metrics by sensor type
SELECT 
    sensorType,
    COUNT(*) as reading_count,
    AVG(temperature) as avg_temp,
    AVG(humidity) as avg_humidity,
    AVG(batteryLevel) as avg_battery
FROM sensors_local
GROUP BY sensorType
ORDER BY sensorType;

-- Identify sensors with low battery
SELECT 
    sensorId,
    sensorType,
    AVG(batteryLevel) as avg_battery,
    MIN(batteryLevel) as min_battery
FROM sensors_local
WHERE timestamp >= now() - INTERVAL 1 HOUR
GROUP BY sensorId, sensorType
HAVING avg_battery < 25.0
ORDER BY avg_battery ASC;

-- Hourly data ingestion rate
SELECT 
    toStartOfHour(timestamp) as hour,
    COUNT(*) as records_per_hour
FROM sensors_local
WHERE timestamp >= now() - INTERVAL 24 HOUR
GROUP BY hour
ORDER BY hour DESC;

-- Temperature anomaly detection
SELECT 
    sensorId,
    timestamp,
    temperature,
    status
FROM sensors_local
WHERE temperature > 35.0 
  AND timestamp >= now() - INTERVAL 1 HOUR
ORDER BY timestamp DESC
LIMIT 100;

📈 Performance Benchmarks

Real-World Performance Metrics

On this cost-optimized setup, you can expect:

Metric	Value	Notes
Message Throughput	50,000 events/sec	Sustained rate with 50 producer pods
End-to-end Latency	< 2 seconds	Producer → ClickHouse
Query Performance	< 500ms	Analytical queries on 1B+ records
CPU Utilization	60-70%	Across all node groups
Memory Usage	~80%	Optimized for t3 instance types
Storage Growth	~10 GB/hour	With 30-day TTL retention
Availability	99.9%+	Multi-AZ deployment

Cost vs. Performance Analysis

💰 Infrastructure Efficiency:
- $1,250/month for 50K events/sec
- $0.000025 per event processed
- Production-ready reliability and monitoring
- Linear scaling to 1M events/sec

📊 Scaling Characteristics:
- Same architecture scales from 1K → 1M events/sec
- Infrastructure-as-Code deployment
- Easy migration path to enterprise setup
- Predictable monthly costs

🛠️ Troubleshooting

Common Issues and Solutions

Issue: High Memory Usage

# Check memory usage across nodes
kubectl top nodes

# Scale up instances if needed
# Edit terraform.tfvars
instance_type = "t3.2xlarge"  # Upgrade from t3.xlarge
terraform apply

Issue: Pods Stuck in Pending

# Check node availability
kubectl get nodes

# Check pod events
kubectl describe pod <pod-name> -n <namespace>

# Scale up nodes if needed
# Edit terraform.tfvars, increase desired_size
terraform apply

Issue: Flink Job Failing

# Check Flink logs
kubectl logs -n flink-benchmark deployment/iot-flink-job

# Common issues:
# - ClickHouse not ready: Wait 2-3 minutes
# - Pulsar not accessible: Check network policies
# - Out of memory: Scale up TaskManagers

Issue: No Data in ClickHouse

# 1. Check producer is running
kubectl get pods -n iot-pipeline -l app=iot-producer

# 2. Check Pulsar has data
kubectl exec -n pulsar pulsar-broker-0 -- \
  bin/pulsar-admin topics stats persistent://public/default/iot-sensor-data

# 3. Check Flink is processing
kubectl logs -n flink-benchmark deployment/iot-flink-job | tail -50

# 4. Verify ClickHouse connectivity
kubectl exec -n clickhouse chi-iot-cluster-repl-iot-cluster-0-0-0 -- \
  clickhouse-client --query "SELECT 1"

Issue: Optimizing Costs

# Enable spot instances for cost savings (can reduce to ~$400-500/month)
use_spot_instances = true

# Reduce node counts during development/testing
desired_size = 1  # Instead of 3-6

# Use smaller instance types for development
instance_type = "t3.small"  # Instead of t3.large

# Schedule auto-shutdown for non-production environments
# Use AWS Instance Scheduler

🧹 Cleanup

When you're done testing:

# Delete all Kubernetes resources
kubectl delete namespace iot-pipeline flink-benchmark clickhouse pulsar

# Destroy AWS infrastructure
terraform destroy
# Type 'yes' when prompted

# This will:
# - Delete EKS cluster
# - Delete VPC and subnets
# - Delete S3 buckets (after emptying)
# - Delete IAM roles
# - Delete ECR repositories

⚠️ Warning: This is irreversible! Make sure to backup any data first.

💡 Production Best Practices

1. Optimize Instance Usage

# terraform.tfvars - Consider spot instances for cost savings
use_spot_instances = true  # Can reduce costs by 60-70%

Benefits of optimization:

Spot instances can reduce costs significantly
Right-sizing instances based on actual usage
Auto-scaling to handle variable workloads

Current setup uses on-demand instances for:

Guaranteed availability and performance
Simplified operations and management
Predictable monthly costs ($1,250)

2. Set Up Alerts

# CloudWatch alarms (via Terraform)
- CPU utilization > 80%
- Disk usage > 85%
- Pod crashes > 3 in 5 minutes
- Flink checkpoint failures

3. Implement Data Retention

-- ClickHouse TTL (30 days)
ALTER TABLE benchmark.sensors_local 
MODIFY TTL timestamp + INTERVAL 30 DAY;

-- Pulsar retention (7 days)
bin/pulsar-admin namespaces set-retention public/default \
  --size 100G --time 7d

4. Enable Backups

# ClickHouse backups to S3
clickhouse-backup create daily_backup
clickhouse-backup upload daily_backup

# Flink savepoints
kubectl exec -n flink-benchmark <jobmanager-pod> -- \
  flink savepoint <job-id> s3://bench-low-infra-state/savepoints

5. Use Auto-Scaling

# HorizontalPodAutoscaler for producers
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: iot-producer-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: iot-producer
  minReplicas: 1
  maxReplicas: 5
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

🎓 What You've Learned

By following this guide, you've:

✅ Deployed a production-grade streaming platform on AWS

✅ Configured cost-optimized infrastructure with t3 instances

✅ Set up real-time stream processing with Apache Flink

✅ Implemented exactly-once semantics with checkpointing

✅ Built scalable message broker with Apache Pulsar

✅ Configured analytics database with ClickHouse

✅ Enabled monitoring and observability with Grafana

✅ Learned cost optimization strategies (90% savings!)

🚀 Next Steps

1. Customize for Your Domain

Modify the producer to generate your specific event types:

// Edit: producer-load/src/main/java/com/iot/pipeline/producer/
public class EventDataProducer {
    private EventData generateEvent() {
        // Your custom event generation logic
        return new EventData(
            customerId,
            orderId,
            amount,
            timestamp
        );
    }
}

2. Add Custom Flink Transformations

// Edit: flink-load/flink-consumer/src/main/java/com/iot/pipeline/flink/
sensorStream
    .filter(record -> record.getTemperature() > 30.0)  // Custom filter
    .map(record -> new AlertRecord(record))            // Custom transformation
    .addSink(new AlertSink());                         // Custom sink

3. Implement Advanced Analytics

-- Create materialized views in ClickHouse
CREATE MATERIALIZED VIEW hourly_aggregates
ENGINE = AggregatingMergeTree()
ORDER BY (sensorId, hour)
AS SELECT
    sensorId,
    toStartOfHour(timestamp) as hour,
    avg(temperature) as avg_temp,
    max(temperature) as max_temp
FROM benchmark.sensors_local
GROUP BY sensorId, hour;

4. Scale to Production

When ready for production scale:

Enable spot instances for cost savings
Set up automated backups
Configure CloudWatch alarms
Implement log aggregation (CloudWatch Logs)
Set up CI/CD pipeline
Enable AWS Shield for DDoS protection

📚 Resources

50K Events Repository: realtime-platform-50k-events
Main Repository: RealtimeDataPlatform
AWS EKS Documentation: docs.aws.amazon.com/eks
Apache Flink: flink.apache.org
Apache Pulsar: pulsar.apache.org
ClickHouse: clickhouse.com/docs
Terraform: terraform.io/docs

💬 Conclusion

You now have a production-grade, cost-optimized streaming platform running on AWS for under $250/month! This setup demonstrates real-world patterns used by companies processing millions of events per day, but optimized for moderate scale and budget constraints.

The beauty of this architecture is its flexibility:

Start small (1K events/sec, lower cost)
Grow to moderate (50K events/sec, ~$1,250/mo)
Scale to enterprise (1M events/sec, higher cost)

All with the same codebase and deployment patterns!

Key takeaways:

AWS EKS deployment provides managed Kubernetes for production workloads
t3 instances deliver excellent price/performance for streaming workloads
$1,250/month infrastructure cost for 50K events/sec processing
AVRO schemas enable efficient serialization at scale
Production-ready with monitoring, alerting, and auto-scaling

What would you build with this platform? Share your use case in the comments! 👇

Found this helpful? Follow me for more posts on cloud architecture, real-time data engineering, and cost optimization strategies!

Next in the series: "Scaling to 1 Million Events/Second - Enterprise Production Guide"

🌟 Support This Project

⭐ Star the repo if you found it useful!

🐛 Report issues - help us improve

💼 Production-tested - used in real workloads

📖 Well-documented - complete guides included

💰 Cost-optimized - save 90% on infrastructure

📊 Performance Summary

Metric	Value
Throughput	50,000 events/sec
Latency	< 2 seconds end-to-end
Monthly Cost	$1,250 (on-demand instances)
Storage	~1TB (30 days retention)
Availability	99.9% (multi-AZ deployment)
Scalability	1K → 1M events/sec
Setup Time	~45 minutes

Tags: #aws #eks #streaming #costsavings #dataengineering #pulsar #flink #clickhouse #terraform #avro #realtimeanalytics

Building a Real-Time Data Platform with Kubernetes (Kind) - A Complete Local Setup Guide

HyperscaleDesignHub — Sun, 26 Oct 2025 10:20:01 +0000

Ever wondered how to build a production-grade real-time data pipeline that can handle millions of events per second? In this guide, I'll show you how to set up a complete IoT streaming platform locally using Kubernetes (Kind) that processes sensor data in real-time.

🎯 What We're Building

A fully functional IoT data pipeline that:

✅ Generates realistic IoT sensor data (temperature, humidity, pressure)
✅ Streams data through Apache Pulsar at 1000+ msg/sec
✅ Processes data in real-time with Apache Flink
✅ Stores analytics in ClickHouse for fast queries
✅ Detects and alerts on anomalies (high temperature, low battery)
✅ Runs entirely on your local machine with Kubernetes

🏗️ Architecture Overview

┌─────────────┐      ┌─────────────┐      ┌─────────────┐      ┌─────────────┐
│             │      │             │      │             │      │             │
│ IoT Producer├─────►│   Pulsar    ├─────►│    Flink    ├─────►│ ClickHouse  │
│  (Java)     │      │  (Broker)   │      │ (Processor) │      │  (Storage)  │
│             │      │             │      │             │      │             │
└─────────────┘      └─────────────┘      └─────────────┘      └─────────────┘
   Sensor Data     Message Queue      Stream Processing     Analytics DB

Tech Stack:

Kubernetes (Kind): Local Kubernetes cluster
Apache Pulsar: Distributed messaging and streaming
Apache Flink: Real-time stream processing
ClickHouse: Columnar database for analytics
Docker: Containerization
Maven: Build automation

📋 Prerequisites

Before we begin, make sure you have:

# Check versions
docker --version      # Docker 20.10+
kubectl version      # kubectl 1.28+
kind version         # kind 0.20+
mvn --version        # Maven 3.8+
java -version        # Java 11+

Installation (macOS):

# Install required tools
brew install kind kubectl maven docker

Installation (Linux):

# Install Kind
curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.20.0/kind-linux-amd64
chmod +x ./kind
sudo mv ./kind /usr/local/bin/kind

# Install kubectl
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
chmod +x kubectl
sudo mv kubectl /usr/local/bin/

🚀 Step 1: Clone the Repository

git clone https://github.com/hyperscaledesignhub/RealtimeDataPlatform.git
cd RealtimeDataPlatform/local-setup/k8s

Repository structure:

local-setup/k8s/
├── create-cluster.sh          # Kind cluster setup
├── start-pipeline.sh          # Deploy entire pipeline
├── stop-pipeline.sh           # Clean shutdown
├── k8s-manifests/            # Kubernetes YAML files
├── flink-jobs/               # Flink job definitions
└── scripts/                  # Helper utilities

🔧 Step 2: Create the Kind Cluster

The repository includes a pre-configured Kind cluster setup:

./create-cluster.sh

What this does:

Creates a 3-node Kubernetes cluster (1 control-plane, 2 workers)
Configures port mappings for service access
Sets up kubeconfig at /tmp/iot-kubeconfig
Validates cluster readiness

Expected output:

✅ Kind cluster created successfully!

Cluster Information:
NAME                         STATUS   ROLES           AGE   VERSION
iot-pipeline-control-plane   Ready    control-plane   40d   v1.32.2
iot-pipeline-worker          Ready    <none>          40d   v1.32.2
iot-pipeline-worker2         Ready    <none>          40d   v1.32.2

🎬 Step 3: Deploy the IoT Pipeline

Now for the exciting part - deploying the entire pipeline:

export KUBECONFIG=/tmp/iot-kubeconfig
./start-pipeline.sh

🔍 What Happens Behind the Scenes

The script performs these steps automatically:

1. Build Phase (~2 minutes)

# Builds the IoT producer Docker image
Building producer image...
✅ iot-producer:latest built

# Compiles Flink consumer with Maven
Building Flink consumer JAR...
✅ flink-consumer-1.0.0.jar created

2. Load Images into Kind

kind load docker-image iot-producer:latest --name iot-pipeline

3. Deploy Services

Creates namespace and deploys:

Pulsar: StatefulSet with persistent storage
ClickHouse: StatefulSet with initialization scripts
Flink: JobManager + 2 TaskManagers
IoT Producer: Deployment generating sensor data

4. Initialize ClickHouse Schema

The platform automatically creates the sensor data schema based on our AVRO definition:

CREATE DATABASE IF NOT EXISTS iot;

CREATE TABLE IF NOT EXISTS sensor_raw_data (
    sensor_id Int32,
    sensor_type Int32,
    temperature Float64,
    humidity Float64,
    pressure Float64,
    battery_level Float64,
    status Int32,
    timestamp DateTime64(3),
    event_time DateTime64(3) DEFAULT now64()
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (sensor_id, timestamp)
SETTINGS index_granularity = 8192;

-- Create alerts table for anomaly detection
CREATE TABLE IF NOT EXISTS sensor_alerts (
    sensor_id Int32,
    alert_type String,
    alert_time DateTime64(3),
    temperature Float64,
    humidity Float64,
    battery_level Float64,
    description String
) ENGINE = MergeTree()
ORDER BY (alert_time, sensor_id);

5. Submit Flink Job

Deploys the stream processing job that:

Consumes from Pulsar topic: persistent://public/default/iot-sensor-data
Deserializes AVRO sensor data using our schema
Detects anomalies (temp > 35°C, humidity > 80%, battery < 20%)
Writes processed data to ClickHouse

📊 Step 4: Verify the Pipeline

After deployment (~90 seconds), you'll see:

✅ Pipeline is working! Data is flowing successfully.

Data Flow Status:
Sensor readings: 39
Alerts generated: 7

Sample sensor data:
┏━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ sensor_id ┃ sensor_type ┃ temperature ┃ humidity ┃ timestamp           ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│        4  │           1 │        24.5 │     68.6 │ 2025-10-26 10:07:32 │
│        3  │           1 │        28.2 │     79.3 │ 2025-10-26 10:07:31 │
│        2  │           1 │        26.3 │     60.1 │ 2025-10-26 10:07:30 │
└───────────┴─────────────┴─────────────┴───────────┴─────────────────────┘

Alert distribution:
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃ alert_type       ┃ count ┃
┡━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
│ HIGH_TEMPERATURE │     7 │
└──────────────────┴───────┘

Check All Pods Running

kubectl get pods -n iot-pipeline

Expected output:

NAME                                 READY   STATUS    RESTARTS   AGE
clickhouse-0                         1/1     Running   0          2m
flink-jobmanager-77c4d6f6c5-v7kqv    1/1     Running   0          2m
flink-taskmanager-7d67d89fd6-5v84r   1/1     Running   0          2m
flink-taskmanager-7d67d89fd6-n7f96   1/1     Running   0          2m
iot-producer-78466d4cf4-6pskj        1/1     Running   0          90s
pulsar-0                             1/1     Running   0          2m

🔍 Step 5: Explore the Pipeline

Access Flink Dashboard

The script automatically sets up port forwarding:

# Flink UI is already accessible at:
open http://localhost:8081

What you'll see:

Running Jobs: 1 job (JDBC IoT Data Pipeline)
Task Managers: 2 active
Task Slots: 4 available
Job Graph: Visual representation of data flow

Query ClickHouse Data

# Enter ClickHouse client
kubectl exec -n iot-pipeline clickhouse-0 -- clickhouse-client -d iot

# Count total sensor readings
SELECT COUNT(*) FROM sensor_raw_data;

# Get average metrics by sensor type
SELECT 
    sensor_type,
    COUNT(*) as reading_count,
    AVG(temperature) as avg_temp,
    AVG(humidity) as avg_humidity,
    AVG(pressure) as avg_pressure,
    AVG(battery_level) as avg_battery
FROM sensor_raw_data
GROUP BY sensor_type
ORDER BY sensor_type;

# Get recent high-temperature alerts
SELECT 
    sensor_id,
    alert_type,
    alert_time,
    temperature,
    description
FROM sensor_alerts
WHERE alert_type = 'HIGH_TEMPERATURE'
ORDER BY alert_time DESC
LIMIT 10;

# Monitor data ingestion rate (records per minute)
SELECT 
    toStartOfMinute(timestamp) as minute,
    COUNT(*) as records_per_minute
FROM sensor_raw_data
WHERE timestamp >= now() - INTERVAL 10 MINUTE
GROUP BY minute
ORDER BY minute DESC;

Monitor Data Flow in Real-Time

# Watch sensor data being written
watch -n 2 "kubectl exec -n iot-pipeline clickhouse-0 -- \
  clickhouse-client -d iot --query 'SELECT COUNT(*) FROM sensor_raw_data'"

# Monitor Flink job status
watch -n 5 "kubectl exec -n iot-pipeline \
  \$(kubectl get pods -n iot-pipeline -l app=flink,component=jobmanager -o jsonpath='{.items[0].metadata.name}') -- \
  flink list"

# Check Pulsar topic stats
kubectl exec -n iot-pipeline pulsar-0 -- \
  bin/pulsar-admin topics stats persistent://public/default/iot-sensor-data

📈 Performance Metrics

On a MacBook Pro (M1/M2) with 16GB RAM:

Metric	Value
Throughput	~1,000 msg/sec
End-to-end Latency	< 500ms
CPU Usage	~40% (all pods)
Memory Usage	~6GB total
Storage	~500MB after 1 hour
Records/minute	~60,000

🎨 Key Features Demonstrated

1. Real-Time Stream Processing

The Flink job processes AVRO-serialized sensor data:

// Flink processes data with 1-minute windows
DataStream<SensorRecord> sensorStream = env.fromSource(
    pulsarSource,
    WatermarkStrategy.noWatermarks(),
    "Pulsar IoT Source"
);

sensorStream
    .keyBy(record -> record.getSensorId())
    .window(TumblingProcessingTimeWindows.of(Time.minutes(1)))
    .aggregate(new SensorAggregator())
    .addSink(new ClickHouseJDBCSink(clickhouseUrl));

2. AVRO Schema Processing

Our sensor data follows the optimized AVRO schema:

// Schema highlights from our SensorData AVRO record:
// - sensorId: int (for efficiency)
// - sensorType: int (1=temp, 2=humidity, 3=pressure, etc.)
// - temperature, humidity, pressure: double
// - batteryLevel: double (percentage)
// - status: int (1=online, 2=offline, 3=maintenance, 4=error)
// - timestamp: long with logical type timestamp-millis

3. Anomaly Detection

// Alert on various conditions
if (record.getTemperature() > 35.0) {
    alertSink.invoke(new Alert(
        record.getSensorId(),
        "HIGH_TEMPERATURE",
        record.getTemperature(),
        Instant.now()
    ));
}

if (record.getBatteryLevel() < 20.0) {
    alertSink.invoke(new Alert(
        record.getSensorId(),
        "LOW_BATTERY",
        record.getBatteryLevel(),
        Instant.now()
    ));
}

4. Scalable Architecture

Horizontal scaling: Add more TaskManagers for increased parallelism
Vertical scaling: Adjust resource limits per pod
Data partitioning: Pulsar topic partitioning for parallel processing

🛠️ Troubleshooting

Issue: Pods not starting

# Check pod status and events
kubectl describe pod -n iot-pipeline <pod-name>

# Check logs for specific errors
kubectl logs -n iot-pipeline <pod-name>

# Check resource constraints
kubectl top pods -n iot-pipeline

Issue: Flink job not running

# Check Flink JobManager logs
kubectl logs -n iot-pipeline -l app=flink,component=jobmanager

# Check TaskManager logs
kubectl logs -n iot-pipeline -l app=flink,component=taskmanager

# Access Flink CLI
kubectl exec -n iot-pipeline deploy/flink-jobmanager -- flink list

Issue: No data in ClickHouse

# Check producer logs for errors
kubectl logs -n iot-pipeline -l app=iot-producer

# Verify Pulsar topic creation
kubectl exec -n iot-pipeline pulsar-0 -- \
  bin/pulsar-admin topics list public/default

# Check Pulsar message production
kubectl exec -n iot-pipeline pulsar-0 -- \
  bin/pulsar-admin topics stats persistent://public/default/iot-sensor-data

# Test ClickHouse connectivity
kubectl exec -n iot-pipeline clickhouse-0 -- \
  clickhouse-client --query "SELECT version()"

Issue: Port forwarding not working

# Manual port forwarding setup
kubectl port-forward -n iot-pipeline svc/flink-jobmanager 8081:8081 &
kubectl port-forward -n iot-pipeline svc/clickhouse 8123:8123 &

# Check service endpoints
kubectl get svc -n iot-pipeline

🧹 Cleanup

When you're done exploring:

# Stop the pipeline (keeps cluster)
./stop-pipeline.sh

# Or delete everything including cluster
kind delete cluster --name iot-pipeline

# Clean up Docker images (optional)
docker rmi iot-producer:latest

🎓 What You've Learned

By following this guide, you've:

✅ Set up a local Kubernetes cluster with Kind

✅ Deployed a distributed streaming platform

✅ Built a real-time data processing pipeline

✅ Implemented stream processing with Apache Flink

✅ Used AVRO schemas for efficient serialization

✅ Stored and queried streaming data in ClickHouse

✅ Implemented real-time anomaly detection

✅ Monitored a production-grade data pipeline

🚀 Next Steps

Want to take this further?

1. Scale to Production

Deploy on AWS EKS with the realtime-platform-1million-events/ setup for handling 1M+ events/sec

2. Add More Features

Checkpointing: Implement exactly-once processing with Flink checkpoints
Monitoring: Add Grafana dashboards and Prometheus metrics
Windowing: Implement different time windows (sliding, session-based)
ML Integration: Add machine learning for predictive maintenance

3. Customize the Pipeline

Schema Evolution: Practice updating AVRO schemas without downtime
Custom Transformations: Add complex event processing (CEP) patterns
External APIs: Connect to weather services or IoT device management platforms
Data Lake Integration: Archive data to S3/MinIO for long-term storage

4. Advanced Topics

Multi-tenant Setup: Isolate different customer data streams
Cross-region Replication: Set up geo-distributed deployments
Security: Add authentication and encryption
Performance Tuning: Optimize for specific workload patterns

📚 Resources

Local Setup Directory: local-setup/k8s
Main Repository: RealtimeDataPlatform
Kind Documentation: kind.sigs.k8s.io
Apache Flink Guides: flink.apache.org
Apache Pulsar Docs: pulsar.apache.org
ClickHouse Documentation: clickhouse.com/docs
AVRO Specification: avro.apache.org

💬 Conclusion

You now have a fully functional, production-grade IoT streaming pipeline running on your local machine! This setup demonstrates real-world patterns used by companies processing billions of events per day.

The best part? Everything runs in Docker containers orchestrated by Kubernetes, making it easy to understand, modify, and eventually deploy to production cloud environments.

Key takeaways:

AVRO schemas provide efficient serialization and schema evolution
Kind clusters enable realistic Kubernetes testing locally
Stream processing patterns work the same at any scale
Real-time analytics can be achieved with open-source tools

What would you build with this pipeline? Share your IoT project ideas in the comments! 👇

Found this helpful? Follow me for more posts on real-time data engineering, Kubernetes, and distributed systems!

Next in the series: "Scaling to 1 Million Events/Second - Production Deployment Guide"

📊 Repository Stats

⭐ Star this repo if you found it useful!

🐛 Issues/PRs welcome - contributions appreciated

💼 Production ready - scales to 1M events/sec

📖 Well documented - complete guides included

🏗️ Local K8s Setup: local-setup/k8s directory

Tags: #kubernetes #iot #dataengineering #streaming #flink #pulsar #clickhouse #devops #avro #realtimeanalytics

Real-Time Streaming Platform with Pulsar, Flink & ClickHouse

HyperscaleDesignHub — Sun, 26 Oct 2025 09:41:05 +0000

Real-Time Streaming Platform: Building Enterprise-Grade Data Infrastructure with Pulsar, Flink & ClickHouse

An overview of a comprehensive event streaming platform designed for high-throughput, real-time data processing across multiple domains

Introduction

Modern businesses generate massive amounts of data every second. From user interactions on e-commerce platforms to sensor readings in IoT networks, the ability to process and analyze this data in real-time has become a competitive necessity.

I've built a comprehensive real-time streaming platform that tackles the fundamental challenges of scalable data ingestion, real-time processing, and high-performance analytics. This platform is designed to handle workloads ranging from development environments to enterprise-scale deployments processing over 1 million messages per second.

🏗️ Platform Architecture

The platform leverages three battle-tested open-source technologies, each serving a specific role in the data pipeline:

📊 Data Sources → 🚀 Apache Pulsar → ⚡ Apache Flink → 🏛️ ClickHouse → 📈 Real-time Analytics

Core Components

Apache Pulsar - The messaging backbone

Distributed pub-sub messaging with multi-tenancy
Built-in schema registry for AVRO serialization
Geo-replication and tiered storage capabilities

Apache Flink - The processing engine

Stateful stream processing with exactly-once guarantees
Complex event processing and windowed aggregations
Fault-tolerant checkpointing and recovery

ClickHouse - The analytical powerhouse

Columnar database optimized for analytical queries
Real-time ingestion with sub-second query performance
Horizontal scaling across distributed clusters

Why This Combination?

This architecture solves a critical integration challenge: ClickHouse lacks native Flink connector support (unlike databases such as MySQL or PostgreSQL). Our platform bridges this gap with a custom integration that maintains performance while ensuring data consistency.

🚀 Platform Scalability

The platform is designed with three distinct deployment tiers to accommodate different organizational needs:

Configuration	Throughput	Infrastructure	Target Audience
Local Development	~1K msg/sec	Docker + Kind	Developers & Testing
Production Ready	50K msg/sec	AWS t3 instances	SMBs & Growing Companies
Enterprise Scale	1M+ msg/sec	AWS c5 + NVMe	Large Enterprises

Each configuration maintains the same architectural principles while scaling the underlying infrastructure to match performance requirements and budget constraints.

🔧 Platform Capabilities

High-Volume Event Generation

The platform includes sophisticated event producers that simulate real-world data patterns:

Multi-domain events: E-commerce, finance, IoT, gaming, logistics, social media
AVRO serialization: Schema evolution and type safety
Configurable throughput: From thousands to millions of events per second
Realistic data patterns: User sessions, device interactions, transaction flows

Distributed Message Streaming

Apache Pulsar provides the messaging infrastructure with enterprise features:

Multi-tenancy: Isolated namespaces for different applications
Schema registry: Centralized schema management and evolution
Geo-replication: Cross-region data distribution
Tiered storage: Cost-effective long-term data retention

Real-Time Stream Processing

Apache Flink handles complex stream processing scenarios:

Windowed aggregations: Time-based data summarization
Stateful processing: Maintain context across events
Exactly-once semantics: Data consistency guarantees
Fault tolerance: Automatic recovery from failures

High-Performance Analytics

ClickHouse delivers sub-second analytical query performance:

Columnar storage: Optimized for analytical workloads
Real-time ingestion: Process streaming data as it arrives
Distributed queries: Scale across multiple nodes
Materialized views: Pre-computed aggregations for instant results

🎯 Use Cases Across Industries

E-commerce

Real-time Inventory: Track product availability across warehouses
Recommendation Engines: Process user interactions for personalized suggestions
Fraud Detection: Analyze payment patterns for suspicious activity

Finance

Trading Analytics: Process market data for algorithmic trading
Risk Assessment: Real-time calculation of portfolio risk metrics
Compliance Monitoring: Track transactions for regulatory compliance

IoT & Manufacturing

Predictive Maintenance: Analyze sensor data to predict equipment failures
Quality Control: Monitor production metrics in real-time
Energy Optimization: Track and optimize energy consumption patterns

Gaming

Player Analytics: Real-time analysis of player behavior and engagement
Live Leaderboards: Update rankings and achievements instantly
Churn Prediction: Identify players at risk of leaving

🔍 Technical Innovations

Solving the Flink-ClickHouse Integration Challenge

One of the most significant technical hurdles was integrating Apache Flink with ClickHouse. Unlike popular databases such as MySQL, PostgreSQL, or Elasticsearch that have official Flink connectors, ClickHouse lacks native integration support.

The Challenge:

Architectural mismatch: Flink's continuous streaming vs ClickHouse's batch-oriented operations
Transaction limitations: ClickHouse lacks full ACID support for Flink's exactly-once guarantees
Performance optimization: Balancing throughput with data consistency

Our Solution:

Custom JDBC sink implementation with idempotent writes
Batch coordination aligned with Flink checkpoints
Adaptive batching based on ClickHouse cluster performance
Circuit breaker patterns for fault tolerance

Multi-Domain Event Schema Design

The platform supports diverse event types across industries through a flexible AVRO schema approach. Here's an example of the IoT sensor data schema used in the platform:

{
  "type": "record",
  "name": "SensorData",
  "namespace": "org.apache.pulsar.testclient.avro",
  "doc": "IoT Sensor Data for Pulsar Performance Testing - Optimized Integer Schema",
  "fields": [
    {
      "name": "sensorId",
      "type": "int",
      "doc": "Unique sensor identifier (integer for efficiency)"
    },
    {
      "name": "sensorType", 
      "type": "int",
      "doc": "Type of sensor (1=temperature, 2=humidity, 3=pressure, 4=motion, 5=light, 6=co2, 7=noise, 8=multisensor)"
    },
    {
      "name": "temperature",
      "type": "double",
      "doc": "Temperature reading in Celsius"
    },
    {
      "name": "humidity",
      "type": "double", 
      "doc": "Humidity reading as percentage"
    },
    {
      "name": "pressure",
      "type": "double",
      "doc": "Pressure reading in hPa"
    },
    {
      "name": "batteryLevel",
      "type": "double",
      "doc": "Battery level as percentage"
    },
    {
      "name": "status",
      "type": "int",
      "doc": "Sensor status (1=online, 2=offline, 3=maintenance, 4=error)"
    },
    {
      "name": "timestamp",
      "type": "long",
      "logicalType": "timestamp-millis",
      "doc": "Timestamp in milliseconds since epoch"
    }
  ]
}

Schema Design Highlights:

Performance optimized: Uses integers for enums and identifiers to minimize serialization overhead
Multi-sensor support: Single schema accommodates 8 different sensor types
Comprehensive telemetry: Captures environmental data, device health, and operational status
Temporal precision: Millisecond-level timestamps for accurate event ordering

This design enables:

Schema evolution: Backward and forward compatibility through AVRO
Type safety: Compile-time validation across the pipeline
Cross-domain analytics: Unified event processing across business units
High throughput: Optimized data types for maximum serialization performance

📊 Performance & Scale

Benchmark Results

The platform has been tested across different scales with impressive results:

Enterprise Configuration (1M+ msg/sec):

Throughput: 1,000,000+ messages per second sustained
End-to-end latency: P99 < 100ms
Query performance: Sub-second analytical queries on billions of records
Availability: 99.9% uptime with automatic failover

Technology Stack

Infrastructure & Orchestration:

Kubernetes (EKS/Kind) for container orchestration
Terraform for infrastructure as code
Docker for containerization
Helm for application deployment

Monitoring & Observability:

Grafana dashboards for real-time metrics
Prometheus for metrics collection

Following will be done in future:

Custom alerting for system health
Distributed tracing for performance debugging

🛠️ Exploring the Platform

The complete platform is available as an open-source project on GitHub, featuring:

Comprehensive documentation for each deployment configuration
Infrastructure templates using Terraform and Kubernetes
Monitoring setup with pre-configured Grafana dashboards
Sample applications demonstrating multi-domain event processing
Performance tuning guides for production optimization

Repository: RealtimeDataPlatform

Whether you're building a proof of concept or deploying at enterprise scale, the platform provides the foundation for modern real-time analytics infrastructure.

📈 Observability & Operations

Comprehensive Monitoring:
The platform includes production-ready observability through Grafana dashboards tracking:

Message flow metrics: Throughput, latency, and backlog across all components
System health: Resource utilization, error rates, and availability
Business metrics: Event processing rates by domain and event type
Performance insights: Query execution times and optimization opportunities

Operational Features:

Checkpoint-based recovery for zero data loss
Horizontal scaling based on workload patterns
Cost tracking and optimization recommendations

🔮 Platform Evolution

The platform continues to evolve with planned enhancements:

Multi-cloud deployment across AWS, GCP, and Azure
Stream SQL interface for simplified data transformations
ML pipeline integration for real-time inference
Enhanced security with end-to-end encryption
Intelligent auto-scaling based on workload patterns

💡 Key Insights

Building this real-time streaming platform highlighted several critical design principles:

1. Component Synergy: The combination of Pulsar's messaging reliability, Flink's processing power, and ClickHouse's analytical performance creates a platform greater than the sum of its parts.

2. Integration Complexity: Solving the Flink-ClickHouse integration challenge required custom solutions, but the performance benefits justify the engineering investment.

3. Scalable Architecture: Designing for multiple deployment tiers from day one enables organizations to start small and scale without architectural rewrites.

4. Operational Excellence: Production-ready monitoring and automation are essential for managing distributed streaming systems at scale.

5. Cost Optimization: Thoughtful resource allocation and component tuning can achieve enterprise performance at reasonable operational costs.

🤝 Community & Future

This platform represents a comprehensive approach to real-time data infrastructure that balances performance, cost, and operational simplicity. By open-sourcing the complete solution, the goal is to accelerate adoption of modern streaming architectures across the industry.

Interested in real-time streaming platforms?

⭐ Star the repository
💬 Share your streaming architecture experiences in the comments
🔗 Connect for discussions about real-time data engineering challenges

What's your experience with real-time streaming platforms? Have you tackled similar integration challenges? I'd love to hear about your approach and lessons learned!

RealTimeAnalytics #ApachePulsar #ApacheFlink #ClickHouse #DataEngineering #StreamProcessing #BigData #EventStreaming