<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: j raja mohan</title>
    <description>The latest articles on Forem by j raja mohan (@j_rajamohan_b845f4d92063).</description>
    <link>https://forem.com/j_rajamohan_b845f4d92063</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3697832%2F4ae22602-634c-44e6-b89e-b12e3256f081.png</url>
      <title>Forem: j raja mohan</title>
      <link>https://forem.com/j_rajamohan_b845f4d92063</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/j_rajamohan_b845f4d92063"/>
    <language>en</language>
    <item>
      <title>Kafka Ingestion &amp; Processing at Scale | Rajamohan Jabbala</title>
      <dc:creator>j raja mohan</dc:creator>
      <pubDate>Wed, 07 Jan 2026 08:04:53 +0000</pubDate>
      <link>https://forem.com/j_rajamohan_b845f4d92063/kafka-ingestion-processing-at-scale-rajamohan-jabbala-23i3</link>
      <guid>https://forem.com/j_rajamohan_b845f4d92063/kafka-ingestion-processing-at-scale-rajamohan-jabbala-23i3</guid>
      <description>&lt;p&gt;Most Kafka failures don’t happen because Kafka can’t scale.&lt;br&gt;
They happen because teams never did the math.&lt;/p&gt;

&lt;p&gt;This post walks through a capacity-driven approach to designing Kafka pipelines that scale predictably—before traffic spikes expose weak assumptions.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What “Good” Looks Like (SLOs First)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A production Kafka pipeline should:&lt;/p&gt;

&lt;p&gt;Handle N msgs/sec per topic with low latency&lt;/p&gt;

&lt;p&gt;Scale linearly via partitions, consumers, brokers&lt;/p&gt;

&lt;p&gt;Guarantee at-least-once (or exactly-once) semantics&lt;/p&gt;

&lt;p&gt;Support fan-out via consumer groups&lt;/p&gt;

&lt;p&gt;Stay within clear lag, throughput, durability SLOs&lt;/p&gt;

&lt;p&gt;Example SLOs&lt;/p&gt;

&lt;p&gt;Produce latency p99 ≤ X ms&lt;/p&gt;

&lt;p&gt;Consumer lag ≤ Y sec (steady state)&lt;/p&gt;

&lt;p&gt;Recovery ≤ Z min after 2× spike&lt;/p&gt;

&lt;p&gt;Availability ≥ 99.9%&lt;/p&gt;

&lt;p&gt;If you don’t define these, Kafka tuning becomes superstition.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Kafka Mechanics (The Parts That Actually Matter)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Topics scale via partitions&lt;/p&gt;

&lt;p&gt;One partition → one consumer per group&lt;/p&gt;

&lt;p&gt;Multiple consumer groups = independent re-reads (fan-out)&lt;/p&gt;

&lt;p&gt;Rule:&lt;/p&gt;

&lt;p&gt;Max useful consumers in one group = number of partitions.&lt;/p&gt;

&lt;p&gt;More consumers ≠ more throughput.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Logical Architecture&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Producers → orders topic (P partitions, RF=3)&lt;/p&gt;

&lt;p&gt;Kafka cluster distributes:&lt;/p&gt;

&lt;p&gt;One leader per partition&lt;/p&gt;

&lt;p&gt;Replicas across brokers&lt;/p&gt;

&lt;p&gt;Downstream:&lt;/p&gt;

&lt;p&gt;Fraud consumer group&lt;/p&gt;

&lt;p&gt;Analytics consumer group&lt;/p&gt;

&lt;p&gt;ML feature consumer group&lt;/p&gt;

&lt;p&gt;Each group owns its own offsets and scales independently.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Capacity Planning (The Non-Negotiable Step)
Inputs&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;T = msgs/sec&lt;/p&gt;

&lt;p&gt;S = avg msg size (bytes, compressed)&lt;/p&gt;

&lt;p&gt;R = replication factor&lt;/p&gt;

&lt;p&gt;C = msgs/sec per consumer (measured)&lt;/p&gt;

&lt;p&gt;H = headroom (1.3–2×)&lt;/p&gt;

&lt;p&gt;RetentionDays&lt;/p&gt;

&lt;p&gt;Core formulas&lt;/p&gt;

&lt;p&gt;Partitions&lt;/p&gt;

&lt;p&gt;P = ceil((T / C) × H)&lt;/p&gt;

&lt;p&gt;Ingress&lt;/p&gt;

&lt;p&gt;Ingress = T × S × R&lt;/p&gt;

&lt;p&gt;Egress (per group)&lt;/p&gt;

&lt;p&gt;Egress = T × S&lt;/p&gt;

&lt;p&gt;Storage/day (leaders)&lt;/p&gt;

&lt;p&gt;T × S × 86,400&lt;/p&gt;

&lt;p&gt;Multiply by R and retention.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Example (1M msgs/sec)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;T = 1,000,000 msgs/sec&lt;/p&gt;

&lt;p&gt;S = 200 bytes&lt;/p&gt;

&lt;p&gt;C = 25k msgs/sec/consumer&lt;/p&gt;

&lt;p&gt;H = 1.5&lt;/p&gt;

&lt;p&gt;RF = 3&lt;/p&gt;

&lt;p&gt;Results:&lt;/p&gt;

&lt;p&gt;60 partitions&lt;/p&gt;

&lt;p&gt;~572 MB/sec ingress&lt;/p&gt;

&lt;p&gt;~191 MB/sec egress per consumer group&lt;/p&gt;

&lt;p&gt;~155 TB for 3-day retention (with replicas)&lt;/p&gt;

&lt;p&gt;This is why “just add brokers” isn’t a strategy.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Partitioning That Doesn’t Backfire&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Use high-cardinality keys (order_id, not country)&lt;/p&gt;

&lt;p&gt;Monitor skew aggressively&lt;/p&gt;

&lt;p&gt;Slightly over-partition early to avoid re-sharding later&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Consumer Group Scaling&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Scale consumers up to P&lt;/p&gt;

&lt;p&gt;Use separate groups for separate pipelines&lt;/p&gt;

&lt;p&gt;Autoscale on lag growth, not raw lag&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Reliability Defaults That Work&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;acks=all&lt;/p&gt;

&lt;p&gt;min.insync.replicas=2 (with RF=3)&lt;/p&gt;

&lt;p&gt;Idempotent producers&lt;/p&gt;

&lt;p&gt;Disable unclean leader election&lt;/p&gt;

&lt;p&gt;Rack/AZ-aware replicas&lt;/p&gt;

&lt;p&gt;Exactly-once only where business semantics demand it.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Observability &amp;gt; Tuning&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Watch:&lt;/p&gt;

&lt;p&gt;Lag growth per partition&lt;/p&gt;

&lt;p&gt;p95/p99 produce &amp;amp; consume latency&lt;/p&gt;

&lt;p&gt;Under-replicated partitions&lt;/p&gt;

&lt;p&gt;Disk, NIC, controller health&lt;/p&gt;

&lt;p&gt;Scale:&lt;/p&gt;

&lt;p&gt;Consumers → lag&lt;/p&gt;

&lt;p&gt;Partitions → consumer saturation&lt;/p&gt;

&lt;p&gt;Brokers → disk/NIC pressure&lt;/p&gt;

&lt;p&gt;Final Takeaway&lt;/p&gt;

&lt;p&gt;Kafka doesn’t scale because it’s Kafka.&lt;br&gt;
It scales because you designed it to.&lt;/p&gt;

&lt;p&gt;Math beats hope.&lt;br&gt;
Measurements beat myths.&lt;/p&gt;

</description>
      <category>kafka</category>
      <category>architecture</category>
      <category>scalability</category>
      <category>systemdesign</category>
    </item>
  </channel>
</rss>
