<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Aleksandr</title>
    <description>The latest articles on Forem by Aleksandr (@writingmuffin).</description>
    <link>https://forem.com/writingmuffin</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3821183%2F192edf9b-1e82-4e5e-91c8-448f19b04b2f.jpeg</url>
      <title>Forem: Aleksandr</title>
      <link>https://forem.com/writingmuffin</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/writingmuffin"/>
    <language>en</language>
    <item>
      <title>Message Broker Throughput: RabbitMQ vs Kafka vs NATS</title>
      <dc:creator>Aleksandr</dc:creator>
      <pubDate>Wed, 01 Apr 2026 13:22:23 +0000</pubDate>
      <link>https://forem.com/writingmuffin/message-broker-throughput-rabbitmq-vs-kafka-vs-nats-11hd</link>
      <guid>https://forem.com/writingmuffin/message-broker-throughput-rabbitmq-vs-kafka-vs-nats-11hd</guid>
      <description>&lt;p&gt;I started using NATS in one of my projects and was generally happy with it, but I wanted to verify the performance claims for myself. Is it really as fast as people say, or is that just marketing and cherry-picked benchmarks? The best way to find out was to write my own tests and compare NATS against the two most common alternatives: RabbitMQ and Kafka.&lt;/p&gt;

&lt;p&gt;This post covers throughput testing of all three brokers on two messaging patterns: async producer-consumer queue, and request-reply. Request-reply is not the typical use case for message brokers, but NATS supports it natively, so it was worth measuring how the others perform when forced into that pattern.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Interactive results page: &lt;a href="https://petrolmuffin.github.io/BrokersPerformance/" rel="noopener noreferrer"&gt;https://petrolmuffin.github.io/BrokersPerformance/&lt;/a&gt;&lt;br&gt;
GitHub: &lt;a href="https://github.com/PetrolMuffin/BrokersPerformance" rel="noopener noreferrer"&gt;https://github.com/PetrolMuffin/BrokersPerformance&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Test Environment
&lt;/h2&gt;

&lt;p&gt;All three brokers ran in Docker containers on the same host. No custom tuning was applied to any broker: default configurations only.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CPU:&lt;/strong&gt; AMD Ryzen 7 8845HS, 8 cores / 16 threads, 3.80 GHz&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OS:&lt;/strong&gt; Windows 11 (10.0.26200)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Runtime:&lt;/strong&gt; .NET 10.0.4, RyuJIT x86-64-v4&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Benchmarking framework:&lt;/strong&gt; BenchmarkDotNet v0.15.8&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Broker Versions and Configuration
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Broker&lt;/th&gt;
&lt;th&gt;Docker Image&lt;/th&gt;
&lt;th&gt;Client Version&lt;/th&gt;
&lt;th&gt;Configuration&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;RabbitMQ&lt;/td&gt;
&lt;td&gt;&lt;code&gt;rabbitmq:4.2-management&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;RabbitMQ.Client&lt;/code&gt; v7.2.1&lt;/td&gt;
&lt;td&gt;Default settings, AMQP 0.9.1, guest/guest&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kafka&lt;/td&gt;
&lt;td&gt;&lt;code&gt;apache/kafka:4.2.0&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;Confluent.Kafka&lt;/code&gt; v2.13.2&lt;/td&gt;
&lt;td&gt;KRaft mode (no ZooKeeper), single node, 1 partition, replication factor = 1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NATS&lt;/td&gt;
&lt;td&gt;&lt;code&gt;nats:2.12-alpine&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;NATS.Net&lt;/code&gt; v2.7.3&lt;/td&gt;
&lt;td&gt;JetStream enabled (&lt;code&gt;-js&lt;/code&gt; flag)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Idle RAM Consumption (Docker, Cold Start)
&lt;/h3&gt;

&lt;p&gt;Measured via &lt;code&gt;docker stats&lt;/code&gt; on freshly started containers with no accumulated data or active connections:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Broker&lt;/th&gt;
&lt;th&gt;RAM&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;NATS&lt;/td&gt;
&lt;td&gt;6 MiB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RabbitMQ&lt;/td&gt;
&lt;td&gt;122 MiB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kafka&lt;/td&gt;
&lt;td&gt;327 MiB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Kafka's JVM-based architecture is immediately visible: 54x the memory of NATS and 2.7x of RabbitMQ on cold start. NATS is the lightest at 6 MiB.&lt;/p&gt;

&lt;h3&gt;
  
  
  BenchmarkDotNet Configuration
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;3 warmup iterations, 10 measured iterations per scenario&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;InvocationCount = 1&lt;/code&gt;, &lt;code&gt;UnrollFactor = 1&lt;/code&gt; (each iteration is a single benchmark call)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;RunStrategy = Monitoring&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;GC: non-concurrent, forced collections, non-server mode&lt;/li&gt;
&lt;li&gt;ThreadPool: minimum 100 worker + 100 I/O completion port threads&lt;/li&gt;
&lt;li&gt;Reported metrics: Mean, StdDev, P95, Op/s, Allocated memory&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note on metric choice:&lt;/strong&gt; all result tables below use P95 (95th percentile) rather than Mean. P95 better represents worst-case performance a system will realistically encounter, filtering out warm-up noise while capturing tail latency.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Test Parameters
&lt;/h2&gt;

&lt;p&gt;The message counts and payload sizes were chosen to cover two dimensions: the number of concurrent messages the broker must route, and the size of individual payloads. Counts are inversely proportional to payload size to keep total benchmark runtime within a few minutes per scenario while still loading the broker enough to reveal its throughput characteristics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Async queue&lt;/strong&gt; (250 concurrent publishers, 1 consumer):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Messages&lt;/th&gt;
&lt;th&gt;Payload&lt;/th&gt;
&lt;th&gt;Total Volume&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;50,000&lt;/td&gt;
&lt;td&gt;256 B&lt;/td&gt;
&lt;td&gt;12.8 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;25,000&lt;/td&gt;
&lt;td&gt;1 KB&lt;/td&gt;
&lt;td&gt;25 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10,000&lt;/td&gt;
&lt;td&gt;4 KB&lt;/td&gt;
&lt;td&gt;40 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5,000&lt;/td&gt;
&lt;td&gt;64 KB&lt;/td&gt;
&lt;td&gt;327 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2,500&lt;/td&gt;
&lt;td&gt;128 KB&lt;/td&gt;
&lt;td&gt;335 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Request-reply&lt;/strong&gt; (150 concurrent publishers):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Messages&lt;/th&gt;
&lt;th&gt;Payload&lt;/th&gt;
&lt;th&gt;Total Volume&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;25,000&lt;/td&gt;
&lt;td&gt;256 B&lt;/td&gt;
&lt;td&gt;6.4 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10,000&lt;/td&gt;
&lt;td&gt;1 KB&lt;/td&gt;
&lt;td&gt;10 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5,000&lt;/td&gt;
&lt;td&gt;4 KB&lt;/td&gt;
&lt;td&gt;20 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The async pattern uses more publishers (250 vs 150) and reaches larger payloads because bulk throughput is the primary concern. Request-reply uses fewer messages and smaller payloads reflecting the typical RPC use case where latency matters more than volume.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation Details
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Async Queue (Producer-Consumer)
&lt;/h3&gt;

&lt;p&gt;All three implementations follow the same structure: N publishers concurrently push messages into a queue/topic/stream, one consumer reads everything. The benchmark measures wall-clock time from the first publish to the last received message.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RabbitMQ:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Persistent messages (&lt;code&gt;DeliveryMode = Persistent&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;QoS: prefetch count = 100&lt;/li&gt;
&lt;li&gt;Manual ACK&lt;/li&gt;
&lt;li&gt;Separate &lt;code&gt;IConnection&lt;/code&gt; for publisher and consumer&lt;/li&gt;
&lt;li&gt;Completion tracked via &lt;code&gt;CounterCompletionSource&lt;/code&gt; (atomic increment + &lt;code&gt;TaskCompletionSource&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Kafka:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Idempotent producer (&lt;code&gt;EnableIdempotence = true&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Write buffer: &lt;code&gt;QueueBufferingMaxKbytes = 1 GB&lt;/code&gt;, &lt;code&gt;QueueBufferingMaxMessages = 1M&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Manual offset commit, &lt;code&gt;AutoOffsetReset = Earliest&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Single partition, consumer group ID randomized per iteration&lt;/li&gt;
&lt;li&gt;Background consumer task with manual message counting&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;NATS JetStream:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;File-backed stream, retention = Workqueue&lt;/li&gt;
&lt;li&gt;Async persistence (&lt;code&gt;StorageType = File&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Explicit ACK, &lt;code&gt;MaxDeliver = 10&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Deduplication window: 1 minute&lt;/li&gt;
&lt;li&gt;&lt;code&gt;WriterBufferSize = 1 GB&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Request-Reply
&lt;/h3&gt;

&lt;p&gt;NATS has native request-reply: &lt;code&gt;RequestAsync&lt;/code&gt; sends a message and returns a response in a single call. The broker handles response routing internally.&lt;/p&gt;

&lt;p&gt;RabbitMQ and Kafka lack this primitive. For both, request-reply was implemented via correlation IDs:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Requester generates a UUID, attaches it to the message, stores a &lt;code&gt;TaskCompletionSource&lt;/code&gt; in a &lt;code&gt;ConcurrentDictionary&amp;lt;string, TaskCompletionSource&amp;gt;&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Responder receives the message, echoes the correlation ID back on a dedicated reply queue/topic&lt;/li&gt;
&lt;li&gt;Requester's reply listener matches the ID and completes the corresponding &lt;code&gt;TaskCompletionSource&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This means each "request" in RabbitMQ/Kafka involves 4 broker operations (publish request → consume request → publish reply → consume reply) vs 1 round-trip in NATS.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RabbitMQ:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Separate request/reply queues&lt;/li&gt;
&lt;li&gt;Correlation-ID in AMQP properties&lt;/li&gt;
&lt;li&gt;Persistent messages (&lt;code&gt;DeliveryMode = Persistent&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;QoS: prefetch count = 100&lt;/li&gt;
&lt;li&gt;Manual ACK&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Kafka:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Separate request/reply topics&lt;/li&gt;
&lt;li&gt;Correlation-ID in Kafka headers.&lt;/li&gt;
&lt;li&gt;Idempotent producer (&lt;code&gt;EnableIdempotence = true&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Write buffer: &lt;code&gt;QueueBufferingMaxKbytes = 1 GB&lt;/code&gt;, &lt;code&gt;QueueBufferingMaxMessages = 1M&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Manual offset commit, &lt;code&gt;AutoOffsetReset = Earliest&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;NATS:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Built-in &lt;code&gt;RequestAsync&lt;/code&gt;/&lt;code&gt;ReplyAsync&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;&lt;code&gt;WriterBufferSize = 1 GB&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;RequestTimeout = 10 min&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;CommandTimeout = 5 min&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Results: Async Queue
&lt;/h2&gt;

&lt;p&gt;All values are P95 (95th percentile) completion time in milliseconds. Lower is better. Ratio columns show time relative to NATS JetStream (baseline).&lt;/p&gt;

&lt;h3&gt;
  
  
  P95 Completion Time
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;RabbitMQ&lt;/th&gt;
&lt;th&gt;Kafka&lt;/th&gt;
&lt;th&gt;NATS JetStream&lt;/th&gt;
&lt;th&gt;RabbitMQ / NATS&lt;/th&gt;
&lt;th&gt;Kafka / NATS&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;50K × 256 B&lt;/td&gt;
&lt;td&gt;1,521 ms&lt;/td&gt;
&lt;td&gt;35,856 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;944 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1.61&lt;/td&gt;
&lt;td&gt;38.98&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;25K × 1 KB&lt;/td&gt;
&lt;td&gt;905 ms&lt;/td&gt;
&lt;td&gt;18,629 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;511 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1.77&lt;/td&gt;
&lt;td&gt;36.46&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10K × 4 KB&lt;/td&gt;
&lt;td&gt;442 ms&lt;/td&gt;
&lt;td&gt;8,329 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;256 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1.73&lt;/td&gt;
&lt;td&gt;32.54&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5K × 64 KB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;534 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;7,496 ms&lt;/td&gt;
&lt;td&gt;878 ms&lt;/td&gt;
&lt;td&gt;0.61&lt;/td&gt;
&lt;td&gt;8.54&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2.5K × 128 KB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;690 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;7,162 ms&lt;/td&gt;
&lt;td&gt;735 ms&lt;/td&gt;
&lt;td&gt;0.94&lt;/td&gt;
&lt;td&gt;9.74&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Messages per Second (at P95)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;RabbitMQ&lt;/th&gt;
&lt;th&gt;Kafka&lt;/th&gt;
&lt;th&gt;NATS JetStream&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;50K × 256 B&lt;/td&gt;
&lt;td&gt;32 873 msg/s&lt;/td&gt;
&lt;td&gt;1 394 msg/s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;52 966 msg/s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;25K × 1 KB&lt;/td&gt;
&lt;td&gt;27 624 msg/s&lt;/td&gt;
&lt;td&gt;1 342 msg/s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;48 924 msg/s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10K × 4 KB&lt;/td&gt;
&lt;td&gt;22 624 msg/s&lt;/td&gt;
&lt;td&gt;1 201 msg/s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;39 063 msg/s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5K × 64 KB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;9 363 msg/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;667 msg/s&lt;/td&gt;
&lt;td&gt;5 695 msg/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2.5K × 128 KB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3 623 msg/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;349 msg/s&lt;/td&gt;
&lt;td&gt;3 401 msg/s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;On small to medium payloads (up to 4 KB), NATS JetStream processes messages x1.6-1.8 faster than RabbitMQ at P95. The gap is consistent, suggesting protocol-level overhead in AMQP relative to NATS's binary protocol.&lt;/p&gt;

&lt;p&gt;On large payloads (64 KB+), RabbitMQ takes the lead. At 64 KB processes messages x1.6 faster than NATS's; x1.1 at 128 KB. RabbitMQ allocates 7-12 MB managed memory for these scenarios, while NATS allocates 368-401 MB. AMQP framing is more efficient for large contiguous payloads.&lt;/p&gt;

&lt;p&gt;Kafka is x9-38 slower than NATS at P95. This is expected: Kafka's commit log architecture, partition leader election, and replication protocol add overhead that only pays off with horizontal scaling across multiple partitions and nodes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Memory Allocation (Managed Heap)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;RabbitMQ&lt;/th&gt;
&lt;th&gt;Kafka&lt;/th&gt;
&lt;th&gt;NATS JetStream&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;50K × 256 B&lt;/td&gt;
&lt;td&gt;106 MB&lt;/td&gt;
&lt;td&gt;115 MB&lt;/td&gt;
&lt;td&gt;678 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;25K × 1 KB&lt;/td&gt;
&lt;td&gt;54 MB&lt;/td&gt;
&lt;td&gt;76 MB&lt;/td&gt;
&lt;td&gt;342 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10K × 4 KB&lt;/td&gt;
&lt;td&gt;22 MB&lt;/td&gt;
&lt;td&gt;60 MB&lt;/td&gt;
&lt;td&gt;205 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5K × 64 KB&lt;/td&gt;
&lt;td&gt;12 MB&lt;/td&gt;
&lt;td&gt;323 MB&lt;/td&gt;
&lt;td&gt;401 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2.5K × 128 KB&lt;/td&gt;
&lt;td&gt;7 MB&lt;/td&gt;
&lt;td&gt;318 MB&lt;/td&gt;
&lt;td&gt;368 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;RabbitMQ consistently uses the least managed memory. NATS allocates significantly more due to the 1 GB writer buffer configuration. Kafka's allocations spike with large payloads (318-323 MB) due to its own producer buffer configuration (&lt;code&gt;QueueBufferingMaxKbytes = 1 GB&lt;/code&gt;).&lt;/p&gt;

&lt;h2&gt;
  
  
  Results: Request-Reply
&lt;/h2&gt;

&lt;p&gt;All values are P95 completion time. Ratio columns show time relative to NATS (baseline).&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;RabbitMQ&lt;/th&gt;
&lt;th&gt;Kafka&lt;/th&gt;
&lt;th&gt;NATS&lt;/th&gt;
&lt;th&gt;RabbitMQ / NATS&lt;/th&gt;
&lt;th&gt;Kafka / NATS&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;25K × 256 B&lt;/td&gt;
&lt;td&gt;41,450 ms&lt;/td&gt;
&lt;td&gt;36,572 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;397 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;104.41&lt;/td&gt;
&lt;td&gt;92.12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10K × 1 KB&lt;/td&gt;
&lt;td&gt;21,434 ms&lt;/td&gt;
&lt;td&gt;15,113 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;226 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;94.84&lt;/td&gt;
&lt;td&gt;66.87&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5K × 4 KB&lt;/td&gt;
&lt;td&gt;12,231 ms&lt;/td&gt;
&lt;td&gt;7,339 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;159 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;76.92&lt;/td&gt;
&lt;td&gt;46.16&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Messages per second (at P95):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;RabbitMQ&lt;/th&gt;
&lt;th&gt;Kafka&lt;/th&gt;
&lt;th&gt;NATS&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;25K × 256 B&lt;/td&gt;
&lt;td&gt;603 msg/s&lt;/td&gt;
&lt;td&gt;684 msg/s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;62 972 msg/s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10K × 1 KB&lt;/td&gt;
&lt;td&gt;467 msg/s&lt;/td&gt;
&lt;td&gt;662 msg/s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;44 248 msg/s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5K × 4 KB&lt;/td&gt;
&lt;td&gt;409 msg/s&lt;/td&gt;
&lt;td&gt;681 msg/s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;31 447 msg/s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;NATS is 46-92x faster than Kafka and x77-104 faster than RabbitMQ at P95. This is the difference between a native protocol primitive (one network round-trip) and an application-level emulation (four broker operations per request).&lt;/p&gt;

&lt;p&gt;RabbitMQ is the slowest in all request-reply scenarios, with P95 degrading linearly: 12.2s for 5K messages, 21.4s for 10K, 41.4s for 25K. The per-message overhead is roughly constant at ~1.7 ms, dominated by the ACK cycle on both request and reply queues.&lt;/p&gt;

&lt;p&gt;Kafka also shows high tail latency: P95 reaches 36.6s on the 25K scenario (Mean is 23.3s), indicating consumer group coordination and offset management overhead amplified in what is effectively a synchronous request pattern.&lt;/p&gt;

&lt;h2&gt;
  
  
  Broker Comparison
&lt;/h2&gt;

&lt;h3&gt;
  
  
  RabbitMQ 4.2
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mature AMQP implementation with 15+ years of production usage&lt;/li&gt;
&lt;li&gt;Rich routing model: direct, topic, fanout, and headers exchanges with flexible bindings&lt;/li&gt;
&lt;li&gt;Management UI included (port 15672), exposing queue depths, message rates, connection counts, and consumer status&lt;/li&gt;
&lt;li&gt;Lowest managed memory allocation in benchmarks, particularly with large payloads&lt;/li&gt;
&lt;li&gt;Multi-protocol support: AMQP 0.9.1, AMQP 1.0, MQTT 3.1.1/5.0, STOMP&lt;/li&gt;
&lt;li&gt;Plugin ecosystem: delayed message exchange, federation, shovel, consistent hash exchange&lt;/li&gt;
&lt;li&gt;Broad client library coverage across all major languages&lt;/li&gt;
&lt;li&gt;Quorum queues and streams for HA and replay scenarios&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Weaknesses:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1.6-1.8x slower than NATS on small message async throughput&lt;/li&gt;
&lt;li&gt;No native request-reply, must be implemented via correlation IDs&lt;/li&gt;
&lt;li&gt;Classic mirrored queues are deprecated; quorum queues improve HA but add latency&lt;/li&gt;
&lt;li&gt;Erlang runtime limits low-level troubleshooting and custom extensions&lt;/li&gt;
&lt;li&gt;Clustering can exhibit split-brain under network partitions (mitigated by peer discovery plugins)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Apache Kafka 4.2
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Distributed commit log with configurable retention, allowing consumers to replay from any offset&lt;/li&gt;
&lt;li&gt;Horizontal throughput scaling via partition-based parallelism&lt;/li&gt;
&lt;li&gt;Exactly-once semantics with idempotent producers and transactional API&lt;/li&gt;
&lt;li&gt;Extensive ecosystem: Kafka Connect (200+ connectors), Kafka Streams, ksqlDB, Schema Registry&lt;/li&gt;
&lt;li&gt;Standard for event sourcing, CDC (Debezium), and data pipeline architectures&lt;/li&gt;
&lt;li&gt;KRaft mode (used here) removes ZooKeeper dependency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Weaknesses:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Slowest in every scenario in this benchmark (single-node, single-partition is its worst case)&lt;/li&gt;
&lt;li&gt;327 MiB RAM on cold start (JVM heap), 54x NATS&lt;/li&gt;
&lt;li&gt;High operational complexity: partitions, ISR, consumer group rebalancing, offset management&lt;/li&gt;
&lt;li&gt;Consumer group rebalancing causes consumption pauses (mitigated by cooperative-sticky assignor)&lt;/li&gt;
&lt;li&gt;Latency-optimized for batched throughput, not per-message delivery&lt;/li&gt;
&lt;li&gt;No native request-reply&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  NATS 2.12 with JetStream
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fastest in 3 out of 5 async scenarios, and all 3 request-reply scenarios&lt;/li&gt;
&lt;li&gt;Native request-reply at the protocol level, no application-level workarounds needed&lt;/li&gt;
&lt;li&gt;Operationally minimal: single binary, single flag (&lt;code&gt;-js&lt;/code&gt;) enables persistence&lt;/li&gt;
&lt;li&gt;6 MiB RAM on cold start&lt;/li&gt;
&lt;li&gt;JetStream provides persistence, replay, exactly-once delivery, de-duplication, and consumer acknowledgement&lt;/li&gt;
&lt;li&gt;Subject-based routing with hierarchical wildcards (&lt;code&gt;&amp;gt;&lt;/code&gt;, &lt;code&gt;*&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Built-in key-value store and object store&lt;/li&gt;
&lt;li&gt;Service discovery via &lt;code&gt;micro&lt;/code&gt; package&lt;/li&gt;
&lt;li&gt;Leafnode and gateway topologies for multi-cluster deployments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Weaknesses:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Higher managed memory allocation&lt;/li&gt;
&lt;li&gt;Slower than RabbitMQ on large payloads (64 KB+)&lt;/li&gt;
&lt;li&gt;Smaller community and fewer production war stories compared to RabbitMQ/Kafka&lt;/li&gt;
&lt;li&gt;JetStream is younger than Kafka Streams; less battle-tested for event streaming at extreme scale&lt;/li&gt;
&lt;li&gt;Monitoring/observability tooling is less mature (no equivalent to Kafka Connect ecosystem)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;For new projects that need a general-purpose message broker, NATS is the most practical starting point.&lt;/p&gt;

&lt;p&gt;It provides a feature set comparable to Kafka: persistence with replay, exactly-once delivery, stream processing primitives, key-value and object stores. At the same time, its throughput on small-to-medium payloads matches or exceeds RabbitMQ, and it handles request-reply 46-105x faster than either alternative at P95 thanks to native protocol support.&lt;/p&gt;

&lt;p&gt;The operational cost is also lower. A single binary with one flag gives you a persistent, JetStream-enabled broker consuming 6 MiB of RAM on cold start. Compare that to Kafka's 327 MiB.&lt;/p&gt;

&lt;p&gt;RabbitMQ remains a strong choice when the workload is primarily large payloads (64 KB+) or when the team has deep AMQP expertise. Kafka is still the right tool for large-scale event streaming, CDC pipelines, and scenarios where partition-based parallelism and the Connect/Streams ecosystem matter.&lt;/p&gt;

&lt;p&gt;But as a default choice for a new distributed system? NATS delivers Kafka-class features at RabbitMQ-class speed, with less operational overhead than either.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Benchmarked with BenchmarkDotNet v0.15.8 on .NET 10.0.4. All brokers ran in Docker on the same machine with default configurations. Single-node results. Production numbers will differ.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>programming</category>
      <category>microservices</category>
      <category>dotnet</category>
    </item>
  </channel>
</rss>
