Forem: Konstantinas Mamonas

What’s Inside gzip, zstd, and Other Lossless Compressors

Konstantinas Mamonas — Mon, 02 Jun 2025 05:00:23 +0000

Compression shows up everywhere - logs, Kafka, Parquet, file systems, APIs - but most engineers use it without thinking about what's actually happening. You call .compress(), and smaller bytes come out. But what’s under the hood?

This post is a follow-up to my previous posts Compression Algorithms you probably inherited and Which Compression Saves the Most Storage $?. This one focuses on what techniques used in real-world lossless compressors. If you’re choosing a format or just want to understand your tools better, this will help.

What Compression Actually Does

Lossless compression shrinks data without sacrificing any of the original information. It achieves this through two primary steps:

Pattern detection - identifying and leveraging recurring structures within the data.
Encoding - representing these structures using fewer bits than the original representation.

While different compressors employ various combinations of these techniques, the fundamentals remain largely consistent. You can think of it like finding abbreviations for frequently used words to make a text shorter without losing any meaning.

Core Techniques

Run-Length Encoding (RLE)

When a sequence of identical data values occurs consecutively, RLE doesn't store each repetition. Instead, it stores the value once, along with the number of times it repeats.

Example:
AAAAABBBB becomes 5A4B

This technique shines when dealing with data containing long stretches of the same value. Imagine a simple black and white image where many consecutive pixels are the same color, RLE would be very effective here. It's less useful for highly variable or random data where repetitions are rare. In such cases, there are few "runs" to encode.

Dictionary Compression (LZ77, LZW)

Instead of repeatedly storing identical sequences of characters or bytes, dictionary compression methods store a single instance of the sequence and then refer back to it whenever it reappears.

LZ77 maintains a sliding window of recently seen data. When a repeated sequence is found, it's replaced by a pointer indicating the sequence's position and length within the window. Think of it like saying, "the next 3 characters are the same as the 5 characters that appeared 10 positions ago."
LZW (Lempel-Ziv-Welch), building upon LZ78, constructs a dictionary of frequently occurring patterns as it processes the data. When a pattern is encountered, it's replaced by its index in the dictionary.

These techniques form the bedrock of many popular compression algorithms:

DEFLATE (used in gzip) combines LZ77 to identify repeated sequences and then uses Huffman coding to efficiently represent the resulting tokens (both literal characters and the length/distance pairs from LZ77).
zstd and Brotli both utilize LZ-style pattern matching as an initial step to reduce redundancy in the data.

Entropy Coding

Once the data has been transformed into a more predictable sequence (often the output of pattern matching), entropy coding further reduces its size by assigning shorter bit codes to more frequent symbols. This stage doesn't care about the meaning of the data, only the frequency of the symbols.

Huffman Coding: Assigns variable-length bit codes to symbols based on their frequency. More frequent symbols get shorter codes. While fast to decode, it's not perfectly optimal and can sometimes waste up to 1 bit per symbol.
Arithmetic Coding: Achieves near-optimal compression by representing the entire input sequence as a single fractional number within the range [0, 1). The length of the fraction's binary representation determines the compressed size. It's generally slower than Huffman coding due to the complex calculations involved in manipulating these ranges.
ANS (Asymmetric Numeral Systems): A more recent family of entropy coding methods that offer compression ratios close to arithmetic coding but with decoding speeds similar to Huffman coding. It achieves this efficiency by encoding symbols into a single large number in a more streamlined way. zstd and Brotli commonly use variants of ANS, such as FSE (Finite State Entropy), which employs tables for faster processing.

The more effectively the data is preprocessed to reveal patterns, the more efficient this final entropy coding stage becomes. For example, if LZ77 has replaced many long repetitions with short pointers, the distribution of these pointers and remaining literal characters will be more skewed towards certain values, allowing Huffman or ANS to assign shorter codes to them.

Preprocessing Transforms

Sometimes, the inherent structure in data isn't immediately obvious to the compression encoder. Preprocessing transforms reorganize or reformat the data to make these redundancies more apparent.

Burrows–Wheeler Transform (BWT): Rearranges the bytes in a block of data to group identical or similar symbols together. BWT doesn't compress the data on its own, but by creating long runs of identical characters and increasing the frequency of certain symbols, it significantly improves the effectiveness of subsequent RLE and Huffman coding.
Move-to-Front (MTF): Maintains a list of recently encountered symbols. When a symbol appears, it's moved to the front of the list, and its position in the list (an integer) is outputted. This transform helps entropy coding because frequently occurring symbols will tend to have small position indices, and smaller integers often appear more frequently or have simpler probability distributions, making them easier to compress.

These transforms are often used in pipelines like:

raw input -> BWT -> MTF -> RLE -> Huffman

You'll find this specific combination in the bzip2 compressor.

Chaining Techniques

Real-world compressors don't rely on a single technique. Instead, they strategically combine multiple stages. One stage restructures the data to expose patterns, and the next stage then efficiently encodes those patterns to achieve a smaller size.

Compressor	Pattern Matching	Transforms	Entropy Coding
gzip	LZ77	-	Huffman
zstd	LZ77-style	Optional (e.g., prefix)	FSE (ANS variant)
Brotli	LZ77-style	MTF, context modeling	Huffman, ANS
bzip2	-	BWT -> MTF -> RLE	Huffman

gzip is relatively straightforward: it finds repeated substrings using LZ77 and then encodes the resulting stream of literals and (length, distance) pairs with Huffman coding.
zstd employs fast LZ-style matching and FSE for highly efficient entropy coding, often incorporating optional preprocessing steps to further enhance compression.
Brotli incorporates transforms like Move-to-Front and sophisticated static/dynamic context models to predict the next symbol, enabling Huffman and ANS to achieve higher compression ratios.
bzip2 uniquely skips dictionary compression and instead relies on heavy preprocessing (BWT, MTF, RLE) to create highly compressible data for the final Huffman encoding stage.

A Note on Entropy

In information theory, entropy represents the theoretical minimum number of bits required to represent a unit of data, on average. High-entropy data is characterized by its unpredictability. If there are no discernible patterns, there's little or nothing for a compressor to exploit.

Random data, such as encrypted information or already highly compressed media, will not compress well and might even increase in size slightly due to the overhead of the compression format's metadata.

Practical Rules of Thumb

Sorted data often compresses better
Try sorting logs or other datasets before compression, as this tends to create longer sequences of identical or similar values, improving pattern matching.
Transforms prep the data
Techniques like BWT, MTF, and delta encoding don't directly reduce the size of the data. Instead, they reorganize it in a way that makes it more amenable to the subsequent compression stages.
Compression is workload-dependent
Text, logs, and telemetry data typically exhibit significant redundancy and compress well. In contrast, encrypted data lacks patterns, and data with very high cardinality (many unique values) offers fewer opportunities for compression.

Why It Matters

Compression's impact extends far beyond just saving disk space. It influences aspects of your data pipelines, including Kafka throughput, Parquet read latency, network transfer times, and CPU utilization. An understanding of how compressors work helps to make informed decisions about which algorithms and settings to use, ultimately leading to better performance and resource efficiency.

Thank you for reading.

Parquet? What Parquet?

Konstantinas Mamonas — Mon, 26 May 2025 07:02:38 +0000

If you’re in data, you’re probably using Parquet. It’s not officially the standard, but good luck trying to convince anyone to use something else.

This post is meant to open the black box that is Parquet to see what exactly makes it so damn good. I’ll give a breakdown of the internals and show you some optimizations on how to go from a badly optimized file to a blazingly fast™ one.

What Even Is Parquet?

Parquet is a columnar storage format, it stores data by column, not by row. That’s ideal for analytical queries like SUM(driver_pay) or WHERE trip_miles < 2, where you only need a few columns, not the entire row. Engines can skip the rest, making reads faster and more efficient.

A Parquet file is composed of several layers, each designed to improve performance and storage:

Columnar layout: Values from the same column are stored together, enabling tight compression and efficient scans.
Row groups: Horizontal partitions of the dataset. These are the unit of parallelism and skipping; each contains all columns for a batch of rows.
Column chunks: Within each row group, data is organized by column.
Pages: Each column chunk is divided into pages (typically ~8KB). These are the unit of encoding and compression, allowing for localized reads and decompression.
Encodings: Within pages, data can be encoded (dictionary, run-length, bit-packing) to reduce size.
Compression: Pages can be compressed (ZSTD, Snappy, etc.) for storage and I/O efficiency.
Metadata: stores stats and min/max indexes so engines can skip row groups that don’t match the filter.

This layered structure makes Parquet efficient, but also means you have a lot of tuning knobs and plenty of ways to screw things up if you’re not careful.

What I Tested

The original data comes from NYC’s TLC trip records, merged into a 1.6GB uncompressed Parquet file. One of the files I used in Which Compression Saves the Most Storage $? (gzip, Snappy, LZ4, zstd).

I used PyArrow to generate multiple Parquet variants, each applying a small tweak, like new row group sizes, better encoding, compression and sorting to isolate the effect.

Then I ran DuckDB queries to benchmark performance.

Environment:

MacBook Pro (2021, M1 Pro, 16GB RAM)

Variant Overview

Here’s a quick snapshot of what changed in each file.

Variant	Size	RGs	Comp	Enc	Dict	Sorted
00_Worst	7.40 GB	60,304	None	V1	No	No
01_Base	6.83 GB	60	None	V1	No	No
02a_Dict	1.67 GB	60	None	V1	Yes	No
02b_Comp	1.82 GB	60	ZSTD	V1	No	No
03_Dict+Comp	1.37 GB	60	ZSTD	V1	Yes	No
04a_OptNS	1.38 GB	60	ZSTD	V2	Yes	No
04b_SortV1	1.88 GB	60	ZSTD	V1	Yes	Yes
05b_OptSV2	1.88 GB	60	ZSTD	V2	Yes	Yes

Legend

Variant: Short ID for each Parquet file variant
Size: Final file size on disk
RGs: Number of row groups
Comp: Compression codec used (ZSTD or None)
Enc: Parquet data page version (V1 or V2)
Dict: Whether dictionary encoding was enabled
Sorted: Whether the file was sorted by PULocationID

How come the worst file ballooned to 7.40GB?

In the post about compression, the same uncompressed file was 1.6GB. What happened here?

I thought, damn, I must have uncompressed it wrong last time making the previous post invalid, however I wasn't wrong.

02a_Dict file confirms that it's dictionary encoding that significantly reduced the size. That alone caused the file to bloat from 1.6GB to 7.4GB. With it on, the size drops back down even without compression. Interestingly, compression alone 02b_Comp doesn't push the file size lower than just dictionary encoding.

How the Internals Affected Query Performance

I ran seven queries on each file variant: filters, projections, aggregations, and full row reads.

BENCHMARK_QUERIES = {
    "1. Count with selective filter (predicate pushdown on numeric 'trip_miles')": "SELECT COUNT(*) FROM read_parquet('{file}') WHERE trip_miles < 2.0;",
    "2. Projection of specific columns (column pruning 'pickup_datetime', 'dropoff_datetime')": "SELECT pickup_datetime, dropoff_datetime FROM read_parquet('{file}') LIMIT 1000;",
    "3. Aggregation on filtered data (mixed types, computation)": "SELECT AVG(trip_miles), SUM(base_passenger_fare) FROM read_parquet('{file}') WHERE tips > 0.0 AND trip_time > 600;",
    "4. Filter on low-cardinality string (dictionary encoding potential 'hvfhs_license_num')": "SELECT COUNT(*) FROM read_parquet('{file}') WHERE hvfhs_license_num = 'HV0003';",
    "5. Full scan and sum of one numeric column (I/O and decompression 'driver_pay')": "SELECT SUM(driver_pay) FROM read_parquet('{file}');",
    "6. Count with filter on an integer ID (predicate pushdown 'PULocationID')": "SELECT COUNT(*) FROM read_parquet('{file}') WHERE PULocationID = 148;",
    "7. Full row reconstruction (worst-case for column store, limited)": "SELECT * FROM read_parquet('{file}') LIMIT 10;",
}

Row Group Size Changed Everything

Just fixing the row group size and going from 60K tiny chunks to 60 made the biggest difference in performance.

In the baseline, a simple COUNT(*) WHERE trip_miles < 2 took 3.662s. With improved row groups? 0.124s. Same data, same engine, 28x faster.

Small row groups mean a ton of metadata scanning and disk seeks. Fewer, larger row groups = more I/O locality and less overhead.

Impact of Larger Row Groups (4.37M vs. 1M Rows per RG)

Official Parquet documentation recommends large row groups 512MB - 1GB for uncompressed data. In our case this would have been around 14 row groups, however after trying that in practice I noticed something interesting, keeping all things the same, the file size has increased. This seems to be caused by dictionary encoding as within larger row groups there might be more unique values within the column, which results in larger dictionaries.

Variant	Rows per RG	Row Groups	File Size
`04a_OptNS`	~1M (1,018,964)	60	1375.28 MB
`04a_OptNS_LargeRG`	~4.37M (4,370,585)	14	1444.29 MB

What about performance?

Query Type	`04a_OptNS`	`04a_OptNS_LargeRG`	% Change
1. Count with selective filter (numeric)	0.1633	0.1664	+1.9%
2. Projection of specific columns	0.0055	0.0070	+27.3%
3. Aggregation on filtered data	0.3541	0.3680	+3.9%
4. Filter on low-cardinality string	0.0934	0.1018	+9.0%
5. Full scan and sum of one numeric column	0.0883	0.0918	+4.0%
6. Count with filter on an integer ID	0.0718	0.0788	+9.7%
7. Full row reconstruction (worst-case for column store)	0.0298	0.0302	+1.3%

In my opinion the cause of the slower performance was mostly due to CPU having to do more work to decompress that larger block in one go.

This highlights a critical point: while a general recommendation for row group size exists, the optimal size is often workload and data dependent. In my case I kept Row Groups at 1M.

Compression Was Worth It

ZSTD reduced the file size from 6.83 GB to 1.82 GB and kept query times low, often under 0.2s for filtered counts and scans. It adds minimal CPU overhead compared to the I/O savings.

As such, for full scans like SUM(driver_pay) neither compression alone nor with dictionary encoding outperform the 01_Base version.

So, the overhead of compression and dictionary encoding slows it down a bit. However the results make sense as compression isn't meant to improve your query execution speeds but reduce the size of the data.

Encodings and Page Versions Gave Smaller Gains

Switching to better encodings made small improvements in file size. If you apply them without compression, they can be extremely powerful on their own, but combined they will truly shine.

The column hvfhs_license_num has low cardinality, just a few distinct values. Fixing the tiny row groups improved that to 0.085s. Turning on dictionary encoding 02a_Dict brought a small additional gain: 0.0846s.

Since in most cases encoding reduced performance due to additional overhead for space savings, I will take a W.

Parquet V2 pages gave marginal improvements. They’re worth turning on if your engine supports them, but they won’t move the needle like row groups or compression will.

Sorting Provided Targeted Speedups

Sorting the file by PULocationID had a huge impact on one query: a count filter on that column.

Unsorted: 0.07s
Sorted: 0.007s

That's a 10x improvement from better row group skipping. The rest of the queries didn’t change much, and the file got a bit larger, but it’s a trade-off that’s worth it if you know your access patterns.

Even Worst-Case Row Reads Got Faster

Reconstructing full rows (all columns, LIMIT 10) is the worst thing you can do to a columnar format.

Worst: 2.27s
Optimized: 0.023s

Good structure helps even when Parquet’s working against you.

So what about it?

Parquet has sensible defaults, and if you’re not actively trying to sabotage them, things usually work fine. But if you know which knobs to turn, you can make it blazingly fast™. Kachow.

The takeaway - the defaults don’t know your workload. You do.

Thanks for reading!

Which Compression Saves the Most Storage $? (gzip, Snappy, LZ4, zstd)

Konstantinas Mamonas — Mon, 19 May 2025 07:07:32 +0000

Compression setting are set and forget in most cases, if it works no reason to change it. I decided to look into and see whether it would be beneficial to review the defaults and if it could save money. I covered most of the algorithms discussed in this post previously in Compression Algorithms You Probably Inherited, where I summarized the info I collected while researching. But I wanted to sanity-check the findings myself and decided to run some benchmarks. This should help me see if compression actually makes a difference for storage costs.

What I Tested

To keep things real, I used actual data: NYC TLC trip records. Each month’s data file was ~500MB. I combined a few to get files at 500MB, 1.6GB, 3.9GB, and 6.6GB.

Compression algorithms tested:

gzip
Snappy
LZ4
zstd at levels 1, 3, 9, and 19

Environment:

MacBook Pro (2021, M1 Pro, 16GB RAM)
Single-threaded runs

I couldn’t process the largest file with my setup. SKILL ISSUE. In reality, I didn’t bother trying to fix it, multi-threading and batching the compression should have allowed me to do it, but I already had the 3 other files to work with.

To run the benchmarks, I built a small CLI tool: compressbench. It’s publicly available and it currently supports gzip, snappy, lz4, and zstd (with levels) and outputs compression/decompression benchmarks for Parquet files. I’m planning to add support for custom codecs later, mostly so I can benchmark my own RLE, Huffman, and LZ77 implementations.

Results

Compression Ratio

If you only care about the smallest file, zstd-19 and gzip come out ahead. But the margin over zstd-3 is tiny, and you pay for it heavily elsewhere. Snappy and LZ4 compress to about 1.12 - just enough to make it look like they tried. But if that’s all you have, 12% savings is still better than no savings.

For most storage use cases, zstd-3 gets close enough to the “best” ratio without turning your CPU into a space heater.

Compression Speed

Snappy and LZ4 are fast. zstd-1, 3, and 9 kept up surprisingly well. gzip is predictably slow. zstd-19 made me question my life choices, I thought it might have frozen or got silently murdered by the OS. I’m not saying never use it, there are some use cases, but they’re likely few and far between.

Decompression Speed

Snappy and LZ4 hit over 3.5GB/s. zstd held steady around 1GB/s across all levels. gzip stayed slow.

If you need to read the same data multiple times Snappy and LZ4 are faster than gzip or zstd. But zstd isn’t slow enough to matter unless your volumes are huge.

File Size Scaling

Throughput went down as file size grew. Gzip was slow the whole time. zstd-19 was even slower and I didn't run it for all file size, so it may have gone even more down.

The others held up fairly well. Snappy stayed fastest, but none of them completely fell apart.

Note: CPU was pinned at 100% during all runs. On a single-threaded, 16GB machine, there was probably some memory pressure too for the larger files. These results match what I’ve seen elsewhere but might be a bit exaggerated.

Storage Cost (S3)

S3 Standard pricing in eu-central-1 is $0.0235/GB. At 500TB/month, codec choice can have a significant (based on your budged) impact on the cost. But if you're only storing a few TB, this doesn’t matter much. At 100TB, you're looking at maybe a few hundred bucks.

Snappy/LZ4 would cost around $10.7K/month. zstd-3 lands near $9.7K for 500TBs. zstd-19 saves a bit more, but the compute cost and latency make it hard to justify. gzip is in the same ballpark, and we’ve already covered its performance.

What to Pick?

For streaming
Snappy or LZ4. Fast compression and decompression. Compression ratio better than nothing.

For batch ETL or periodic jobs
zstd-1 or zstd-3. Good balance between speed and size.

For archival
zstd-9 if you care for small gains. zstd-19 if you’re archiving something you hope nobody ever reads again.

Final Thoughts

After my initial post, I assumed the real-life impact of LZ4 and zstd would be more obvious. But it turns out you need quite a bit of scale to feel it. In the future, I won’t be so quick to dismiss Snappy as it has its place. But it’s not the only viable option there is.

I'd also like to benchmark compute cost in the future and see whether using zstd at scale is actually worth it for batch processes or if the additional compute time eats up your storage savings.

Also keep in mind that your mileage might vary based on your data, compression is about finding patters and if there's none, the result might be a larger file than you began with, so pick accordingly, maybe run a benchmark yourself.

Thank you for reading.

Compression Algorithms You Probably Inherited: gzip, Snappy, LZ4, zstd

Konstantinas Mamonas — Mon, 12 May 2025 07:31:25 +0000

You Might Be Using The Wrong Compression Algorithm

If you work in data engineering, you’ve probably used gzip, Snappy, LZ4, or Zstandard (zstd). More likely - you inherited them. Either the person who set these defaults is long gone, there’s never enough time to revisit the choice, or things work well enough and you’d rather not duck around and find out otherwise.

Most engineers stick with the defaults. Changing them feels risky. And let’s be honest - many don’t really know what these algorithms do or why one was chosen in the first place.

I’ve been that person myself: "Oh, we’re using Snappy? OK." Never thinking to ask why or what else we could use.

This post explains the most common compression algorithms, what makes them different, and when you should actually use each.

Why Compression Choices Matter

Compression decisions aren’t just about saving space. They directly impact:

Storage costs
CPU utilization
Throughput
Latency

In modern pipelines — Kafka, Parquet, column stores, data lakes - the wrong compression algorithm can degrade all of these.

Two metrics matter most:

Compression ratio: How much smaller the data gets.
Throughput: How quickly data can be compressed and decompressed.

Your workload - and whether you prioritize CPU, latency, or bandwidth - determines which trade-offs are acceptable.

Main Culprits

gzip

What it is: Uses the DEFLATE algorithm (LZ77 + Huffman coding).
Goal: Good compression ratio. Compatibility.
Speed: Slow to compress. Moderate decompression speed.
Strength: Ubiquitous. Supported everywhere.
Weakness: Outclassed in both speed and compression ratio by newer algorithms.
When to use: Archival, compatibility with legacy tools. Otherwise, avoid.

Snappy

What it is: Developed by Google. Based on LZ77 without entropy coding.
Goal: Maximize speed, not compression ratio.
Speed: Very fast compression and decompression.
Strength: Low CPU overhead. Stable. Production-proven at Google scale.
Weakness: Larger compressed size than other options.
When to use: Real-time, low-CPU systems where latency matters more than storage. Or if you're stuck with it.

LZ4

What it is: LZ77-based. Prioritizes speed.
Goal: Fast compression and decompression with moderate compression ratio.
Speed: > 500 MB/s compression. GB/s decompression.
Strength: Extremely fast. Low CPU usage.
Weakness: Compression ratio lower than gzip or zstd.
When to use: High-throughput, low-latency systems. Datacenter transfers. OLAP engines (DuckDB, Cassandra).

zstd (Zstandard)

What it is: Developed by Facebook. Combines LZ77, Huffman coding, and FSE.
Goal: High compression ratio with fast speed.
Speed: Compression 500+ MB/s. Decompression ~1500+ MB/s.
Strength: Tunable. Balances speed and compression. Strong performance across data types.
Weakness: Slightly more CPU than LZ4/Snappy at default settings.
When to use: General-purpose. Parquet files. Kafka. Data transfers. Usually the best all-around choice.

Strengths and Weaknesses (At a Glance)

Algorithm	Compression Ratio	Compression Speed	Decompression Speed	Best For
gzip	Moderate	Slow	Moderate	Archival, web content
Snappy	Low	Very Fast	Very Fast	Real-time, low-CPU systems
LZ4	Moderate	Extremely Fast	Extremely Fast	High-throughput, low-latency systems
zstd	High	Fast	Fast	General-purpose, Parquet, Kafka, data transfers

Real-World Scenarios: When to Use What

High-throughput streaming (Kafka)

Use: zstd or LZ4
Why: zstd gives better compression with good speed. LZ4 if latency is critical and CPU is limited. Snappy is acceptable if inherited, but usually not optimal anymore.

Long-term storage (Parquet, S3)

Use: zstd
Why: Best compression ratio reduces storage cost and IO. Slight CPU trade-off is acceptable.

Low-latency querying (DuckDB, Cassandra)

Use: LZ4
Why: Prioritize decompression speed for fast queries. LZ4 is the common choice in OLAP engines.

CPU/memory constrained environments

Use: Snappy or LZ4
Why: Low CPU overhead is more important than compression ratio. zstd can still be used at low compression levels if needed.

Fast network, low compression benefit (datacenter file transfer)

Use: LZ4
Why: Minimal compression overhead. On fast networks, speed beats smaller file sizes.

Slow network or internet transfers

Use: zstd
Why: Better compression reduces transfer time despite slightly higher CPU cost.

What to Remember

No algorithm is best for every workload.
zstd has become the Swiss Army knife of compression. Unless you have a good reason not to, it’s a smart pick.
LZ4 is unbeatable when speed matters more than compression.
Snappy is still acceptable in latency-sensitive, CPU-constrained setups but is generally being replaced.
gzip remains for legacy systems or when maximum compatibility is required.

What's Underneath The Hood

LZ77 - Replaces repeated sequences of data with references to earlier copies in the stream (sliding window). Wikipedia
Huffman Coding - A method of assigning shorter codes to more frequent data patterns to save space. Wikipedia
FSE (Finite State Entropy) - An advanced entropy coding method that efficiently compresses sequences by balancing speed and compression ratio. Facebook’s zstd Manual

Why it matters
Most compression algorithms combine finding patterns (LZ77) with efficient encoding (Huffman, FSE) to shrink data without losing information.

Closing Thoughts

Compression choices tend to stick around. There’s rarely time to revisit legacy pipelines, and if something works, it’s easy to assume it’s good enough. But if you can make the time, you’re now better equipped to review your defaults (I know I am.) - and see if a different choice might better fit your needs.

Thank you for reading.

DuckDB: When You Don’t Need Spark (But Still Need SQL)

Konstantinas Mamonas — Sat, 03 May 2025 13:48:38 +0000

The Problem

Too often, data engineering tasks that should be simple end up requiring heavyweight tools. Something breaks, or I need to explore a new dataset, and suddenly I’m firing up Spark or connecting to a cloud warehouse - even though the data easily fits on my laptop. That adds extra steps, slows things down, and costs more than it should. I wanted something simpler for local analytics that could still handle serious queries.

What Is DuckDB?

DuckDB is an open-source, in-process SQL OLAP database designed for analytics.

It runs embedded inside applications, similar to SQLite, but optimized for analytical queries like joins, aggregations, and large scans.

In short, it goes fast without adding the complexity of distributed systems.

How DuckDB Achieves High Performance

Columnar Storage:
Data is stored by columns, not rows. This lets queries scan only the data they need, cutting down IO.

Vectorized Execution:
Processes data in batches (about 1000 rows at a time) to leverage CPU caching and SIMD instructions, reducing processing overhead.

These two design choices allow DuckDB to handle complex analytical queries efficiently on a single machine.

Handling Large Datasets

DuckDB dynamically manages memory and disk usage based on workload size:

In-Memory Mode: Keeps everything in RAM if possible.
Out-of-Core Mode: Spills to disk if data exceeds memory.
Hybrid Execution: Switches between modes automatically based on workload.
Persistent Storage: Can save results in .duckdb files for reuse.

No manual configuration. No crashing on out-of-memory errors (Hi Pandas!).

Extensibility & Concurrency

Single-writer, multiple-reader concurrency (MVCC).
Growing ecosystem of extensions: Parquet, CSV, S3, HTTP endpoints, geospatial analytics.

Trade-Offs: DuckDB vs Specialized Engines

DuckDB is flexible and fast, but:

SQL Parsing Overhead: Engines like Polars can be faster for simple dataframe operations.
General Purpose Design: Flexibility trades off some raw speed.

That said, for most data engineering tasks, the trade-off is worth it.

Where DuckDB Shines

Local dataset exploration (when Pandas hits limits).
CI and pipeline testing without Spark.
Batch transformations on Parquet, CSV, and other formats.
Lightweight production workflows.

Limits to Keep in Mind

Single-machine only - limited by your hardware.
Not built for transactional workloads.
SQL pipelines can get messy if not managed well.

Reflection: Why This Matters

DuckDB helps bridge the gap between dataset size and engineering overhead.It’s not about replacing big tools, but avoiding them when you don’t need them.

For tasks that outgrow Pandas or require complex queries, it’s a practical alternative to heavier tools.

Thanks for reading.

kafka-replay-cli: A Lightweight Kafka Replay & Debugging Tool

Konstantinas Mamonas — Sat, 03 May 2025 07:12:22 +0000

Project Links

GitHub: github.com/KonMam/kafka-replay-cli
PyPI: pypi.org/project/kafka-replay-cli

Why I Built This

I wanted more hands-on Kafka experience - that's the gist of it. Before this, I’d dealt with a few producers/consumers here and there, read the docs, and studied Kafka’s architectural design principles (very insightful read if you are interested in that sort of thing: https://kafka.apache.org/documentation/).
But there’s only so much you can learn with limited exposure and just reading, so I decided to spend some time tinkering and learning by doing.

Goals

There were a few things I wanted to achieve with this:

Get more Kafka experience - main goal.
Integrate DuckDB - for the past year, I have seen a lot of hype around it and have started using it for some ad-hoc analysis. I enjoy using it, so I wanted to find a place for it.
Have something to show at the end of it - meaning, find a real issue that people using Kafka might have and develop something around it, applying good practices.

Problem & MVP

I needed to find a problem I could so-called 'solve,' even if it had been done before. After some careful Googling and ChatGPT-ing, Kafka message replay came up as something people either struggle with or need heavy tools to handle. The tool should be useful for someone who needs to reprocess events with filters or transformations, debugging, or migrating data between topics.

The initial MVP I scoped was simple:

Basic replay of messages with filters.
Ability to dump Kafka topic data.
Query dumped data.

I wanted it lightweight, scriptable, and easy to use - no streaming engine, web UI, or over-engineering.

Architecture

The first decision I had to make was whether to use Python or Golang.

Arguments for Python - I have the most experience with it and expected it would be easier and faster to develop.
Arguments for Golang - In the long run, it would most likely be more performant. I would get more familiar with Golang.

Due to my decision to have something tangible in a few days, I went with Python. Since it is a small tool and I didn’t know how much use it would get, I preferred not to worry about making it as performant as possible - premature optimization is the root of all evil, after all.

Tools used for this project:

Kafka - the core thing I wanted to learn. Using the confluent_kafka Python package, as it had all the features I needed.
DuckDB - see above.
Typer - a library for building CLI applications. I had never used it before but liked the look and ergonomics it offered.
PyArrow for Parquet - efficient storage; I’m used to working with it, and DuckDB can read from it. For alternatives could have used JSON or Avro, but JSON is inefficient for larger data volumes. Avro - might add support in the future.

Features

Dump Kafka topics into Parquet files
Replay messages from Parquet back into Kafka
Filter replays by timestamp range and key
Optional throttling during replay
Apply custom transform hooks to modify or skip messages
Preview replays without sending messages using --dry-run
Control output verbosity with --verbose and --quiet
Query message dumps with DuckDB SQL

Lessons Learned

Kafka - Not as intimidating as expected and quite enjoyable. Both the official Kafka CLI tools and the Python integrations are mature.
DuckDB - Currently limited use in the project, but good for what it does. I might add more use for it in the future or remove it to reduce bloat if it isn’t utilized.
Typer - Enjoyed working with it a lot. Super easy to get a CLI tool going.
Testing - Used pytest. For unit tests, I didn’t want Kafka running for each test, so I used MagicMock and monkeypatch to simulate real objects - techniques I’ll keep in my pocket for future. For integration testing, I spun up a Docker container with a Kafka broker to test real usage of the CLI using subprocess.

Main takeaway:
It’s important to figure out your goals and think about the architecture before you start mashing on the keyboard. Deciding the project scope and dependencies early let me focus on the main features. It’s always a balancing act: what’s core, what’s nice to have, and how much time you want to spend.

Outcome & Reflection

Did I get more Kafka experience? Yes.
Does the tool do what I set out to make it do? Yes.
Is it the best thing since sliced bread? Highly unlikely.
Are there better tools for this use case? Probably.

At the end of the day, this was a learning experience and I had fun. If someone uses it - great. If no one does - also great, it just means that I didn't spend enough time researching real usage problems.

Installation & Usage

pip install kafka-replay-cli

kafka-replay-cli dump --help
kafka-replay-cli replay --help

Thank you for reading.

Kafka Producers Explained: Partitioning, Batching, and Reliability

Konstantinas Mamonas — Mon, 28 Apr 2025 11:06:21 +0000

A Kafka producer is the entry point for all data written to Kafka. It sends records to specific topic partitions, defines batching behavior, and controls how reliably data is delivered.

This post covers the behaviors and configurations that influence the producer: partitioning, batching, delivery guarantees, and message structure.

What Does a Kafka Producer Do?

A Kafka producer is a client library integrated into applications to write messages to Kafka topics. When a message is sent, the producer determines:

Which partition the message should go to
How to serialize the message for Kafka
Whether to batch it with others
How many acknowledgments are required before the message is considered delivered

Producers are designed to balance speed, reliability, ordering, and throughput. Optimizing for one might require to compromise another.

Partitioning Strategies: Routing Messages to Partitions

Kafka topics are split into partitions. Every message sent by a producer is written to one partition. This decision is made by a partitioner function.

With a Key

If a message has a key, Kafka hashes it using the Murmur2 algorithm and assigns the message to a partition using:

partition = hash(key) % number_of_partitions

This ensures all messages with the same key go to the same partition. Kafka guarantees message order within a partition, so key-based partitioning is how per-key ordering is maintained.

Without a Key

If the key is null, Kafka uses one of two strategies:

Round-robin: messages cycle through partitions in order. Used in older clients
Sticky partitioning: the producer sends all messages to the same partition until the batch is sent, then picks a new one. Default in modern clients

Sticky partitioning improves batching efficiency while maintaining fair distribution over time.

Message Format: Structure and Serialization

Kafka treats every message as a set of bytes. Each record includes:

Key (optional): used for partitioning. Serialized to bytes
Value: the actual data payload. Serialized to bytes
Headers (optional): metadata as key-value pairs
Timestamp: assigned by the client or broker
Partition + Offset: assigned by the broker after the message is stored

Kafka does not interpret or modify message content; it just stores and transmits byte arrays. Producers are responsible for serializing messages before sending them.

Example (Python):

producer = KafkaProducer(value_serializer=lambda v: json.dumps(v).encode('utf-8'))

Efficient serialization improves throughput and reduces broker load. Avoid inefficient formats like uncompressed JSON unless specifically required by system constraints.

Batching and Compression: Optimizing Throughput

Sending one message per request is inefficient. Kafka producers batch multiple records together per partition before sending them to the broker.

Key Configuration Options

batch.size: maximum size in bytes for a batch. Larger batches improve compression and throughput, but increase memory usage
linger.ms: how long to wait before sending a batch, even if it is not full. Increases batching opportunities at the cost of latency
compression.type: compresses full batches. Options include gzip, lz4, snappy, zstd

The send() method is non-blocking. It queues the record in memory and returns immediately. The background sender thread flushes batches when batch.size is reached or linger.ms expires.

Batching operates on a per-partition basis. As a result, applications that produce to a large number of partitions may experience reduced batching efficiency unless message flow is concentrated across fewer partitions.

Delivery Guarantees: Configuring Reliability and Ordering

Kafka producers can trade reliability for speed using the acks configuration:

acks=0: fire and forget. Fastest, but data may be lost
acks=1: wait for leader. Reasonable balance for many use cases
acks=all: wait for all in-sync replicas. Safest, with higher latency

Ordering and Retries

Kafka guarantees ordering within a single partition. To maintain strict ordering, ensure:

All related messages share the same key
max.in.flight.requests.per.connection <= 1 when retries are enabled (to prevent out-of-order writes during retries)

Idempotence and Transactions

By default, producers use at-least-once semantics, meaning retries may cause duplicate messages. Kafka provides stronger guarantees where needed.

Idempotent Producer

Enable with enable.idempotence=true. This prevents duplicates during retries by assigning each producer a unique ID and tracking sequence numbers per partition.

This guarantees exactly-once delivery per partition, assuming the producer does not crash and restart with a new ID.

Use this when:

Downstream systems cannot deduplicate
Every message must be uniquely written (for example, financial systems)

Avoid using high max.in.flight values with idempotence if ordering matters.

Transactional Producer

Transactional producers enable atomic writes across multiple partitions or topics.

Requires:

A configured transactional.id
Use of API methods: begin_transaction(), send(), commit_transaction()

This is critical for:

Exactly-once event processing pipelines
Kafka Streams applications
Coordinating multiple topic writes as a single atomic unit

Transactions ensure no duplicates, no partial writes, and consistent failure handling.

Conclusion

A well-tuned Kafka producer is critical to balancing throughput, reliability, and resource efficiency. It's important to understand your delivery requirements and system constraints before leaning into aggressive batching or strong guarantees as you trade higher throughput for it.

Kafka Consumers Explained: Pull, Offsets, and Parallelism

Konstantinas Mamonas — Wed, 23 Apr 2025 09:39:35 +0000

Kafka is built for high throughput, scalability, and fault tolerance. At the core of this is its consumer model. Unlike traditional messaging systems, Kafka gives consumers full control over how they read data. This post explains how Kafka consumers work by focusing on three things: how they pull data, how offsets work, and how parallelism is achieved with consumer groups.

Pulling Data from Kafka

Kafka producers push data to brokers. Consumers pull data from brokers. This setup is intentional. It gives consumers control over how fast they process data.

In push-based systems, if the producer is faster than the consumer, the consumer can get overwhelmed or crash. Kafka avoids this problem by letting consumers decide when to fetch data. This helps with backpressure and makes the system more reliable.

Pulling also helps with batching. A consumer can fetch many messages in a single request. This reduces the number of network calls. In contrast, push systems must send each message one by one or hold back messages without knowing if the consumer is ready.

One downside of pull-based systems is wasteful polling. A consumer might keep asking for data even if nothing is available. Kafka avoids this by letting the consumer wait until enough data is ready before responding. This keeps CPU usage low and throughput high.

Kafka also avoids a model where brokers pull data from producers. That design would need every producer to store its own data. It would require more coordination and increase the risk of disk failure. Instead, Kafka stores data on the broker, where it can be managed and replicated more easily.

How Kafka Offsets Work

Kafka splits topics into partitions. Each message in a partition has a number called an offset. The offset marks the position of the message in the log.

Offsets give consumers control. A consumer chooses where to start reading and can track what has already been processed. If a consumer crashes, it can pick up where it left off by using its last committed offset.

Kafka does not track this progress for the consumer. The consumer is responsible for managing its own offsets. This is part of what makes Kafka scalable and efficient.

By default, a consumer starts at offset zero. This means it will read all messages that are still available. It can also be configured to start at the latest offset to only read new data.

Kafka only keeps data for a limited time. If a consumer tries to read from an offset that is too old, Kafka will return an error. In that case, the consumer must reset to the earliest or latest available offset.

Key Terms

Offset: The position of a message in a partition.
Log End Offset: The offset where the next message will be written.
Committed Offset: The offset of the last message the consumer has finished processing.

Delivery Options

At-most-once: The consumer commits the offset before processing. If it crashes during processing, the message is lost.
At-least-once: The consumer commits the offset after processing. If it crashes before committing, the message may be processed again.
Exactly-once: This uses Kafka transactions. The message and its offset are written together. If anything fails, nothing is committed. This guarantees no duplication and no loss.

Parallelism with Consumer Groups

Kafka uses consumer groups to scale out processing. A consumer group is a set of consumers working together to read from a topic.

Kafka assigns each partition to only one consumer in the group. This avoids duplication and ensures order within each partition.

When the group changes (for example, when consumers are added or removed), Kafka reassigns partitions to the available consumers.

Examples

100 partitions and 100 consumers: each consumer handles one partition.
100 partitions and 50 consumers: each consumer handles two partitions.
50 partitions and 100 consumers: only 50 consumers do work, the rest are idle.

Kafka does not let multiple consumers read from the same partition in the same group. This would require the broker to manage shared state, which adds complexity. Instead, Kafka puts the responsibility on the consumer to track offsets. This makes the broker faster and simpler.

The number of partitions controls how much you can scale out. More partitions allow for more parallelism. Choosing the right number of partitions is important for performance and resource usage.

Conclusion

Kafka gives consumers control over how they pull data, where they start, and how they scale. Pull-based reads avoid overload. Offsets make it easy to recover from failure. Consumer groups allow you to scale out processing.

This design makes Kafka fast, reliable, and efficient at any scale.

Appendix: Quick Reference

Partition: A subset of a topic. Used for parallel processing and message ordering.
Offset: A number showing a message’s position in a partition.
Consumer Group: A set of consumers that share the work of reading from a topic.
Rebalancing: The process where Kafka reassigns partitions when consumers join or leave a group.
Delivery Types:
- At-most-once: Fast, but may lose messages.
- At-least-once: Reliable, but may duplicate messages.
- Exactly-once: Most accurate, but needs Kafka transactions.

How Kafka Achieves High Throughput: A Breakdown of Its Log-Centric Architecture

Konstantinas Mamonas — Sun, 20 Apr 2025 05:14:20 +0000

Kafka routinely handles millions of messages per second on commodity hardware. This performance isn't accidental. It stems from deliberate architectural choices centered around log-based storage, OS-level optimizations, and minimal coordination between readers and writers.

This post breaks down the core mechanisms that enable Kafka's high-throughput design.

1. Append-Only Log Storage

Each Kafka topic is split into partitions, and each partition is an append-only log. It is essentially a durable, ordered sequence of messages that are immutable once written.

To manage growing data size efficiently, Kafka breaks each partition’s log into multiple segment files. A segment is a file on disk that stores a continuous range of messages. New messages are always written to the active segment using low-level system calls like write(). This write lands in the OS page cache, not written to disk immediately.

Kafka delays calling fsync() to flush data to disk, relying instead on configurable flush policies (based on time or size). This reduces disk I/O and improves performance, at the cost of brief durability gaps. Kafka mitigates this through replication across brokers.

Over time, when a segment reaches a size threshold, it is closed and a new one is created. Older segments become read-only and are subject to log retention, compaction, or deletion based on topic settings.

By aligning its write path with sequential disk I/O, Kafka avoids random seeks entirely. This makes reads and writes fast and predictable, even on spinning disks, and scales well with data volume.

2. Outperforming Traditional Queues with Sequential I/O

Traditional messaging systems often manage message delivery using per-consumer tracking and persistence mechanisms that can result in random disk access, especially during acknowledgment, redelivery, or crash recovery. While these systems are efficient in memory, random I/O patterns on disk introduce performance bottlenecks. For spinning disks, a single seek can take around 10 milliseconds, and disks can only perform one seek at a time.

Kafka sidesteps this entirely by relying on sequential I/O. Writes are appended, and reads proceed in order. This design significantly improves disk efficiency, especially under load.

By decoupling performance from data volume and enabling concurrent read/write access, Kafka makes efficient use of low-cost storage hardware, such as commodity SATA drives, without sacrificing performance.

3. Speeding Up Seeks with Lightweight Indexing

Each segment file is accompanied by lightweight offset and timestamp indexes. These allow consumers to seek directly to specific message positions without scanning entire files, ensuring fast lookup even on large datasets.

Since Kafka consumers track their own offsets and messages are immutable, there is no need to update shared state for acknowledgments or deletions. This eliminates coordination between readers and writers, reducing contention and enabling true parallelism.

4. Batching to Maximize I/O Efficiency

High-throughput systems must avoid the overhead of processing one message at a time. Kafka uses a message set abstraction to batch messages:

Producers group messages before sending them.
Brokers perform a single disk write per batch.
Consumers fetch large batches with a single network call.

This batching reduces system calls, disk seeks, and protocol overhead. As a result, throughput improves significantly.

5. Zero-Copy Data Transfer with `sendfile()`

Conventional data transfer involves multiple memory copies:

Disk to kernel space (page cache)
Kernel to user space (application buffer)
User space back to kernel (socket buffer)
Kernel to NIC buffer (for network)

Kafka avoids this overhead using the sendfile() system call. This enables zero-copy data transfer from the page cache directly to the network stack, bypassing user space entirely.

This reduces CPU usage and memory pressure, allowing near wire-speed data transfer even under heavy load.

6. Long-Term Retention Without Performance Loss

Kafka’s append-only log model enables long-term message retention, even for days or weeks, without degrading performance. Because reads and writes are decoupled, and messages are not mutated post-write, old data remains accessible without impacting current workloads.

This supports powerful use cases like:

Replaying messages for state recovery
Late-arriving consumer processing
Time-travel debugging and auditing

Conclusion

Kafka’s high throughput is the result of system and architectural decisions that work together by design. Its log-centric model avoids random I/O, minimizes coordination, and takes full advantage of OS-level features like the page cache and zero-copy transfers.

The result: Kafka handles massive data volumes not through abstract complexity, but by working with the OS instead of against it.

Appendix: Key Terms

write(): A system call that transfers data from user space to the OS page cache.
Page cache: A memory buffer managed by the kernel.
fsync(): Forces data in the page cache to be flushed to disk.
sendfile(): A system call that sends data from a file directly to the network without copying to user space.
Sequential I/O: Reading or writing data in a linear order. Much faster than random I/O, especially on HDDs.
Random I/O: Accessing data at non-contiguous disk locations. This causes performance degradation due to disk seeks.

Further Reading:

Kafka Official Design Documentation