alexey.zh

Posted on May 31

Testing PostgreSQL WAL Streamers for Byte-Level Fidelity

#backup #go #postgres

Verifying that WAL streamers preserve exact database state — bit by bit.

🧭 Context

In the previous post, we explored the motivations behind building pgrwl, a PostgreSQL WAL receiver designed for zero data loss (RPO=0) scenarios in containerized environments. We covered its architecture, features like compression/encryption, and its suitability for Kubernetes-based disaster recovery.

This follow-up post focuses on testing — specifically validating that pgrwl produces WAL archives that are byte-for-byte identical to PostgreSQL’s official tool (pg_receivewal) and that it supports full PITR (Point-in-Time Recovery) after abrupt system crashes.

🚀 Intro

Write-Ahead Logs (WALs) are at the heart of PostgreSQL’s crash recovery and replication capabilities. But what happens when we replace the native WAL receiver (pg_receivewal) with a third-party tool like pgrwl? Can we trust it to preserve data integrity byte-for-byte?

This post dives into a golden test designed to answer that question — by simulating real-world PostgreSQL workloads, abrupt crashes, and full recovery workflows.

Note: All Bash scripts shown here are simplified examples to illustrate the core logic. The full implementation with deep technical details and automation scripts is available in the pgrwl GitHub repository. This post focuses on explaining the primary test goal, rather than every integration nuance.

Integration Test Source Code

✅ Goal

To verify that:

pgrwl can reliably stream WALs during active writes.
The restored database is identical to its pre-crash state.
WAL files produced by pgrwl match those produced by pg_receivewal bit-for-bit.

🛠️ Tools Used

PostgreSQL 16+
pg_receivewal — the official WAL receiver.
pgrwl — WAL receiver with encryption/compression/backends.
pg_dumpall, pgbench
Bash for orchestration

🧪 Test Procedure: Step-by-Step

We simulate a live system, insert tons of data, kill everything mid-flight, and then recover from base backup + WALs.

1. Start PostgreSQL

Initialize a clean cluster:

initdb -D /tmp/pgdata
pg_ctl -D /tmp/pgdata -l logfile start

2. Launch WAL Receivers

Run both in parallel (in background):

pg_receivewal --slot=test_slot -D /tmp/pgwal_pg ...
pgrwl --mode=receive -c config.yml ...

3. Take a Base Backup

pg_basebackup \
  --pgdata="/tmp/base_backup" \
  --wal-method=none \
  --checkpoint=fast \
  --progress \
  --no-password \
  --verbose

4. Simulate Real Workload

Insert timestamps every second:

psql -c 'CREATE TABLE ticks(ts TIMESTAMPTZ DEFAULT now());'
while true; do psql -c 'INSERT INTO ticks DEFAULT VALUES;'; sleep 1; done &

Run pgbench to add load:

pgbench -i -s 10

Create 100 tables in parallel:

for i in $(seq 1 100); do
  psql -c "CREATE TABLE t_$i AS SELECT * FROM generate_series(1, 10000) AS g(id);" &
done
wait

5. Capture Golden Snapshot

pg_dumpall > /tmp/before.sql

Kill the ticks inserter.

6. Simulate Crash

pkill -9 postgres || true
pg_ctl -D /tmp/pgdata -m immediate stop
rm -rf /tmp/pgdata

7. Restore from Base + WALs

cp -r /tmp/base_backup /tmp/pgdata
touch /tmp/pgdata/recovery.signal
echo "restore_command = 'pgrwl restore-command --serve-addr=127.0.0.1:7070 %f %p'" >> /tmp/pgdata/postgresql.conf

💡 Rename all *.partial WALs to their final names before restart.

8. Restart PostgreSQL

pg_ctl -D /tmp/pgdata -l logfile start

Wait for recovery to complete.

9. Validate Database Consistency

pg_dumpall > /tmp/after.sql
diff -u /tmp/before.sql /tmp/after.sql

✅ Expect: No differences.

Also verify ticks table for the latest inserted row — confirming no data loss.

10. Compare WAL Files

diff -r /tmp/pgwal_pg /tmp/pgwal_pgrwl

✅ Expect: Identical content and filenames.

📉 Post-Crash: Retest on New Timeline

Restart both WAL streamers on a new timeline (due to crash + recovery) and verify they pick up correctly.

Then rerun the diff again.

🧠 What This Test Proves

WALs received by pgrwl are valid and byte-identical to official ones.
PostgreSQL can recover from pgrwl's archived WALs to the latest committed transaction.

🔬 Bonus: Add Compression and Encryption

Add this to the config:

compression:
  algo: gzip
encryption:
  algo: aesgcm
  pass: "${PGRWL_ENCRYPT_PASS}"

💡 WALs will no longer match byte-for-byte (they’re transformed), but recovery should still work identically.

✅ Conclusion

Testing WAL archiving isn’t just about receiving files — it’s about trust.
This golden test validates pgrwl as a reliable WAL receiver with byte-level fidelity and advanced features
like encryption and compression.

📦 Check out the code: github.com/hashmap-kz/pgrwl

🙌 Get Involved

pgrwl is an open-source project built for the PostgreSQL community — and your feedback matters!

🐞 Found a bug? Open an issue
💡 Have an idea or feature request? We'd love to hear it.
🧪 Want to improve WAL testing coverage? Run the integration tests or add your own cases.
🔧 Found a rough edge or an unclear doc? Contributions are always welcome.

Start by starring ⭐ the repo, trying it out in your own cluster, and sharing what you learn.

Let’s build better PostgreSQL backup tooling — together.

Make it make sense

Make sense of fixing your code with straight-forward application monitoring.

Start debugging →

Top comments (2)

Jessica Brown • Jun 6

This is a thorough approach to byte-level verification, but is it always necessary for practical disaster recovery that WAL files are completely bit-for-bit identical, or could there be cases where logical equivalence suffices?

alexey.zh • Jun 7

Hello, the very first MVP I made showed poor performance because of a message processing loop that was missing one small but significant detail here: github.com/hashmap-kz/pgrwl/blob/m.... This caused a lot of unnecessary confirmation requests.
Additionally, I later optimized the fsync functions with syscalls, which provided further performance improvements.
So this test is not only for byte-level verification, but also for timing. I rebuilt the pg_receivewal binary from source, injecting log messages and timing, and did the same for pgrwl, to compare that the flow is identical.
The very first implementation lagged behind pg_receivewal, whereas the current version performs similarly.

🐯 🚀 Timescale is now TigerData: Building the Modern PostgreSQL for the Analytical and Agentic Era

We’ve quietly evolved from a time-series database into the modern PostgreSQL for today’s and tomorrow’s computing, built for performance, scale, and the agentic future.

So we’re changing our name: from Timescale to TigerData. Not to change who we are, but to reflect who we’ve become. TigerData is bold, fast, and built to power the next era of software.

DEV Community