Forem: Dmitriy

Alena Rybakina: “The path to becoming a PostgreSQL committer starts with your first review”

Dmitriy — Wed, 11 Mar 2026 16:59:06 +0000

We’re starting a series of articles about Postgres Professional engineers who received awards for their contributions to the PostgreSQL community. In this interview, core developer Alena Rybakina explains how to start contributing and where it can take you.

— Alena, what did you receive the medal for?
— For reviewing a large number of patches. I would pick them up during CommitFests, test how they behaved, and leave feedback. Sometimes it was just about adding tests. In other cases, I found edge cases, regressions, or scenarios that hadn’t been considered.

Later, I started working on a major patch of my own — OR-ANY Transformation. Let me explain what it does.

The problem it addressed was inefficient planner behavior when handling multiple OR conditions on the same column. Before this patch, the PostgreSQL optimizer would typically execute a separate Bitmap Index Scan for each OR clause. That meant scanning the same index pages multiple times instead of walking them once.

My patch transforms such queries into a more efficient Index Scan, or Index Only Scan, that traverses the index pages just once and filters the required values during the scan itself. The work took about a year, but eventually the patch was accepted and committed to PostgreSQL 17.

— Alena, how did the optimizer patch come about? Did it grow out of a review, or was it triggered by a real-world issue?
— It actually started with a customer case. Their framework didn’t support IN operators, so queries with multiple conditions on the same column were generated as long chains of OR clauses.
In principle, this should have been handled with IN. The real challenge, though, was the scale. The framework was generating around 50,000 OR conditions for a single column. During planning, such a query could consume about 1.5 GB of memory and eventually be killed by the operating system.

At first, it felt like a somewhat artificial problem. The root cause was clearly in the framework rather than in PostgreSQL itself. The customer eventually resolved the issue by switching frameworks and didn’t wait for the patch. Still, we found the problem interesting enough to continue working on it. While discussing the patch on the mailing list, I received a reply from Peter Geoghegan, who shared a related project he had been working on. That turned out to be closely connected to my approach. We eventually completed the patch, and the community accepted it. So in a sense, the problem appeared almost by accident, but it led to a meaningful improvement in the planner.

— Why did you decide to start reviewing patches in the first place?
— There wasn’t any special initiative behind it. I knew I needed to do reviews to improve my skills.
What really pushed me was my former colleague Andrey Lepikhov, who is an active PostgreSQL contributor. He believes that one of the best ways to grow as an engineer is to study how other developers write and structure their patches.

What Andrey Lepikhov was recognized for
Like Alena, Andrey received a medal for his contributions to the community. He identified a bug in the PostgreSQL codebase while working with his colleague Nikolay Shaplov to test hypotheses around fuzzing autonomous transactions. The issue appeared to have existed for quite some time, but for various reasons had not been addressed. Thanks to this finding, Tom Lane was able to close one of the vulnerabilities in PostgreSQL.

At first, it was really difficult. I didn’t know what to look for, what to comment on, or even where to start. But after a while, things began to fall into place. Now I try to do reviews as often as I can, although my main job doesn’t always leave enough time for it. It’s a great way to improve your skills and make your name known in the community.

— In your view, what makes a “good” patch that gets through review quickly?
— It really depends on the type of patch. Broadly speaking, there are three main scenarios:

1) A bug fix. Usually the fastest path. The key points are:

Make sure the code style is correct. Comments should be properly placed and clearly explain why a particular line was added, modified, or removed. The commit message should describe the problem clearly and explain how the patch resolves it.
Add regression tests that demonstrate the issue. Ideally, the tests should fail without the patch and pass with it. This helps prevent others from reintroducing the same bug later. In some cases, adding a test is not feasible, for example if the bug is related to abnormal memory consumption. Reviewers understand that. But if a test can be written, it should be included.
Ensure the patch does not introduce regressions.

2) A new feature. This is the most complex and time-consuming path, as it was with my OR-ANY Transformation.

You need to prove that your feature does not break anything. That means extensive testing. You should demonstrate that it behaves correctly with complex plans, multiple JOINs, user-defined operators, and that it does not interfere with existing optimizations. Query results must remain identical with and without the patch.
Be prepared for a long review cycle. Reviewers will deliberately try to break your patch and uncover edge cases you may have missed. That is a good thing. The more people look at your code, the lower the risk of post-commit issues. Even in our case, after the patch was committed, we still discovered an unpleasant bug that had slipped through.

3) Documentation. This is the simplest path. You spot an issue in the documentation and propose a fix. There is no code involved, so the review process is usually straightforward and relatively quick. Reviewers may suggest wording changes or clarifications, but those are minor compared to code reviews.

— Alena, what would you advise developers who want to start reviewing patches?
— My advice is based entirely on personal experience:

Dive deep into a specific area. I focused on my domain — the query optimizer — and looked for every available resource to understand it thoroughly.
Just start. I knew I wasn’t 100% ready, but I still picked a patch that interested me and began digging into it: understanding the problem and evaluating the proposed solution.
Start small. Documentation patches are often the easiest entry point. They help you understand how things are structured internally, and reviewing them is already a meaningful contribution.
Don’t be afraid of small steps. Even if your feedback is just an email to the mailing list suggesting an improvement to a code comment, that is still a valid contribution.
Be patient. A finished patch may wait some time before being committed. Committers can be busy with higher-priority issues, such as critical production bugs. That is a normal part of the process.

The most important thing is to pick a patch, sit down with it, understand what it does, and send an email with your thoughts.

As Lao Tzu said, a journey of a thousand miles begins with a single step. The path to becoming a PostgreSQL committer begins with your first review.

Message queues in Postgres Pro: ditching external brokers for true transactional reliability

Dmitriy — Wed, 04 Mar 2026 15:22:04 +0000

In the age of distributed systems — where every component must be not just fast, but predictable — the reliability of data exchange becomes mission‑critical. Picture this: a user clicks “Generate Report”, and instantly a dozen processes must fall into sync — from creating the document to emailing it. But what if the mail server is temporarily down? Or the task processor crashes mid‑operation?

That’s exactly where message queues step in: they turn a chaotic storm of requests into a controlled stream, ensuring no task goes missing along the way.

The story behind creating built‑in queues for PostgreSQL started with a familiar pain: external brokers like RabbitMQ or Kafka — while powerful — introduce complexity. They need dedicated servers, clusters, monitoring, backups… the whole zoo. In enterprise environments with thousands of deployments, every additional component increases operational risk and administrative load.

So the question naturally arises: why bolt on a separate broker when the queue can live inside the database itself? Not only does this save time, it eliminates consistency headaches: if a transaction rolls back, the message rolls back with it — no extra code, no magic glue.

Two approaches to queues: log‑based vs AMQP/JMS

Distributed systems have converged on two main models for message processing:

Log‑based queues (e.g., Kafka). These operate as an append‑only event log. Data is written strictly sequentially, and consumers read it in the same order. Great for data synchronization between microservices or for database replication. But their biggest strength — linearity — becomes a weakness when flexibility is needed. You can’t easily pick messages by priority or filter them on the fly.
AMQP/JMS brokers (e.g., RabbitMQ). These let you manage message lifecycles: priorities, filters, error handling. You can emulate asynchronous RPC and reliably retry failed operations. Their weakness is not the protocol — it’s the architecture: no external broker can guarantee full transactional consistency with a database.

The problems with external brokers: transactions & complexity

Using Kafka or RabbitMQ in enterprise setups often leads to subtle, but very real problems:

Data mismatch. Imagine this: the application sends a message to RabbitMQ and starts a DB transaction. The message is already in the broker, but the DB transaction fails on commit. Result? Inconsistency: the task executes, but its data isn’t saved. You can fix this with two‑phase commits (2PC), but that adds another coordinator, slows everything down, and complicates the architecture.
Operational overhead. External brokers need their own installation, tuning, monitoring, backups. They add new failure points (e.g., network splits between the app and the broker). Support becomes harder — especially when different teams manage the DB and the broker.

Postgres Pro Enterprise Queues: transactional, native, automated

The new pgpro_queue extension is a direct response to developers tired of juggling consistency issues. By integrating queues inside the database, it removes the need for external components entirely.

Messages are stored in regular tables, replicated via standard PostgreSQL mechanisms, and take part in the same transactions as your application logic.

Installation & setup

Setting up pgpro_queue is pleasantly straightforward:

Enable in shared_preload_libraries (postgresql.conf):

shared_preload_libraries = 'pgpro_queue'

Create the extension:

CREATE EXTENSION pgpro_queue;

Initialize internal objects:

SELECT queue_initialize();

This creates a dedicated schema pgpro_queue_data, ensuring smooth pg_dump and replication.

Key features & how they work

Retry‑on‑rollback (automatic retries). This is pgpro_queue’s secret sauce. If a transaction that processed a message rolls back (for example, because an external service was unavailable) the message isn’t lost. It automatically returns to the queue.

Configuration. Set retry policy per queue using CREATE_QUEUE with q_retries and q_retrydelay.
Important. To enable retries, you must set your database name in postgresql.conf:

pgpro_queue.database_with_managed_retry = 'your_database_name'

Without it, messages are deleted on rollback.

Filtering & Priorities. pgpro_queue lets you shape the message stream:

Priorities. When inserting a message via INSERT_MESSAGE, you can set q_msg_priority. The lower the value, the higher the priority. Higher-priority messages are processed first.
Filtering. READ_MESSAGE and READ_MESSAGE_XML accept q_msg_hfilter an q_msg_pfilter parameters, letting you fetch messages based on the contents of their headers or properties.

JSON & XML support. Separate APIs simplify integration with external systems:

JSON: INSERT_MESSAGE, READ_MESSAGE;
XML: INSERT_MESSAGE_XML и READ_MESSAGE_XML;
Universal: READ_MESSAGE_ANY.

Delayed execution. Use q_msg_enable_time to defer when a message becomes available.

Under the hood

pgpro_queue is built on a simple principle: messages live in normal PostgreSQL tables. This enables WAL loggingб crash‑safe recovery and streaming replication. Every message is written to WAL, ensuring recovery even after power loss. The crown jewel is the retry‑on‑rollback mechanism tightly integrated with PostgreSQL’s transaction engine. If processing fails, the message is transparently returned for another attempt — no custom logic needed.

Another strength is its synergy with the built‑in Postgres Pro Enterprise scheduler. This internal cron‑like system can launch periodic jobs that themselves generate sub‑tasks and push them into queues. Example: generating thousands of regional reports. The scheduler creates regional tasks, which enqueue messages considering local time zones.

External brokers like Kafka are indeed robust, but only in isolation. In real systems, the weak points are the integration layers: networks, configs, and transaction boundaries. With pgpro_queue, queues live in the same universe as your data. Messages survive crashes thanks to WAL and replication. Admins no longer need separate dashboards, everything is managed through the familiar PostgreSQL toolkit.

Important technical notes

To use pgpro_queue effectively, keep in mind:

Isolation level. Supported only in transactions with READ COMMITTED.
Prepared transactions (2PC). If a transaction with READ_MESSAGE() is prepared via PREPARE TRANSACTION, the message remains locked until COMMIT PREPARED or ROLLBACK PREPARED. Note: ROLLBACK PREPARED unlocks the message but does not trigger retries.

Roadmap: what's next for built‑in queues

Upcoming features include:

Pub/Sub subscription system. Similar to RabbitMQ Exchanges: producers will publish to topics, and all subscribers get a copy. This enables full event‑driven architectures.
Callback notifications. The database will be able to send an HTTP callback when a message arrives — no more polling loops.
Dead Letter Queue (DLQ). Planned for later releases: a safe place for poisoned messages.

Conclusion

For enterprise systems, the main benefit is predictability. Every message is processed within database transactions, eliminating inconsistencies. Infrastructure becomes simpler: instead of a jungle of services, you get one cohesive environment where queues, business logic, and scheduling are all first‑class citizens.

This approach is ideal for domains where errors are costly—banking, healthcare, government registries. If your application depends on atomic operations and low operational overhead, PostgreSQL’s built‑in queues aren’t just convenient — they’re essential. They don’t just deliver messages — they deliver order to the chaos of distributed systems.

A year-long hunt for a Linux kernel bug, or the unexpected zeros from XFS

Dmitriy — Mon, 16 Feb 2026 14:57:02 +0000

You’ve probably had this happen too: a service runs smoothly, keeps users happy with its stability and performance, and your monitoring stays reassuringly green. Then, the next moment — boom, it’s gone. You panic, dive into the error logs, and find either a vague segfault or nothing at all. What to do is unclear, and production needs saving, so you bring it back up — and everything works just like before. You try to investigate what happened, but over time you switch to other tasks, and the incident fades into the background or gets forgotten entirely.

That’s all fine when you’re on your own. But once you have many customers, sooner or later you start feeling that something isn’t right, and that you need to dig into these spikes of entropy to find the root cause of such incidents.

This article describes our year-long investigation. You’ll learn why PostgreSQL (and any other application) can crash because of a bug in the Linux kernel, what XFS has to do with it, and why memory reclamation might not be as helpful as you thought.

How it started

Of course we cannot tell you all the internal details because of strict NDAs, but trust us: you heard the bang of the crashes that started this story.

We would occasionally get reports from clients (and not only them): "It worked, then crashed for no visible reason. We brought it back up, it keeps working, but we want to understand what that was". Not mass reports, not frequent, just sometimes. Nobody really understood what broke, because there were no obvious reasons. It was fine until enough of these reports piled up and it became clear we had to solve this mystery.

So what did we have at time zero after a typical report from not-so-happy customers?

Clusters (and standalone installs) of PostgreSQL different versions, vanilla and not, crashing loudly. Some with segfaults, some with just a vanished process
Servers running different Linux kernel versions
A "zoo" of installed software with very different settings, fields, and flags, including kernel parameters
Logs with errors about invalid memory addresses
Attempts to jump into protected memory segments
Attempts to execute illegal instructions
Crashes at random moments without obvious causes you could systematize. You could almost start looking for correlations with moon phases or planetary alignment.

There was one more important commonality: dump analysis and diagnostics pointed to anything except PostgreSQL. And while we are database developers, professional curiosity won, so we decided to go all the way.

Looking for clues

So we had a classic heisenbug, impossible to catch or reproduce in an obvious way. The worst part: real production systems were affected, not lab stands. We had to do a deep, exhaustive analysis of everything, testing even the wildest hypotheses.

The starting point was an observation that in some cases the system journal showed fragments of a write-protected code segment that contained only zeros instead of real instructions. This indirectly pointed to hardware or OS-level problems: we were reading what should be valid data from memory and getting unexpected zeros.

As we chased this version, slowly but surely we found the full instruction sequence that led to the error. It helped that similar errors appeared in logs of other applications that also crashed on the same systems. They were just less critical than the database, so either nobody noticed the crashes or a nearby service auto-restarted them.

On the other hand, the problem appeared on servers with very different configurations, so we shifted focus from hardware issues to OS behavior during crashes. That meant fun things like digging into memory management code and driver analysis. And it was not just hard; we ran into another factor that limited our analysis. Until that moment all diagnostics were collected at the user-space process level, and the bug did not reproduce frequently enough to quickly get fresh data for each new hypothesis. Also, collecting detailed kernel-level info is not enabled on every production system. Although, if you look at it through an infrastructure admin's eyes, these constraints are not that bad.

Further investigation revealed another pattern: the higher the memory pressure, the higher the probability of the bug manifesting and crashing the system. For example, running several large memory tests in parallel increased the probability dramatically. With a partner who owns a large testing hardware pool, we were able to create the first scripts to reproduce the situation artificially.

So what actually happened?

We will talk about concrete reproduction scenarios below, but first let's understand what happened. We confirmed that the root cause is reading incorrect zeroed data from memory pages that map files (mmap). It does not matter whether we read data or execute code. So the bug was indeed at the OS kernel level.

Further research showed that the file mapped into memory must be on the XFS file system. And memory page clearing on free must be enabled. That means the memory subsystem must be under heavy pressure so the reclaim mechanism actively frees pages.

mmap is a mechanism for mapping files into memory. It appeared in UNIX in the early 1990s and was implemented in the Linux kernel around the same time. Its story started in BSD systems in the 1980s, so this is core Linux functionality, not something new. That is why mmap is used heavily both in the kernel and in applications. Even if our application does not call it explicitly, it can still hit the bug, because the OS uses it to load executable code into memory. If executable code or a dynamic library (even plain libc) resides on the affected filesystem, then at the moment of reproduction the CPU can read zeros instead of real instructions, crashing the application.

We tested different hypotheses and found the issue reproduces with different mapping parameters: Read-Only and Read-Write, Shared and Private. In short, the full basket of happiness.

Reproduction also requires the mechanism that zeroes freed memory pages to be enabled. Normally, physical pages are cleared when allocated to an application, but if they are only freed, their content is preserved. This requirement appears in systems with high data-compromise risk requirements, which regulators around the world love. Starting with kernel version 5.3, the init_on_free parameter was introduced, and its default is set by the kernel build option CONFIG_INIT_ON_FREE_DEFAULT_ON. Pro tip: if you do not have bootloader access, you can confirm this mechanism is active by searching logs for mem auto-init:

mem auto-init: stack:off, heap alloc:off, heap free:on
mem auto-init: clearing system memory may take some time…

Now about the XFS mentioned in the title and why this starts with the filesystem. It is not simple: we could not find a single definitive cause. The most likely reason the bug triggers when files are on XFS is page folio, the successor of struct page. If you have not run into it, it is a memory management concept designed to simplify huge pages and compound pages. In the affected kernel versions, it was not yet supported by other filesystems, and the commits below are tied to the page folio implementation.

Finally, the last factor that leads to the crash is heavy memory subsystem load. When the system actively allocates free memory, cached pages can be evicted. We reproduced this state by allocating memory in small blocks until free space was exhausted. After that it does not matter whether the OOM Killer triggers or the process is simply stopped. This proved that the OOM Killer itself is not to blame.

So yes, the bug is complex and OS behavior is unpredictable. Any application can crash when it tries to modify a protected memory area or execute code from a segment without execute permission.

Our path to reproduction

Because the bug was unstable, we decided to focus on one synthetic scenario that would test the maximum number of hypotheses with an unambiguous result. The test intentionally includes logic that leads to process termination due to memory exhaustion, but it is not guaranteed that the bug only appears under these conditions.

Our test stand is Debian 11/12 with gcc and xfsprogs installed. The VM gets 4 CPU cores, 4 GB of RAM, and 20 GB of disk. The system partition is formatted as ext4. The bootloader is grub with options init_on_free=1 and transparent_hugepage=never (to exclude the impact of THP).

~# grep GRUB_CMDLINE_LINUX_DEFAULT /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet init_on_free=1 transparent_hugepage=never"
~# update-grub
<restart>
~# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.1.114 root=UUID=56af771d-b0a2-45af-bc53-66f4ca792577 ro quiet init_on_free=1 transparent_hugepage=never

For the test we mount a 384 MB XFS test partition on top of a file on the system partition:

~# dd if=/dev/zero of=xfs.file bs=1M count=384
~# mkfs.xfs -f xfs.file
~# mkdir xfs.mnt
~# mount -t xfs xfs.file xfs.mnt

We decided to reproduce the bug in two ways:

Show that an application can get incorrect data from a memory-mapped file.
Get a segfault in a correct application due to the CPU reading incorrect values from the code segment.

In both cases the trigger is running a test program in parallel that continuously allocates memory in 1 KB chunks until the OOM Killer kills it.

Straightforward, like a railway sleeper, code:

include <stdlib.h>
include <stdio.h>
int main(int argc, char* argv[]) {
for (;;) {
if (malloc(1024) == NULL) {
printf("Could not allocate memory\n");
return 1;
}
}
}

We also need to map a file located on XFS into memory. We prepare a 100*4096-byte sample because it is convenient:

~# dd if=/dev/zero bs=4096 count=100 | tr '\0' '\1' > xfs.mnt/test_file

And a simple program to map it into memory with this algorithm: in a loop, read each 4 KB block and check that the file contains 1.

include <fcntl.h>
include <stdio.h>
include <stdlib.h>
include <sys/mman.h>
include <unistd.h>

int main(int argc, char* argv[]) {
char* filename="xfs.mnt/test_file";
int fd = open(filename, O_RDONLY);
if (fd == -1) {
printf("Could not open file %s\n", filename);
exit(EXIT_FAILURE);
}
char* map = mmap(NULL, 4096 * 100, PROT_READ, MAP_PRIVATE, fd, 0);
if (map == MAP_FAILED) {
close(fd);
printf("Could not map file content %s\n", filename);
exit(EXIT_FAILURE);
}

printf("Starting test...\n");
/* Test body */
for(;;) {
for (int i = 0; i < 100; i++) {
char c = map[i*4096];
if (c != 1) {
printf("Got invalid value on page %i = %i\n", i, c);
}
}
}

munmap(map, 4096 * 100);
close(fd);
return EXIT_SUCCESS;
}

Now we run all programs until the test starts returning zeros instead of ones. That means primary success is achieved and the next task is to get a segfault. For that we write one more test that floods calls to "empty" functions and checks their return values:

include <stdio.h>
include <stdlib.h>

define FUNC_HERE(FUNC_NAME, RET_VAL) int FUNC_NAME() {\
return RET_VAL;\
}

define CHECK_FUNC(FUNC_NAME, RET_VAL) { \
int check_val = FUNC_NAME();\
if (check_val != RET_VAL) {\
printf("Erroneous value from " #FUNC_NAME ", expected %d, received %d\n", RET_VAL, check_val);\
exit(1);\
}\
}

FUNC_HERE(func0,   0)
FUNC_HERE(func1,   1)
FUNC_HERE(func2,   2)
FUNC_HERE(func3,   3)
FUNC_HERE(func4,   4)
FUNC_HERE(func5,   5)
FUNC_HERE(func6,   6)
FUNC_HERE(func7 ,  7)
FUNC_HERE(func8,   8)
FUNC_HERE(func9,   9)
FUNC_HERE(func10, 10)
FUNC_HERE(func11, 11)
FUNC_HERE(func12, 12)
FUNC_HERE(func13, 13)
FUNC_HERE(func14, 14)
FUNC_HERE(func15, 15)
FUNC_HERE(func16, 16)
FUNC_HERE(func17, 17)
FUNC_HERE(func18, 18)
FUNC_HERE(func19, 19)
FUNC_HERE(func20, 20)
FUNC_HERE(func21, 21)
FUNC_HERE(func22, 22)
FUNC_HERE(func23, 23)
FUNC_HERE(func24, 24)
FUNC_HERE(func25, 25)
FUNC_HERE(func26, 26)
FUNC_HERE(func27, 27)
FUNC_HERE(func28, 28)
FUNC_HERE(func29, 29)

int main(int argc, char* argv[]) {
printf("Program start\n");

for(;;) {
// Check functions
CHECK_FUNC(func0,   0)
CHECK_FUNC(func1,   1)
CHECK_FUNC(func2,   2)
CHECK_FUNC(func3,   3)
CHECK_FUNC(func4,   4)
CHECK_FUNC(func5,   5)
CHECK_FUNC(func6,   6)
CHECK_FUNC(func7,   7)
CHECK_FUNC(func8,   8)
CHECK_FUNC(func9,   9)
CHECK_FUNC(func10, 10)
CHECK_FUNC(func11, 11)
CHECK_FUNC(func12, 12)
CHECK_FUNC(func13, 13)
CHECK_FUNC(func14, 14)
CHECK_FUNC(func15, 15)
CHECK_FUNC(func16, 16)
CHECK_FUNC(func17, 17)
CHECK_FUNC(func18, 18)
CHECK_FUNC(func19, 19)
CHECK_FUNC(func20, 20)
CHECK_FUNC(func21, 21)
CHECK_FUNC(func22, 22)
CHECK_FUNC(func23, 23)
CHECK_FUNC(func24, 24)
CHECK_FUNC(func25, 25)
CHECK_FUNC(func26, 26)
CHECK_FUNC(func27, 27)
CHECK_FUNC(func28, 28)
CHECK_FUNC(func29, 29)
}

printf("Program end\n");
}

This thing is compiled with -fno-inline -falign-functions=4096, to ensure correct alignment of functions to different memory pages. That means each function sits on its own page. The resulting binary is placed on the XFS volume where the source files live. If our theory is right, after a few runs we get a segfault. The system journal shows an entry like:

~# dmesg
...
[ 1160.680986] test2[590]: segfault at 0 ip 00005645b312e000 sp 00007ffdb8a95758 error 6 in test2[5645b312b000+21000] likely on CPU 1 (core 1, socket 0)
[ 1160.681007] Code: Unable to access opcode bytes at 0x5645b312dfd6.
...

If everything is done correctly, the result arrives quickly and you can go drink champagne (or take insomnia pills to sleep after what you have seen). You just confirmed a Linux kernel bug that leads to data loss.

Your tests are nice, but how do I diagnose my system?

A very valid question! Because the problem can appear anywhere in code, symptoms can be almost anything, and deep memory dump analysis is not available to every engineer. Also, not everyone has sufficient logging enabled, to be honest. But no gloom - there is a set of typical symptoms that can make life easier.

First, dmesg will likely show a clear segfault:

~# dmesg
...
[ 1160.680986] test2[590]: segfault at 0 ip 00005645b312e000 sp 00007ffdb8a95758 error 6 in test2[5645b312b000+21000] likely on CPU 1 (core 1, socket 0)
[ 1160.681007] Code: Unable to access opcode bytes at 0x5645b312dfd6.
...

In some cases the error may not be a segmentation error but an attempt to execute an invalid instruction — invalid opcode.

traps: postgres[1923471] trap invalid opcode ip:7f71444c0003 sp:7ffcb3ec4320 error:0 in liblz4.so.1.9.3[7f71444b7000+1b000]

Another characteristic sign is zero values in the code segment after a segfault:

[<timestamp>] postgres[24237]: segfault at cb1 ip 00005579a1505130 sp 00007ffe4f32d008 error 6 in postgres[5579a1170000+559000] likely on CPU 49 (core 17, socket 1)
[<date>] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <00> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Remember, though, it does not always fire this way. The exit code can be perfectly normal. So do not make sudden, rash decisions.

How bad is this globally?

Our checks showed that the bug exists starting from Linux kernel 5.18 and stops reproducing from 6.9. That hints at a fairly wide spread.

The place of breakage led us to this commit ("mm/readahead: Switch to page_cache_ra_order"). And it was fixed around this patch discussion for 6.9. Specifically after this commit ("mm: free folios in a batch in shrink_folio_list()").

Summary

Let's wrap up with a classic self-checklist and a few useful tips to sleep a little calmer:

Affected kernels are Linux 5.18 to 6.8
The issue manifests if init_on_free is enabled in the kernel
The issue manifests if the filesystem is XFS

What if you have all three? The corporate answer is "contact your OS vendor," because realistically only they can help you. It is possible your vendor already fixed it in your specific kernel and you do not need to do anything.

If you are not so lucky or you are waiting for a response or a patch, one option is to migrate to another filesystem (hello ext4), or set init_on_free=0 in Linux boot arguments. But of course, do it at your own risk.

File handling in PostgreSQL: barriers and ways around them

Dmitriy — Tue, 23 Dec 2025 10:30:07 +0000

In an ideal world, you'd store files separately from your database. Keep the DB lean, stick files in object storage or a file server, and call it a day. But reality? Not so simple.

Some developers find it easier to work with files directly in the database. Others need to meet strict compliance requirements around storing original documents and digital signatures. And some teams just don't want the hassle of maintaining yet another storage system. So despite best practices, many teams end up storing files in their DBMS, but often without fully understanding the implications.

File storage options in PostgreSQL

PostgreSQL gives you a few ways to store files:

regular tables;
pg_largeobject — a catalog for large objects.

Let's dig into each.

Regular tables

Standard PostgreSQL tables can handle text, bytea, and JSONB data types. There's also a basic file API with commands like:

pg_ls_dir — list files and directories;
pg_stat_file — get file metadata (size, modification time, permissions, etc.);
pg_read_file — read a text file and return its contents as a string;
pg_read_binary_file — read a file (including binary data) and return bytea format.

Note: You can find the full function list here, but these aren't meant for user data management.

That said, regular tables have two big limitations when it comes to files.

First limitation: the 1 GB wall
By default, regular tables can't handle files (or any data) larger than 1GB. Try to insert something bigger, and you'll hit an error:

create table user_data (
    id bigint,
    data_bytea bytea
);

# insert into user_data (id, data_bytea)
  values (8, pg_read_binary_file('/tmp/file.1gb.bytea'));
ERROR: file length too large
Time: 669.132 ms

# select pg_read_binary_file('/tmp/file.1gb.bytea');
ERROR: file length too large
Time: 666.684 ms

You can work around this by chunking your data. Split a 5 GB file into 100 MB pieces and store each as a separate row. This bypasses the field size limit, but it complicates your application architecture — you'll need extra tables to track and reassemble the fragments.

insert into user_data (id, data_bytea)
values (81, pg_read_binary_file('/tmp/file.1gb.bytea', 0, 1024*1024*100));

insert into user_data (id, data_bytea)
values (82, pg_read_binary_file('/tmp/file.1gb.bytea', (1024*1024*100)+1, 1024*1024*100));

insert into user_data (id, data_bytea)
values (83, pg_read_binary_file('/tmp/file.1gb.bytea', (1024*1024*100)+1, 1024*1024*100));

Second limitation: inconsistent insert performance
The other issue? Insert speeds can become wildly unpredictable. Your storage might be humming along fine, transactions cruising through, and then suddenly — parallel inserts start crawling or hanging completely.

Let's simulate this: fire off multiple inserts of heavy data.

insert into user_data values (777, repeat('b', 10000)::bytea);

To understand what's happening, check pg_stat_activity to see what active processes are waiting for:

SELECT pid,
       extract(epoch from now() - state_change) as sec_state_change,
       state_change,
       wait_event,
       wait_event_type,
       state,
       query
FROM pg_stat_activity
WHERE query LIKE 'insert into user_data%'
  AND state = 'active';

Result:

pid     | sec_state_change | state_change           | wait_event     | wait_event_type | state  | query
--------+------------------+------------------------+----------------+-----------------+--------+-------
1013836 | 196.148285       | 17:58:03.280158+03     | ClientRead     | Client          | idle   | SHOW TRANSACTION ISOLATION LEVEL
1013866 | 403.011387       | 17:54:36.417056+03     | BufferIO       | IPC             | active | insert into user_data (id,data) values (777...
1013867 | 403.011375       | 17:54:36.417068+03     |                |                 | active | insert into user_data (id,data) values (777...
1013868 | 403.011407       | 17:54:36.417036+03     | BufferIO       | IPC             | active | insert into user_data (id,data) values (777...
1013869 | 403.011383       | 17:54:36.41706+03      | BufferIO       | IPC             | active | insert into user_data (id,data) values (777...
1013870 | 403.011375       | 17:54:36.417068+03     |                |                 | active | insert into user_data (id,data) values (777...
1013871 | 403.011384       | 17:54:36.417059+03     | BufferIO       | IPC             | active | insert into user_data (id,data) values (777...
1013872 | 403.011389       | 17:54:36.417054+03     | BufferIO       | IPC             | active | insert into user_data (id,data) values (777...
1013873 | 403.011077       | 17:54:36.417366+03     | BufferMapping  | LWLock          | active | insert into user_data (id,data) values (777...
1013874 | 403.011070       | 17:54:36.417373+03     |                |                 | active | insert into user_data (id,data) values (777...
1013875 | 403.011138       | 17:54:36.417305+03     |                |                 | active | insert into user_data (id,data) values (777...

You'll see mass wait events for BufferMapping and BufferIO.

Here's what's happening: when inserting large values, PostgreSQL splits data into TOAST chunks on the fly, compresses them, and hunts for free space to write them.

BufferMapping means processes are competing to allocate new buffer cache pages for this data.
BufferIO indicates the system is bottlenecked on disk I/O performance while writing those pages.

This is all due to complex processes running under the hood when writing large objects. You can read about the intricacies of large value storage in the official docs and detailed articles. The key takeaway: under high load, this mechanism becomes a bottleneck.

For each user table, PostgreSQL creates just one TOAST table, regardless of how many columns contain large data. Values from all columns end up in this single storage, which has hard limits:

Maximum size: 32 TB.
Maximum number of rows (chunks): ~4.2 billion (int4 limitation for chunk ID).

Important: The 4.2 billion record limit is the total for all columns combined, not per column.

TOAST supports four storage strategies for columns on disk:
PLAIN — for fixed-length values like integers
EXTERNAL — for external storage of uncompressed data
EXTENDED — for external storage of compressed data
MAIN — for inline storage of compressed data

Note:

The TOAST mechanism kicks in when a row value exceeds TOAST_TUPLE_THRESHOLD (usually 2KB). TOAST will compress and/or move field values outside the table until the row value drops below 2KB or can't be compressed further.

So the distribution works like this:

If data fits in the original table (under 2 KB), write it directly to the column.
If data exceeds 2 KB, store it in the TOAST table and put a reference in the main table.

If your system actively works with files, this limit can sneak up on you—and once you hit it, inserting new data becomes impossible.

You can overcome this barrier through partitioning. Each partition gets its own TOAST table, pushing back the overall limit. But this requires significant design and maintenance effort from your DBA.

Large object storage
PostgreSQL also provides the pg_largeobject system table for storing large objects. Each file is identified by an OID assigned at creation.

Note: Since pg_largeobject is a system table, it can't be partitioned.

The API for pg_largeobject resembles standard file operations: create, read, modify, and delete objects. Like TOAST, there are limits: maximum object size is 32 TB, and you can store about 4.2 billion objects (rows).

Let's see how this works. Important: the actual file data isn't stored in your user table — only the reference (OID).

-- 1. Create a table to store only the file identifier (OID)
CREATE TABLE lo (
    id bigint PRIMARY KEY,
    lo_oid oid
);

-- 2. Load a file into storage and record the OID in our table
-- (Assumes '/tmp/file.2gb.bytea' exists on the server)
INSERT INTO lo (id, lo_oid) VALUES (1, lo_import('/tmp/file.2gb.bytea'));

-- 3. Check data in the user table
SELECT * FROM lo;
-- Result:
-- id | lo_oid
-- ---+------------
-- 1  | 2032353971 &lt;-- This is the reference to the object

-- 4. Check the pg_largeobject system table where data is physically stored (split into pages)
SELECT loid, pageno, data FROM pg_largeobject WHERE loid = 2032353971 LIMIT 2;
-- Result:
-- loid       | pageno | data
-- -----------+--------+-------
-- 2032353971 | 0      | \x58546f...
-- 2032353971 | 1      | \x81234b...

-- 5. Deleting an object
-- IMPORTANT: Deleting a row from the 'lo' table DOES NOT delete the file from pg_largeobject.
-- You must explicitly delete it using lo_unlink:
SELECT lo_unlink(2032353971);

However, this approach has quirks. With intensive inserts, you might encounter OID generation contention or exhaust the OID limit. In monitoring, this shows up as OidGen lock waits:

select pid, wait_event, wait_event_type, state, left(query,15) 
from pg_stat_activity 
where backend_type = 'client backend';

pid    | wait_event    | wait_event_type | state  | left
-------+---------------+-----------------+--------+-----------------
977260 | ClientRead    | Client          | idle   | SHOW TRANSACTIO
977309 |               |                 | active | INSERT INTO lo
977310 | OidGen        | LWLock          | active | INSERT INTO lo
977311 |               |                 | active | INSERT INTO lo
977312 |               |                 | active | INSERT INTO lo
977313 |               |                 | active | INSERT INTO lo
977314 | BufferIO      | IPC             | active | INSERT INTO lo
977315 |               |                 | active | INSERT INTO lo
977316 | BufferMapping | LWLock          | active | INSERT INTO lo
977317 |               |                 | active | INSERT INTO lo
977318 | BufferIO      | IPC             | active | INSERT INTO lo
977426 |               |                 | active | select pid, wai
(12 rows)

So with pg_largeobject:

You can store up to 32TB and 4.2 billion objects.
No partitioning support.
Deleting a row (OID) from your user table leaves data in pg_largeobject untouched.

There are also significant operational downsides. Migrating objects to another storage system is cumbersome due to the lack of built-in convenient mechanisms. All this makes pg_largeobject less than ideal for fully-featured large file handling in high-load systems.

Note: Besides these options, you could roll your own implementation in PostgreSQL for large object storage, but custom solutions don't guarantee full compatibility or reliable operation with the RDBMS.

What Postgres Pro enterprise brings to the table

Postgres Pro Enterprise (starting from version 16) includes specialized mechanisms that lift the standard PostgreSQL limitations and ease migration from Oracle. For database file storage, you can use three extensions:

pgpro_sfile — extension for storing large objects inside the database;
pgpro_bfile — extension for working with files stored on disk;
dbms_lob — an Oracle DBMS_LOB package analogue that uses pgpro_sfile and pgpro_bfile.

Let's explore each one.

The pgpro_sfile module

The pgpro_sfile module lets you store numerous large objects (called sfiles in the extension's terminology) and provides functionality similar to Oracle's BLOB.

Key features:

Removing limits. Total storage volume and object count are limited only by disk space and the bigint type (2^63 – 1). This solves the 32 TB and 4 billion object problem.
Storage structure. Data is stored in a set of special tables (in the pgpro_sfile_data schema), managed by the module. Each object is split into blocks, and blocks into pages of ~8KB. User tables only store sfile object identifiers.
Performance. Supports parallel read and write operations. For example, you can write to one large object from multiple sessions simultaneously, as different blocks can be written to different storage partitions in parallel.
Management. Implements functions for creating (sf_create), reading (sf_read), writing (sf_write), trimming (sf_trim), and deleting (sf_delete) objects.

Detailed function and architecture descriptions are available in the module documentation.

The pgpro_bfile module

pgpro_bfile is a Postgres Pro Enterprise extension designed for working with external files stored on the server's disk. The module provides capabilities for managing references to files outside the database, similar to Oracle's BFILE type.

The bfile type contains two fields:

dir_id — directory ID in the bfile_directories table.
file_name — external file name.

Key features:

Security. Access is through directory objects. An admin creates a "directory" in the DB (bfile_directory_create), linked to a filesystem path, and grants read/write permissions to specific users (bfile_grant_directory).
Data access. The module provides functions for reading (bfile_read), writing (bfile_write), copying, and retrieving file properties.
Virtualization. File access goes through Postgres virtual file functions, so standard RDBMS restrictions and settings apply.

Overall, the pgpro_bfile module is an optimal solution for working with files of any size on external file storage when you don't need to store the data inside the database.

The dbms_lob module

dbms_lob is an extension that emulates the Oracle DBMS_LOB package interface. It serves as a "wrapper" over pgpro_sfile and pgpro_bfile and is critical for migration projects, allowing application code to be ported with minimal changes.

The module operates with these data types:

CLOB — type for storing large volumes of text data (like long documents, XML/JSON files). It's analogous to Oracle CLOB and stores data as the standard text type.

BLOB — type for storing binary data (images, videos, PDFs) in pgpro_sfile extension tables. It replaces Oracle BLOB and allows storing files of virtually unlimited size.

BFILE — data type for referencing a file in the server's filesystem, provided by the pgpro_bfile extension. It's analogous to Oracle's BFILE.

dbms_lob provides an extensive API for interacting with large objects, including functions for:

Creation and conversion: empty_blob(), empty_clob(), to_blob(), to_clob().
Opening and closing: open(), close(), isopen().
Reading: getlength(), read(), substr(), instr().
Modification: write(), writeappend(), erase(), trim().
Multi-LOB operations: compare(), append(), copy().

So the dbms_lob module for Postgres Pro Enterprise not only fully replicates Oracle's DBMS_LOB functionality but also allows storing data both inside and outside the database. This makes it the best solution when migrating applications from Oracle.

The bottom line

Storing files in a database is often considered suboptimal, it increases DB load and complicates operations. But when business requirements demand it, PostgreSQL and Postgres Pro Enterprise users have solid tools at their disposal.

While standard PostgreSQL mechanisms (TOAST, Large Objects) impose limits on volume (32 TB) and object count, specialized Postgres Pro Enterprise modules (pgpro_sfile, dbms_lob) let you break through these barriers, enabling petabyte-scale data storage and providing convenient interfaces for development and Oracle migration.

Breaking data for fun

Dmitriy — Tue, 16 Dec 2025 10:50:48 +0000

Every day, thousands of engineers do their best to protect data quality and integrity: making sure every transaction stays intact. But sometimes you need the exact opposite — to deliberately break data or even invent it from scratch. And that’s not so simple either.

In this article, I’ll explain what fake (synthetic, test — call them what you like) data is, why you might need it, how to generate it, and how to use it. The article is based on my talk at PGConf.Russia.

A real-life case

Imagine you’re building a new information system. To test basic functionality — like registration and login — you need test data: usernames, full names, emails, phone numbers, and so on. The system will check them for validity and uniqueness: is the username free, is the email correct, does the address match the format.

Using real data for this is a bad idea for several reasons:

Legality. In many countries it’s prohibited or strictly limited by data protection laws (GDPR in Europe, 152-FL in Russia).
Security. Test environments are usually less protected than production. Using real data risks leaks — sensitive information could end up in insecure environments or in the wrong hands.
Scalability. You might simply not have enough real data. What if you need tens of thousands of accounts with different roles and privileges to test complex scenarios?

The simplest approach is to use placeholders like “Ivan Ivanov”, “Vasya Pupkin”, “John Doe”. But that list and your patience will run out fast, leaving you typing random nonsense. And if you need to test load with thousands of users, your testbed turns into a junkyard of meaningless, invalid records. Many of us have seen this sad picture:

Not only does it look ugly, it creates real problems as data grows:

Validation. Apps and databases check login lengths, email and phone formats. Random junk usually fails those checks.
Uniqueness. Logins, emails, internal IDs must be unique within a table or group (department, company).
Business logic. Data should reflect real business scenarios. An HR employee must have an “HR” role, an order can’t be delivered before it’s created.

Luckily, there are plenty of ways to generate high-quality test data that not only looks real but also respects all your system’s rules. Ideally, your data should look something like this:

Then you get the same fields (name, position, address) — but looking far more solid and solving the issues above:

Safe. Fully synthetic: names, addresses, positions are made up, so no risk even if leaked.
Plausible. Looks real, which is critical for testing UI/UX and logic.
Valid. Ready for checks from simple email format to complex business rules.

Let’s go through how to achieve this and what pitfalls to avoid.

Why “fake” data matters

Synthetic data has far more uses than you might think. Key scenarios:

Functional testing. Generate endless cases — valid and edge — to see how the system reacts. Will all triggers fire? Will it crash on invalid formats? Fake data is essential here.
Load testing. What happens when user numbers jump 100x? To find out, you need data that simulates that load. Generation lets you create datasets bigger than production and find bottlenecks before real users do.
Demos and pilots. Showing a prototype with a client’s real data is a big no-no (and often a breach of NDA). With synthetic data — go ahead: looks good, safe, and won’t expose anyone’s account.

You can hand such data to contractors, use it in conference talks, or share it with partners — no damage if it leaks. It’s also perfect for screenshots in docs and articles, without masking tricks.

Four ways to get “fake” data

Depending on your goals and existing data, you can use one of four methods:

Masking. Ideal for creating a testbed from a real production DB. Personal data is anonymized — values replaced or encrypted. Key advantage: structure, volume, statistical properties, and relational integrity are preserved. You get a full prod copy, risk-free.
Generation from scratch. Best for a clean setup — just the schema, no data. Data is generated artificially, based on formats, validation rules, and size requirements. Great for predictable datasets tailored to specific tests or initializing new projects.
Extrapolation. Useful when you have historical data and want to forecast future loads. You can “grow” existing data to simulate scenarios like doubling user numbers or a year of system growth. This helps reveal future bottlenecks.
Combined approach. The most practical real-world scenario. Typically you’ll combine methods: masking for user data, extrapolation for transactions to simulate load, generation for new features not yet in prod.

Tools: from scripts to data factories

To tackle data masking and generation, there’s a wide range of tools out there. Broadly speaking, they fall into three categories, depending on how complex and comprehensive they are.

DIY scripts. You can always roll your own with SQL or any general-purpose language, using libraries like the classic Python Faker or its PostgreSQL-specific wrappers such as pg_faker. The upside: you have full control and it’s easy to get started. The downside: maintaining these scripts quickly becomes a pain if your schema changes often. That’s why this approach is best for one-off jobs or when you’re working with static schemas that rarely change.
Open-source tools. Projects like PostgreSQL Anonymizer or GreenMask can scan the database and apply declarative configs (YAML, JSON) that define masking and generation rules across the entire DB. Their main strength is automation and built-in strategies that preserve data integrity. The tradeoff is that you’ll need to invest some time learning and setting them up. These are a great fit when you need regular test copies of production databases that keep relational consistency intact.
Test Data Management (TDM) platforms. Think Tonic, Gretel, and similar products. These are more than just utilities — they’re full-fledged data factories. They can work not only with databases but also with other data sources, spinning up entire test environments in the cloud on demand. Their strengths: maximum automation, scalability, and support for highly complex environments. Their weaknesses: high cost and steep implementation effort. These platforms are usually found in large enterprises where managing the full lifecycle of test data is a must.

Despite the variety of options, most of these tools are powered by the same set of fundamental ideas — the so-called fake data patterns. Let’s take a closer look at them.

Masking patterns

Static masking is one of the simplest and most reliable approaches. The idea is to replace sensitive data with fake values that partially or fully preserve the format of the original data. For instance, you could mask an email by keeping its structure (the part before @ and the domain) while altering the actual characters.

The main advantage of static masking is how easy it is to implement. Even standard SQL functions are enough to whip up something usable on the spot. But the method has important limitations: such data can’t be used as primary or foreign keys, since that would break referential integrity. On top of that, masked values might fail business validation if the application checks them against real-world patterns.

Hashing is another widely used method that anonymizes data while preserving uniqueness. This is done with hash functions — both standard and custom. Some tools even support advanced options like format-preserving hashing, which allows you to mask an email so that it remains structurally valid without exposing the original. The key advantage here is determinism: identical input values always produce the same hash. This makes it possible to maintain relationships across tables, which is particularly useful when testing complex systems with interconnected data.

There are also several other important data masking patterns:

Noising. Adding random noise to numeric values. This keeps statistical characteristics intact but makes individual records unrecognizable. For example, you could add a random deviation of ±10% to employee salaries.
Generalization. Replacing exact values with broader ranges or categories. Instead of a specific age (say, 28), you store an interval like 20–40. Instead of precise coordinates, you keep only the city or district. This reduces detail while preserving analytical value.
Aggregation. Replacing source data with aggregate statistics. Instead of showing each employee’s salary, you display the department average or company median. This is common in reporting and analytics.
Rule-based masking. Applying business logic to masking. For instance, hiding movie titles with an 18+ rating, masking credit card numbers according to a pattern, or applying different masking rules depending on the user’s role.
Shuffling. Randomly swapping values within a column. This keeps the dataset composition but breaks record-level consistency.
Structured data masking: Special methods for handling complex formats. For JSON or XML, this may mean masking values by keys, working with nested structures, or even altering document structure. For binary data, specific algorithms ensure file formats remain valid.
Randomized masking: Replacing values with randomly generated ones that match the required format. For example, generating a random but valid-looking email, phone number, or identifier.

Each of these approaches has its strengths and limitations, and in practice the best results usually come from combining several methods depending on the type of data and the use case.

Generation patterns

Generating rows
The very first step in testing or bootstrapping an application is filling empty tables with data — whether it’s just a few rows or millions of records. Specialized tools make this trivial: you simply specify the desired number of rows (absolute or relative) in a GUI or config file, and the data magically appears.

If you stick to standard tools, our good old SQL comes to the rescue. Since SQL:1999, the standard has supported generating sequences of rows using recursive queries. It’s a powerful, though not always intuitive, method:

with recursive tmp (r) as (
  select 0
  union all
  select r+1
  from tmp
  where r < 100500)
select r from tmp;

PostgreSQL makes this even easier with the built-in generate_series function, which can create numeric, time-based, and even text sequences in just a couple of lines. It’s a perfect entry point for quickly generating test data right at the database level:

select generate_series
from generate_series(1, 100500)

If you’re curious, I highly recommend Timescale’s fantastic series of articles on generate_series.

Generating values
Once the “skeleton” of rows is in place, it’s time to fill them with actual data. For small projects and quick checks, the DBMS itself has enough built-in capabilities. This approach works great for rapid prototyping, validating basic functionality, or building reports where realism isn’t critical. Here’s an example in PostgreSQL using generate_series()along with random value generators:

select
  md5(random()::text) as name,
  floor(random() * 99)::int as years_of_experience,
  md5(random()::text) || '@gmail.com' as email,
  now() - (random() * (interval '90 days')) as order_date,
  case when random() > 0.5 then true else false end as is_student
from generate_series(1, 1000);

But real-world projects need much more realistic data. Fortunately, most modern data generation tools can produce lifelike primitives: names, addresses, emails, document numbers, phone numbers — all customizable and localized. For example, many tools support:

Demographic data (full names) adapted for specific regions (Cyrillic for Russia, Latin for Europe, etc.).
Addresses that match real postal formats for different countries.
Valid document numbers (passports, driver’s licenses) with proper checksums, plus phone numbers formatted by operator masks.
Emails bound to actual-looking domains.

This makes the data not only look convincing but also pass formal validation checks, which is critical for testing business logic.

Other key data generation patterns

Generating related values — producing logically connected data across fields and tables. For instance, generating an email based on a user’s first and last name, or maintaining consistency of region codes across different tables. This ensures both realism and data integrity.
Generation from parent tables — respecting foreign keys and relationships. Data in child tables should correspond to records in parent tables.
Recursive relationships — handling self-referential or cyclic structures (like an employees table where each record can point to a manager in the same table). This requires: -- analyzing the business logic of the recursion, -- defining a safe nesting depth, -- preventing circular references, -- and using iterative or recursive generation algorithms.

Each of these patterns demands a deep understanding of the database schema and the application’s business logic. In practice, you’ll often mix specialized tools with custom SQL queries or scripts to get the job done.

Where to start

Study your DB schema in detail — you won’t get good results without it. Don’t forget triggers and stored procedures: they can silently change data during generation or masking.
Choose the right tool for your tasks and tech stack. Be ready to find workarounds. Make sure the tool or script can selectively process specific tables or parts of tables.
Build a testbed. Generate data, apply basic masking, and review the results together with your colleagues.