Forem: Jake Lazarus

Database Subsetting for PostgreSQL: A Practical Guide (2026)

Jake Lazarus — Thu, 16 Apr 2026 14:35:00 +0000

Every team that has tried to copy production data into a dev environment has hit the same wall: production is too big, full of PII, and growing. The fix is not a bigger laptop or a faster pg_dump. It is database subsetting — extracting a small, self-contained slice of the database instead of all of it.

Subsetting is the workflow underneath almost every modern dev-data tool. It is what makes "restore production data locally" actually viable. But the term gets thrown around loosely, and the difference between a real FK-aware subset and a glorified SELECT ... LIMIT is the difference between a working dev environment and a database full of orphaned rows.

This guide is the canonical version. What subsetting is, how it works at the foreign-key level, the strategies that work for common PostgreSQL schemas, and a 2026 honest look at the tools that do it.

TL;DR: Database subsetting extracts a referentially complete slice of a PostgreSQL database by traversing foreign keys from one or more root tables. Done well, it produces a dataset 10–1000× smaller than production that still behaves like production. Done badly, it produces broken referential integrity and silent test failures. This post covers how to do it well, the strategies that fit common schemas, and the tools that handle it in 2026.

What is database subsetting?

Database subsetting is the process of extracting a representative, self-contained slice of a database instead of copying the whole thing. You start from one or more root tables — usually entities like users, accounts, or tenants — apply filters and row limits, and then traverse foreign key relationships to pull in every related row that the slice depends on.

The output is a smaller database that preserves the same schema and the same relational structure as production. Every foreign key still resolves. Every join still returns rows. The dataset behaves like production, just at 1% (or 0.1%, or 0.01%) of the size.

The point of subsetting is not to make a backup. It is to produce a dataset small enough to be useful for development, CI, and staging while still being realistic enough to surface the bugs that fake data hides.

Why teams need database subsetting

The motivation is almost always the same: pg_dump does not scale, and seed scripts do not survive contact with reality.

A full pg_dump of a mature production database is:

Too big. A 200GB database is fine on production hardware. It is unusable on a laptop, painful in CI, and a pain to refresh weekly.
Full of PII. Real emails, real names, real billing addresses end up on dev machines, in CI logs, and in artifacts. That is a compliance problem in any environment that handles customer data.
Slow to restore. A restore that takes 90 seconds today takes 8 minutes in 18 months as production grows. That cost compounds across every CI run.

We covered the full breakdown in pg_dump vs database snapshots — the headline is that pg_dump is the right tool for backups and disaster recovery, and the wrong tool for dev data.

Seed scripts have the opposite problem. They start small and stay small, so size is not an issue. But every schema migration is a chance for the seed script to break, drift, or silently produce stale data. The shapes of real data — Unicode, NULLs, accounts with hundreds of related rows — never appear in hand-written fixtures. We covered why that matters in Why Fake PostgreSQL Test Data Misses Real Bugs.

Subsetting solves both problems at once. The output is small enough to be useful and real enough to be representative.

How FK-aware subsetting actually works

Most of the difference between "good subsetting" and "bad subsetting" comes down to whether the extractor understands foreign keys. Here is what FK-aware extraction is doing under the hood.

Root tables and traversal

You pick one or more root tables. These are the entities the subset is "about" — usually users, accounts, tenants, or organizations. The extractor reads the rows that match your filter from the root tables and then walks the foreign key graph to pull in everything that depends on those rows.

If your root is users and you select 1000 rows, the extractor follows every foreign key that points at users and pulls in the matching rows from orders, subscriptions, audit_logs, and any other dependent table. It then walks one level deeper: orders has FKs to line_items and payments, so those come along. And so on, until the closure is complete.

This is the part that matters. Without traversal, you end up with 1000 users and zero orders, because nothing told the extractor to follow orders.user_id. With traversal, you get a connected subgraph: 1000 users, all of their orders, all of those orders' line items, all of the related payments.

Filters and row limits

Filters narrow the slice before traversal runs. They look like SQL WHERE clauses applied to the root tables:

from:
  - table: users
    where: 'created_at > :since AND plan = :plan'
    params:
      since: '2026-01-01'
      plan: 'team'

Row limits cap the slice in case the filter still pulls in too much. A per_table limit prevents one accidentally-huge table from blowing up the snapshot, and a total limit caps the whole extract:

limits:
  rows:
    per_table: 5000
    total: 100000

Filters and limits together are the levers that make subsetting work for any size of production database. The same config that works on 10GB of data works on 10TB — only the filter changes.

Why referential completeness matters

A subset is referentially complete when every foreign key in the extract resolves to a row that is also in the extract. If orders.user_id = 42 is in the extract but users.id = 42 is not, the subset is broken and the restore will fail with a constraint violation — or worse, succeed with constraints disabled and produce a database your app cannot read correctly.

This is the failure mode of "naive" subsetting (run SELECT * FROM users LIMIT 1000 and call it done). The extracted users table has 1000 rows. The orders table has rows pointing at user IDs that no longer exist. The restore either errors out or silently corrupts the dataset.

A real FK-aware extractor guarantees referential completeness by construction: every row in the output is reachable from a root row by following foreign keys, and every foreign key in every row resolves inside the output. There are no orphans.

This is the property that makes subsetting actually useful. Without it, you do not have a working database — you have a pile of disconnected rows.

Subsetting and anonymization

A referentially complete subset still contains real PII unless something explicitly removes it. The two operations — subsetting and anonymization — are usually run together at extraction time, before the data ever leaves production.

The reason to anonymize during extraction (not after restore) is that any post-restore approach lets real PII travel through your pipeline before the masking script runs. Real emails appear in restore logs. Real names sit on disk for the few seconds it takes the script to start. New columns added since the script was last updated never get masked at all.

The fix is to anonymize as part of the extract step, deterministically, so the same source value maps to the same fake value across every related table:

anonymize:
  mode: auto
  rules:
    - column: '*.email'
      strategy: deterministic_email
    - column: 'users.full_name'
      strategy: deterministic_name

Deterministic masking matters because joins still need to work. If jane@company.com becomes lmitchell@example.com in users but kpark@example.com in audit_logs, queries that join across tables stop returning the right rows. Determinism preserves the relationships even though every value has been replaced.

We covered the full mechanics — including the difference between masking and anonymization, what GDPR considers anonymized, and why automatic detection matters — in How to Anonymize PII in PostgreSQL for Development.

Subsetting strategies for common PostgreSQL schemas

Most production schemas fit one of three patterns. Each one has a different "right way" to subset, and getting the strategy right matters more than which tool you use.

Multi-tenant SaaS: filter by tenant_id

The most common shape. Every business object has a tenant_id (or org_id, or account_id) column, and every row in the system belongs to exactly one tenant. The natural subset is "give me all the data for one tenant" or "give me all the data for the 50 most recent tenants."

from:
  - table: tenants
    where: 'created_at > :since'
    params:
      since: '2026-01-01'
    limit: 50

The extractor then follows foreign keys from tenants to users, projects, documents, audit_logs, and everything else, naturally producing a dataset that is "the last 50 tenants and everything they own." Restore time scales with tenant size, not database size.

The anti-pattern here is filtering on a child table directly (SELECT * FROM documents WHERE tenant_id IN (...)). You end up with documents whose owning users were not pulled in, and the joins break.

Time-windowed: last N days of activity

When you do not have a clean tenant boundary (or you want to capture cross-tenant traffic), filter by recency. Pick a root table that represents activity — events, orders, sessions — and grab the last 30 days:

from:
  - table: orders
    where: 'created_at > :since'
    params:
      since: '2026-03-08'

Traversal pulls in the related users, products, and line items. The result is a snapshot that captures whatever shapes of data are flowing through the system right now, including recent edge cases like new payment methods or feature flags that only enabled in the last week.

This strategy is particularly good for catching regressions, because the subset always reflects the latest production patterns. Refresh weekly and you are testing against last week's data shapes, not last quarter's.

Customer-scoped: one specific account for repro

When a customer reports a bug you cannot reproduce, the fastest path to a fix is usually a snapshot of just their account. Pick the customer as a root and traverse:

from:
  - table: accounts
    where: 'id = :account_id'
    params:
      account_id: 'acct_01HXYZ...'

The output is a tiny, FK-complete database containing exactly that customer's data — anonymized so you can share it with the team, restore it locally, and reproduce the bug in seconds. This is the workflow that justifies subsetting on its own for most teams: a 30-second restore beats half a day of "can you give me your steps again?"

We dig into the broader workflow in the local development use case.

Tools that do PostgreSQL subsetting in 2026

The market has shaken out a bit since 2024. Here is the honest 2026 picture.

	Basecut	Tonic.ai	Delphix	OSS Snaplet fork	Hand-rolled SQL
FK-aware traversal	Yes	Yes	Yes	Yes	DIY
Anonymize at extract time	Yes	Yes	Yes	Yes	No (post-restore)
Referential completeness guaranteed	Yes	Yes	Yes	Yes	DIY
Auto-detects common PII	Yes	Yes	Yes	Yes	No
Config format	YAML	GUI + config	GUI + agents	TypeScript	SQL / shell
Hosted option	Yes (free tier)	Self-host + hosted	Enterprise self-host	Self-host only	N/A
Actively maintained	Yes	Yes	Yes	No active upstream	N/A
Best for	Teams wanting CLI + YAML, free tier	Enterprise procurement	Large enterprise	Self-hosters with bandwidth	Tiny schemas, stopgaps

A few notes that the table cannot capture.

Basecut is the actively maintained CLI-first option. YAML config, FK-aware traversal, deterministic masking, and a free tier that covers small teams. Built for the same workflow Snaplet pioneered. We cover the details on the Snaplet alternative page.

Tonic.ai is the heavyweight commercial option. Strong for enterprise procurement and SOC 2 paperwork, heavier than most teams want for "just give me dev data." Full Tonic comparison.

Delphix is the legacy enterprise player. Powerful, but the operational model assumes a dedicated platform team. Full Delphix comparison.

The open-source Snaplet fork is on GitHub and viable if you have engineering bandwidth to self-host and own maintenance indefinitely. There is no active upstream. Context here.

Hand-rolled pg_dump plus SQL scripts is what most teams default to before they have evaluated anything. It works at small schemas and breaks quietly as they grow. The full breakdown is in pg_dump vs database snapshots, and the broader "stop writing seed scripts" argument is in Replace Seed Scripts with Production Snapshots.

If you are evaluating from scratch, the honest order of operations is: pick the simplest tool that will work for your schema today, and make sure it handles referential completeness and at-extract anonymization. Everything else is detail.

How to subset a PostgreSQL database

Here is the minimum viable workflow with Basecut, end to end. The same five steps apply to any FK-aware tool — the syntax is just different.

1. Pick your root tables

Identify the entities your subset is "about." For multi-tenant SaaS this is usually tenants or accounts. For a marketplace it might be users or listings. For an event-driven system it might be events or orders. Pick the table whose rows naturally pull in everything else through foreign keys.

2. Write a config

A Basecut config defines roots, filters, limits, and anonymization rules in YAML:

version: '1'
name: 'dev-snapshot'

from:
  - table: tenants
    where: 'created_at > :since'
    params:
      since: '2026-01-01'
    limit: 50

limits:
  rows:
    per_table: 5000
    total: 100000

anonymize:
  mode: auto

mode: auto handles common PII columns (emails, names, phones, addresses) without explicit rules. Add explicit rules later if you have unusual fields like JSONB blobs or free-text notes.

3. Create a snapshot

Run the create command against a production read replica:

basecut snapshot create \
  --config basecut.yml \
  --source "$PRODUCTION_READ_REPLICA_URL"

Basecut traverses foreign keys from your root tables, pulls in every dependent row, anonymizes PII inline, and writes a versioned, referentially complete snapshot. Real PII never leaves production.

4. Restore wherever you need it

Same snapshot, any target:

# Local dev
basecut snapshot restore dev-snapshot:latest \
  --target "$LOCAL_DATABASE_URL"

# Staging
basecut snapshot restore dev-snapshot:latest \
  --target "$STAGING_DATABASE_URL"

# CI runner
basecut snapshot restore dev-snapshot:latest \
  --target "postgresql://postgres:postgres@localhost:5432/test_db"

The restore is fast because the subset is small, and safe because the data is already anonymized. We cover the CI flavor specifically in PostgreSQL test database in GitHub Actions and the staging flavor in setting up a staging database.

5. Refresh on a schedule

A snapshot from three months ago is only as good as the data shapes from three months ago. Schedule weekly refreshes so :latest always points at fresh data:

basecut snapshot create --config basecut.yml --source "$PRODUCTION_READ_REPLICA_URL"

Run it from a cron job, a CI workflow, or — on the team plan — a Basecut agent that handles scheduling for you. Existing restore commands keep working unchanged because they reference :latest.

That is the whole loop. Most teams get a working subset config in an afternoon and roll it out across local, CI, and staging over the following sprint.

When subsetting is not the right answer

Subsetting is the right default for development data, but it is not the right tool for every job. Skip it when:

You need an exact copy of production for migration rehearsal. Use pg_dump or a logical replication snapshot. Subsetting is a slice, not a forensic copy.
Your schema is genuinely tiny. If you have five tables and 10MB of data, subsetting is overkill. A pg_dump plus a quick masking script is fine until the schema grows.
You have a strict requirement for full row-level fidelity. Some compliance scenarios mandate a full copy with controlled access rather than a representative subset. Subsetting is a fit for the development workflow, not for forensic or audit use cases.

For everything else — local dev, CI test data, staging refreshes, customer repros, and onboarding — subsetting is the workflow worth investing in. It is the difference between "we have realistic dev data" and "we have a database we can actually develop against."

If you want to see whether subsetting fits your schema, the Basecut free tier covers most small teams. Install the CLI, point it at a read replica, and you can have a first FK-complete, anonymized snapshot in a few minutes.

Try Basecut free →

Snaplet Alternative in 2026: What to Use After Snaplet Shut Down

Jake Lazarus — Tue, 14 Apr 2026 14:30:00 +0000

Originally published on basecut.dev.

If you used Snaplet, you know exactly what it felt like when it worked: one command to pull a slice of production, anonymized, ready to restore. When they shut down in 2024, that workflow went with them.

Snaplet had a real following because it solved a real problem. It made "get anonymized production data into your dev environment" feel like a first-class workflow instead of a weekend project. Teams built their local dev setup and CI pipelines around it. When the shutdown happened, those workflows broke overnight.

Losing it meant teams had to either find a maintained alternative, fork the open-source code themselves, or fall back to scripts they had avoided writing for good reason. Two years on, there are a few legitimate paths. They are not all equivalent.

This post is an honest 2026 survey of where things stand: what you can actually use, who each option fits, and what the trade-offs are.

TL;DR: The best actively maintained Snaplet alternative in 2026 is Basecut — same subset → anonymize → restore workflow, YAML config instead of TypeScript, free tier for small teams. The open-source Snaplet fork is viable if you have bandwidth to self-host and maintain it indefinitely. Tonic.ai is a heavier enterprise option. pg_dump plus anonymization scripts works for tiny schemas and breaks quietly as they grow.

What is the best Snaplet alternative in 2026?

Basecut is the most actively maintained alternative to Snaplet for teams that need anonymized PostgreSQL snapshots. It covers the same core workflow — subset, anonymize, restore — with a YAML config instead of TypeScript and a broader set of anonymization strategies. The open-source Snaplet fork is also viable if your team has the bandwidth to self-host and maintain it long-term. Tonic.ai is available if you need an enterprise-grade commercial alternative with heavier procurement. And for the simplest possible cases, pg_dump plus some post-restore scripts will do the job, though that approach has a well-documented ceiling.

Snaplet alternatives compared

Before going deep on each option, here is how the four mainstream paths line up against the workflow Snaplet established.

	Basecut	OSS Snaplet fork	Tonic.ai	pg_dump + scripts
Actively maintained	Yes	No active upstream	Yes	N/A
FK-aware subsetting	Yes	Yes	Yes	DIY with `WHERE` clauses
Anonymize before data leaves prod	Yes	Yes	Yes	No (masking runs post-restore)
Auto-detects common PII	Yes	Yes	Yes	No
Deterministic masking across tables	Yes	Partial	Yes	No
Config format	YAML	TypeScript	GUI + config	SQL / shell
Hosted option	Yes (free tier)	Self-host only	Self-host + hosted	N/A
Team / org-level policies	Yes (paid)	No	Yes	No
Free for small teams	Yes	Yes (self-host)	No	Yes
Best for	Teams wanting a maintained Snaplet replacement	Teams with bandwidth to own the fork	Enterprise with procurement cycles	Tiny schemas, temporary stopgaps

The rest of this post walks through each option in detail, who it fits, and where it falls down.

What teams actually miss about Snaplet

Before evaluating options, it helps to be specific about what Snaplet actually gave people. Not everything about it was special — but a few things were.

The one-command workflow. snaplet snapshot capture followed by snaplet snapshot restore. That was the whole loop. New developer joins the team — two commands and they have a working database. CI needs realistic test data — restore the snapshot. Debugging a production incident locally — restore the snapshot. The value was not the technology; it was the fact that the workflow required no thinking.

A config that understood your schema. Snaplet's TypeScript config let you write transforms close to the data model. When it worked, it felt like the tool was aware of your schema rather than applying dumb column-name heuristics. You could write logic that said "for this table, transform this column this way" and have it behave exactly as intended.

Automatic PII masking. Snaplet would detect and mask common PII fields without requiring you to enumerate every sensitive column manually. For teams that had not yet thought carefully about their anonymization strategy, that auto-detection was the difference between using the tool and not using it.

Subsetting — not full dumps. This is the one that gets underappreciated. Snaplet did not copy your entire database. It followed foreign keys and extracted a connected slice: recent users, their related orders, their related records. The result was small enough to restore locally in seconds and referentially intact because the relationships were traversed rather than copied blindly. Full pg_dump gives you everything; Snaplet gave you the right slice. That difference matters for restore speed, local disk usage, and CI job time. It is also the reason a pg_dump-based fallback feels so much worse by comparison — once you have had subsetting, going back to full dumps is painful.

The main options after Snaplet

The open-source Snaplet fork

When Snaplet shut down, they open-sourced their code. It is on GitHub and available for anyone to run. For teams that depended heavily on Snaplet's TypeScript config and want to keep as much of their existing setup intact as possible, the fork is worth a look.

When it fits: you have a small team with the engineering bandwidth to set it up, self-host the necessary infrastructure, and handle any issues that come up. If you are comfortable owning a dependency that has no active upstream, this can be a workable path — especially if your existing snaplet.config.ts is already dialed in and you do not want to migrate anything.

The honest catch is that there are no active maintainers, no bug fixes, and no support channel. When PostgreSQL releases a new version and something breaks, that is your problem to debug. When your schema evolves and an edge case triggers unexpected behavior, you are reading source code to figure out why. That is not a knock on Snaplet — open-sourcing was a generous move by the team. It just means you are taking on maintenance ownership in full.

Verdict: viable if you are prepared to own it indefinitely. Not a good fit if "set it and forget it" is a requirement.

pg_dump plus anonymization scripts

The path many teams take immediately after losing a tool like Snaplet: wire together pg_dump, restore the dump to a scratch database, run a series of UPDATE statements to mask sensitive fields, and call it done.

When it fits: your schema is simple, your compliance requirements are minimal, and you just need a basic way to move data around. If you have five tables, no complicated FK relationships, and PII is confined to a handful of columns you can enumerate manually, this works fine.

The catch is that this is exactly the pattern that breaks as the schema grows — and it breaks quietly. UPDATE-based anonymization runs after the data is already in the target database, which means real PII is sitting on a dev machine or CI runner for however long the masking script takes. Subsetting is not built in, so you are either restoring the full production dump or writing custom WHERE clauses for every table. And when a new sensitive column gets added, nothing reminds you to update the masking script.

Verdict: fine for simple cases. Expect to outgrow it.

Tonic.ai

Tonic is the heaviest commercial option in this space. It predates Snaplet, targets enterprise buyers, and covers the core workflow — subsetting, anonymization, restore — with a GUI-first experience and a set of higher-end features like synthetic data generation.

When it fits: you are at an org where procurement, SOC 2 paperwork, and a dedicated platform team are normal parts of buying a tool. Tonic is built for that world. If your team can spend a quarter evaluating it and your compliance team wants sign-off from a vendor with a full trust-center, it is a reasonable pick.

The honest catch for most Snaplet refugees is that Tonic is not really a like-for-like replacement. The pricing, onboarding, and operational model are different — it is not a CLI you install in an afternoon. Smaller teams that liked Snaplet for its "two commands and a YAML" ergonomics usually find Tonic heavier than what they were looking for.

Verdict: reasonable if you are already an enterprise buyer. Heavier than most Snaplet users want.

Basecut

Basecut is built around the same workflow Snaplet established: define what to extract, run a CLI command, restore anywhere.

The config is YAML instead of TypeScript, which trades flexibility for simplicity. For most teams — especially those whose Snaplet configs were mostly just masking rules and row limits — the YAML is easier to read, easier to review, and easier for non-JS teams to maintain.

version: '1'
name: 'dev-snapshot'

from:
  - table: users
    where: 'created_at > :since'
    params:
      since: '2026-01-01'

limits:
  rows:
    per_table: 1000
    total: 50000

anonymize:
  mode: auto

Basecut has 30+ anonymization strategies and auto-detects common PII fields — names, emails, phone numbers, addresses — without requiring you to enumerate them. It also supports deterministic masking, which matters when the same source value needs to map to the same fake value across related tables. If jane@company.com turns into two different fake emails across users and audit_logs, your data stops behaving like the real system.

For teams with compliance requirements, Basecut adds org-level anonymization policies — rules that apply across all snapshots in a workspace without relying on individual contributors to remember to set them. You can enforce that certain columns are always masked, regardless of who runs the snapshot or what config file they used.

Snapshots can be stored locally or in cloud storage, and the CLI supports both interactive use and async agent execution for teams that want to run snapshot creation on a schedule without leaving a terminal open. The free tier covers small teams. Paid plans add team features and higher snapshot volumes.

Verdict: the closest like-for-like replacement for Snaplet, actively maintained, with a free tier that covers most small teams.

Post-Snaplet migration checklist

If your workflow just broke and you are staring at a Snaplet-shaped hole in your dev setup, here is the minimum viable path to get back to a working state without picking the wrong tool under time pressure.

List the workflows that actually depended on Snaplet. Usually there are two or three: local dev onboarding, CI test data, and staging refreshes. Write them down. You do not need to replace every Snaplet feature — you need to restore these workflows.
Identify the data each workflow needs. How many rows of which root tables? What recency window? What is genuinely PII? This is the information that will become your replacement tool's config, regardless of which tool you pick.
Inventory the anonymization rules you actually rely on. Pull up your old snaplet.config.ts and list the columns that had explicit transforms. Most teams find that 80% of the rules are "mask emails, names, phone numbers" — which any serious replacement handles automatically.
Pick one workflow to migrate first. Local dev onboarding is usually the right starting point: it is the lowest stakes, has the fastest feedback loop, and exercises the full subset → anonymize → restore loop.
Run the new tool alongside the old script for one sprint. Do not delete anything yet. Let one workflow prove itself before you rip out the fallback scripts that kept you running after the Snaplet shutdown.
Expand to CI once local dev is stable. CI/CD test data is where the time savings compound fastest — a snapshot restore that takes seconds saves real engineer time on every PR.
Automate snapshot refresh on a schedule. Weekly is a reasonable default. Without a refresh schedule, any replacement tool is just a different kind of stale dataset six months from now.
Delete the fallback scripts. Once the replacement has been running unattended for a few weeks, delete the pg_dump-and-UPDATE scripts you wired up in the panic. Leaving them around means they get used again eventually, and now you have two parallel systems drifting apart.

The whole migration typically takes an afternoon for the CLI swap and one or two sprints to expand across local, CI, and staging. The hard part is usually the decision, not the execution.

Migrating from Snaplet to Basecut

The CLI migration is straightforward. If your team ran:

snaplet snapshot capture
snaplet snapshot restore

The Basecut equivalent is:

basecut snapshot create --config basecut.yml
basecut snapshot restore my-snapshot:latest --target $DEV_DB_URL

The conceptual mapping is close enough that most teams can get a working Basecut config in an afternoon. The main translation is from Snaplet's TypeScript transform functions to anonymize rules in YAML — and in most cases, anonymize: mode: auto handles the common fields automatically, so the config ends up shorter than what you had before.

One thing worth knowing: Basecut uses snapshot names with a version tag (my-snapshot:latest) rather than Snaplet's path-based restore syntax. The restore command takes a --target flag pointing at the destination database URL, which keeps source and destination separate and makes it explicit which database is being written to.

FAQ

Is Snaplet still available in 2026?
No. Snaplet shut down in 2024 and open-sourced their codebase. The hosted product is gone, the team no longer maintains it, and there is no support channel. The code is still on GitHub for anyone who wants to self-host, but there are no active maintainers or bug fixes.

What is the best alternative to Snaplet in 2026?
Basecut is the most actively maintained like-for-like alternative: same subset → anonymize → restore workflow, YAML config instead of TypeScript, and a free tier that covers small teams. The open-source Snaplet fork is viable if you can self-host and own the maintenance. Tonic.ai is a heavier enterprise option. pg_dump plus anonymization scripts is a stopgap that breaks quietly as schemas grow.

Can I migrate my Snaplet config directly to Basecut?
There is no automatic translator, but the conceptual mapping is close. Snaplet transform functions become Basecut anonymize rules in YAML. For most teams, anonymize: mode: auto handles the common PII columns (emails, names, phones, addresses) without explicit rules, which makes Basecut configs shorter than the Snaplet equivalents they replace. Most teams get a working Basecut config in an afternoon.

Is the open-source Snaplet fork still maintained?
The code is available, but there is no active upstream. New PostgreSQL versions, edge cases, and schema quirks are your problem to debug. It is a reasonable option if you have engineering bandwidth and want to keep your existing snaplet.config.ts intact. It is a poor fit if "set it and forget it" is a requirement.

How long does a Snaplet → Basecut migration usually take?
The CLI swap itself takes an afternoon: write a basecut.yml, run basecut snapshot create, run basecut snapshot restore, and verify the result against your app. Rolling it out across local dev, CI, and staging typically takes one or two sprints, mostly because each workflow needs to be validated independently before the old fallback scripts can be deleted.

Do I need to self-host Basecut the way I would self-host the Snaplet fork?
No. Basecut is a hosted product with a free tier for small teams, and snapshots can be stored either locally on your own machine or in cloud storage managed by Basecut. You install the CLI, point it at a read replica, and create your first snapshot without provisioning any infrastructure.

If you want to see whether Basecut fits your workflow before committing to a migration, the free tier is a reasonable place to start. No infrastructure to set up — install the CLI, point it at a read replica, and you can have a first snapshot in a few minutes.

Try Basecut free →

See the full Snaplet → Basecut migration guide →

How to Set Up a PostgreSQL Test Database in GitHub Actions (Without pg_dump)

Jake Lazarus — Thu, 09 Apr 2026 14:35:00 +0000

Originally published on basecut.dev.

The standard GitHub Actions job for PostgreSQL tests looks something like this:

- name: Set up test database
  run: psql "$TEST_DB_URL" -f scripts/seed.sql

Or worse, a pg_dump from production piped straight into the test container. It runs. The tests pass. Nobody questions it.

Until a developer checks the CI logs and finds real customer emails in the output. Or a dump that used to take two minutes now takes fifteen. Or the seed script breaks on a migration and half the CI matrix fails with ERROR: column "created_by" of relation "orders" does not exist.

This guide covers a better pattern: restore a pre-built, anonymized database snapshot instead. For small-to-medium snapshots, restores typically take seconds rather than minutes, the data contains no PII, and the process never breaks on schema changes.

How do you set up a PostgreSQL test database in GitHub Actions?

The cleanest way to set up a PostgreSQL test database in GitHub Actions is to run a postgres service container and restore a pre-built snapshot before your test step runs. This is faster than restoring a pg_dump, safer than fixture SQL files, and does not require maintaining seed data by hand.

The three-step pattern is:

Spin up a postgres service container in the workflow.
Install a CLI that can restore a versioned snapshot.
Restore the snapshot before your test suite runs.

The rest of this guide shows how to do that with Basecut, with full YAML you can copy directly.

Why the pg_dump approach breaks down in CI

The obvious approach — dump production, restore in CI — causes three problems that compound over time.

Speed. A pg_dump restore is full-size by default. Production databases grow. A restore that takes 90 seconds today takes 8 minutes in two years. That is a lot of developer time spent waiting before tests even start.

PII. A full dump copies real emails, real names, real addresses. Those land in CI artifacts, appear in test output, and end up in logs. That is a compliance problem even if nobody reads the logs. The Thoughtworks Technology Radar explicitly calls out raw production data in test pipelines as a risk to address.

Fragility. Fixture SQL files and seed scripts break the moment a migration adds a NOT NULL column they do not know about. Someone commits a quick fix, the fix produces slightly different data than everyone else's local setup, and now "works on my machine" is a data problem, not a code problem.

The snapshot approach

Instead of restoring a raw dump or running a seed script, you restore a named, versioned snapshot — a small, FK-complete, pre-anonymized subset of production.

The key properties:

Pre-built. The snapshot is created once (or on a schedule), not on every CI run. For most team-sized datasets, restore time is measured in seconds rather than minutes — though this depends on snapshot size and network.
Already anonymized. PII is masked during snapshot creation, before any data leaves production. Nothing sensitive ever travels to CI.
Referentially complete. Every foreign key resolves. The snapshot is a consistent subgraph of production, not a random sample of disconnected rows.
Versioned. dev-snapshot:latest always points to the most recent run. You can also pin to dev-snapshot:v12 to keep tests stable while multiple PRs are in flight.

Full GitHub Actions setup

Here is a complete workflow you can adapt directly:

name: Test

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest

    services:
      postgres:
        image: postgres:16
        env:
          POSTGRES_USER: postgres
          POSTGRES_PASSWORD: postgres
          POSTGRES_DB: test_db
        ports:
          - 5432:5432
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5

    steps:
      - uses: actions/checkout@v4

      - name: Install Basecut CLI
        run: |
          curl -fsSL https://basecut.dev/install.sh | sh
          echo "$HOME/.local/bin" >> $GITHUB_PATH
          # If the installer uses a different path, add that path instead

      - name: Restore test snapshot
        env:
          BASECUT_API_KEY: ${{ secrets.BASECUT_API_KEY }}
        run: |
          basecut snapshot restore dev-snapshot:latest \
            --target "postgresql://postgres:postgres@localhost:5432/test_db"

      - name: Run tests
        run: npm test

A few things worth noting:

--health-cmd pg_isready ensures the Postgres container is accepting connections before the restore step runs. Without this, you get intermittent failures on container startup.
dev-snapshot:latest is the snapshot name. Change this to match whatever you named your snapshot in the Basecut config. Tag a specific version (dev-snapshot:v3) if you want tests to run against a pinned dataset across multiple branches.
BASECUT_API_KEY goes in GitHub repo secrets (Settings → Secrets and variables → Actions), not in the workflow YAML.

GitLab CI equivalent

The same pattern works in GitLab CI:

test:
  image: node:20

  services:
    - name: postgres:16
      alias: postgres
      variables:
        POSTGRES_USER: postgres
        POSTGRES_PASSWORD: postgres
        POSTGRES_DB: test_db

  variables:
    PGHOST: postgres
    PGUSER: postgres
    PGPASSWORD: postgres

  before_script:
    - curl -fsSL https://basecut.dev/install.sh | sh
    - export PATH="$HOME/.local/bin:$PATH"
    - |
      basecut snapshot restore dev-snapshot:latest \
        --target "postgresql://postgres:postgres@postgres:5432/test_db"

  script:
    - npm test

Set BASECUT_API_KEY as a CI/CD variable in GitLab (Settings → CI/CD → Variables) with "Masked" enabled so it does not appear in job logs.

Keeping snapshots fresh without breaking CI

A snapshot that is three months old starts drifting from production. New columns, new relationships, new edge cases — none of them exist in it yet. The fix depends on how stable your CI needs to be.

Refresh on a schedule (recommended for most teams):

name: Refresh test snapshot

on:
  schedule:
    - cron: '0 3 * * 1' # Every Monday at 3am

jobs:
  refresh:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install Basecut CLI
        run: |
          curl -fsSL https://basecut.dev/install.sh | sh
          echo "$HOME/.local/bin" >> $GITHUB_PATH

      - name: Create fresh snapshot
        env:
          BASECUT_API_KEY: ${{ secrets.BASECUT_API_KEY }}
        run: |
          basecut snapshot create \
            --config basecut.yml \
            --source "${{ secrets.PRODUCTION_DATABASE_URL }}"

This keeps dev-snapshot:latest current week-to-week. Existing test workflows keep working — latest only advances when the refresh job runs.

Pin for stability, bump when needed:

If your test suite is sensitive to data changes, pin the snapshot version in the test workflow (dev-snapshot:v3) and bump the pin manually when you want the new data. This gives you explicit control over when CI picks up fresh data shapes.

PII in CI: the part most teams skip

If your pipeline is restoring anything from production — a full dump, a partial export, a seed script that grabs real rows — real customer data is in your CI artifacts. That is the kind of thing commonly discussed in the context of GDPR Article 25 (data protection by design) and SOC 2 data handling controls, and most teams have not thought through whether their CI pipeline is in scope.

The snapshot approach sidesteps this entirely because anonymization happens at extraction time, before data leaves production. The snapshot that arrives in CI already has fake emails, fake names, and fake phone numbers — not real data with a cleanup script someone may or may not have run.

Deterministic masking ensures the same source value maps to the same fake value across all tables, so joins still work and the app behaves like it does in production.

When this pattern matters most

This is overkill for a three-table CRUD app. It becomes the right default when:

Your test suite has data-dependent integration tests
Your schema has more than a handful of related tables
CI job time is starting to affect developer feedback loops
You handle PII and need to demonstrate anonymization in CI for compliance audits
Onboarding means "restore the snapshot and start working" instead of "ask around for a dump"

If your CI pipeline currently has a slow pg_dump step or a seed script that breaks every few sprints, this is the thing worth replacing first.

FAQ

How do I set up a PostgreSQL test database in GitHub Actions?
Spin up a postgres service container in the workflow, install a CLI that can restore a versioned snapshot, then restore a pre-built anonymized snapshot before your test step runs. This is faster than restoring a pg_dump, safer than fixture SQL files, and does not require maintaining seed data by hand. The snapshot is built once from production with PII masked at extraction time, then reused across every CI run until the next refresh.

Why is pg_dump slow in GitHub Actions CI?
A full pg_dump restore scales with the size of production and runs in full on every CI job. A restore that takes 90 seconds today takes 8 minutes in two years as production grows. Because the dump is not subset or anonymized, it also copies real PII into CI logs and artifacts. Pre-built snapshots replace this with a small, FK-complete, pre-anonymized dataset that typically restores in seconds for team-sized schemas.

Can I cache my PostgreSQL test database between GitHub Actions runs?
Caching a live database across runs is fragile — the postgres service container is fresh every job, and seeding state into caches tends to produce flaky tests. A better pattern is to cache the snapshot artifact itself (small, static, versioned) and restore it into a fresh container each run. That gives you fast startup without carrying mutable state between jobs.

How do I keep real customer PII out of CI logs?
Anonymize at extraction time, not after restore. If a pg_dump or seed script ever runs against production data in CI, real emails and names will appear in logs, artifacts, and error output. Pre-built snapshots with deterministic masking applied during creation guarantee that no raw PII ever travels through the CI pipeline. Any snapshot a developer can restore is already safe.

Does this pattern work with GitLab CI or CircleCI?
Yes. The pattern — service container for PostgreSQL, CLI install step, snapshot restore before tests — translates directly to GitLab CI, CircleCI, Buildkite, and Jenkins. Only the YAML syntax changes. The post includes a GitLab CI example; the same three steps apply to any runner that can start a PostgreSQL service and run a shell command.

How do I keep the test snapshot from going stale?
Refresh it on a schedule with a separate workflow — typically a weekly cron job that runs basecut snapshot create against production or a read replica. The :latest tag always points at the most recent run, so existing test workflows pick up fresh data automatically. If your test suite is sensitive to data changes, pin a specific version tag (e.g. dev-snapshot:v12) and bump it manually when you want to adopt new data shapes.

Get started with Basecut. The CLI is free for small teams — create your first snapshot and restore it in CI in under 10 minutes. Try Basecut free →

How to Set Up a Staging Database from Production PostgreSQL (2026 Guide)

Jake Lazarus — Tue, 07 Apr 2026 14:30:00 +0000

Setting up a PostgreSQL staging database that actually reflects production is one of those tasks that sounds simple and turns into a day of cleanup. The obvious approach — pg_dump production to staging — breaks down immediately: the dump is too large, it contains real customer PII, and it produces a shared environment nobody wants to touch.

This guide walks through a better pattern: extract a connected, anonymized subset of production and restore it as your staging database. You get a realistic, production-like environment without the size, privacy risk, or shared-state headaches of a full copy.

What you'll learn:

Why pg_dump fails as a staging database solution
How FK-aware subsetting produces a referentially complete extract
How to anonymize PII before it ever leaves production
How to automate staging refreshes so it never goes stale

How do you create a staging database from production PostgreSQL safely?

Create a staging database by extracting a referentially complete subset of production, anonymizing PII during extraction, and restoring that snapshot into staging. This keeps the environment realistic without copying the entire database, leaking customer data, or forcing developers to maintain manual cleanup scripts after every refresh.

The shortest version of the workflow is:

Choose a representative slice of production
Follow foreign keys so the extract stays complete
Mask sensitive fields before data leaves production
Restore the snapshot into staging
Refresh it on a schedule so it stays current

Why pg_dump Fails for Staging Environments

The first instinct is usually pg_dump:

pg_dump "$PRODUCTION_URL" | psql "$STAGING_URL"

This works in the sense that it runs. The issues come later:

Size. Production databases grow. A dump that takes 5 minutes today takes 30 minutes in a year. Restores slow down CI, slow down onboarding, and make "refresh staging" a thing people avoid.
PII. A full dump copies everything — real emails, real names, real addresses, real payment details. That data is now in your staging environment, which means it is probably on developer laptops, in logs, and reachable by anyone with staging credentials. Thoughtworks' Technology Radar on production data in test calls out exactly these privacy and security tradeoffs.
Shared state. One staging environment with real-ish data usually becomes a place where everyone makes changes at once. It gets out of sync with production constantly. Nobody owns keeping it clean.

We wrote a full comparison of pg_dump vs Basecut if you want the detailed breakdown. The short version: pg_dump is the right tool for backups and disaster recovery. For dev and staging environments, you usually want something smaller and safer.

The Right Pattern: Subset, Anonymize, Restore

Instead of copying the whole database, the better approach is:

Start from one or more root tables (usually users, accounts, or whatever your primary entities are).
Filter to a representative slice — recent signups, a specific account tier, a date range.
Follow foreign keys to pull in all the related data those rows depend on.
Anonymize sensitive fields during extraction, before anything leaves the production environment.
Save the result as a named snapshot.
Restore that snapshot to staging.

What you end up with is a self-contained, realistic subset of production — with real relationships, real data shapes, real edge cases — but no raw PII and a manageable size.

This is what the industry calls database subsetting, and it is the same pattern that powers local dev environments and CI test data pipelines. Staging is just another restore target.

Step 1: Define What "Representative" Means for Your Database

Before you can extract anything, you need to decide what data to include.

For most applications this means picking a root table and a sensible filter:

Recent users (past 30–90 days)
A specific cohort (paid accounts, a particular plan tier)
Accounts associated with specific test scenarios

You do not need the whole database. You need enough data that staging behaves like production — correct relationships, realistic distributions, enough rows to surface data-dependent bugs.

A starting point for a typical SaaS PostgreSQL database:

500–2,000 user accounts
All their related records (orders, subscriptions, events, etc.)
Enough to make the app behave realistically, small enough to restore in under a minute

Step 2: Extract with Foreign Key Awareness

The mistake most DIY approaches make is sampling rows naively — take 500 rows from users, take 500 rows from orders, call it done. Then you restore it and get:

orders pointing to users who are not in the snapshot
line items pointing to products that were not included
foreign key violations on restore
an app that half-works or fails in strange ways

A useful staging database has to be referentially complete. Every foreign key must resolve. Every parent row must exist before its children.

This is why FK-aware extraction matters. The extraction process traverses the schema — if you include an order, you also need the user who placed it, the products on the order, the shipping address, and whatever else your schema requires. The result is a subgraph that can be restored into an empty database without broken references.

See the official PostgreSQL docs on foreign key constraints for the underlying mechanics if you want to understand what the extractor is navigating.

Step 3: Anonymize PII Before It Leaves Production

The natural next question is: what do you do about PII?

The common answer is a cleanup script that runs after restore:

UPDATE users SET
  email = 'user' || id || '@example.com',
  first_name = 'Test',
  last_name = 'User';

This has two problems. First, someone has to remember to run it. Second, when a new PII column gets added to the schema, someone has to update the script — and they usually forget.

More importantly, it is already too late by the time this runs. The data traveled through your restore pipeline with real values in it.

The better approach is to anonymize at extraction time, before the data leaves production. The snapshot that gets created already has fake emails, fake names, and fake addresses. Nothing sensitive ever travels to staging.

For columns that need specific handling — free-text fields, external IDs, JSONB blobs — you add explicit rules. Everything else gets auto-detected by column name and type.

We go deep on this in How to Anonymize PII in PostgreSQL for Development.

Step 4: Full Example with Basecut

Here is a complete staging database setup using Basecut. Create a basecut.yml at the root of your repo:

version: '1'
name: 'staging'

from:
  - table: users
    where: 'created_at > :since AND plan != :plan'
    params:
      since: '2025-10-01'
      plan: 'free'

limits:
  rows:
    per_table: 2000
    total: 100000

anonymize:
  mode: auto
  rules:
    users:
      notes: null
      stripe_customer_id: hash
    audit_logs:
      ip_address: fake_ip

Then create the snapshot from production (or a read replica):

basecut snapshot create \
  --config basecut.yml \
  --source "$PRODUCTION_DATABASE_URL"

And restore it to staging:

basecut snapshot restore staging:latest \
  --target "$STAGING_DATABASE_URL"

That is the entire workflow. The snapshot is named and versioned — staging:latest always points to the most recent one, and you can also restore a specific tagged version like staging:v2 if you need to pin a particular snapshot.

Step 5: Keep Your Staging Database Fresh

A staging database that is three months old is almost as bad as one that does not exist. Edge cases you care about — new billing flows, new user types, new schema columns — are not in there yet.

The simplest way to keep staging fresh is to trigger a snapshot refresh on a schedule. Wire it into an existing cron job, a scheduled GitHub Actions workflow, or your platform's scheduler. For example, to refresh every Monday at 2am:

# crontab entry (or a scheduled CI job)
0 2 * * 1 basecut snapshot create --config basecut.yml --source "$PRODUCTION_DATABASE_URL" && \
          basecut snapshot restore staging:latest --target "$STAGING_DATABASE_URL"

Or trigger it as part of your CI pipeline whenever you cut a release branch:

- name: Refresh staging database
  env:
    BASECUT_API_KEY: ${{ secrets.BASECUT_API_KEY }}
  run: |
    basecut snapshot create --config basecut.yml --source "$PRODUCTION_DATABASE_URL"
    basecut snapshot restore staging:latest --target "$STAGING_DATABASE_URL"

Weekly refreshes are usually enough. For teams doing frequent releases, a refresh on every release branch works well too.

Shared Staging vs Per-Developer Environments

Staging environments fall into two patterns, and this approach works for both.

Shared staging (one environment, multiple developers): the workflow above applies directly. Refresh it on a schedule. Everyone gets the same anonymized, realistic baseline. When someone corrupts state for testing, you restore again.

Per-developer environments (each developer has their own): this is actually easier. Each developer restores the same snapshot to their own local PostgreSQL instance. They can make whatever changes they need without affecting anyone else. When they want a fresh start, one command resets it. We cover this more in the local development guide.

The main advantage of per-developer environments is independence — nobody is waiting for staging to settle down before they can test. The tradeoff is that each developer needs somewhere to run their own PostgreSQL instance, which is easy locally but less obvious for teams that rely entirely on remote environments.

Staging Database Setup Checklist

Before you call it done, verify:

[ ] Snapshot restores cleanly (no FK violations, no missing extension errors)
[ ] Application runs against it without crashing
[ ] No real PII visible in common queries (SELECT email FROM users LIMIT 10)
[ ] Referential integrity intact (spot-check a few joined queries)
[ ] Row counts are reasonable (not 12 rows, not 5 million rows)
[ ] Refresh process is automated and nobody is doing it manually

When a Full pg_dump Is Still the Right Answer

To be fair: there are cases where a full dump is the right approach.

You need to test schema migrations against production-exact data before running them.
You are debugging a specific production incident and need the exact rows to reproduce it.
Your compliance requirements demand a production-identical environment for specific tests.

In those cases, the official pg_dump reference is the right place to understand what a logical dump includes. The anonymization and subsetting workflow described here is for the other 95% of staging use cases — where you want something fast, safe, and realistic, not a forensic copy of production.

Final Thought

Staging databases are usually either out of date, full of PII, or both. The reason is that setting them up properly was never made easy enough to do right.

A scripted subset + anonymize + restore workflow fixes all of this at once. The result is a staging environment that is fast to restore, safe to share, realistic enough to catch real bugs, and easy to keep fresh.

Originally published at basecut.dev. If you found this useful, Basecut is the tool described here — free for small teams.

How to Replace Seed Scripts with Production Snapshots

Jake Lazarus — Thu, 26 Mar 2026 14:35:00 +0000

Seed scripts are technical debt that nobody tracks.

They start as a convenience — a few INSERT statements so the app boots locally — and they end up as a 400-line file that touches 30 tables, breaks on every third migration, and produces data that looks nothing like production. Everyone knows the seed script is bad. Nobody wants to fix it, because fixing it means rewriting it, and it will just rot again.

The usual response is to invest in a better seed script — more tables, better relationships, more realistic values. But the underlying issue is not the quality of the script. It is that hand-crafting test data stops scaling as schema complexity grows.

TL;DR: Define what data you need in a YAML config, extract a subset from production with PII anonymized, and restore it anywhere. No INSERT statements to maintain — the snapshot stays current automatically.

Why database seed scripts break as projects grow

As a database seeding approach, seed scripts have real advantages early on: they are version-controlled, deterministic, and easy to understand. But those advantages erode as the schema grows, and three problems start compounding.

Schema drift

Every migration is a chance for the seed script to break. A new NOT NULL column, a renamed FK, a dropped table — each one needs a corresponding update to the seed file. Those updates happen late or not at all. The person writing the migration is thinking about the migration, not about whether seed.sql still runs.

The script does not fail loudly. It just produces increasingly stale data.

Manual referential integrity

In production, an order belongs to a user, references line items, connects to shipments and payments. In a seed script, you maintain all of those relationships by hand. Every ID, every FK, every cross-table reference. Miss one and you get constraint violations or, worse, data that loads fine but makes the app behave in ways it never would in production.

Flat data

Test User 1 with test@example.com and two orders is structurally valid. It is not useful. Real users have Unicode in their names. Real accounts have nullable fields that are actually null. Real customers have 47 orders accumulated over two years, with edge cases nobody thought to fabricate.

The bugs that reach production are usually triggered by data shapes that did not exist in the seed script, because nobody anticipated them. We covered this in more detail in Why Fake PostgreSQL Test Data Misses Real Bugs.

Production database snapshots: the seed script alternative

Instead of fabricating data, extract it.

Not with pg_dump — that copies everything, including all the PII you should not have in dev and all the volume you do not need. The Thoughtworks Technology Radar explicitly recommends against using raw production data in test environments for exactly this reason — the privacy and security risks outweigh the convenience. Instead, extract a subset: a small, connected slice of production data with sensitive fields anonymized during extraction.

What you get is a snapshot that reflects the real schema, has valid relationships because they were followed rather than hand-coded, and contains real data shapes because they came from production. It also requires no maintenance, because the next snapshot picks up schema changes automatically.

This is the workflow Basecut was built for. Define what to extract, run one command, restore to any database.

How database subsetting works in practice

1. Define what to extract

Instead of INSERT statements, you describe the shape of the data you want:

version: '1'
name: 'dev-data'

from:
  - table: users
    where: 'created_at > :since'
    params:
      since: '2026-01-01'

limits:
  rows:
    per_table: 1000
    total: 50000

anonymize:
  mode: auto

Start from recent users, follow FKs to collect related data, cap the size, auto-detect and anonymize PII. You can add explicit anonymization rules when you need them — we cover that in How to Anonymize PII in PostgreSQL for Development.

The important difference from a seed script: this config describes what to extract, not what to insert. New columns, new tables, and new relationships get picked up on the next snapshot without touching the config.

2. Create a snapshot

basecut snapshot create --config basecut.yml

Basecut connects to your database (or a read replica), traverses relationships, anonymizes PII inline, and writes the result. No real PII ever leaves production.

3. Restore locally

basecut snapshot restore dev-data:latest \
  --target "$LOCAL_DATABASE_URL"

That is the local dev setup. A new developer joining the team runs two commands and has a working database with realistic test data management handled for them.

Basecut handles all of this in one CLI command. See the quickstart →

4. Share it

Once a snapshot exists, anyone on the team can restore it. Everyone works against the same fixture data, which means bugs are reproducible across machines and "works on my machine" stops being about data differences.

Comparison

	Seed script	Production snapshot
Schema changes	Breaks until someone updates it	Automatic
FK integrity	Manual	Followed from the database
Data realism	Fabricated	Real, anonymized
PII risk	None (but no realism either)	Handled at extraction
Maintenance	Grows with the schema	Near zero
Onboarding	Run, debug, ask for help	Restore, start working
Edge cases	Only what someone added	Whatever production has

More detail in our seed scripts vs Basecut comparison.

Using production snapshots in CI/CD pipelines

The same snapshot works in CI. Replace the seed script step with a restore:

name: Test
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    services:
      postgres:
        image: postgres:15
        env:
          POSTGRES_USER: postgres
          POSTGRES_PASSWORD: postgres
          POSTGRES_DB: test_db
        ports:
          - 5432:5432

    steps:
      - uses: actions/checkout@v4

      - name: Install Basecut CLI
        run: |
          curl -fsSL https://basecut.dev/install.sh | sh
          echo "$HOME/.local/bin" >> $GITHUB_PATH

      - name: Restore snapshot
        env:
          BASECUT_API_KEY: ${{ secrets.BASECUT_API_KEY }}
        run: |
          basecut snapshot restore dev-data:latest \
            --target "postgresql://postgres:postgres@localhost:5432/test_db"

      - name: Run tests
        run: npm test

New tables and columns from migrations show up in the next snapshot. No CI config changes needed.

More on this in our CI/CD test data guide.

When seed scripts still make sense

Seed scripts are the right tool in some situations:

Pre-launch projects with no production data yet.
Intentionally fictional demos where you need a specific scenario.
Unit tests that need three rows in one table. A snapshot is overkill there.
Small schemas where the maintenance cost is genuinely low.

If your schema has fewer than ten tables and no PII, a seed script is probably the right choice. The crossover point is usually obvious — it is when maintaining the seed file takes more effort than it saves.

Migrating off the seed script

You do not have to rip it out in one PR.

Start with one workflow. Pick the place where the seed script hurts most — usually dev environment setup or CI. Set up a snapshot and run it alongside the seed script.
Compare. Run the app against both datasets. The snapshot will usually expose things the seed script missed: edge cases, data shapes that only exist in production, relationships that only worked because the script inserted rows in a specific order.
Switch gradually. Replace the seed script where the snapshot is better. Keep it for unit tests or demos if it still makes sense.
Let it go. Once the snapshot covers your main workflows, stop maintaining the seed script. Do not delete it if people reference it — just stop investing in keeping it current.

Final thought

The seed script is one of those things that works well enough to never get fixed. It seeds the database. The app boots. Nobody wants to touch it.

The problem is that "well enough" slowly gets worse. The schema changes, the data drifts, the edge cases multiply, and the gap between what the seed script produces and what production looks like gets wider every quarter.

Production snapshots close that gap by removing the maintenance entirely. The data stays current because it comes from the real database. The relationships stay valid because they are followed, not written by hand. And anonymization is part of the process rather than a separate step someone has to remember.

If your seed script is the file nobody wants to own, maybe the right move is making sure nobody has to.

Get started in minutes. Basecut extracts FK-aware, anonymized snapshots from PostgreSQL with one CLI command. Free for small teams. Try Basecut free →

Or explore first: quickstart guide · snapshot config reference

How to Anonymize PII in PostgreSQL for Development

Jake Lazarus — Tue, 24 Mar 2026 14:43:00 +0000

Ask any developer whether their local database has real customer data in it, and most will say no.

Ask them to check, and most will find that it does.

Real emails in users. Real names in profiles. Real billing addresses in payments. Real IP addresses in audit_logs. Data that landed in production, got copied somewhere for debugging, and has been sitting in local databases and CI pipelines ever since.

This is not a hypothetical compliance problem. It is a real one, and it gets messier the longer it goes unaddressed.

What counts as PII in a PostgreSQL database

PII is broader than most developers expect. The obvious fields are easy to spot:

email, email_address
first_name, last_name, full_name
phone, phone_number, mobile
address, street_address, city, postal_code
date_of_birth, dob
ssn, national_id, tax_id

But in real production schemas, PII hides in less obvious places:

free-text fields like notes, description, bio that users fill in
ip_address columns in event logs and audit tables
stripe_customer_id, paypal_email — identifiers that link back to real people
JSONB columns that store user-submitted form data
metadata fields that accumulate whatever the app was logging at the time

If you have been copying your production database to dev environments without systematically anonymizing those fields, that data is on developer laptops, in CI logs, and probably in Slack at some point.

Why this matters beyond just being careful

GDPR, CCPA, HIPAA, and most other data protection frameworks have something in common: they do not distinguish between production and non-production environments. If you are processing personal data in a development environment without appropriate controls, you are in scope.

In practice, the consequences are usually:

GDPR Article 25: "data protection by design and by default" — development tooling is explicitly in scope
SOC 2 Type II: data handling controls are audited across environments, not just production
HIPAA minimum necessary rule: PHI should only be available to the systems that need it for the purpose it was collected

Beyond compliance, there is a simpler reason: real customer data in dev environments is one of the most common sources of accidental exposure. A developer shares a failing test case on Slack. A CI artifact gets retained with real names in it. A staging database backup ends up in a public S3 bucket.

The fix is not more policies. It is removing the real data from the environments where it should not be.

The naive approach: UPDATE statements after restore

The most common first attempt at anonymization looks like this:

-- Run after restoring a pg_dump to dev
UPDATE users SET
  email = 'user' || id || '@example.com',
  first_name = 'Test',
  last_name = 'User';

UPDATE orders SET
  shipping_address = '123 Test St';

This works well enough until it does not.

The problems start to accumulate:

Someone forgets to run the script, and real data ends up in a dev environment anyway.
The script is not versioned with the schema, so it breaks when new PII columns are added.
It replaces data inconsistently — the same customer gets a different fake email in users than in audit_logs, breaking join-based queries.
It runs after the fact, which means real data has already traveled through the restore pipeline.
It has no automated detection — every new PII column has to be added manually.

This is the pattern that eventually lands teams in trouble. It feels like a solution because it works most of the time. It fails when someone does not run it, or when a new field gets added and nobody updates the script. It shares the same fundamental problem as seed scripts — manual upkeep that silently falls behind.

The better approach: anonymize at extraction time

The more reliable pattern is to anonymize the data before it ever leaves the production environment, not after it arrives in dev. This applies whether you are setting up a local development environment or populating CI databases.

That means the anonymization step is baked into the snapshot process:

Connect to production (or a read replica).
Extract the rows you need.
Anonymize sensitive fields inline, during extraction.
Write the already-anonymized snapshot to wherever it will be stored.

The result is that no real PII ever travels to dev environments. What gets restored is already masked.

This matters because it removes the "forget to run the script" failure mode entirely. There is no post-restore step to forget.

What good anonymization actually requires

Replacing real values with fake ones is straightforward. Making the fake values behave like real data is harder.

A few requirements come up in practice:

Realistic fake values

email = 'test@example.com' is easy to write and easy to spot. It does not behave like real email data in filtering, search, or display.

Better: generate realistic-looking fake emails that follow the same structure. lmitchell@example.com is harder to accidentally mistake for test data, and it exercises the same code paths as real emails. We cover why this realism matters for catching bugs in Why Fake PostgreSQL Test Data Misses Real Bugs.

Deterministic masking

If jane@company.com maps to lmitchell@example.com in users, it should map to the same fake email everywhere it appears — in audit_logs, notifications, email_events, and anywhere else it is referenced.

Without deterministic masking, the same source value produces different fake values in different tables. Queries that join across tables start returning mismatched or missing results. The data looks restored but does not behave like the real system.

FK-aware scope

Anonymization cannot happen table-by-table in isolation. If order_id = 1001 belongs to a user whose real name you are masking, the anonymization needs to be consistent across users, orders, billing_addresses, and everything else connected to that user.

Coverage of unknown columns

Manually listing every PII column to anonymize only works until someone adds a new one and forgets to update the anonymization config. Auto-detection — pattern matching on column names, types, and values — catches fields that have not been explicitly listed yet.

What Basecut's anonymization config looks like

In Basecut, anonymization is part of the snapshot config. You can enable automatic detection, add explicit rules for specific columns, or mix both.

The simplest version uses auto-detection:

version: '1'
name: 'dev-snapshot'

from:
  - table: users
    where: 'created_at > :since'
    params:
      since: '2026-01-01'

traverse:
  parents: 5
  children: 10

limits:
  rows:
    per_table: 1000
    total: 50000

anonymize:
  mode: auto

With mode: auto, Basecut scans column names and data patterns to detect likely PII fields and applies sensible defaults. Emails become realistic fake emails. Names become realistic fake names. Phone numbers become valid-format fake phone numbers.

For fields that need explicit control:

anonymize:
  mode: auto
  rules:
    - column: users.notes
      strategy: clear
    - column: users.profile_image_url
      strategy: clear
    - column: payments.card_last_four
      strategy: preserve
    - column: audit_logs.ip_address
      strategy: fake_ip
    - column: users.external_id
      strategy: hash
      deterministic: true

The strategies map to real behaviors:

fake_* — generate realistic-looking fake values (email, name, phone, address, IP)
clear — replace with NULL or empty string
preserve — keep as-is (for fields that are not sensitive but look like they might be)
hash — consistent one-way hash, useful for IDs that need to be consistent but not reversible

Deterministic masking is on by default for most strategies. Basecut uses a stable seed so the same source value always produces the same output, which keeps join queries working correctly across tables.

Org-wide policies

One problem with per-snapshot anonymization configs is that each team or developer can configure their own rules — which means someone will configure them wrong.

For teams with compliance requirements, Basecut supports org-wide anonymization policies: rules defined once at the organization level that apply to every snapshot, regardless of who creates it. An individual snapshot config can add rules but cannot remove or override org-level ones.

This is how you get consistent anonymization without relying on every developer to configure it correctly every time.

What this looks like in CI

The same anonymization config that runs locally runs in CI. If you are restoring a snapshot before your test suite, the data arriving in your CI pipeline is already masked.

name: Test
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    services:
      postgres:
        image: postgres:15
        env:
          POSTGRES_USER: postgres
          POSTGRES_PASSWORD: postgres
          POSTGRES_DB: test_db
        ports:
          - 5432:5432

    steps:
      - uses: actions/checkout@v4

      - name: Install Basecut CLI
        run: |
          curl -fsSL https://basecut.dev/install.sh | sh
          echo "$HOME/.local/bin" >> $GITHUB_PATH

      - name: Restore anonymized snapshot
        env:
          BASECUT_API_KEY: ${{ secrets.BASECUT_API_KEY }}
        run: |
          basecut snapshot restore dev-snapshot:latest \
            --target "postgresql://postgres:postgres@localhost:5432/test_db"

      - name: Run tests
        run: npm test

The CI pipeline gets realistic data shapes and relationships without ever touching real PII. That is also worth noting for audits: you can show that your CI environment runs against masked data by design, not by policy.

We go deeper on the CI setup in our CI/CD test data guide.

When to use pg_anonymizer (and when not to)

postgresql_anonymizer is a PostgreSQL extension worth knowing about. It provides declarative masking rules at the database level, which is useful in some situations — particularly if you want to expose a masked view of production data to specific roles.

A few places where it fits less well for dev workflows:

It requires installing an extension in production, which many teams are not able or willing to do.
It anonymizes in-place or via views, not during extraction — the data still travels to dev before masking.
It does not handle subsetting, so you are still copying the full database.
It requires explicit rule definitions for every column with no auto-detection.

For teams who want anonymization as part of a broader snapshot + subset + restore workflow, extraction-time masking is usually cleaner. You can see more in our comparison with pg_dump. For a broader look at how Basecut compares to commercial alternatives like Tonic and Delphix, see our comparison pages.

A practical rollout

If you have never done systematic anonymization before, the cleanest way to start is gradually.

Audit first.
Run a query across your schema to find columns that look like PII: email, name, phone, address, any JSONB column with user-submitted data. You will find more than you expect.
Start with auto-detection.
Let the tooling make a first pass at detection. Review what it finds, add explicit rules for anything it missed or got wrong.
Run a test snapshot.
Restore it to a local dev database and check: does the data look anonymized? Do join queries still work? Does the app behave normally with it?
Commit the config.
Your anonymization rules should live in version control alongside your schema. When someone adds a new PII column, the config update is part of the same PR.
Enforce at the org level.
Once your rules are stable, promote the critical ones to org-wide policies so they apply regardless of who runs the snapshot.

The common mistake is treating anonymization as a post-restore step rather than part of the data pipeline. Once it is baked into snapshot creation, the risk of it being skipped drops to near zero.

Final thought

The reason most dev databases have real PII in them is not malice. It is that anonymization was not built into the default workflow — it was an afterthought, a script someone ran sometimes, a step in a doc nobody updated.

The fix is simple in principle: make anonymization happen at extraction time, not after the fact. Every snapshot that gets restored to dev, CI, or staging should arrive already masked.

If your team is still relying on manual cleanup scripts, or skipping anonymization entirely, that is the thing worth fixing first.

Get started with Basecut's anonymization. The CLI auto-detects common PII fields and applies masking during snapshot creation. Free for small teams. Try Basecut free →

Or dig into the details first: anonymization config reference · how FK-aware extraction works

Why Fake PostgreSQL Test Data Misses Real Bugs

Jake Lazarus — Wed, 18 Mar 2026 05:00:00 +0000

Most teams do not have a testing problem. They have a test data realism problem.

Locally, the app runs against test@example.com, User 1, and a seed script nobody wants to maintain. In CI, fixtures slowly drift away from reality. Then the bugs show up after deploy:

a customer name has an apostrophe or accent
a field is NULL where your code assumed a string
an account has 47 related records instead of 2
a query that worked on 20 rows falls over on 20,000
shared staging data gets mutated by three people at once

If that sounds familiar, the answer usually is not "write more tests." It is "stop testing against fake data."

What teams actually want

What most teams actually want is not a full copy of production, not a giant pg_dump, and not another 400-line seed script. They want production-like data:

realistic enough to expose bugs
small enough to restore locally and in CI
safe enough to use outside production
reproducible enough that every developer can get the same result

That is the gap most dev workflows never solve cleanly.

Why the usual approaches break down

Seed scripts rot

Seed scripts are fine when your app has five tables. They get painful when your schema grows:

every migration breaks something
relationships get harder to maintain
the data gets less realistic over time
nobody wants to own the script anymore

You end up with a setup that is reproducible, but not especially useful. We wrote a deeper comparison of seed scripts vs production snapshots.

`pg_dump` is great for backups, not dev environments

pg_dump solves a different problem. It copies everything:

all rows
all tables
all PII
all the size and baggage of production

That is useful for backup and recovery. It is usually overkill for local development and CI.

For dev workflows, full dumps create new problems:

slow restores
bloated local databases
longer CI jobs
sensitive data showing up in places it should not

Most of the time, you do not need the entire database. You need the right slice of it. We wrote a full comparison of pg_dump vs Basecut if you want the details.

The better pattern: subset, anonymize, restore

The workflow that makes sense looks more like this:

Start from one or more root tables.
Filter to the rows you actually care about.
Follow foreign keys to pull in the connected data.
Anonymize sensitive fields inline.
Save the snapshot.
Restore it anywhere you need it.

That gives you a connected, realistic, privacy-safe subset of production instead of a raw copy. This is the workflow we built Basecut around for PostgreSQL: FK-aware extraction, automatic PII anonymization, and one-command restores for local dev, CI, and debugging.

The reason this approach works is simple: it treats test data as a repeatable snapshot problem, not a hand-crafted fixture problem.

What "production-like" should actually mean

The phrase gets used loosely. In practice, production-like data should have four properties.

1. Realistic structure

It should reflect the real relationships, optional fields, and edge cases in your schema.

2. Referential integrity

If you copy one row from orders, you usually also need related rows from users, line_items, shipments, and whatever else your app expects to exist together.

3. Privacy safety

Emails, names, phone numbers, addresses, and other sensitive fields need to be anonymized before the data lands on laptops, CI runners, or logs.

4. Repeatability

Developers need a predictable way to recreate the same kind of dataset without asking someone to send them a dump.

If any one of those is missing, the workflow gets shaky fast.

Why FK-aware extraction matters

This is the part many DIY approaches get wrong. Randomly sampling rows from each table sounds easy until you restore them. Then you get:

orders pointing to missing users
line items pointing to missing products
child rows without their parents
failed restores or strange app behavior after restore

A useful snapshot has to behave like a self-contained mini-version of production.

That is why FK-aware extraction matters. In Basecut, snapshots are built by following foreign keys in both directions and collecting a connected subgraph of your data. The result is something you can restore into an empty database without ending up with broken references.

That matters more than people think. It is the difference between:

"the data loaded"
and
"the app actually behaves like it does in production"

What the workflow looks like in practice

The nice part is that this can stay simple. Basecut starts with a small YAML config that tells it:

where to start
how far to traverse relationships
how much data to include
how anonymization should work

Example:

version: '1'
name: 'dev-snapshot'

from:
  - table: users
    where: 'created_at > :since'
    params:
      since: '2026-01-01'

traverse:
  parents: 5
  children: 10

limits:
  rows:
    per_table: 1000
    total: 50000

anonymize:
  mode: auto

Then the workflow becomes:

basecut snapshot create \
  --config basecut.yml \
  --name "dev-snapshot" \
  --source "$DATABASE_URL"

basecut snapshot restore dev-snapshot:latest \
  --target "$LOCAL_DATABASE_URL"

That is the whole loop:

inspect schema
create snapshot
restore anywhere

In practice, most teams can get from install to first snapshot in a few minutes. You can try this workflow with Basecut — the CLI is free for small teams.

The privacy part is not optional

One more requirement: privacy. If you are moving production-like data into dev and CI, PII handling cannot be a manual cleanup step.

At minimum, your workflow should:

detect common PII automatically
allow explicit masking rules
preserve join integrity where needed
anonymize during extraction, not afterward

Basecut handles this with automatic PII detection plus 30+ anonymization strategies. It also supports deterministic masking, which matters when the same source value needs to map to the same fake value across related tables.

If jane@company.com turns into one fake email in users and a different fake email somewhere else, your data stops behaving like the real system. That is exactly the sort of detail that makes fake dev data feel fine right up until it is not.

Why this works well in CI too

This pattern is just as useful in CI as it is locally. Instead of checking brittle fixtures into the repo, you restore a realistic snapshot before the test suite runs.

For example:

name: Test
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    services:
      postgres:
        image: postgres:15
        env:
          POSTGRES_USER: postgres
          POSTGRES_PASSWORD: postgres
          POSTGRES_DB: test_db
        ports:
          - 5432:5432

    steps:
      - uses: actions/checkout@v4

      - name: Install Basecut CLI
        run: |
          curl -fsSL https://basecut.dev/install.sh | sh
          echo "$HOME/.local/bin" >> $GITHUB_PATH

      - name: Restore snapshot
        env:
          BASECUT_API_KEY: ${{ secrets.BASECUT_API_KEY }}
        run: |
          basecut snapshot restore test-data:latest \
            --target "postgresql://postgres:postgres@localhost:5432/test_db"

      - name: Run tests
        run: npm test

That gives your pipeline real shapes, real relationships, and realistic edge cases without restoring an entire production dump on every run. It also keeps snapshots small enough that restores stay fast. We go deeper in our CI/CD test data guide.

When this is worth doing

You probably want production-like snapshots if:

your app has more than a handful of tables
your bugs are often data-dependent
you need realistic data in local dev
your CI pipeline should test against something closer to reality
you handle meaningful PII
your team is tired of maintaining fixtures or seed scripts

You might not need it yet if:

the product is brand new
you do not have real production data yet
the schema is tiny
completely fictional demo data is the goal

This is not about replacing every fixture in your test suite. Unit tests still benefit from tiny, explicit test data. The value here is in integration tests, local development, CI, onboarding, and debugging where data shape matters.

A practical rollout

If you want to adopt this without overcomplicating it, start small.

Pick one painful workflow.
Usually local dev onboarding, shared staging, or CI integration tests.
Define a small snapshot.
Keep the restore fast. Start with a few root tables and sensible row limits.
Turn on anonymization from day one.
Do not leave this for later.
Restore it somewhere useful immediately.
Local dev DB or CI test DB is usually enough to prove the value.
Expand gradually.
Add more tables, better filters, and refresh automation once the loop is working.

That gets you to useful production-like data quickly without turning the whole thing into a platform project.

Final thought

Most teams are not under-testing. They are testing against data that makes them feel safe. That is not the same thing.

If your local environments and CI pipelines run against tiny, stale, or fake data, they will keep giving you false confidence. Production-like snapshots are one of the highest-leverage ways to make development and testing feel closer to the real system without dragging raw production data everywhere.

If you want to try this with PostgreSQL, Basecut is free for small teams. Or dig into the quickstart guide and how FK-aware extraction works.