<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Jake Lazarus</title>
    <description>The latest articles on Forem by Jake Lazarus (@jakelaz).</description>
    <link>https://forem.com/jakelaz</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3821271%2F86e1f840-dd57-4a1b-8df6-6274201606fb.png</url>
      <title>Forem: Jake Lazarus</title>
      <link>https://forem.com/jakelaz</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/jakelaz"/>
    <language>en</language>
    <item>
      <title>Database Subsetting for PostgreSQL: A Practical Guide (2026)</title>
      <dc:creator>Jake Lazarus</dc:creator>
      <pubDate>Thu, 16 Apr 2026 14:35:00 +0000</pubDate>
      <link>https://forem.com/jakelaz/database-subsetting-for-postgresql-a-practical-guide-2026-4j5m</link>
      <guid>https://forem.com/jakelaz/database-subsetting-for-postgresql-a-practical-guide-2026-4j5m</guid>
      <description>&lt;p&gt;Every team that has tried to copy production data into a dev environment has hit the same wall: production is too big, full of PII, and growing. The fix is not a bigger laptop or a faster &lt;code&gt;pg_dump&lt;/code&gt;. It is &lt;strong&gt;database subsetting&lt;/strong&gt; — extracting a small, self-contained slice of the database instead of all of it.&lt;/p&gt;

&lt;p&gt;Subsetting is the workflow underneath almost every modern dev-data tool. It is what makes "restore production data locally" actually viable. But the term gets thrown around loosely, and the difference between a real FK-aware subset and a glorified &lt;code&gt;SELECT ... LIMIT&lt;/code&gt; is the difference between a working dev environment and a database full of orphaned rows.&lt;/p&gt;

&lt;p&gt;This guide is the canonical version. What subsetting is, how it works at the foreign-key level, the strategies that work for common PostgreSQL schemas, and a 2026 honest look at the tools that do it.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Database subsetting extracts a referentially complete slice of a PostgreSQL database by traversing foreign keys from one or more root tables. Done well, it produces a dataset 10–1000× smaller than production that still behaves like production. Done badly, it produces broken referential integrity and silent test failures. This post covers how to do it well, the strategies that fit common schemas, and the tools that handle it in 2026.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What is database subsetting?
&lt;/h2&gt;

&lt;p&gt;Database subsetting is the process of extracting a representative, self-contained slice of a database instead of copying the whole thing. You start from one or more &lt;strong&gt;root tables&lt;/strong&gt; — usually entities like &lt;code&gt;users&lt;/code&gt;, &lt;code&gt;accounts&lt;/code&gt;, or &lt;code&gt;tenants&lt;/code&gt; — apply filters and row limits, and then traverse foreign key relationships to pull in every related row that the slice depends on.&lt;/p&gt;

&lt;p&gt;The output is a smaller database that preserves the same schema and the same relational structure as production. Every foreign key still resolves. Every join still returns rows. The dataset behaves like production, just at 1% (or 0.1%, or 0.01%) of the size.&lt;/p&gt;

&lt;p&gt;The point of subsetting is &lt;strong&gt;not&lt;/strong&gt; to make a backup. It is to produce a dataset small enough to be useful for development, CI, and staging while still being realistic enough to surface the bugs that fake data hides.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why teams need database subsetting
&lt;/h2&gt;

&lt;p&gt;The motivation is almost always the same: &lt;strong&gt;&lt;code&gt;pg_dump&lt;/code&gt; does not scale, and seed scripts do not survive contact with reality.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A full &lt;code&gt;pg_dump&lt;/code&gt; of a mature production database is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Too big.&lt;/strong&gt; A 200GB database is fine on production hardware. It is unusable on a laptop, painful in CI, and a pain to refresh weekly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Full of PII.&lt;/strong&gt; Real emails, real names, real billing addresses end up on dev machines, in CI logs, and in artifacts. That is a compliance problem in any environment that handles customer data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Slow to restore.&lt;/strong&gt; A restore that takes 90 seconds today takes 8 minutes in 18 months as production grows. That cost compounds across every CI run.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We covered the full breakdown in &lt;a href="https://basecut.dev/vs/pg-dump" rel="noopener noreferrer"&gt;pg_dump vs database snapshots&lt;/a&gt; — the headline is that &lt;code&gt;pg_dump&lt;/code&gt; is the right tool for backups and disaster recovery, and the wrong tool for dev data.&lt;/p&gt;

&lt;p&gt;Seed scripts have the opposite problem. They start small and stay small, so size is not an issue. But every schema migration is a chance for the seed script to break, drift, or silently produce stale data. The shapes of real data — Unicode, NULLs, accounts with hundreds of related rows — never appear in hand-written fixtures. We covered why that matters in &lt;a href="https://basecut.dev/blog/why-fake-postgresql-test-data-misses-bugs" rel="noopener noreferrer"&gt;Why Fake PostgreSQL Test Data Misses Real Bugs&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Subsetting solves both problems at once. The output is small enough to be useful and real enough to be representative.&lt;/p&gt;

&lt;h2&gt;
  
  
  How FK-aware subsetting actually works
&lt;/h2&gt;

&lt;p&gt;Most of the difference between "good subsetting" and "bad subsetting" comes down to whether the extractor understands foreign keys. Here is what FK-aware extraction is doing under the hood.&lt;/p&gt;

&lt;h3&gt;
  
  
  Root tables and traversal
&lt;/h3&gt;

&lt;p&gt;You pick one or more root tables. These are the entities the subset is "about" — usually &lt;code&gt;users&lt;/code&gt;, &lt;code&gt;accounts&lt;/code&gt;, &lt;code&gt;tenants&lt;/code&gt;, or &lt;code&gt;organizations&lt;/code&gt;. The extractor reads the rows that match your filter from the root tables and then walks the foreign key graph to pull in everything that depends on those rows.&lt;/p&gt;

&lt;p&gt;If your root is &lt;code&gt;users&lt;/code&gt; and you select 1000 rows, the extractor follows every foreign key that points at &lt;code&gt;users&lt;/code&gt; and pulls in the matching rows from &lt;code&gt;orders&lt;/code&gt;, &lt;code&gt;subscriptions&lt;/code&gt;, &lt;code&gt;audit_logs&lt;/code&gt;, and any other dependent table. It then walks one level deeper: &lt;code&gt;orders&lt;/code&gt; has FKs to &lt;code&gt;line_items&lt;/code&gt; and &lt;code&gt;payments&lt;/code&gt;, so those come along. And so on, until the closure is complete.&lt;/p&gt;

&lt;p&gt;This is the part that matters. Without traversal, you end up with 1000 users and zero orders, because nothing told the extractor to follow &lt;code&gt;orders.user_id&lt;/code&gt;. With traversal, you get a connected subgraph: 1000 users, all of their orders, all of those orders' line items, all of the related payments.&lt;/p&gt;

&lt;h3&gt;
  
  
  Filters and row limits
&lt;/h3&gt;

&lt;p&gt;Filters narrow the slice before traversal runs. They look like SQL &lt;code&gt;WHERE&lt;/code&gt; clauses applied to the root tables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;from&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;table&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;users&lt;/span&gt;
    &lt;span class="na"&gt;where&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;created_at&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;gt;&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;:since&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;AND&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;plan&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;=&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;:plan'&lt;/span&gt;
    &lt;span class="na"&gt;params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;since&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2026-01-01'&lt;/span&gt;
      &lt;span class="na"&gt;plan&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;team'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Row limits cap the slice in case the filter still pulls in too much. A &lt;code&gt;per_table&lt;/code&gt; limit prevents one accidentally-huge table from blowing up the snapshot, and a &lt;code&gt;total&lt;/code&gt; limit caps the whole extract:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;rows&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;per_table&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5000&lt;/span&gt;
    &lt;span class="na"&gt;total&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Filters and limits together are the levers that make subsetting work for any size of production database. The same config that works on 10GB of data works on 10TB — only the filter changes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why referential completeness matters
&lt;/h3&gt;

&lt;p&gt;A subset is &lt;strong&gt;referentially complete&lt;/strong&gt; when every foreign key in the extract resolves to a row that is also in the extract. If &lt;code&gt;orders.user_id = 42&lt;/code&gt; is in the extract but &lt;code&gt;users.id = 42&lt;/code&gt; is not, the subset is broken and the restore will fail with a constraint violation — or worse, succeed with constraints disabled and produce a database your app cannot read correctly.&lt;/p&gt;

&lt;p&gt;This is the failure mode of "naive" subsetting (run &lt;code&gt;SELECT * FROM users LIMIT 1000&lt;/code&gt; and call it done). The extracted users table has 1000 rows. The orders table has rows pointing at user IDs that no longer exist. The restore either errors out or silently corrupts the dataset.&lt;/p&gt;

&lt;p&gt;A real FK-aware extractor guarantees referential completeness by construction: every row in the output is reachable from a root row by following foreign keys, and every foreign key in every row resolves inside the output. There are no orphans.&lt;/p&gt;

&lt;p&gt;This is the property that makes subsetting actually useful. Without it, you do not have a working database — you have a pile of disconnected rows.&lt;/p&gt;

&lt;h2&gt;
  
  
  Subsetting and anonymization
&lt;/h2&gt;

&lt;p&gt;A referentially complete subset still contains real PII unless something explicitly removes it. The two operations — subsetting and anonymization — are usually run together at extraction time, before the data ever leaves production.&lt;/p&gt;

&lt;p&gt;The reason to anonymize &lt;strong&gt;during&lt;/strong&gt; extraction (not after restore) is that any post-restore approach lets real PII travel through your pipeline before the masking script runs. Real emails appear in restore logs. Real names sit on disk for the few seconds it takes the script to start. New columns added since the script was last updated never get masked at all.&lt;/p&gt;

&lt;p&gt;The fix is to anonymize as part of the extract step, deterministically, so the same source value maps to the same fake value across every related table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;anonymize&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;auto&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;column&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;*.email'&lt;/span&gt;
      &lt;span class="na"&gt;strategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;deterministic_email&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;column&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;users.full_name'&lt;/span&gt;
      &lt;span class="na"&gt;strategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;deterministic_name&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Deterministic masking matters because joins still need to work. If &lt;code&gt;jane@company.com&lt;/code&gt; becomes &lt;code&gt;lmitchell@example.com&lt;/code&gt; in &lt;code&gt;users&lt;/code&gt; but &lt;code&gt;kpark@example.com&lt;/code&gt; in &lt;code&gt;audit_logs&lt;/code&gt;, queries that join across tables stop returning the right rows. Determinism preserves the relationships even though every value has been replaced.&lt;/p&gt;

&lt;p&gt;We covered the full mechanics — including the difference between masking and anonymization, what GDPR considers anonymized, and why automatic detection matters — in &lt;a href="https://basecut.dev/blog/how-to-anonymize-pii-in-postgresql-for-development" rel="noopener noreferrer"&gt;How to Anonymize PII in PostgreSQL for Development&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Subsetting strategies for common PostgreSQL schemas
&lt;/h2&gt;

&lt;p&gt;Most production schemas fit one of three patterns. Each one has a different "right way" to subset, and getting the strategy right matters more than which tool you use.&lt;/p&gt;

&lt;h3&gt;
  
  
  Multi-tenant SaaS: filter by tenant_id
&lt;/h3&gt;

&lt;p&gt;The most common shape. Every business object has a &lt;code&gt;tenant_id&lt;/code&gt; (or &lt;code&gt;org_id&lt;/code&gt;, or &lt;code&gt;account_id&lt;/code&gt;) column, and every row in the system belongs to exactly one tenant. The natural subset is "give me all the data for one tenant" or "give me all the data for the 50 most recent tenants."&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;from&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;table&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tenants&lt;/span&gt;
    &lt;span class="na"&gt;where&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;created_at&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;gt;&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;:since'&lt;/span&gt;
    &lt;span class="na"&gt;params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;since&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2026-01-01'&lt;/span&gt;
    &lt;span class="na"&gt;limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;50&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The extractor then follows foreign keys from &lt;code&gt;tenants&lt;/code&gt; to &lt;code&gt;users&lt;/code&gt;, &lt;code&gt;projects&lt;/code&gt;, &lt;code&gt;documents&lt;/code&gt;, &lt;code&gt;audit_logs&lt;/code&gt;, and everything else, naturally producing a dataset that is "the last 50 tenants and everything they own." Restore time scales with tenant size, not database size.&lt;/p&gt;

&lt;p&gt;The anti-pattern here is filtering on a child table directly (&lt;code&gt;SELECT * FROM documents WHERE tenant_id IN (...)&lt;/code&gt;). You end up with documents whose owning users were not pulled in, and the joins break.&lt;/p&gt;

&lt;h3&gt;
  
  
  Time-windowed: last N days of activity
&lt;/h3&gt;

&lt;p&gt;When you do not have a clean tenant boundary (or you want to capture cross-tenant traffic), filter by recency. Pick a root table that represents activity — &lt;code&gt;events&lt;/code&gt;, &lt;code&gt;orders&lt;/code&gt;, &lt;code&gt;sessions&lt;/code&gt; — and grab the last 30 days:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;from&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;table&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;orders&lt;/span&gt;
    &lt;span class="na"&gt;where&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;created_at&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;gt;&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;:since'&lt;/span&gt;
    &lt;span class="na"&gt;params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;since&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2026-03-08'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Traversal pulls in the related users, products, and line items. The result is a snapshot that captures whatever shapes of data are flowing through the system right now, including recent edge cases like new payment methods or feature flags that only enabled in the last week.&lt;/p&gt;

&lt;p&gt;This strategy is particularly good for catching regressions, because the subset always reflects the latest production patterns. Refresh weekly and you are testing against last week's data shapes, not last quarter's.&lt;/p&gt;

&lt;h3&gt;
  
  
  Customer-scoped: one specific account for repro
&lt;/h3&gt;

&lt;p&gt;When a customer reports a bug you cannot reproduce, the fastest path to a fix is usually a snapshot of just their account. Pick the customer as a root and traverse:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;from&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;table&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;accounts&lt;/span&gt;
    &lt;span class="na"&gt;where&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;=&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;:account_id'&lt;/span&gt;
    &lt;span class="na"&gt;params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;account_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;acct_01HXYZ...'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The output is a tiny, FK-complete database containing exactly that customer's data — anonymized so you can share it with the team, restore it locally, and reproduce the bug in seconds. This is the workflow that justifies subsetting on its own for most teams: a 30-second restore beats half a day of "can you give me your steps again?"&lt;/p&gt;

&lt;p&gt;We dig into the broader workflow in &lt;a href="https://basecut.dev/use-cases/local-development" rel="noopener noreferrer"&gt;the local development use case&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tools that do PostgreSQL subsetting in 2026
&lt;/h2&gt;

&lt;p&gt;The market has shaken out a bit since 2024. Here is the honest 2026 picture.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Basecut&lt;/th&gt;
&lt;th&gt;Tonic.ai&lt;/th&gt;
&lt;th&gt;Delphix&lt;/th&gt;
&lt;th&gt;OSS Snaplet fork&lt;/th&gt;
&lt;th&gt;Hand-rolled SQL&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FK-aware traversal&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;DIY&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Anonymize at extract time&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No (post-restore)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Referential completeness guaranteed&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;DIY&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Auto-detects common PII&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Config format&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;YAML&lt;/td&gt;
&lt;td&gt;GUI + config&lt;/td&gt;
&lt;td&gt;GUI + agents&lt;/td&gt;
&lt;td&gt;TypeScript&lt;/td&gt;
&lt;td&gt;SQL / shell&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hosted option&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes (free tier)&lt;/td&gt;
&lt;td&gt;Self-host + hosted&lt;/td&gt;
&lt;td&gt;Enterprise self-host&lt;/td&gt;
&lt;td&gt;Self-host only&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Actively maintained&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No active upstream&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best for&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Teams wanting CLI + YAML, free tier&lt;/td&gt;
&lt;td&gt;Enterprise procurement&lt;/td&gt;
&lt;td&gt;Large enterprise&lt;/td&gt;
&lt;td&gt;Self-hosters with bandwidth&lt;/td&gt;
&lt;td&gt;Tiny schemas, stopgaps&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A few notes that the table cannot capture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Basecut&lt;/strong&gt; is the actively maintained CLI-first option. YAML config, FK-aware traversal, deterministic masking, and a free tier that covers small teams. Built for the same workflow Snaplet pioneered. We cover the details on &lt;a href="https://basecut.dev/blog/snaplet-alternative" rel="noopener noreferrer"&gt;the Snaplet alternative page&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tonic.ai&lt;/strong&gt; is the heavyweight commercial option. Strong for enterprise procurement and SOC 2 paperwork, heavier than most teams want for "just give me dev data." &lt;a href="https://basecut.dev/vs/tonic" rel="noopener noreferrer"&gt;Full Tonic comparison&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Delphix&lt;/strong&gt; is the legacy enterprise player. Powerful, but the operational model assumes a dedicated platform team. &lt;a href="https://basecut.dev/vs/delphix" rel="noopener noreferrer"&gt;Full Delphix comparison&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The open-source Snaplet fork&lt;/strong&gt; is on GitHub and viable if you have engineering bandwidth to self-host and own maintenance indefinitely. There is no active upstream. &lt;a href="https://basecut.dev/blog/snaplet-alternative" rel="noopener noreferrer"&gt;Context here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hand-rolled &lt;code&gt;pg_dump&lt;/code&gt; plus SQL scripts&lt;/strong&gt; is what most teams default to before they have evaluated anything. It works at small schemas and breaks quietly as they grow. The full breakdown is in &lt;a href="https://basecut.dev/vs/pg-dump" rel="noopener noreferrer"&gt;pg_dump vs database snapshots&lt;/a&gt;, and the broader "stop writing seed scripts" argument is in &lt;a href="https://basecut.dev/blog/replace-seed-scripts-with-production-snapshots" rel="noopener noreferrer"&gt;Replace Seed Scripts with Production Snapshots&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you are evaluating from scratch, the honest order of operations is: pick the simplest tool that will work for your schema today, and make sure it handles referential completeness and at-extract anonymization. Everything else is detail.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to subset a PostgreSQL database
&lt;/h2&gt;

&lt;p&gt;Here is the minimum viable workflow with Basecut, end to end. The same five steps apply to any FK-aware tool — the syntax is just different.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Pick your root tables
&lt;/h3&gt;

&lt;p&gt;Identify the entities your subset is "about." For multi-tenant SaaS this is usually &lt;code&gt;tenants&lt;/code&gt; or &lt;code&gt;accounts&lt;/code&gt;. For a marketplace it might be &lt;code&gt;users&lt;/code&gt; or &lt;code&gt;listings&lt;/code&gt;. For an event-driven system it might be &lt;code&gt;events&lt;/code&gt; or &lt;code&gt;orders&lt;/code&gt;. Pick the table whose rows naturally pull in everything else through foreign keys.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Write a config
&lt;/h3&gt;

&lt;p&gt;A Basecut config defines roots, filters, limits, and anonymization rules in YAML:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;1'&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dev-snapshot'&lt;/span&gt;

&lt;span class="na"&gt;from&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;table&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tenants&lt;/span&gt;
    &lt;span class="na"&gt;where&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;created_at&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;gt;&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;:since'&lt;/span&gt;
    &lt;span class="na"&gt;params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;since&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2026-01-01'&lt;/span&gt;
    &lt;span class="na"&gt;limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;50&lt;/span&gt;

&lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;rows&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;per_table&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5000&lt;/span&gt;
    &lt;span class="na"&gt;total&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100000&lt;/span&gt;

&lt;span class="na"&gt;anonymize&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;auto&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;mode: auto&lt;/code&gt; handles common PII columns (emails, names, phones, addresses) without explicit rules. Add explicit rules later if you have unusual fields like JSONB blobs or free-text notes.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Create a snapshot
&lt;/h3&gt;

&lt;p&gt;Run the create command against a production read replica:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;basecut snapshot create &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--config&lt;/span&gt; basecut.yml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--source&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$PRODUCTION_READ_REPLICA_URL&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Basecut traverses foreign keys from your root tables, pulls in every dependent row, anonymizes PII inline, and writes a versioned, referentially complete snapshot. Real PII never leaves production.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Restore wherever you need it
&lt;/h3&gt;

&lt;p&gt;Same snapshot, any target:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Local dev&lt;/span&gt;
basecut snapshot restore dev-snapshot:latest &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--target&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$LOCAL_DATABASE_URL&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

&lt;span class="c"&gt;# Staging&lt;/span&gt;
basecut snapshot restore dev-snapshot:latest &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--target&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$STAGING_DATABASE_URL&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

&lt;span class="c"&gt;# CI runner&lt;/span&gt;
basecut snapshot restore dev-snapshot:latest &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--target&lt;/span&gt; &lt;span class="s2"&gt;"postgresql://postgres:postgres@localhost:5432/test_db"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The restore is fast because the subset is small, and safe because the data is already anonymized. We cover the CI flavor specifically in &lt;a href="https://basecut.dev/blog/postgresql-test-database-github-actions" rel="noopener noreferrer"&gt;PostgreSQL test database in GitHub Actions&lt;/a&gt; and the staging flavor in &lt;a href="https://basecut.dev/blog/how-to-set-up-a-staging-database-from-production-postgresql" rel="noopener noreferrer"&gt;setting up a staging database&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Refresh on a schedule
&lt;/h3&gt;

&lt;p&gt;A snapshot from three months ago is only as good as the data shapes from three months ago. Schedule weekly refreshes so &lt;code&gt;:latest&lt;/code&gt; always points at fresh data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;basecut snapshot create &lt;span class="nt"&gt;--config&lt;/span&gt; basecut.yml &lt;span class="nt"&gt;--source&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$PRODUCTION_READ_REPLICA_URL&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run it from a cron job, a CI workflow, or — on the team plan — a Basecut agent that handles scheduling for you. Existing restore commands keep working unchanged because they reference &lt;code&gt;:latest&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;That is the whole loop. Most teams get a working subset config in an afternoon and roll it out across local, CI, and staging over the following sprint.&lt;/p&gt;

&lt;h2&gt;
  
  
  When subsetting is not the right answer
&lt;/h2&gt;

&lt;p&gt;Subsetting is the right default for development data, but it is not the right tool for every job. Skip it when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;You need an exact copy of production for migration rehearsal.&lt;/strong&gt; Use &lt;code&gt;pg_dump&lt;/code&gt; or a logical replication snapshot. Subsetting is a slice, not a forensic copy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Your schema is genuinely tiny.&lt;/strong&gt; If you have five tables and 10MB of data, subsetting is overkill. A &lt;code&gt;pg_dump&lt;/code&gt; plus a quick masking script is fine until the schema grows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You have a strict requirement for full row-level fidelity.&lt;/strong&gt; Some compliance scenarios mandate a full copy with controlled access rather than a representative subset. Subsetting is a fit for the development workflow, not for forensic or audit use cases.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For everything else — local dev, CI test data, staging refreshes, customer repros, and onboarding — subsetting is the workflow worth investing in. It is the difference between "we have realistic dev data" and "we have a database we can actually develop against."&lt;/p&gt;




&lt;p&gt;If you want to see whether subsetting fits your schema, the Basecut free tier covers most small teams. Install the CLI, point it at a read replica, and you can have a first FK-complete, anonymized snapshot in a few minutes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://basecut.dev" rel="noopener noreferrer"&gt;Try Basecut free →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Or read more: &lt;a href="https://basecut.dev/vs/pg-dump" rel="noopener noreferrer"&gt;pg_dump comparison&lt;/a&gt; · &lt;a href="https://basecut.dev/blog/how-to-anonymize-pii-in-postgresql-for-development" rel="noopener noreferrer"&gt;How to anonymize PII in PostgreSQL&lt;/a&gt; · &lt;a href="https://basecut.dev/blog/replace-seed-scripts-with-production-snapshots" rel="noopener noreferrer"&gt;Replace seed scripts with snapshots&lt;/a&gt;&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>database</category>
      <category>testing</category>
      <category>devops</category>
    </item>
    <item>
      <title>Snaplet Alternative in 2026: What to Use After Snaplet Shut Down</title>
      <dc:creator>Jake Lazarus</dc:creator>
      <pubDate>Tue, 14 Apr 2026 14:30:00 +0000</pubDate>
      <link>https://forem.com/jakelaz/snaplet-alternative-in-2026-what-to-use-after-snaplet-shut-down-fh5</link>
      <guid>https://forem.com/jakelaz/snaplet-alternative-in-2026-what-to-use-after-snaplet-shut-down-fh5</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://basecut.dev/blog/snaplet-alternative" rel="noopener noreferrer"&gt;basecut.dev&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If you used Snaplet, you know exactly what it felt like when it worked: one command to pull a slice of production, anonymized, ready to restore. When they shut down in 2024, that workflow went with them.&lt;/p&gt;

&lt;p&gt;Snaplet had a real following because it solved a real problem. It made "get anonymized production data into your dev environment" feel like a first-class workflow instead of a weekend project. Teams built their local dev setup and CI pipelines around it. When the shutdown happened, those workflows broke overnight.&lt;/p&gt;

&lt;p&gt;Losing it meant teams had to either find a maintained alternative, fork the open-source code themselves, or fall back to scripts they had avoided writing for good reason. Two years on, there are a few legitimate paths. They are not all equivalent.&lt;/p&gt;

&lt;p&gt;This post is an honest 2026 survey of where things stand: what you can actually use, who each option fits, and what the trade-offs are.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; The best actively maintained Snaplet alternative in 2026 is &lt;strong&gt;Basecut&lt;/strong&gt; — same subset → anonymize → restore workflow, YAML config instead of TypeScript, free tier for small teams. The &lt;strong&gt;open-source Snaplet fork&lt;/strong&gt; is viable if you have bandwidth to self-host and maintain it indefinitely. &lt;strong&gt;Tonic.ai&lt;/strong&gt; is a heavier enterprise option. &lt;strong&gt;pg_dump plus anonymization scripts&lt;/strong&gt; works for tiny schemas and breaks quietly as they grow.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What is the best Snaplet alternative in 2026?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Basecut&lt;/strong&gt; is the most actively maintained alternative to Snaplet for teams that need anonymized PostgreSQL snapshots. It covers the same core workflow — subset, anonymize, restore — with a YAML config instead of TypeScript and a broader set of anonymization strategies. The open-source Snaplet fork is also viable if your team has the bandwidth to self-host and maintain it long-term. Tonic.ai is available if you need an enterprise-grade commercial alternative with heavier procurement. And for the simplest possible cases, &lt;code&gt;pg_dump&lt;/code&gt; plus some post-restore scripts will do the job, though that approach has a well-documented ceiling.&lt;/p&gt;

&lt;h2&gt;
  
  
  Snaplet alternatives compared
&lt;/h2&gt;

&lt;p&gt;Before going deep on each option, here is how the four mainstream paths line up against the workflow Snaplet established.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Basecut&lt;/th&gt;
&lt;th&gt;OSS Snaplet fork&lt;/th&gt;
&lt;th&gt;Tonic.ai&lt;/th&gt;
&lt;th&gt;pg_dump + scripts&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Actively maintained&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No active upstream&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FK-aware subsetting&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;DIY with &lt;code&gt;WHERE&lt;/code&gt; clauses&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Anonymize before data leaves prod&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No (masking runs post-restore)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Auto-detects common PII&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deterministic masking across tables&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Partial&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Config format&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;YAML&lt;/td&gt;
&lt;td&gt;TypeScript&lt;/td&gt;
&lt;td&gt;GUI + config&lt;/td&gt;
&lt;td&gt;SQL / shell&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hosted option&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes (free tier)&lt;/td&gt;
&lt;td&gt;Self-host only&lt;/td&gt;
&lt;td&gt;Self-host + hosted&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Team / org-level policies&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes (paid)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Free for small teams&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes (self-host)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best for&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Teams wanting a maintained Snaplet replacement&lt;/td&gt;
&lt;td&gt;Teams with bandwidth to own the fork&lt;/td&gt;
&lt;td&gt;Enterprise with procurement cycles&lt;/td&gt;
&lt;td&gt;Tiny schemas, temporary stopgaps&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The rest of this post walks through each option in detail, who it fits, and where it falls down.&lt;/p&gt;

&lt;h2&gt;
  
  
  What teams actually miss about Snaplet
&lt;/h2&gt;

&lt;p&gt;Before evaluating options, it helps to be specific about what Snaplet actually gave people. Not everything about it was special — but a few things were.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The one-command workflow.&lt;/strong&gt; &lt;code&gt;snaplet snapshot capture&lt;/code&gt; followed by &lt;code&gt;snaplet snapshot restore&lt;/code&gt;. That was the whole loop. New developer joins the team — two commands and they have a working database. CI needs realistic test data — restore the snapshot. Debugging a production incident locally — restore the snapshot. The value was not the technology; it was the fact that the workflow required no thinking.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A config that understood your schema.&lt;/strong&gt; Snaplet's TypeScript config let you write transforms close to the data model. When it worked, it felt like the tool was aware of your schema rather than applying dumb column-name heuristics. You could write logic that said "for this table, transform this column this way" and have it behave exactly as intended.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Automatic PII masking.&lt;/strong&gt; Snaplet would detect and mask common PII fields without requiring you to enumerate every sensitive column manually. For teams that had not yet thought carefully about their anonymization strategy, that auto-detection was the difference between using the tool and not using it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Subsetting — not full dumps.&lt;/strong&gt; This is the one that gets underappreciated. Snaplet did not copy your entire database. It followed foreign keys and extracted a connected slice: recent users, their related orders, their related records. The result was small enough to restore locally in seconds and referentially intact because the relationships were traversed rather than copied blindly. Full &lt;code&gt;pg_dump&lt;/code&gt; gives you everything; Snaplet gave you the right slice. That difference matters for restore speed, local disk usage, and CI job time. It is also the reason a &lt;code&gt;pg_dump&lt;/code&gt;-based fallback feels so much worse by comparison — once you have had subsetting, going back to full dumps is painful.&lt;/p&gt;

&lt;h2&gt;
  
  
  The main options after Snaplet
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The open-source Snaplet fork
&lt;/h3&gt;

&lt;p&gt;When Snaplet shut down, they open-sourced their code. It is on GitHub and available for anyone to run. For teams that depended heavily on Snaplet's TypeScript config and want to keep as much of their existing setup intact as possible, the fork is worth a look.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When it fits:&lt;/strong&gt; you have a small team with the engineering bandwidth to set it up, self-host the necessary infrastructure, and handle any issues that come up. If you are comfortable owning a dependency that has no active upstream, this can be a workable path — especially if your existing &lt;code&gt;snaplet.config.ts&lt;/code&gt; is already dialed in and you do not want to migrate anything.&lt;/p&gt;

&lt;p&gt;The honest catch is that there are no active maintainers, no bug fixes, and no support channel. When PostgreSQL releases a new version and something breaks, that is your problem to debug. When your schema evolves and an edge case triggers unexpected behavior, you are reading source code to figure out why. That is not a knock on Snaplet — open-sourcing was a generous move by the team. It just means you are taking on maintenance ownership in full.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Verdict:&lt;/strong&gt; viable if you are prepared to own it indefinitely. Not a good fit if "set it and forget it" is a requirement.&lt;/p&gt;

&lt;h3&gt;
  
  
  pg_dump plus anonymization scripts
&lt;/h3&gt;

&lt;p&gt;The path many teams take immediately after losing a tool like Snaplet: wire together &lt;code&gt;pg_dump&lt;/code&gt;, restore the dump to a scratch database, run a series of &lt;code&gt;UPDATE&lt;/code&gt; statements to mask sensitive fields, and call it done.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When it fits:&lt;/strong&gt; your schema is simple, your compliance requirements are minimal, and you just need a basic way to move data around. If you have five tables, no complicated FK relationships, and PII is confined to a handful of columns you can enumerate manually, this works fine.&lt;/p&gt;

&lt;p&gt;The catch is that this is exactly the pattern that breaks as the schema grows — and it breaks quietly. &lt;code&gt;UPDATE&lt;/code&gt;-based anonymization runs after the data is already in the target database, which means real PII is sitting on a dev machine or CI runner for however long the masking script takes. Subsetting is not built in, so you are either restoring the full production dump or writing custom &lt;code&gt;WHERE&lt;/code&gt; clauses for every table. And when a new sensitive column gets added, nothing reminds you to update the masking script.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Verdict:&lt;/strong&gt; fine for simple cases. Expect to outgrow it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tonic.ai
&lt;/h3&gt;

&lt;p&gt;Tonic is the heaviest commercial option in this space. It predates Snaplet, targets enterprise buyers, and covers the core workflow — subsetting, anonymization, restore — with a GUI-first experience and a set of higher-end features like synthetic data generation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When it fits:&lt;/strong&gt; you are at an org where procurement, SOC 2 paperwork, and a dedicated platform team are normal parts of buying a tool. Tonic is built for that world. If your team can spend a quarter evaluating it and your compliance team wants sign-off from a vendor with a full trust-center, it is a reasonable pick.&lt;/p&gt;

&lt;p&gt;The honest catch for most Snaplet refugees is that Tonic is not really a like-for-like replacement. The pricing, onboarding, and operational model are different — it is not a CLI you install in an afternoon. Smaller teams that liked Snaplet for its "two commands and a YAML" ergonomics usually find Tonic heavier than what they were looking for.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Verdict:&lt;/strong&gt; reasonable if you are already an enterprise buyer. Heavier than most Snaplet users want.&lt;/p&gt;

&lt;h3&gt;
  
  
  Basecut
&lt;/h3&gt;

&lt;p&gt;Basecut is built around the same workflow Snaplet established: define what to extract, run a CLI command, restore anywhere.&lt;/p&gt;

&lt;p&gt;The config is YAML instead of TypeScript, which trades flexibility for simplicity. For most teams — especially those whose Snaplet configs were mostly just masking rules and row limits — the YAML is easier to read, easier to review, and easier for non-JS teams to maintain.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;1'&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dev-snapshot'&lt;/span&gt;

&lt;span class="na"&gt;from&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;table&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;users&lt;/span&gt;
    &lt;span class="na"&gt;where&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;created_at&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;gt;&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;:since'&lt;/span&gt;
    &lt;span class="na"&gt;params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;since&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2026-01-01'&lt;/span&gt;

&lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;rows&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;per_table&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1000&lt;/span&gt;
    &lt;span class="na"&gt;total&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;50000&lt;/span&gt;

&lt;span class="na"&gt;anonymize&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;auto&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Basecut has 30+ anonymization strategies and auto-detects common PII fields — names, emails, phone numbers, addresses — without requiring you to enumerate them. It also supports deterministic masking, which matters when the same source value needs to map to the same fake value across related tables. If &lt;code&gt;jane@company.com&lt;/code&gt; turns into two different fake emails across &lt;code&gt;users&lt;/code&gt; and &lt;code&gt;audit_logs&lt;/code&gt;, your data stops behaving like the real system.&lt;/p&gt;

&lt;p&gt;For teams with compliance requirements, Basecut adds org-level anonymization policies — rules that apply across all snapshots in a workspace without relying on individual contributors to remember to set them. You can enforce that certain columns are always masked, regardless of who runs the snapshot or what config file they used.&lt;/p&gt;

&lt;p&gt;Snapshots can be stored locally or in cloud storage, and the CLI supports both interactive use and async agent execution for teams that want to run snapshot creation on a schedule without leaving a terminal open. The free tier covers small teams. Paid plans add team features and higher snapshot volumes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Verdict:&lt;/strong&gt; the closest like-for-like replacement for Snaplet, actively maintained, with a free tier that covers most small teams.&lt;/p&gt;

&lt;h2&gt;
  
  
  Post-Snaplet migration checklist
&lt;/h2&gt;

&lt;p&gt;If your workflow just broke and you are staring at a Snaplet-shaped hole in your dev setup, here is the minimum viable path to get back to a working state without picking the wrong tool under time pressure.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;List the workflows that actually depended on Snaplet.&lt;/strong&gt; Usually there are two or three: local dev onboarding, CI test data, and staging refreshes. Write them down. You do not need to replace every Snaplet feature — you need to restore these workflows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Identify the data each workflow needs.&lt;/strong&gt; How many rows of which root tables? What recency window? What is genuinely PII? This is the information that will become your replacement tool's config, regardless of which tool you pick.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inventory the anonymization rules you actually rely on.&lt;/strong&gt; Pull up your old &lt;code&gt;snaplet.config.ts&lt;/code&gt; and list the columns that had explicit transforms. Most teams find that 80% of the rules are "mask emails, names, phone numbers" — which any serious replacement handles automatically.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pick one workflow to migrate first.&lt;/strong&gt; Local dev onboarding is usually the right starting point: it is the lowest stakes, has the fastest feedback loop, and exercises the full subset → anonymize → restore loop.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run the new tool alongside the old script for one sprint.&lt;/strong&gt; Do not delete anything yet. Let one workflow prove itself before you rip out the fallback scripts that kept you running after the Snaplet shutdown.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Expand to CI once local dev is stable.&lt;/strong&gt; CI/CD test data is where the time savings compound fastest — a snapshot restore that takes seconds saves real engineer time on every PR.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automate snapshot refresh on a schedule.&lt;/strong&gt; Weekly is a reasonable default. Without a refresh schedule, any replacement tool is just a different kind of stale dataset six months from now.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Delete the fallback scripts.&lt;/strong&gt; Once the replacement has been running unattended for a few weeks, delete the &lt;code&gt;pg_dump&lt;/code&gt;-and-&lt;code&gt;UPDATE&lt;/code&gt; scripts you wired up in the panic. Leaving them around means they get used again eventually, and now you have two parallel systems drifting apart.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The whole migration typically takes an afternoon for the CLI swap and one or two sprints to expand across local, CI, and staging. The hard part is usually the decision, not the execution.&lt;/p&gt;

&lt;h2&gt;
  
  
  Migrating from Snaplet to Basecut
&lt;/h2&gt;

&lt;p&gt;The CLI migration is straightforward. If your team ran:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;snaplet snapshot capture
snaplet snapshot restore
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Basecut equivalent is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;basecut snapshot create &lt;span class="nt"&gt;--config&lt;/span&gt; basecut.yml
basecut snapshot restore my-snapshot:latest &lt;span class="nt"&gt;--target&lt;/span&gt; &lt;span class="nv"&gt;$DEV_DB_URL&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The conceptual mapping is close enough that most teams can get a working Basecut config in an afternoon. The main translation is from Snaplet's TypeScript transform functions to anonymize rules in YAML — and in most cases, &lt;code&gt;anonymize: mode: auto&lt;/code&gt; handles the common fields automatically, so the config ends up shorter than what you had before.&lt;/p&gt;

&lt;p&gt;One thing worth knowing: Basecut uses snapshot names with a version tag (&lt;code&gt;my-snapshot:latest&lt;/code&gt;) rather than Snaplet's path-based restore syntax. The restore command takes a &lt;code&gt;--target&lt;/code&gt; flag pointing at the destination database URL, which keeps source and destination separate and makes it explicit which database is being written to.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Is Snaplet still available in 2026?&lt;/strong&gt;&lt;br&gt;
No. Snaplet shut down in 2024 and open-sourced their codebase. The hosted product is gone, the team no longer maintains it, and there is no support channel. The code is still on GitHub for anyone who wants to self-host, but there are no active maintainers or bug fixes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is the best alternative to Snaplet in 2026?&lt;/strong&gt;&lt;br&gt;
Basecut is the most actively maintained like-for-like alternative: same subset → anonymize → restore workflow, YAML config instead of TypeScript, and a free tier that covers small teams. The open-source Snaplet fork is viable if you can self-host and own the maintenance. Tonic.ai is a heavier enterprise option. &lt;code&gt;pg_dump&lt;/code&gt; plus anonymization scripts is a stopgap that breaks quietly as schemas grow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I migrate my Snaplet config directly to Basecut?&lt;/strong&gt;&lt;br&gt;
There is no automatic translator, but the conceptual mapping is close. Snaplet transform functions become Basecut &lt;code&gt;anonymize&lt;/code&gt; rules in YAML. For most teams, &lt;code&gt;anonymize: mode: auto&lt;/code&gt; handles the common PII columns (emails, names, phones, addresses) without explicit rules, which makes Basecut configs shorter than the Snaplet equivalents they replace. Most teams get a working Basecut config in an afternoon.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is the open-source Snaplet fork still maintained?&lt;/strong&gt;&lt;br&gt;
The code is available, but there is no active upstream. New PostgreSQL versions, edge cases, and schema quirks are your problem to debug. It is a reasonable option if you have engineering bandwidth and want to keep your existing &lt;code&gt;snaplet.config.ts&lt;/code&gt; intact. It is a poor fit if "set it and forget it" is a requirement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How long does a Snaplet → Basecut migration usually take?&lt;/strong&gt;&lt;br&gt;
The CLI swap itself takes an afternoon: write a &lt;code&gt;basecut.yml&lt;/code&gt;, run &lt;code&gt;basecut snapshot create&lt;/code&gt;, run &lt;code&gt;basecut snapshot restore&lt;/code&gt;, and verify the result against your app. Rolling it out across local dev, CI, and staging typically takes one or two sprints, mostly because each workflow needs to be validated independently before the old fallback scripts can be deleted.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do I need to self-host Basecut the way I would self-host the Snaplet fork?&lt;/strong&gt;&lt;br&gt;
No. Basecut is a hosted product with a free tier for small teams, and snapshots can be stored either locally on your own machine or in cloud storage managed by Basecut. You install the CLI, point it at a read replica, and create your first snapshot without provisioning any infrastructure.&lt;/p&gt;




&lt;p&gt;If you want to see whether Basecut fits your workflow before committing to a migration, the free tier is a reasonable place to start. No infrastructure to set up — install the CLI, point it at a read replica, and you can have a first snapshot in a few minutes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://basecut.dev" rel="noopener noreferrer"&gt;Try Basecut free →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://basecut.dev/alternatives/snaplet" rel="noopener noreferrer"&gt;See the full Snaplet → Basecut migration guide →&lt;/a&gt;&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>database</category>
      <category>devops</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>How to Set Up a PostgreSQL Test Database in GitHub Actions (Without pg_dump)</title>
      <dc:creator>Jake Lazarus</dc:creator>
      <pubDate>Thu, 09 Apr 2026 14:35:00 +0000</pubDate>
      <link>https://forem.com/jakelaz/how-to-set-up-a-postgresql-test-database-in-github-actions-without-pgdump-1m0b</link>
      <guid>https://forem.com/jakelaz/how-to-set-up-a-postgresql-test-database-in-github-actions-without-pgdump-1m0b</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://basecut.dev/blog/postgresql-test-database-github-actions" rel="noopener noreferrer"&gt;basecut.dev&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The standard GitHub Actions job for PostgreSQL tests looks something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Set up test database&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;psql "$TEST_DB_URL" -f scripts/seed.sql&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or worse, a &lt;code&gt;pg_dump&lt;/code&gt; from production piped straight into the test container. It runs. The tests pass. Nobody questions it.&lt;/p&gt;

&lt;p&gt;Until a developer checks the CI logs and finds real customer emails in the output. Or a dump that used to take two minutes now takes fifteen. Or the seed script breaks on a migration and half the CI matrix fails with &lt;code&gt;ERROR: column "created_by" of relation "orders" does not exist&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This guide covers a better pattern: restore a pre-built, anonymized database snapshot instead. For small-to-medium snapshots, restores typically take seconds rather than minutes, the data contains no PII, and the process never breaks on schema changes.&lt;/p&gt;

&lt;h2&gt;
  
  
  How do you set up a PostgreSQL test database in GitHub Actions?
&lt;/h2&gt;

&lt;p&gt;The cleanest way to set up a PostgreSQL test database in GitHub Actions is to run a &lt;code&gt;postgres&lt;/code&gt; service container and restore a pre-built snapshot before your test step runs. This is faster than restoring a &lt;code&gt;pg_dump&lt;/code&gt;, safer than fixture SQL files, and does not require maintaining seed data by hand.&lt;/p&gt;

&lt;p&gt;The three-step pattern is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Spin up a &lt;code&gt;postgres&lt;/code&gt; service container in the workflow.&lt;/li&gt;
&lt;li&gt;Install a CLI that can restore a versioned snapshot.&lt;/li&gt;
&lt;li&gt;Restore the snapshot before your test suite runs.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The rest of this guide shows how to do that with Basecut, with full YAML you can copy directly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the pg_dump approach breaks down in CI
&lt;/h2&gt;

&lt;p&gt;The obvious approach — dump production, restore in CI — causes three problems that compound over time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Speed.&lt;/strong&gt; A &lt;code&gt;pg_dump&lt;/code&gt; restore is full-size by default. Production databases grow. A restore that takes 90 seconds today takes 8 minutes in two years. That is a lot of developer time spent waiting before tests even start.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PII.&lt;/strong&gt; A full dump copies real emails, real names, real addresses. Those land in CI artifacts, appear in test output, and end up in logs. That is a compliance problem even if nobody reads the logs. The &lt;a href="https://www.thoughtworks.com/en-us/radar/techniques/production-data-in-test-environments" rel="noopener noreferrer"&gt;Thoughtworks Technology Radar&lt;/a&gt; explicitly calls out raw production data in test pipelines as a risk to address.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fragility.&lt;/strong&gt; Fixture SQL files and seed scripts break the moment a migration adds a &lt;code&gt;NOT NULL&lt;/code&gt; column they do not know about. Someone commits a quick fix, the fix produces slightly different data than everyone else's local setup, and now "works on my machine" is a data problem, not a code problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  The snapshot approach
&lt;/h2&gt;

&lt;p&gt;Instead of restoring a raw dump or running a seed script, you restore a named, versioned snapshot — a small, FK-complete, pre-anonymized subset of production.&lt;/p&gt;

&lt;p&gt;The key properties:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pre-built.&lt;/strong&gt; The snapshot is created once (or on a schedule), not on every CI run. For most team-sized datasets, restore time is measured in seconds rather than minutes — though this depends on snapshot size and network.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Already anonymized.&lt;/strong&gt; PII is masked during snapshot creation, before any data leaves production. Nothing sensitive ever travels to CI.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Referentially complete.&lt;/strong&gt; Every foreign key resolves. The snapshot is a consistent subgraph of production, not a random sample of disconnected rows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Versioned.&lt;/strong&gt; &lt;code&gt;dev-snapshot:latest&lt;/code&gt; always points to the most recent run. You can also pin to &lt;code&gt;dev-snapshot:v12&lt;/code&gt; to keep tests stable while multiple PRs are in flight.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Full GitHub Actions setup
&lt;/h2&gt;

&lt;p&gt;Here is a complete workflow you can adapt directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Test&lt;/span&gt;

&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;test&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;

    &lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;postgres&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres:16&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;POSTGRES_USER&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
          &lt;span class="na"&gt;POSTGRES_PASSWORD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
          &lt;span class="na"&gt;POSTGRES_DB&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;test_db&lt;/span&gt;
        &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;5432:5432&lt;/span&gt;
        &lt;span class="na"&gt;options&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;&amp;gt;-&lt;/span&gt;
          &lt;span class="s"&gt;--health-cmd pg_isready&lt;/span&gt;
          &lt;span class="s"&gt;--health-interval 10s&lt;/span&gt;
          &lt;span class="s"&gt;--health-timeout 5s&lt;/span&gt;
          &lt;span class="s"&gt;--health-retries 5&lt;/span&gt;

    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Install Basecut CLI&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;curl -fsSL https://basecut.dev/install.sh | sh&lt;/span&gt;
          &lt;span class="s"&gt;echo "$HOME/.local/bin" &amp;gt;&amp;gt; $GITHUB_PATH&lt;/span&gt;
          &lt;span class="s"&gt;# If the installer uses a different path, add that path instead&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Restore test snapshot&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;BASECUT_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.BASECUT_API_KEY }}&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;basecut snapshot restore dev-snapshot:latest \&lt;/span&gt;
            &lt;span class="s"&gt;--target "postgresql://postgres:postgres@localhost:5432/test_db"&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run tests&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;npm test&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few things worth noting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;--health-cmd pg_isready&lt;/code&gt; ensures the Postgres container is accepting connections before the restore step runs. Without this, you get intermittent failures on container startup.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;dev-snapshot:latest&lt;/code&gt; is the snapshot name. Change this to match whatever you named your snapshot in the Basecut config. Tag a specific version (&lt;code&gt;dev-snapshot:v3&lt;/code&gt;) if you want tests to run against a pinned dataset across multiple branches.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;BASECUT_API_KEY&lt;/code&gt; goes in GitHub repo secrets (Settings → Secrets and variables → Actions), not in the workflow YAML.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  GitLab CI equivalent
&lt;/h2&gt;

&lt;p&gt;The same pattern works in GitLab CI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;test&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;node:20&lt;/span&gt;

  &lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres:16&lt;/span&gt;
      &lt;span class="na"&gt;alias&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
      &lt;span class="na"&gt;variables&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;POSTGRES_USER&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
        &lt;span class="na"&gt;POSTGRES_PASSWORD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
        &lt;span class="na"&gt;POSTGRES_DB&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;test_db&lt;/span&gt;

  &lt;span class="na"&gt;variables&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;PGHOST&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
    &lt;span class="na"&gt;PGUSER&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
    &lt;span class="na"&gt;PGPASSWORD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;

  &lt;span class="na"&gt;before_script&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;curl -fsSL https://basecut.dev/install.sh | sh&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;export PATH="$HOME/.local/bin:$PATH"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;basecut snapshot restore dev-snapshot:latest \&lt;/span&gt;
        &lt;span class="s"&gt;--target "postgresql://postgres:postgres@postgres:5432/test_db"&lt;/span&gt;

  &lt;span class="na"&gt;script&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;npm test&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Set &lt;code&gt;BASECUT_API_KEY&lt;/code&gt; as a CI/CD variable in GitLab (Settings → CI/CD → Variables) with "Masked" enabled so it does not appear in job logs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Keeping snapshots fresh without breaking CI
&lt;/h2&gt;

&lt;p&gt;A snapshot that is three months old starts drifting from production. New columns, new relationships, new edge cases — none of them exist in it yet. The fix depends on how stable your CI needs to be.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Refresh on a schedule&lt;/strong&gt; (recommended for most teams):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Refresh test snapshot&lt;/span&gt;

&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;schedule&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cron&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;0&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;3&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;1'&lt;/span&gt; &lt;span class="c1"&gt;# Every Monday at 3am&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;refresh&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Install Basecut CLI&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;curl -fsSL https://basecut.dev/install.sh | sh&lt;/span&gt;
          &lt;span class="s"&gt;echo "$HOME/.local/bin" &amp;gt;&amp;gt; $GITHUB_PATH&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create fresh snapshot&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;BASECUT_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.BASECUT_API_KEY }}&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;basecut snapshot create \&lt;/span&gt;
            &lt;span class="s"&gt;--config basecut.yml \&lt;/span&gt;
            &lt;span class="s"&gt;--source "${{ secrets.PRODUCTION_DATABASE_URL }}"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This keeps &lt;code&gt;dev-snapshot:latest&lt;/code&gt; current week-to-week. Existing test workflows keep working — &lt;code&gt;latest&lt;/code&gt; only advances when the refresh job runs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pin for stability, bump when needed:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If your test suite is sensitive to data changes, pin the snapshot version in the test workflow (&lt;code&gt;dev-snapshot:v3&lt;/code&gt;) and bump the pin manually when you want the new data. This gives you explicit control over when CI picks up fresh data shapes.&lt;/p&gt;

&lt;h2&gt;
  
  
  PII in CI: the part most teams skip
&lt;/h2&gt;

&lt;p&gt;If your pipeline is restoring anything from production — a full dump, a partial export, a seed script that grabs real rows — real customer data is in your CI artifacts. That is the kind of thing commonly discussed in the context of GDPR Article 25 (data protection by design) and SOC 2 data handling controls, and most teams have not thought through whether their CI pipeline is in scope.&lt;/p&gt;

&lt;p&gt;The snapshot approach sidesteps this entirely because anonymization happens at extraction time, before data leaves production. The snapshot that arrives in CI already has fake emails, fake names, and fake phone numbers — not real data with a cleanup script someone may or may not have run.&lt;/p&gt;

&lt;p&gt;Deterministic masking ensures the same source value maps to the same fake value across all tables, so joins still work and the app behaves like it does in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  When this pattern matters most
&lt;/h2&gt;

&lt;p&gt;This is overkill for a three-table CRUD app. It becomes the right default when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your test suite has data-dependent integration tests&lt;/li&gt;
&lt;li&gt;Your schema has more than a handful of related tables&lt;/li&gt;
&lt;li&gt;CI job time is starting to affect developer feedback loops&lt;/li&gt;
&lt;li&gt;You handle PII and need to demonstrate anonymization in CI for compliance audits&lt;/li&gt;
&lt;li&gt;Onboarding means "restore the snapshot and start working" instead of "ask around for a dump"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your CI pipeline currently has a slow &lt;code&gt;pg_dump&lt;/code&gt; step or a seed script that breaks every few sprints, this is the thing worth replacing first.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;How do I set up a PostgreSQL test database in GitHub Actions?&lt;/strong&gt;&lt;br&gt;
Spin up a &lt;code&gt;postgres&lt;/code&gt; service container in the workflow, install a CLI that can restore a versioned snapshot, then restore a pre-built anonymized snapshot before your test step runs. This is faster than restoring a &lt;code&gt;pg_dump&lt;/code&gt;, safer than fixture SQL files, and does not require maintaining seed data by hand. The snapshot is built once from production with PII masked at extraction time, then reused across every CI run until the next refresh.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why is pg_dump slow in GitHub Actions CI?&lt;/strong&gt;&lt;br&gt;
A full &lt;code&gt;pg_dump&lt;/code&gt; restore scales with the size of production and runs in full on every CI job. A restore that takes 90 seconds today takes 8 minutes in two years as production grows. Because the dump is not subset or anonymized, it also copies real PII into CI logs and artifacts. Pre-built snapshots replace this with a small, FK-complete, pre-anonymized dataset that typically restores in seconds for team-sized schemas.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I cache my PostgreSQL test database between GitHub Actions runs?&lt;/strong&gt;&lt;br&gt;
Caching a live database across runs is fragile — the &lt;code&gt;postgres&lt;/code&gt; service container is fresh every job, and seeding state into caches tends to produce flaky tests. A better pattern is to cache the snapshot artifact itself (small, static, versioned) and restore it into a fresh container each run. That gives you fast startup without carrying mutable state between jobs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How do I keep real customer PII out of CI logs?&lt;/strong&gt;&lt;br&gt;
Anonymize at extraction time, not after restore. If a &lt;code&gt;pg_dump&lt;/code&gt; or seed script ever runs against production data in CI, real emails and names will appear in logs, artifacts, and error output. Pre-built snapshots with deterministic masking applied during creation guarantee that no raw PII ever travels through the CI pipeline. Any snapshot a developer can restore is already safe.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does this pattern work with GitLab CI or CircleCI?&lt;/strong&gt;&lt;br&gt;
Yes. The pattern — service container for PostgreSQL, CLI install step, snapshot restore before tests — translates directly to GitLab CI, CircleCI, Buildkite, and Jenkins. Only the YAML syntax changes. The post includes a GitLab CI example; the same three steps apply to any runner that can start a PostgreSQL service and run a shell command.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How do I keep the test snapshot from going stale?&lt;/strong&gt;&lt;br&gt;
Refresh it on a schedule with a separate workflow — typically a weekly cron job that runs &lt;code&gt;basecut snapshot create&lt;/code&gt; against production or a read replica. The &lt;code&gt;:latest&lt;/code&gt; tag always points at the most recent run, so existing test workflows pick up fresh data automatically. If your test suite is sensitive to data changes, pin a specific version tag (e.g. &lt;code&gt;dev-snapshot:v12&lt;/code&gt;) and bump it manually when you want to adopt new data shapes.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Get started with Basecut.&lt;/strong&gt; The CLI is free for small teams — create your first snapshot and restore it in CI in under 10 minutes. &lt;a href="https://basecut.dev" rel="noopener noreferrer"&gt;Try Basecut free →&lt;/a&gt;&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>githubactions</category>
      <category>testing</category>
      <category>devops</category>
    </item>
    <item>
      <title>How to Set Up a Staging Database from Production PostgreSQL (2026 Guide)</title>
      <dc:creator>Jake Lazarus</dc:creator>
      <pubDate>Tue, 07 Apr 2026 14:30:00 +0000</pubDate>
      <link>https://forem.com/jakelaz/how-to-set-up-a-staging-database-from-production-postgresql-2026-guide-1ckg</link>
      <guid>https://forem.com/jakelaz/how-to-set-up-a-staging-database-from-production-postgresql-2026-guide-1ckg</guid>
      <description>&lt;p&gt;Setting up a PostgreSQL staging database that actually reflects production is one of those tasks that sounds simple and turns into a day of cleanup. The obvious approach — &lt;code&gt;pg_dump&lt;/code&gt; production to staging — breaks down immediately: the dump is too large, it contains real customer PII, and it produces a shared environment nobody wants to touch.&lt;/p&gt;

&lt;p&gt;This guide walks through a better pattern: extract a connected, anonymized subset of production and restore it as your staging database. You get a realistic, production-like environment without the size, privacy risk, or shared-state headaches of a full copy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What you'll learn:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Why &lt;code&gt;pg_dump&lt;/code&gt; fails as a staging database solution&lt;/li&gt;
&lt;li&gt;How FK-aware subsetting produces a referentially complete extract&lt;/li&gt;
&lt;li&gt;How to anonymize PII before it ever leaves production&lt;/li&gt;
&lt;li&gt;How to automate staging refreshes so it never goes stale&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  How do you create a staging database from production PostgreSQL safely?
&lt;/h2&gt;

&lt;p&gt;Create a staging database by extracting a referentially complete subset of production, anonymizing PII during extraction, and restoring that snapshot into staging. This keeps the environment realistic without copying the entire database, leaking customer data, or forcing developers to maintain manual cleanup scripts after every refresh.&lt;/p&gt;

&lt;p&gt;The shortest version of the workflow is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Choose a representative slice of production&lt;/li&gt;
&lt;li&gt;Follow foreign keys so the extract stays complete&lt;/li&gt;
&lt;li&gt;Mask sensitive fields before data leaves production&lt;/li&gt;
&lt;li&gt;Restore the snapshot into staging&lt;/li&gt;
&lt;li&gt;Refresh it on a schedule so it stays current&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Why pg_dump Fails for Staging Environments
&lt;/h2&gt;

&lt;p&gt;The first instinct is usually &lt;code&gt;pg_dump&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pg_dump &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$PRODUCTION_URL&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | psql &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$STAGING_URL&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works in the sense that it runs. The issues come later:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Size.&lt;/strong&gt; Production databases grow. A dump that takes 5 minutes today takes 30 minutes in a year. Restores slow down CI, slow down onboarding, and make "refresh staging" a thing people avoid.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PII.&lt;/strong&gt; A full dump copies everything — real emails, real names, real addresses, real payment details. That data is now in your staging environment, which means it is probably on developer laptops, in logs, and reachable by anyone with staging credentials. &lt;a href="https://www.thoughtworks.com/en-us/radar/techniques/production-data-in-test-environments" rel="noopener noreferrer"&gt;Thoughtworks' Technology Radar on production data in test&lt;/a&gt; calls out exactly these privacy and security tradeoffs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shared state.&lt;/strong&gt; One staging environment with real-ish data usually becomes a place where everyone makes changes at once. It gets out of sync with production constantly. Nobody owns keeping it clean.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We wrote a &lt;a href="https://basecut.dev/vs/pg-dump" rel="noopener noreferrer"&gt;full comparison of pg_dump vs Basecut&lt;/a&gt; if you want the detailed breakdown. The short version: &lt;code&gt;pg_dump&lt;/code&gt; is the right tool for backups and disaster recovery. For dev and staging environments, you usually want something smaller and safer.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Right Pattern: Subset, Anonymize, Restore
&lt;/h2&gt;

&lt;p&gt;Instead of copying the whole database, the better approach is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Start from one or more root tables (usually &lt;code&gt;users&lt;/code&gt;, &lt;code&gt;accounts&lt;/code&gt;, or whatever your primary entities are).&lt;/li&gt;
&lt;li&gt;Filter to a representative slice — recent signups, a specific account tier, a date range.&lt;/li&gt;
&lt;li&gt;Follow foreign keys to pull in all the related data those rows depend on.&lt;/li&gt;
&lt;li&gt;Anonymize sensitive fields during extraction, before anything leaves the production environment.&lt;/li&gt;
&lt;li&gt;Save the result as a named snapshot.&lt;/li&gt;
&lt;li&gt;Restore that snapshot to staging.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;What you end up with is a self-contained, realistic subset of production — with real relationships, real data shapes, real edge cases — but no raw PII and a manageable size.&lt;/p&gt;

&lt;p&gt;This is what the industry calls &lt;strong&gt;database subsetting&lt;/strong&gt;, and it is the same pattern that powers &lt;a href="https://basecut.dev/use-cases/local-development" rel="noopener noreferrer"&gt;local dev environments&lt;/a&gt; and &lt;a href="https://basecut.dev/use-cases/ci-cd-test-data" rel="noopener noreferrer"&gt;CI test data pipelines&lt;/a&gt;. Staging is just another restore target.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Define What "Representative" Means for Your Database
&lt;/h2&gt;

&lt;p&gt;Before you can extract anything, you need to decide what data to include.&lt;/p&gt;

&lt;p&gt;For most applications this means picking a root table and a sensible filter:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Recent users (past 30–90 days)&lt;/li&gt;
&lt;li&gt;A specific cohort (paid accounts, a particular plan tier)&lt;/li&gt;
&lt;li&gt;Accounts associated with specific test scenarios&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You do not need the whole database. You need enough data that staging behaves like production — correct relationships, realistic distributions, enough rows to surface data-dependent bugs.&lt;/p&gt;

&lt;p&gt;A starting point for a typical SaaS PostgreSQL database:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;500–2,000 user accounts&lt;/li&gt;
&lt;li&gt;All their related records (orders, subscriptions, events, etc.)&lt;/li&gt;
&lt;li&gt;Enough to make the app behave realistically, small enough to restore in under a minute&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 2: Extract with Foreign Key Awareness
&lt;/h2&gt;

&lt;p&gt;The mistake most DIY approaches make is sampling rows naively — take 500 rows from &lt;code&gt;users&lt;/code&gt;, take 500 rows from &lt;code&gt;orders&lt;/code&gt;, call it done. Then you restore it and get:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;orders pointing to users who are not in the snapshot&lt;/li&gt;
&lt;li&gt;line items pointing to products that were not included&lt;/li&gt;
&lt;li&gt;foreign key violations on restore&lt;/li&gt;
&lt;li&gt;an app that half-works or fails in strange ways&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A useful staging database has to be &lt;strong&gt;referentially complete&lt;/strong&gt;. Every foreign key must resolve. Every parent row must exist before its children.&lt;/p&gt;

&lt;p&gt;This is why FK-aware extraction matters. The extraction process traverses the schema — if you include an order, you also need the user who placed it, the products on the order, the shipping address, and whatever else your schema requires. The result is a subgraph that can be restored into an empty database without broken references.&lt;/p&gt;

&lt;p&gt;See the &lt;a href="https://www.postgresql.org/docs/current/ddl-constraints.html#DDL-CONSTRAINTS-FK" rel="noopener noreferrer"&gt;official PostgreSQL docs on foreign key constraints&lt;/a&gt; for the underlying mechanics if you want to understand what the extractor is navigating.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Anonymize PII Before It Leaves Production
&lt;/h2&gt;

&lt;p&gt;The natural next question is: what do you do about PII?&lt;/p&gt;

&lt;p&gt;The common answer is a cleanup script that runs after restore:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt;
  &lt;span class="n"&gt;email&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'user'&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;'@example.com'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;first_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Test'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;last_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'User'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This has two problems. First, someone has to remember to run it. Second, when a new PII column gets added to the schema, someone has to update the script — and they usually forget.&lt;/p&gt;

&lt;p&gt;More importantly, it is already too late by the time this runs. The data traveled through your restore pipeline with real values in it.&lt;/p&gt;

&lt;p&gt;The better approach is to anonymize &lt;strong&gt;at extraction time&lt;/strong&gt;, before the data leaves production. The snapshot that gets created already has fake emails, fake names, and fake addresses. Nothing sensitive ever travels to staging.&lt;/p&gt;

&lt;p&gt;For columns that need specific handling — free-text fields, external IDs, JSONB blobs — you add explicit rules. Everything else gets auto-detected by column name and type.&lt;/p&gt;

&lt;p&gt;We go deep on this in &lt;a href="https://basecut.dev/blog/how-to-anonymize-pii-in-postgresql-for-development" rel="noopener noreferrer"&gt;How to Anonymize PII in PostgreSQL for Development&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Full Example with Basecut
&lt;/h2&gt;

&lt;p&gt;Here is a complete staging database setup using &lt;a href="https://basecut.dev" rel="noopener noreferrer"&gt;Basecut&lt;/a&gt;. Create a &lt;code&gt;basecut.yml&lt;/code&gt; at the root of your repo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;1'&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;staging'&lt;/span&gt;

&lt;span class="na"&gt;from&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;table&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;users&lt;/span&gt;
    &lt;span class="na"&gt;where&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;created_at&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;gt;&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;:since&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;AND&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;plan&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;!=&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;:plan'&lt;/span&gt;
    &lt;span class="na"&gt;params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;since&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2025-10-01'&lt;/span&gt;
      &lt;span class="na"&gt;plan&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;free'&lt;/span&gt;

&lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;rows&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;per_table&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2000&lt;/span&gt;
    &lt;span class="na"&gt;total&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100000&lt;/span&gt;

&lt;span class="na"&gt;anonymize&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;auto&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;users&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;notes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;
      &lt;span class="na"&gt;stripe_customer_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hash&lt;/span&gt;
    &lt;span class="na"&gt;audit_logs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;ip_address&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fake_ip&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then create the snapshot from production (or a read replica):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;basecut snapshot create &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--config&lt;/span&gt; basecut.yml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--source&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$PRODUCTION_DATABASE_URL&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And restore it to staging:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;basecut snapshot restore staging:latest &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--target&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$STAGING_DATABASE_URL&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the entire workflow. The snapshot is named and versioned — &lt;code&gt;staging:latest&lt;/code&gt; always points to the most recent one, and you can also restore a specific tagged version like &lt;code&gt;staging:v2&lt;/code&gt; if you need to pin a particular snapshot.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5: Keep Your Staging Database Fresh
&lt;/h2&gt;

&lt;p&gt;A staging database that is three months old is almost as bad as one that does not exist. Edge cases you care about — new billing flows, new user types, new schema columns — are not in there yet.&lt;/p&gt;

&lt;p&gt;The simplest way to keep staging fresh is to trigger a snapshot refresh on a schedule. Wire it into an existing cron job, a scheduled GitHub Actions workflow, or your platform's scheduler. For example, to refresh every Monday at 2am:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# crontab entry (or a scheduled CI job)&lt;/span&gt;
0 2 &lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt; 1 basecut snapshot create &lt;span class="nt"&gt;--config&lt;/span&gt; basecut.yml &lt;span class="nt"&gt;--source&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$PRODUCTION_DATABASE_URL&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
          basecut snapshot restore staging:latest &lt;span class="nt"&gt;--target&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$STAGING_DATABASE_URL&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or trigger it as part of your CI pipeline whenever you cut a release branch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Refresh staging database&lt;/span&gt;
  &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;BASECUT_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.BASECUT_API_KEY }}&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;basecut snapshot create --config basecut.yml --source "$PRODUCTION_DATABASE_URL"&lt;/span&gt;
    &lt;span class="s"&gt;basecut snapshot restore staging:latest --target "$STAGING_DATABASE_URL"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Weekly refreshes are usually enough. For teams doing frequent releases, a refresh on every release branch works well too.&lt;/p&gt;

&lt;h2&gt;
  
  
  Shared Staging vs Per-Developer Environments
&lt;/h2&gt;

&lt;p&gt;Staging environments fall into two patterns, and this approach works for both.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shared staging&lt;/strong&gt; (one environment, multiple developers): the workflow above applies directly. Refresh it on a schedule. Everyone gets the same anonymized, realistic baseline. When someone corrupts state for testing, you restore again.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Per-developer environments&lt;/strong&gt; (each developer has their own): this is actually easier. Each developer restores the same snapshot to their own local PostgreSQL instance. They can make whatever changes they need without affecting anyone else. When they want a fresh start, one command resets it. We cover this more in the &lt;a href="https://basecut.dev/use-cases/local-development" rel="noopener noreferrer"&gt;local development guide&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The main advantage of per-developer environments is independence — nobody is waiting for staging to settle down before they can test. The tradeoff is that each developer needs somewhere to run their own PostgreSQL instance, which is easy locally but less obvious for teams that rely entirely on remote environments.&lt;/p&gt;

&lt;h2&gt;
  
  
  Staging Database Setup Checklist
&lt;/h2&gt;

&lt;p&gt;Before you call it done, verify:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Snapshot restores cleanly (no FK violations, no missing extension errors)&lt;/li&gt;
&lt;li&gt;[ ] Application runs against it without crashing&lt;/li&gt;
&lt;li&gt;[ ] No real PII visible in common queries (&lt;code&gt;SELECT email FROM users LIMIT 10&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;[ ] Referential integrity intact (spot-check a few joined queries)&lt;/li&gt;
&lt;li&gt;[ ] Row counts are reasonable (not 12 rows, not 5 million rows)&lt;/li&gt;
&lt;li&gt;[ ] Refresh process is automated and nobody is doing it manually&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  When a Full pg_dump Is Still the Right Answer
&lt;/h2&gt;

&lt;p&gt;To be fair: there are cases where a full dump is the right approach.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You need to test schema migrations against production-exact data before running them.&lt;/li&gt;
&lt;li&gt;You are debugging a specific production incident and need the exact rows to reproduce it.&lt;/li&gt;
&lt;li&gt;Your compliance requirements demand a production-identical environment for specific tests.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In those cases, the &lt;a href="https://www.postgresql.org/docs/current/app-pgdump.html" rel="noopener noreferrer"&gt;official &lt;code&gt;pg_dump&lt;/code&gt; reference&lt;/a&gt; is the right place to understand what a logical dump includes. The anonymization and subsetting workflow described here is for the other 95% of staging use cases — where you want something fast, safe, and realistic, not a forensic copy of production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;p&gt;Staging databases are usually either out of date, full of PII, or both. The reason is that setting them up properly was never made easy enough to do right.&lt;/p&gt;

&lt;p&gt;A scripted subset + anonymize + restore workflow fixes all of this at once. The result is a staging environment that is fast to restore, safe to share, realistic enough to catch real bugs, and easy to keep fresh.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://basecut.dev/blog/how-to-set-up-a-staging-database-from-production-postgresql" rel="noopener noreferrer"&gt;basecut.dev&lt;/a&gt;. If you found this useful, &lt;a href="https://basecut.dev" rel="noopener noreferrer"&gt;Basecut&lt;/a&gt; is the tool described here — free for small teams.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>devops</category>
      <category>database</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>How to Replace Seed Scripts with Production Snapshots</title>
      <dc:creator>Jake Lazarus</dc:creator>
      <pubDate>Thu, 26 Mar 2026 14:35:00 +0000</pubDate>
      <link>https://forem.com/jakelaz/how-to-replace-seed-scripts-with-production-snapshots-40ng</link>
      <guid>https://forem.com/jakelaz/how-to-replace-seed-scripts-with-production-snapshots-40ng</guid>
      <description>&lt;p&gt;Seed scripts are technical debt that nobody tracks.&lt;/p&gt;

&lt;p&gt;They start as a convenience — a few &lt;code&gt;INSERT&lt;/code&gt; statements so the app boots locally — and they end up as a 400-line file that touches 30 tables, breaks on every third migration, and produces data that looks nothing like production. Everyone knows the seed script is bad. Nobody wants to fix it, because fixing it means rewriting it, and it will just rot again.&lt;/p&gt;

&lt;p&gt;The usual response is to invest in a better seed script — more tables, better relationships, more realistic values. But the underlying issue is not the quality of the script. It is that hand-crafting test data stops scaling as schema complexity grows.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Define what data you need in a YAML config, extract a subset from production with PII anonymized, and restore it anywhere. No INSERT statements to maintain — the snapshot stays current automatically.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Why database seed scripts break as projects grow
&lt;/h2&gt;

&lt;p&gt;As a database seeding approach, seed scripts have real advantages early on: they are version-controlled, deterministic, and easy to understand. But those advantages erode as the schema grows, and three problems start compounding.&lt;/p&gt;

&lt;h3&gt;
  
  
  Schema drift
&lt;/h3&gt;

&lt;p&gt;Every migration is a chance for the seed script to break. A new &lt;code&gt;NOT NULL&lt;/code&gt; column, a renamed FK, a dropped table — each one needs a corresponding update to the seed file. Those updates happen late or not at all. The person writing the migration is thinking about the migration, not about whether &lt;code&gt;seed.sql&lt;/code&gt; still runs.&lt;/p&gt;

&lt;p&gt;The script does not fail loudly. It just produces increasingly stale data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Manual referential integrity
&lt;/h3&gt;

&lt;p&gt;In production, an order belongs to a user, references line items, connects to shipments and payments. In a seed script, you maintain all of those relationships by hand. Every ID, every FK, every cross-table reference. Miss one and you get constraint violations or, worse, data that loads fine but makes the app behave in ways it never would in production.&lt;/p&gt;

&lt;h3&gt;
  
  
  Flat data
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;Test User 1&lt;/code&gt; with &lt;code&gt;test@example.com&lt;/code&gt; and two orders is structurally valid. It is not useful. Real users have Unicode in their names. Real accounts have nullable fields that are actually null. Real customers have 47 orders accumulated over two years, with edge cases nobody thought to fabricate.&lt;/p&gt;

&lt;p&gt;The bugs that reach production are usually triggered by data shapes that did not exist in the seed script, because nobody anticipated them. We covered this in more detail in &lt;a href="https://basecut.dev/blog/why-fake-postgresql-test-data-misses-bugs" rel="noopener noreferrer"&gt;Why Fake PostgreSQL Test Data Misses Real Bugs&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production database snapshots: the seed script alternative
&lt;/h2&gt;

&lt;p&gt;Instead of fabricating data, extract it.&lt;/p&gt;

&lt;p&gt;Not with &lt;code&gt;pg_dump&lt;/code&gt; — that copies everything, including all the PII you should not have in dev and all the volume you do not need. The &lt;a href="https://www.thoughtworks.com/en-us/radar/techniques/production-data-in-test-environments" rel="noopener noreferrer"&gt;Thoughtworks Technology Radar&lt;/a&gt; explicitly recommends against using raw production data in test environments for exactly this reason — the privacy and security risks outweigh the convenience. Instead, extract a &lt;strong&gt;subset&lt;/strong&gt;: a small, connected slice of production data with sensitive fields anonymized during extraction.&lt;/p&gt;

&lt;p&gt;What you get is a snapshot that reflects the real schema, has valid relationships because they were followed rather than hand-coded, and contains real data shapes because they came from production. It also requires no maintenance, because the next snapshot picks up schema changes automatically.&lt;/p&gt;

&lt;p&gt;This is the workflow &lt;a href="https://basecut.dev" rel="noopener noreferrer"&gt;Basecut&lt;/a&gt; was built for. Define what to extract, run one command, restore to any database.&lt;/p&gt;

&lt;h2&gt;
  
  
  How database subsetting works in practice
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Define what to extract
&lt;/h3&gt;

&lt;p&gt;Instead of INSERT statements, you describe the shape of the data you want:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;1'&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dev-data'&lt;/span&gt;

&lt;span class="na"&gt;from&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;table&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;users&lt;/span&gt;
    &lt;span class="na"&gt;where&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;created_at&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;gt;&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;:since'&lt;/span&gt;
    &lt;span class="na"&gt;params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;since&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2026-01-01'&lt;/span&gt;

&lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;rows&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;per_table&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1000&lt;/span&gt;
    &lt;span class="na"&gt;total&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;50000&lt;/span&gt;

&lt;span class="na"&gt;anonymize&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;auto&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Start from recent users, follow FKs to collect related data, cap the size, auto-detect and anonymize PII. You can add explicit anonymization rules when you need them — we cover that in &lt;a href="https://basecut.dev/blog/how-to-anonymize-pii-in-postgresql-for-development" rel="noopener noreferrer"&gt;How to Anonymize PII in PostgreSQL for Development&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The important difference from a seed script: this config describes &lt;em&gt;what to extract&lt;/em&gt;, not &lt;em&gt;what to insert&lt;/em&gt;. New columns, new tables, and new relationships get picked up on the next snapshot without touching the config.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Create a snapshot
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;basecut snapshot create &lt;span class="nt"&gt;--config&lt;/span&gt; basecut.yml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Basecut connects to your database (or a read replica), traverses relationships, anonymizes PII inline, and writes the result. No real PII ever leaves production.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Restore locally
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;basecut snapshot restore dev-data:latest &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--target&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$LOCAL_DATABASE_URL&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the local dev setup. A new developer joining the team runs two commands and has a working database with realistic test data management handled for them.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Basecut handles all of this in one CLI command.&lt;/strong&gt; &lt;a href="https://docs.basecut.dev/getting-started/quick-start" rel="noopener noreferrer"&gt;See the quickstart →&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  4. Share it
&lt;/h3&gt;

&lt;p&gt;Once a snapshot exists, anyone on the team can restore it. Everyone works against the same fixture data, which means bugs are reproducible across machines and "works on my machine" stops being about data differences.&lt;/p&gt;

&lt;h2&gt;
  
  
  Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Seed script&lt;/th&gt;
&lt;th&gt;Production snapshot&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Schema changes&lt;/td&gt;
&lt;td&gt;Breaks until someone updates it&lt;/td&gt;
&lt;td&gt;Automatic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FK integrity&lt;/td&gt;
&lt;td&gt;Manual&lt;/td&gt;
&lt;td&gt;Followed from the database&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data realism&lt;/td&gt;
&lt;td&gt;Fabricated&lt;/td&gt;
&lt;td&gt;Real, anonymized&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PII risk&lt;/td&gt;
&lt;td&gt;None (but no realism either)&lt;/td&gt;
&lt;td&gt;Handled at extraction&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Maintenance&lt;/td&gt;
&lt;td&gt;Grows with the schema&lt;/td&gt;
&lt;td&gt;Near zero&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Onboarding&lt;/td&gt;
&lt;td&gt;Run, debug, ask for help&lt;/td&gt;
&lt;td&gt;Restore, start working&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Edge cases&lt;/td&gt;
&lt;td&gt;Only what someone added&lt;/td&gt;
&lt;td&gt;Whatever production has&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;More detail in our &lt;a href="https://basecut.dev/vs/seed-scripts" rel="noopener noreferrer"&gt;seed scripts vs Basecut comparison&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Using production snapshots in CI/CD pipelines
&lt;/h2&gt;

&lt;p&gt;The same snapshot works in CI. Replace the seed script step with a restore:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Test&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;test&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;postgres&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres:15&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;POSTGRES_USER&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
          &lt;span class="na"&gt;POSTGRES_PASSWORD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
          &lt;span class="na"&gt;POSTGRES_DB&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;test_db&lt;/span&gt;
        &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;5432:5432&lt;/span&gt;

    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Install Basecut CLI&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;curl -fsSL https://basecut.dev/install.sh | sh&lt;/span&gt;
          &lt;span class="s"&gt;echo "$HOME/.local/bin" &amp;gt;&amp;gt; $GITHUB_PATH&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Restore snapshot&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;BASECUT_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.BASECUT_API_KEY }}&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;basecut snapshot restore dev-data:latest \&lt;/span&gt;
            &lt;span class="s"&gt;--target "postgresql://postgres:postgres@localhost:5432/test_db"&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run tests&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;npm test&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;New tables and columns from migrations show up in the next snapshot. No CI config changes needed.&lt;/p&gt;

&lt;p&gt;More on this in our &lt;a href="https://basecut.dev/use-cases/ci-cd-test-data" rel="noopener noreferrer"&gt;CI/CD test data guide&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  When seed scripts still make sense
&lt;/h2&gt;

&lt;p&gt;Seed scripts are the right tool in some situations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pre-launch projects&lt;/strong&gt; with no production data yet.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Intentionally fictional demos&lt;/strong&gt; where you need a specific scenario.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unit tests&lt;/strong&gt; that need three rows in one table. A snapshot is overkill there.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Small schemas&lt;/strong&gt; where the maintenance cost is genuinely low.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your schema has fewer than ten tables and no PII, a seed script is probably the right choice. The crossover point is usually obvious — it is when maintaining the seed file takes more effort than it saves.&lt;/p&gt;

&lt;h2&gt;
  
  
  Migrating off the seed script
&lt;/h2&gt;

&lt;p&gt;You do not have to rip it out in one PR.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Start with one workflow.&lt;/strong&gt; Pick the place where the seed script hurts most — usually dev environment setup or CI. Set up a snapshot and run it alongside the seed script.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Compare.&lt;/strong&gt; Run the app against both datasets. The snapshot will usually expose things the seed script missed: edge cases, data shapes that only exist in production, relationships that only worked because the script inserted rows in a specific order.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Switch gradually.&lt;/strong&gt; Replace the seed script where the snapshot is better. Keep it for unit tests or demos if it still makes sense.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Let it go.&lt;/strong&gt; Once the snapshot covers your main workflows, stop maintaining the seed script. Do not delete it if people reference it — just stop investing in keeping it current.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Final thought
&lt;/h2&gt;

&lt;p&gt;The seed script is one of those things that works well enough to never get fixed. It seeds the database. The app boots. Nobody wants to touch it.&lt;/p&gt;

&lt;p&gt;The problem is that "well enough" slowly gets worse. The schema changes, the data drifts, the edge cases multiply, and the gap between what the seed script produces and what production looks like gets wider every quarter.&lt;/p&gt;

&lt;p&gt;Production snapshots close that gap by removing the maintenance entirely. The data stays current because it comes from the real database. The relationships stay valid because they are followed, not written by hand. And anonymization is part of the process rather than a separate step someone has to remember.&lt;/p&gt;

&lt;p&gt;If your seed script is the file nobody wants to own, maybe the right move is making sure nobody has to.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Get started in minutes.&lt;/strong&gt; Basecut extracts FK-aware, anonymized snapshots from PostgreSQL with one CLI command. Free for small teams. &lt;a href="https://basecut.dev" rel="noopener noreferrer"&gt;Try Basecut free →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Or explore first: &lt;a href="https://docs.basecut.dev/getting-started/quick-start" rel="noopener noreferrer"&gt;quickstart guide&lt;/a&gt; · &lt;a href="https://docs.basecut.dev/configuration/snapshots" rel="noopener noreferrer"&gt;snapshot config reference&lt;/a&gt;&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>testing</category>
      <category>database</category>
      <category>devtools</category>
    </item>
    <item>
      <title>How to Anonymize PII in PostgreSQL for Development</title>
      <dc:creator>Jake Lazarus</dc:creator>
      <pubDate>Tue, 24 Mar 2026 14:43:00 +0000</pubDate>
      <link>https://forem.com/jakelaz/how-to-anonymize-pii-in-postgresql-for-development-hb2</link>
      <guid>https://forem.com/jakelaz/how-to-anonymize-pii-in-postgresql-for-development-hb2</guid>
      <description>&lt;p&gt;Ask any developer whether their local database has real customer data in it, and most will say no.&lt;/p&gt;

&lt;p&gt;Ask them to check, and most will find that it does.&lt;/p&gt;

&lt;p&gt;Real emails in &lt;code&gt;users&lt;/code&gt;. Real names in &lt;code&gt;profiles&lt;/code&gt;. Real billing addresses in &lt;code&gt;payments&lt;/code&gt;. Real IP addresses in &lt;code&gt;audit_logs&lt;/code&gt;. Data that landed in production, got copied somewhere for debugging, and has been sitting in local databases and CI pipelines ever since.&lt;/p&gt;

&lt;p&gt;This is not a hypothetical compliance problem. It is a real one, and it gets messier the longer it goes unaddressed.&lt;/p&gt;

&lt;h2&gt;
  
  
  What counts as PII in a PostgreSQL database
&lt;/h2&gt;

&lt;p&gt;PII is broader than most developers expect. The obvious fields are easy to spot:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;email&lt;/code&gt;, &lt;code&gt;email_address&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;first_name&lt;/code&gt;, &lt;code&gt;last_name&lt;/code&gt;, &lt;code&gt;full_name&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;phone&lt;/code&gt;, &lt;code&gt;phone_number&lt;/code&gt;, &lt;code&gt;mobile&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;address&lt;/code&gt;, &lt;code&gt;street_address&lt;/code&gt;, &lt;code&gt;city&lt;/code&gt;, &lt;code&gt;postal_code&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;date_of_birth&lt;/code&gt;, &lt;code&gt;dob&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ssn&lt;/code&gt;, &lt;code&gt;national_id&lt;/code&gt;, &lt;code&gt;tax_id&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But in real production schemas, PII hides in less obvious places:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;free-text fields like &lt;code&gt;notes&lt;/code&gt;, &lt;code&gt;description&lt;/code&gt;, &lt;code&gt;bio&lt;/code&gt; that users fill in&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ip_address&lt;/code&gt; columns in event logs and audit tables&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;stripe_customer_id&lt;/code&gt;, &lt;code&gt;paypal_email&lt;/code&gt; — identifiers that link back to real people&lt;/li&gt;
&lt;li&gt;JSONB columns that store user-submitted form data&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;metadata&lt;/code&gt; fields that accumulate whatever the app was logging at the time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you have been copying your production database to dev environments without systematically anonymizing those fields, that data is on developer laptops, in CI logs, and probably in Slack at some point.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters beyond just being careful
&lt;/h2&gt;

&lt;p&gt;GDPR, CCPA, HIPAA, and most other data protection frameworks have something in common: they do not distinguish between production and non-production environments. If you are processing personal data in a development environment without appropriate controls, you are in scope.&lt;/p&gt;

&lt;p&gt;In practice, the consequences are usually:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GDPR Article 25&lt;/strong&gt;: "data protection by design and by default" — development tooling is explicitly in scope&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SOC 2 Type II&lt;/strong&gt;: data handling controls are audited across environments, not just production&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HIPAA minimum necessary rule&lt;/strong&gt;: PHI should only be available to the systems that need it for the purpose it was collected&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Beyond compliance, there is a simpler reason: real customer data in dev environments is one of the most common sources of accidental exposure. A developer shares a failing test case on Slack. A CI artifact gets retained with real names in it. A staging database backup ends up in a public S3 bucket.&lt;/p&gt;

&lt;p&gt;The fix is not more policies. It is removing the real data from the environments where it should not be.&lt;/p&gt;

&lt;h2&gt;
  
  
  The naive approach: UPDATE statements after restore
&lt;/h2&gt;

&lt;p&gt;The most common first attempt at anonymization looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Run after restoring a pg_dump to dev&lt;/span&gt;
&lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt;
  &lt;span class="n"&gt;email&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'user'&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;'@example.com'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;first_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Test'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;last_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'User'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt;
  &lt;span class="n"&gt;shipping_address&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'123 Test St'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works well enough until it does not.&lt;/p&gt;

&lt;p&gt;The problems start to accumulate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Someone forgets to run the script, and real data ends up in a dev environment anyway.&lt;/li&gt;
&lt;li&gt;The script is not versioned with the schema, so it breaks when new PII columns are added.&lt;/li&gt;
&lt;li&gt;It replaces data inconsistently — the same customer gets a different fake email in &lt;code&gt;users&lt;/code&gt; than in &lt;code&gt;audit_logs&lt;/code&gt;, breaking join-based queries.&lt;/li&gt;
&lt;li&gt;It runs after the fact, which means real data has already traveled through the restore pipeline.&lt;/li&gt;
&lt;li&gt;It has no automated detection — every new PII column has to be added manually.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the pattern that eventually lands teams in trouble. It feels like a solution because it works most of the time. It fails when someone does not run it, or when a new field gets added and nobody updates the script. It shares the same fundamental problem as &lt;a href="https://basecut.dev/vs/seed-scripts" rel="noopener noreferrer"&gt;seed scripts&lt;/a&gt; — manual upkeep that silently falls behind.&lt;/p&gt;

&lt;h2&gt;
  
  
  The better approach: anonymize at extraction time
&lt;/h2&gt;

&lt;p&gt;The more reliable pattern is to anonymize the data before it ever leaves the production environment, not after it arrives in dev. This applies whether you are setting up a &lt;a href="https://basecut.dev/use-cases/local-development" rel="noopener noreferrer"&gt;local development environment&lt;/a&gt; or populating CI databases.&lt;/p&gt;

&lt;p&gt;That means the anonymization step is baked into the snapshot process:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Connect to production (or a read replica).&lt;/li&gt;
&lt;li&gt;Extract the rows you need.&lt;/li&gt;
&lt;li&gt;Anonymize sensitive fields inline, during extraction.&lt;/li&gt;
&lt;li&gt;Write the already-anonymized snapshot to wherever it will be stored.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The result is that no real PII ever travels to dev environments. What gets restored is already masked.&lt;/p&gt;

&lt;p&gt;This matters because it removes the "forget to run the script" failure mode entirely. There is no post-restore step to forget.&lt;/p&gt;

&lt;h2&gt;
  
  
  What good anonymization actually requires
&lt;/h2&gt;

&lt;p&gt;Replacing real values with fake ones is straightforward. Making the fake values behave like real data is harder.&lt;/p&gt;

&lt;p&gt;A few requirements come up in practice:&lt;/p&gt;

&lt;h3&gt;
  
  
  Realistic fake values
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;email = 'test@example.com'&lt;/code&gt; is easy to write and easy to spot. It does not behave like real email data in filtering, search, or display.&lt;/p&gt;

&lt;p&gt;Better: generate realistic-looking fake emails that follow the same structure. &lt;code&gt;lmitchell@example.com&lt;/code&gt; is harder to accidentally mistake for test data, and it exercises the same code paths as real emails. We cover why this realism matters for catching bugs in &lt;a href="https://basecut.dev/blog/why-fake-postgresql-test-data-misses-bugs" rel="noopener noreferrer"&gt;Why Fake PostgreSQL Test Data Misses Real Bugs&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Deterministic masking
&lt;/h3&gt;

&lt;p&gt;If &lt;code&gt;jane@company.com&lt;/code&gt; maps to &lt;code&gt;lmitchell@example.com&lt;/code&gt; in &lt;code&gt;users&lt;/code&gt;, it should map to the same fake email everywhere it appears — in &lt;code&gt;audit_logs&lt;/code&gt;, &lt;code&gt;notifications&lt;/code&gt;, &lt;code&gt;email_events&lt;/code&gt;, and anywhere else it is referenced.&lt;/p&gt;

&lt;p&gt;Without deterministic masking, the same source value produces different fake values in different tables. Queries that join across tables start returning mismatched or missing results. The data looks restored but does not behave like the real system.&lt;/p&gt;

&lt;h3&gt;
  
  
  FK-aware scope
&lt;/h3&gt;

&lt;p&gt;Anonymization cannot happen table-by-table in isolation. If &lt;code&gt;order_id = 1001&lt;/code&gt; belongs to a user whose real name you are masking, the anonymization needs to be consistent across &lt;code&gt;users&lt;/code&gt;, &lt;code&gt;orders&lt;/code&gt;, &lt;code&gt;billing_addresses&lt;/code&gt;, and everything else connected to that user.&lt;/p&gt;

&lt;h3&gt;
  
  
  Coverage of unknown columns
&lt;/h3&gt;

&lt;p&gt;Manually listing every PII column to anonymize only works until someone adds a new one and forgets to update the anonymization config. Auto-detection — pattern matching on column names, types, and values — catches fields that have not been explicitly listed yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Basecut's anonymization config looks like
&lt;/h2&gt;

&lt;p&gt;In Basecut, anonymization is part of the snapshot config. You can enable automatic detection, add explicit rules for specific columns, or mix both.&lt;/p&gt;

&lt;p&gt;The simplest version uses auto-detection:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;1'&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dev-snapshot'&lt;/span&gt;

&lt;span class="na"&gt;from&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;table&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;users&lt;/span&gt;
    &lt;span class="na"&gt;where&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;created_at&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;gt;&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;:since'&lt;/span&gt;
    &lt;span class="na"&gt;params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;since&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2026-01-01'&lt;/span&gt;

&lt;span class="na"&gt;traverse&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;parents&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
  &lt;span class="na"&gt;children&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;

&lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;rows&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;per_table&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1000&lt;/span&gt;
    &lt;span class="na"&gt;total&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;50000&lt;/span&gt;

&lt;span class="na"&gt;anonymize&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;auto&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With &lt;code&gt;mode: auto&lt;/code&gt;, Basecut scans column names and data patterns to detect likely PII fields and applies sensible defaults. Emails become realistic fake emails. Names become realistic fake names. Phone numbers become valid-format fake phone numbers.&lt;/p&gt;

&lt;p&gt;For fields that need explicit control:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;anonymize&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;auto&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;column&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;users.notes&lt;/span&gt;
      &lt;span class="na"&gt;strategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;clear&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;column&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;users.profile_image_url&lt;/span&gt;
      &lt;span class="na"&gt;strategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;clear&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;column&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;payments.card_last_four&lt;/span&gt;
      &lt;span class="na"&gt;strategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;preserve&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;column&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;audit_logs.ip_address&lt;/span&gt;
      &lt;span class="na"&gt;strategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fake_ip&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;column&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;users.external_id&lt;/span&gt;
      &lt;span class="na"&gt;strategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hash&lt;/span&gt;
      &lt;span class="na"&gt;deterministic&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The strategies map to real behaviors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;fake_*&lt;/code&gt; — generate realistic-looking fake values (email, name, phone, address, IP)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;clear&lt;/code&gt; — replace with &lt;code&gt;NULL&lt;/code&gt; or empty string&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;preserve&lt;/code&gt; — keep as-is (for fields that are not sensitive but look like they might be)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;hash&lt;/code&gt; — consistent one-way hash, useful for IDs that need to be consistent but not reversible&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Deterministic masking is on by default for most strategies. Basecut uses a stable seed so the same source value always produces the same output, which keeps join queries working correctly across tables.&lt;/p&gt;

&lt;h2&gt;
  
  
  Org-wide policies
&lt;/h2&gt;

&lt;p&gt;One problem with per-snapshot anonymization configs is that each team or developer can configure their own rules — which means someone will configure them wrong.&lt;/p&gt;

&lt;p&gt;For teams with compliance requirements, Basecut supports org-wide anonymization policies: rules defined once at the organization level that apply to every snapshot, regardless of who creates it. An individual snapshot config can add rules but cannot remove or override org-level ones.&lt;/p&gt;

&lt;p&gt;This is how you get consistent anonymization without relying on every developer to configure it correctly every time.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this looks like in CI
&lt;/h2&gt;

&lt;p&gt;The same anonymization config that runs locally runs in CI. If you are restoring a snapshot before your test suite, the data arriving in your CI pipeline is already masked.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Test&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;test&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;postgres&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres:15&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;POSTGRES_USER&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
          &lt;span class="na"&gt;POSTGRES_PASSWORD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
          &lt;span class="na"&gt;POSTGRES_DB&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;test_db&lt;/span&gt;
        &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;5432:5432&lt;/span&gt;

    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Install Basecut CLI&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;curl -fsSL https://basecut.dev/install.sh | sh&lt;/span&gt;
          &lt;span class="s"&gt;echo "$HOME/.local/bin" &amp;gt;&amp;gt; $GITHUB_PATH&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Restore anonymized snapshot&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;BASECUT_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.BASECUT_API_KEY }}&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;basecut snapshot restore dev-snapshot:latest \&lt;/span&gt;
            &lt;span class="s"&gt;--target "postgresql://postgres:postgres@localhost:5432/test_db"&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run tests&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;npm test&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The CI pipeline gets realistic data shapes and relationships without ever touching real PII. That is also worth noting for audits: you can show that your CI environment runs against masked data by design, not by policy.&lt;/p&gt;

&lt;p&gt;We go deeper on the CI setup in our &lt;a href="https://basecut.dev/use-cases/ci-cd-test-data" rel="noopener noreferrer"&gt;CI/CD test data guide&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to use pg_anonymizer (and when not to)
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://postgresql-anonymizer.readthedocs.io/" rel="noopener noreferrer"&gt;postgresql_anonymizer&lt;/a&gt; is a PostgreSQL extension worth knowing about. It provides declarative masking rules at the database level, which is useful in some situations — particularly if you want to expose a masked view of production data to specific roles.&lt;/p&gt;

&lt;p&gt;A few places where it fits less well for dev workflows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It requires installing an extension in production, which many teams are not able or willing to do.&lt;/li&gt;
&lt;li&gt;It anonymizes in-place or via views, not during extraction — the data still travels to dev before masking.&lt;/li&gt;
&lt;li&gt;It does not handle subsetting, so you are still copying the full database.&lt;/li&gt;
&lt;li&gt;It requires explicit rule definitions for every column with no auto-detection.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For teams who want anonymization as part of a broader snapshot + subset + restore workflow, extraction-time masking is usually cleaner. You can see more in our &lt;a href="https://basecut.dev/vs/pg-dump" rel="noopener noreferrer"&gt;comparison with pg_dump&lt;/a&gt;. For a broader look at how Basecut compares to commercial alternatives like &lt;a href="https://basecut.dev/vs/tonic" rel="noopener noreferrer"&gt;Tonic&lt;/a&gt; and &lt;a href="https://basecut.dev/vs/delphix" rel="noopener noreferrer"&gt;Delphix&lt;/a&gt;, see our comparison pages.&lt;/p&gt;

&lt;h2&gt;
  
  
  A practical rollout
&lt;/h2&gt;

&lt;p&gt;If you have never done systematic anonymization before, the cleanest way to start is gradually.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Audit first.&lt;/strong&gt;&lt;br&gt;
Run a query across your schema to find columns that look like PII: &lt;code&gt;email&lt;/code&gt;, &lt;code&gt;name&lt;/code&gt;, &lt;code&gt;phone&lt;/code&gt;, &lt;code&gt;address&lt;/code&gt;, any JSONB column with user-submitted data. You will find more than you expect.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Start with auto-detection.&lt;/strong&gt;&lt;br&gt;
Let the tooling make a first pass at detection. Review what it finds, add explicit rules for anything it missed or got wrong.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Run a test snapshot.&lt;/strong&gt;&lt;br&gt;
Restore it to a local dev database and check: does the data look anonymized? Do join queries still work? Does the app behave normally with it?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Commit the config.&lt;/strong&gt;&lt;br&gt;
Your anonymization rules should live in version control alongside your schema. When someone adds a new PII column, the config update is part of the same PR.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Enforce at the org level.&lt;/strong&gt;&lt;br&gt;
Once your rules are stable, promote the critical ones to org-wide policies so they apply regardless of who runs the snapshot.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The common mistake is treating anonymization as a post-restore step rather than part of the data pipeline. Once it is baked into snapshot creation, the risk of it being skipped drops to near zero.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final thought
&lt;/h2&gt;

&lt;p&gt;The reason most dev databases have real PII in them is not malice. It is that anonymization was not built into the default workflow — it was an afterthought, a script someone ran sometimes, a step in a doc nobody updated.&lt;/p&gt;

&lt;p&gt;The fix is simple in principle: make anonymization happen at extraction time, not after the fact. Every snapshot that gets restored to dev, CI, or staging should arrive already masked.&lt;/p&gt;

&lt;p&gt;If your team is still relying on manual cleanup scripts, or skipping anonymization entirely, that is the thing worth fixing first.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Get started with Basecut's anonymization.&lt;/strong&gt; The CLI auto-detects common PII fields and applies masking during snapshot creation. Free for small teams. &lt;a href="https://basecut.dev" rel="noopener noreferrer"&gt;Try Basecut free →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Or dig into the details first: &lt;a href="https://docs.basecut.dev/configuration/anonymization" rel="noopener noreferrer"&gt;anonymization config reference&lt;/a&gt; · &lt;a href="https://docs.basecut.dev/core-concepts/how-it-works" rel="noopener noreferrer"&gt;how FK-aware extraction works&lt;/a&gt;&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>security</category>
      <category>devops</category>
      <category>database</category>
    </item>
    <item>
      <title>Why Fake PostgreSQL Test Data Misses Real Bugs</title>
      <dc:creator>Jake Lazarus</dc:creator>
      <pubDate>Wed, 18 Mar 2026 05:00:00 +0000</pubDate>
      <link>https://forem.com/jakelaz/why-fake-postgresql-test-data-misses-real-bugs-3jco</link>
      <guid>https://forem.com/jakelaz/why-fake-postgresql-test-data-misses-real-bugs-3jco</guid>
      <description>&lt;p&gt;Most teams do not have a testing problem. They have a &lt;strong&gt;test data realism&lt;/strong&gt; problem.&lt;/p&gt;

&lt;p&gt;Locally, the app runs against &lt;code&gt;test@example.com&lt;/code&gt;, &lt;code&gt;User 1&lt;/code&gt;, and a seed script nobody wants to maintain. In CI, fixtures slowly drift away from reality. Then the bugs show up after deploy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a customer name has an apostrophe or accent&lt;/li&gt;
&lt;li&gt;a field is &lt;code&gt;NULL&lt;/code&gt; where your code assumed a string&lt;/li&gt;
&lt;li&gt;an account has 47 related records instead of 2&lt;/li&gt;
&lt;li&gt;a query that worked on 20 rows falls over on 20,000&lt;/li&gt;
&lt;li&gt;shared staging data gets mutated by three people at once&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If that sounds familiar, the answer usually is not "write more tests." It is "stop testing against fake data."&lt;/p&gt;

&lt;h2&gt;
  
  
  What teams actually want
&lt;/h2&gt;

&lt;p&gt;What most teams actually want is not a full copy of production, not a giant &lt;code&gt;pg_dump&lt;/code&gt;, and not another 400-line seed script. They want &lt;strong&gt;production-like data&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;realistic enough to expose bugs&lt;/li&gt;
&lt;li&gt;small enough to restore locally and in CI&lt;/li&gt;
&lt;li&gt;safe enough to use outside production&lt;/li&gt;
&lt;li&gt;reproducible enough that every developer can get the same result&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the gap most dev workflows never solve cleanly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the usual approaches break down
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Seed scripts rot
&lt;/h3&gt;

&lt;p&gt;Seed scripts are fine when your app has five tables. They get painful when your schema grows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;every migration breaks something&lt;/li&gt;
&lt;li&gt;relationships get harder to maintain&lt;/li&gt;
&lt;li&gt;the data gets less realistic over time&lt;/li&gt;
&lt;li&gt;nobody wants to own the script anymore&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You end up with a setup that is reproducible, but not especially useful. We wrote a deeper &lt;a href="https://basecut.dev/vs/seed-scripts" rel="noopener noreferrer"&gt;comparison of seed scripts vs production snapshots&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;pg_dump&lt;/code&gt; is great for backups, not dev environments
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;pg_dump&lt;/code&gt; solves a different problem. It copies everything:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;all rows&lt;/li&gt;
&lt;li&gt;all tables&lt;/li&gt;
&lt;li&gt;all PII&lt;/li&gt;
&lt;li&gt;all the size and baggage of production&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is useful for backup and recovery. It is usually overkill for local development and CI.&lt;/p&gt;

&lt;p&gt;For dev workflows, full dumps create new problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;slow restores&lt;/li&gt;
&lt;li&gt;bloated local databases&lt;/li&gt;
&lt;li&gt;longer CI jobs&lt;/li&gt;
&lt;li&gt;sensitive data showing up in places it should not&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most of the time, you do not need the entire database. You need the &lt;strong&gt;right slice&lt;/strong&gt; of it. We wrote a &lt;a href="https://basecut.dev/vs/pg-dump" rel="noopener noreferrer"&gt;full comparison of pg_dump vs Basecut&lt;/a&gt; if you want the details.&lt;/p&gt;

&lt;h2&gt;
  
  
  The better pattern: subset, anonymize, restore
&lt;/h2&gt;

&lt;p&gt;The workflow that makes sense looks more like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Start from one or more root tables.&lt;/li&gt;
&lt;li&gt;Filter to the rows you actually care about.&lt;/li&gt;
&lt;li&gt;Follow foreign keys to pull in the connected data.&lt;/li&gt;
&lt;li&gt;Anonymize sensitive fields inline.&lt;/li&gt;
&lt;li&gt;Save the snapshot.&lt;/li&gt;
&lt;li&gt;Restore it anywhere you need it.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That gives you a connected, realistic, privacy-safe subset of production instead of a raw copy. This is the workflow we built &lt;a href="https://basecut.dev" rel="noopener noreferrer"&gt;Basecut&lt;/a&gt; around for PostgreSQL: FK-aware extraction, automatic PII anonymization, and one-command restores for local dev, CI, and debugging.&lt;/p&gt;

&lt;p&gt;The reason this approach works is simple: it treats test data as a &lt;strong&gt;repeatable snapshot problem&lt;/strong&gt;, not a hand-crafted fixture problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "production-like" should actually mean
&lt;/h2&gt;

&lt;p&gt;The phrase gets used loosely. In practice, production-like data should have four properties.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Realistic structure
&lt;/h3&gt;

&lt;p&gt;It should reflect the real relationships, optional fields, and edge cases in your schema.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Referential integrity
&lt;/h3&gt;

&lt;p&gt;If you copy one row from &lt;code&gt;orders&lt;/code&gt;, you usually also need related rows from &lt;code&gt;users&lt;/code&gt;, &lt;code&gt;line_items&lt;/code&gt;, &lt;code&gt;shipments&lt;/code&gt;, and whatever else your app expects to exist together.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Privacy safety
&lt;/h3&gt;

&lt;p&gt;Emails, names, phone numbers, addresses, and other sensitive fields need to be anonymized before the data lands on laptops, CI runners, or logs.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Repeatability
&lt;/h3&gt;

&lt;p&gt;Developers need a predictable way to recreate the same kind of dataset without asking someone to send them a dump.&lt;/p&gt;

&lt;p&gt;If any one of those is missing, the workflow gets shaky fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why FK-aware extraction matters
&lt;/h2&gt;

&lt;p&gt;This is the part many DIY approaches get wrong. Randomly sampling rows from each table sounds easy until you restore them. Then you get:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;orders pointing to missing users&lt;/li&gt;
&lt;li&gt;line items pointing to missing products&lt;/li&gt;
&lt;li&gt;child rows without their parents&lt;/li&gt;
&lt;li&gt;failed restores or strange app behavior after restore&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A useful snapshot has to behave like a self-contained mini-version of production.&lt;/p&gt;

&lt;p&gt;That is why FK-aware extraction matters. In Basecut, snapshots are built by following foreign keys in both directions and collecting a connected subgraph of your data. The result is something you can restore into an empty database without ending up with broken references.&lt;/p&gt;

&lt;p&gt;That matters more than people think. It is the difference between:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"the data loaded"&lt;/li&gt;
&lt;li&gt;and&lt;/li&gt;
&lt;li&gt;"the app actually behaves like it does in production"&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What the workflow looks like in practice
&lt;/h2&gt;

&lt;p&gt;The nice part is that this can stay simple. Basecut starts with a small YAML config that tells it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;where to start&lt;/li&gt;
&lt;li&gt;how far to traverse relationships&lt;/li&gt;
&lt;li&gt;how much data to include&lt;/li&gt;
&lt;li&gt;how anonymization should work&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;1'&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dev-snapshot'&lt;/span&gt;

&lt;span class="na"&gt;from&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;table&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;users&lt;/span&gt;
    &lt;span class="na"&gt;where&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;created_at&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;gt;&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;:since'&lt;/span&gt;
    &lt;span class="na"&gt;params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;since&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2026-01-01'&lt;/span&gt;

&lt;span class="na"&gt;traverse&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;parents&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
  &lt;span class="na"&gt;children&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;

&lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;rows&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;per_table&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1000&lt;/span&gt;
    &lt;span class="na"&gt;total&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;50000&lt;/span&gt;

&lt;span class="na"&gt;anonymize&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;auto&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then the workflow becomes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;basecut snapshot create &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--config&lt;/span&gt; basecut.yml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; &lt;span class="s2"&gt;"dev-snapshot"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--source&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$DATABASE_URL&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

basecut snapshot restore dev-snapshot:latest &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--target&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$LOCAL_DATABASE_URL&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the whole loop:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;inspect schema&lt;/li&gt;
&lt;li&gt;create snapshot&lt;/li&gt;
&lt;li&gt;restore anywhere&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In practice, most teams can get from install to first snapshot in a few minutes. You can try this workflow with &lt;a href="https://basecut.dev" rel="noopener noreferrer"&gt;Basecut&lt;/a&gt; — the CLI is free for small teams.&lt;/p&gt;

&lt;h2&gt;
  
  
  The privacy part is not optional
&lt;/h2&gt;

&lt;p&gt;One more requirement: privacy. If you are moving production-like data into dev and CI, PII handling cannot be a manual cleanup step.&lt;/p&gt;

&lt;p&gt;At minimum, your workflow should:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;detect common PII automatically&lt;/li&gt;
&lt;li&gt;allow explicit masking rules&lt;/li&gt;
&lt;li&gt;preserve join integrity where needed&lt;/li&gt;
&lt;li&gt;anonymize during extraction, not afterward&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Basecut handles this with automatic PII detection plus 30+ anonymization strategies. It also supports deterministic masking, which matters when the same source value needs to map to the same fake value across related tables.&lt;/p&gt;

&lt;p&gt;If &lt;code&gt;jane@company.com&lt;/code&gt; turns into one fake email in &lt;code&gt;users&lt;/code&gt; and a different fake email somewhere else, your data stops behaving like the real system. That is exactly the sort of detail that makes fake dev data feel fine right up until it is not.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this works well in CI too
&lt;/h2&gt;

&lt;p&gt;This pattern is just as useful in CI as it is locally. Instead of checking brittle fixtures into the repo, you restore a realistic snapshot before the test suite runs.&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Test&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;test&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;postgres&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres:15&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;POSTGRES_USER&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
          &lt;span class="na"&gt;POSTGRES_PASSWORD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
          &lt;span class="na"&gt;POSTGRES_DB&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;test_db&lt;/span&gt;
        &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;5432:5432&lt;/span&gt;

    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Install Basecut CLI&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;curl -fsSL https://basecut.dev/install.sh | sh&lt;/span&gt;
          &lt;span class="s"&gt;echo "$HOME/.local/bin" &amp;gt;&amp;gt; $GITHUB_PATH&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Restore snapshot&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;BASECUT_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.BASECUT_API_KEY }}&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;basecut snapshot restore test-data:latest \&lt;/span&gt;
            &lt;span class="s"&gt;--target "postgresql://postgres:postgres@localhost:5432/test_db"&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run tests&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;npm test&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That gives your pipeline real shapes, real relationships, and realistic edge cases without restoring an entire production dump on every run. It also keeps snapshots small enough that restores stay fast. We go deeper in our &lt;a href="https://basecut.dev/use-cases/ci-cd-test-data" rel="noopener noreferrer"&gt;CI/CD test data guide&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  When this is worth doing
&lt;/h2&gt;

&lt;p&gt;You probably want production-like snapshots if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;your app has more than a handful of tables&lt;/li&gt;
&lt;li&gt;your bugs are often data-dependent&lt;/li&gt;
&lt;li&gt;you need realistic data in local dev&lt;/li&gt;
&lt;li&gt;your CI pipeline should test against something closer to reality&lt;/li&gt;
&lt;li&gt;you handle meaningful PII&lt;/li&gt;
&lt;li&gt;your team is tired of maintaining fixtures or seed scripts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You might not need it yet if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the product is brand new&lt;/li&gt;
&lt;li&gt;you do not have real production data yet&lt;/li&gt;
&lt;li&gt;the schema is tiny&lt;/li&gt;
&lt;li&gt;completely fictional demo data is the goal&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not about replacing every fixture in your test suite. Unit tests still benefit from tiny, explicit test data. The value here is in &lt;strong&gt;integration tests, &lt;a href="https://basecut.dev/use-cases/local-development" rel="noopener noreferrer"&gt;local development&lt;/a&gt;, CI, onboarding, and debugging&lt;/strong&gt; where data shape matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  A practical rollout
&lt;/h2&gt;

&lt;p&gt;If you want to adopt this without overcomplicating it, start small.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Pick one painful workflow.&lt;br&gt;
Usually local dev onboarding, shared staging, or CI integration tests.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Define a small snapshot.&lt;br&gt;
Keep the restore fast. Start with a few root tables and sensible row limits.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Turn on anonymization from day one.&lt;br&gt;
Do not leave this for later.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Restore it somewhere useful immediately.&lt;br&gt;
Local dev DB or CI test DB is usually enough to prove the value.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Expand gradually.&lt;br&gt;
Add more tables, better filters, and refresh automation once the loop is working.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That gets you to useful production-like data quickly without turning the whole thing into a platform project.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final thought
&lt;/h2&gt;

&lt;p&gt;Most teams are not under-testing. They are testing against data that makes them feel safe. That is not the same thing.&lt;/p&gt;

&lt;p&gt;If your local environments and CI pipelines run against tiny, stale, or fake data, they will keep giving you false confidence. Production-like snapshots are one of the highest-leverage ways to make development and testing feel closer to the real system without dragging raw production data everywhere.&lt;/p&gt;

&lt;p&gt;If you want to try this with PostgreSQL, &lt;a href="https://basecut.dev" rel="noopener noreferrer"&gt;Basecut&lt;/a&gt; is free for small teams. Or dig into the &lt;a href="https://docs.basecut.dev/getting-started/quick-start" rel="noopener noreferrer"&gt;quickstart guide&lt;/a&gt; and &lt;a href="https://docs.basecut.dev/core-concepts/how-it-works" rel="noopener noreferrer"&gt;how FK-aware extraction works&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>testing</category>
      <category>devops</category>
      <category>database</category>
    </item>
  </channel>
</rss>
