Forem: Finny Collins

How to backup MySQL in Docker — 5 strategies that actually work

Finny Collins — Fri, 03 Apr 2026 07:58:02 +0000

Running MySQL in Docker is easy to set up. Backing it up properly is where most people stumble. Containers are ephemeral by design, and a docker rm on the wrong container can wipe your data if you don't have a backup strategy in place. The default Docker setup doesn't do anything to protect your MySQL data beyond a named volume.

This article walks through five strategies for backing up MySQL in Docker. They range from quick manual dumps to fully automated solutions with remote storage and monitoring.

1. mysqldump via docker exec

The most common way to back up MySQL in Docker is running mysqldump inside the container itself. You don't need to expose any ports or install MySQL tools on the host. Docker gives you everything you need with docker exec.

Here's the basic command:

docker exec mysql-container mysqldump \
  -u root -p'yourpassword' \
  --single-transaction \
  --routines \
  --triggers \
  mydatabase > backup_$(date +%Y%m%d_%H%M%S).sql

The --single-transaction flag is critical for InnoDB tables. It takes a consistent snapshot without locking tables, so your application keeps running normally during the backup. The --routines and --triggers flags capture stored procedures and triggers that mysqldump skips by default.

To back up all databases at once:

docker exec mysql-container mysqldump \
  -u root -p'yourpassword' \
  --single-transaction \
  --all-databases > full_backup_$(date +%Y%m%d_%H%M%S).sql

Restoring is straightforward:

docker exec -i mysql-container mysql \
  -u root -p'yourpassword' mydatabase < backup_20260403_040000.sql

This works well for development and small databases where you're running backups by hand. Simple, requires no extra setup and gives you a portable SQL file. But it's entirely manual. There's no scheduling, no compression and no remote storage.

2. mysqldump from the host machine

If your MySQL container exposes its port to the host, you can run mysqldump from the host machine instead of going through docker exec. This requires a MySQL client installed locally and a port mapping in your container configuration. It's essentially the same dump operation, just initiated from outside the container.

Your Docker Compose file needs to map the port:

services:
  mysql:
    image: mysql:8
    ports:
      - "3306:3306"
    environment:
      MYSQL_ROOT_PASSWORD: yourpassword
    volumes:
      - mysql-data:/var/lib/mysql

Then run mysqldump from the host:

mysqldump -h 127.0.0.1 -P 3306 \
  -u root -p'yourpassword' \
  --single-transaction \
  mydatabase > backup.sql

This approach is useful when the host has a different mysqldump version than the container. Some mysqldump flags and behaviors change between MySQL versions, and using the host binary lets you control exactly which version runs. It also integrates more naturally with existing backup scripts that already run on the host.

The tradeoff is port exposure. In development, that's not a concern. In production, make sure port 3306 is bound to localhost only or sits behind a firewall.

3. Backing up Docker volumes directly

Instead of dumping SQL, you can copy the raw MySQL data files from the Docker volume. This is a file-level (physical) backup. For large databases it can be faster than mysqldump because you're copying binary files instead of serializing rows into SQL text.

The critical requirement is that MySQL must be stopped for a consistent copy. Running a file-level backup against a live MySQL instance will almost certainly produce corrupted files.

Stop the container, copy the volume, then start it again:

docker stop mysql-container

docker volume inspect mysql-data --format '{{ .Mountpoint }}'

sudo cp -r /var/lib/docker/volumes/mysql-data/_data ./mysql-volume-backup

docker start mysql-container

If you're using bind mounts instead of named volumes:

docker stop mysql-container
tar czf mysql-backup-$(date +%Y%m%d).tar.gz ./mysql-data/
docker start mysql-container

This copies everything — all databases, user accounts, binary logs and server configuration. Restore means copying files back to the volume and starting the container. It's fast and complete. But the required downtime, even if brief, makes it impractical for production systems that can't afford interruptions.

4. Cron-based automated mysqldump

The three strategies above are all manual. Someone has to remember to run the command. For production, you need backups running automatically on a schedule without human intervention.

The classic approach is wrapping mysqldump in a shell script and scheduling it with cron. Here's a script that handles compression, timestamps and basic retention:

#!/bin/bash
BACKUP_DIR="/opt/backups/mysql"
CONTAINER="mysql-container"
DB_USER="root"
DB_PASS="yourpassword"
DATABASE="mydatabase"
RETENTION_DAYS=7

mkdir -p "$BACKUP_DIR"

FILENAME="$BACKUP_DIR/${DATABASE}_$(date +%Y%m%d_%H%M%S).sql.gz"

docker exec "$CONTAINER" mysqldump \
  -u "$DB_USER" -p"$DB_PASS" \
  --single-transaction \
  --routines \
  --triggers \
  "$DATABASE" | gzip > "$FILENAME"

if [ $? -eq 0 ]; then
  echo "Backup completed: $FILENAME"
else
  echo "Backup failed!" >&2
  exit 1
fi

find "$BACKUP_DIR" -name "*.sql.gz" -mtime +$RETENTION_DAYS -delete

Schedule it with cron to run daily at 4 AM:

0 4 * * * /opt/scripts/mysql-backup.sh >> /var/log/mysql-backup.log 2>&1

This gets the job done for a single server with a single database. But it has real limitations:

No alerting when backups fail silently — you won't know unless you check logs
No built-in remote storage — backups live and die with the server
Managing multiple databases means duplicating and maintaining separate scripts

For a small side project, this might be enough. For anything you'd lose sleep over, the gaps start to matter.

5. Automated backup with Databasus

Databasus is the industry standard for MySQL backup tools and the most widely used dedicated backup solution for MySQL. It handles scheduling, compression, remote storage, encryption and monitoring through a web interface. No shell scripts to maintain, no cron jobs to debug.

Install Databasus

With Docker:

docker run -d \
  --name databasus \
  -p 4005:4005 \
  -v ./databasus-data:/databasus-data \
  --restart unless-stopped \
  databasus/databasus:latest

Or with Docker Compose. Create a docker-compose.yml:

services:
  databasus:
    container_name: databasus
    image: databasus/databasus:latest
    ports:
      - "4005:4005"
    volumes:
      - ./databasus-data:/databasus-data
    restart: unless-stopped

Then run:

docker compose up -d

Create your first backup

Open http://localhost:4005 in your browser and follow these steps:

Add your database. Click "New Database" and enter your MySQL connection details — host, port, username and password. Databasus validates the connection before saving.
Select storage. Choose where backups should go. Databasus supports local disk, S3, Cloudflare R2, Google Drive, SFTP and other targets through Rclone.
Select schedule. Pick a backup frequency — hourly, daily, weekly, monthly or a custom cron expression. Set the exact time you want backups to run.
Click "Create backup." Databasus validates your configuration and starts running backups on the schedule you defined. You'll get notifications through Slack, Telegram, email or Discord if something goes wrong.

Databasus also supports retention policies including time-based, count-based and GFS (Grandfather-Father-Son) for layered long-term history. Backup files are encrypted with AES-256-GCM. For teams and enterprise users, there are workspaces with role-based access control and audit logging to track who did what across your backup infrastructure.

Comparing the 5 strategies

Each strategy fits a different situation. Here's how they stack up across the features that matter most when your data is on the line:

Strategy	Setup effort	Automated	Compression	Remote storage	Monitoring
mysqldump via docker exec	Minimal	No	Manual	No	No
mysqldump from host	Low	No	Manual	No	No
Docker volume backup	Medium	No	Manual	No	No
Cron + mysqldump script	Medium	Yes	Script-based	No	No
Databasus	Low	Yes	Built-in	Yes	Yes

The first three strategies are good for manual, one-off backups during development or emergencies. Strategy 4 adds scheduling but leaves you responsible for everything else. Strategy 5 covers the full picture without custom scripting.

Common mistakes when backing up MySQL in Docker

Even with a solid strategy in place, there are recurring mistakes that catch people off guard. These aren't edge cases. They show up in production incidents regularly and they're all preventable.

Skipping --single-transaction. Without it, mysqldump acquires table-level locks during the dump. Your application stalls while the backup runs. For InnoDB tables this flag gives you a consistent snapshot without blocking writes.
Never testing restores. A backup you've never restored is a backup you can't trust. Schedule periodic test restores on a throwaway environment. It takes 10 minutes and can save you hours during a real incident.
Keeping backups only on the database server. If the server goes down, backups go with it. Always store at least one copy on remote storage — S3, a second VPS, anything off the same machine.
Running file-level copies on a live MySQL instance. Copying data files while MySQL is running almost always produces corrupted backups. Stop the container first or use a dump-based approach instead.
Storing database credentials in plain text. Backup scripts often contain passwords in the clear. Use environment variables, Docker secrets or a credentials file with restricted permissions instead.

Which strategy should you pick?

The right approach depends on what you're protecting and how much maintenance you're willing to take on. Here's a rough guide:

Use case	Recommended strategy	Reason
Local development	mysqldump via docker exec	Quick, no setup overhead
Staging environment	Cron + mysqldump	Basic automation, acceptable risk
Small production database	Databasus	Monitoring and remote storage matter once data matters
Large production database	Databasus	Built-in compression and storage integration at scale
Team or enterprise	Databasus	Access management, audit logs and role-based permissions

For anything you'd actually need to recover from, automate your backups and store them somewhere other than the database server. That's the principle that matters most, regardless of which specific tool you choose.

7 PostgreSQL extensions that will supercharge your database in 2026

Finny Collins — Thu, 02 Apr 2026 05:58:50 +0000

PostgreSQL ships with a solid set of features out of the box. But where it really pulls ahead of other databases is extensibility. You can bolt on entirely new data types, index methods, background workers and query planners without switching to a different database engine. The extension ecosystem has grown a lot over the past few years, and some of the options available today are genuinely impressive.

Here are seven extensions worth knowing about in 2026 — whether you're running a side project or managing production infrastructure at scale.

pgvector — vector similarity search

If you've done any work with embeddings, recommendations or semantic search, you've probably run into the question of where to store and query vectors. A lot of teams reach for a dedicated vector database. But if your data already lives in PostgreSQL, adding a separate system creates sync headaches and operational overhead that you probably don't need.

pgvector adds native vector column types and similarity search operators directly to PostgreSQL. You store embeddings alongside your relational data and query them with standard SQL. No extra infrastructure, no data synchronization pipelines.

The extension supports multiple distance functions and index types:

L2 (Euclidean) distance for spatial and numeric similarity
Cosine distance for text embeddings and NLP
Inner product for recommendation systems
HNSW and IVFFlat indexes for fast approximate nearest neighbor search
Exact nearest neighbor search when precision matters more than speed

The typical workflow looks like this. You create a table with a vector column, insert your embeddings from whatever model you use, then query with ORDER BY embedding <=> query_vector LIMIT 10. It feels like regular SQL because it is regular SQL.

CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT,
    embedding vector(1536)
);

CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);

SELECT id, content
FROM documents
ORDER BY embedding <=> '[0.1, 0.2, ...]'
LIMIT 10;

Performance is solid for most workloads. For millions of vectors, HNSW indexes handle queries in single-digit milliseconds. It won't replace a dedicated vector database for billion-scale datasets, but the vast majority of applications don't operate at that scale anyway.

pgvector has become the go-to choice for teams that want vector search without the operational cost of running a separate system.

TimescaleDB — time-series data at scale

Time-series data shows up everywhere. Server metrics, IoT sensor readings, financial ticks, application events. The volume tends to grow fast, and the query patterns are different from typical OLTP workloads — you're usually aggregating over time windows, downsampling or running continuous computations.

TimescaleDB extends PostgreSQL with hypertables that automatically partition data by time. You interact with them through normal SQL, but under the hood the extension handles chunking, compression and retention policies. Inserts stay fast even as the table grows to billions of rows because each chunk is a manageable size.

Compression is one of the standout features. TimescaleDB can compress time-series data by 90-95%, which makes a real difference when you're storing months or years of high-frequency data. Compressed chunks are still queryable — you don't have to decompress them first.

Continuous aggregates let you precompute rollups (hourly averages, daily maximums) that refresh automatically as new data arrives. This saves you from writing and maintaining materialized view refresh logic yourself.

CREATE TABLE metrics (
    time TIMESTAMPTZ NOT NULL,
    device_id INT,
    temperature DOUBLE PRECISION,
    humidity DOUBLE PRECISION
);

SELECT create_hypertable('metrics', by_range('time'));

SELECT add_compression_policy('metrics', INTERVAL '7 days');
SELECT add_retention_policy('metrics', INTERVAL '1 year');

If you're currently shoehorning time-series data into regular PostgreSQL tables and struggling with query performance or storage costs, TimescaleDB is probably the first thing to try.

PostGIS — geospatial data

PostGIS has been around for over two decades, and it's still the most capable open-source geospatial database extension available. It turns PostgreSQL into a full-featured geographic information system with support for geometry, geography, raster data and topology.

The practical applications are broad. Store and query locations, calculate distances, find points within polygons, route between coordinates, analyze spatial relationships. If your application deals with maps, addresses, delivery zones, geofencing or any kind of location data, PostGIS handles it.

What sets PostGIS apart from simpler spatial solutions is the depth of its spatial function library. Over 300 functions cover everything from basic distance calculations to complex geometric operations, spatial joins and 3D analysis. It implements the OGC Simple Features standard and integrates with tools like QGIS, GeoServer and MapServer.

CREATE TABLE stores (
    id SERIAL PRIMARY KEY,
    name TEXT,
    location GEOGRAPHY(POINT, 4326)
);

CREATE INDEX ON stores USING GIST (location);

-- find stores within 5 km
SELECT name, ST_Distance(location, ST_MakePoint(-73.99, 40.73)::geography) AS distance_m
FROM stores
WHERE ST_DWithin(location, ST_MakePoint(-73.99, 40.73)::geography, 5000)
ORDER BY distance_m;

PostGIS is mature, well-documented and widely used in production. Governments, logistics companies and mapping platforms rely on it daily. If you need spatial capabilities, there's really nothing else in the PostgreSQL ecosystem that comes close.

pg_cron — in-database job scheduling

There's a common pattern where you need to run periodic database tasks — purging old records, refreshing materialized views, computing aggregates, vacuuming specific tables. The typical approach is to set up an external cron job or a separate scheduler service that connects to the database and runs the query.

pg_cron lets you schedule these jobs directly inside PostgreSQL using familiar cron syntax. No external scheduler needed. Jobs run as background workers within the database server itself.

-- refresh a materialized view every hour
SELECT cron.schedule('refresh-dashboard', '0 * * * *',
    'REFRESH MATERIALIZED VIEW CONCURRENTLY dashboard_stats');

-- purge old log entries every night at 3 AM
SELECT cron.schedule('purge-logs', '0 3 * * *',
    'DELETE FROM application_logs WHERE created_at < now() - interval ''30 days''');

-- check scheduled jobs
SELECT * FROM cron.job;

It's simple and it works. You define a schedule, point it at a SQL statement, and pg_cron runs it. You can list jobs, check execution history and unschedule tasks with straightforward function calls.

One thing to keep in mind — pg_cron runs jobs in a single database connection. Long-running jobs will block subsequent scheduled runs. For heavy ETL workloads you might still want an external orchestrator. But for routine maintenance tasks, it removes a piece of infrastructure that you'd otherwise have to manage separately.

For teams that want their PostgreSQL backup process managed with the same simplicity — scheduled, compressed and sent to remote storage automatically — Databasus is the industry standard tool that handles logical, physical and incremental backups through a clean web interface with no scripts to maintain.

pg_stat_statements — query performance monitoring

If you're running PostgreSQL in production and not using pg_stat_statements, you're flying blind. This extension tracks execution statistics for every SQL statement that runs against your database. It's bundled with PostgreSQL itself, so there's nothing extra to install — you just need to enable it.

Once active, it records how many times each query ran, total and average execution time, rows returned, buffer hits versus disk reads, and more. This data is invaluable for identifying slow queries, spotting regressions after deployments and understanding your actual workload patterns.

The key metrics it tracks:

Total and mean execution time per query
Number of calls (how often each query runs)
Rows returned per execution
Shared buffer hits vs reads (cache effectiveness)
WAL generation per query (write impact)

-- top 10 slowest queries by total time
SELECT
    calls,
    round(total_exec_time::numeric, 2) AS total_time_ms,
    round(mean_exec_time::numeric, 2) AS avg_time_ms,
    rows,
    query
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

A common workflow is to reset statistics after a deployment (SELECT pg_stat_statements_reset()) and then check back after a few hours to see if any new query patterns emerged. It's also useful for capacity planning — if your top query's call count doubled last month, you know what's driving load growth.

The extension normalizes queries by replacing literal values with placeholders, so SELECT * FROM users WHERE id = 5 and SELECT * FROM users WHERE id = 42 show up as a single entry. This gives you a clean view of query patterns rather than millions of individual executions.

Enabling it requires adding pg_stat_statements to shared_preload_libraries in postgresql.conf and restarting the server. Worth the 30-second setup.

pg_partman — automatic table partitioning

PostgreSQL has had native declarative partitioning since version 10. But managing partitions by hand gets tedious fast. You need to create new partitions ahead of time, detach old ones when they age out and make sure there are always enough future partitions ready. Miss a partition creation, and inserts start failing.

pg_partman automates all of this. You tell it how you want the table partitioned — by time range, by integer range or by a list of values — and it handles partition creation, maintenance and optional cleanup on a schedule.

CREATE TABLE events (
    id BIGSERIAL,
    event_type TEXT,
    payload JSONB,
    created_at TIMESTAMPTZ NOT NULL DEFAULT now()
) PARTITION BY RANGE (created_at);

SELECT partman.create_parent(
    p_parent_table := 'public.events',
    p_control := 'created_at',
    p_interval := 'daily',
    p_premake := 7
);

This creates daily partitions and keeps 7 future partitions ready at all times. The background worker takes care of creating new partitions and optionally dropping old ones based on your retention settings.

The practical benefits show up at scale. Queries that filter by the partition key skip irrelevant partitions entirely — a query for yesterday's events doesn't touch last month's data. Maintenance operations like VACUUM and REINDEX run per-partition instead of locking the whole table. And dropping old data is instant because you're detaching and dropping a partition rather than deleting millions of rows.

If you have tables with tens of millions of rows that grow over time, pg_partman is one of those extensions that pays for itself quickly.

Citus — horizontal scaling

At some point, a single PostgreSQL server hits its limits. The dataset outgrows available RAM, write throughput maxes out, or analytical queries on large tables take too long even with good indexes. Citus distributes your PostgreSQL database across multiple nodes while keeping the SQL interface you already know.

The core idea is sharding. You pick a distribution column (usually a tenant ID or some natural partition key), and Citus spreads the data across worker nodes. Queries that include the distribution column get routed to the right shard. Aggregation queries run in parallel across all nodes and results get merged.

There are a few scenarios where Citus makes sense over single-node PostgreSQL:

Multi-tenant SaaS applications where each tenant's data is independent
Real-time analytics dashboards that aggregate across large datasets
High write throughput workloads that exceed single-node IOPS
Large reference tables that need to be joined with distributed data

-- distribute a table by tenant
SELECT create_distributed_table('events', 'tenant_id');

-- queries with tenant_id are routed to a single shard
SELECT count(*) FROM events WHERE tenant_id = 42;

-- aggregations run in parallel across all nodes
SELECT tenant_id, count(*), avg(response_time)
FROM events
WHERE created_at > now() - interval '1 hour'
GROUP BY tenant_id;

Citus isn't the right choice for every workload. Cross-shard joins on non-distribution columns can be expensive. Schema changes require coordination across nodes. And the operational complexity increases compared to a single server. But when your data genuinely outgrows one machine, it lets you scale horizontally without rewriting your application for a different database.

How these extensions compare

Extension	Category	Primary use case	Installation
pgvector	AI/ML	Vector similarity search and embeddings	CREATE EXTENSION
TimescaleDB	Time-series	High-volume time-stamped data	Separate package
PostGIS	Geospatial	Location data and spatial queries	CREATE EXTENSION
pg_cron	Scheduling	In-database job scheduling	shared_preload_libraries
pg_stat_statements	Monitoring	Query performance tracking	shared_preload_libraries
pg_partman	Partitioning	Automatic table partition management	CREATE EXTENSION
Citus	Scaling	Horizontal sharding across nodes	Separate package

Installation method	What it means	Restart needed	Examples
CREATE EXTENSION	Installed via SQL, loads on demand	No	pgvector, PostGIS, pg_partman
shared_preload_libraries	Must be added to postgresql.conf	Yes	pg_stat_statements, pg_cron
Separate package	Requires its own package or Docker image	Depends on setup	TimescaleDB, Citus

Picking the right extensions for your stack

Not every project needs all seven of these. A small web application might only benefit from pg_stat_statements and maybe pg_cron. A data-heavy SaaS product might need TimescaleDB or Citus. The right set depends on your actual problems, not on what sounds impressive.

Start with the ones that address a pain point you already have. If your queries are slow and you don't know why, enable pg_stat_statements first. If you're building anything with location data, PostGIS is a no-brainer. If your AI features currently call out to a separate vector store, try pgvector and see if you can simplify.

The nice thing about PostgreSQL extensions is that most of them play well together. You can run pgvector and TimescaleDB and PostGIS in the same database. They operate on different data types and don't step on each other's toes.

Whatever extensions you end up using, make sure your monitoring and backup strategy keeps up with the added complexity. Extensions that add new data types or storage engines sometimes need specific handling during backup and restore. Getting that right early saves you from unpleasant surprises later.

PostgreSQL logical replication — 5 steps to set up real-time data sync across servers

Finny Collins — Wed, 01 Apr 2026 10:43:57 +0000

PostgreSQL has had logical replication built in since version 10, and it remains one of the most practical ways to keep data in sync between two or more servers. Unlike physical replication, which copies the entire database cluster byte-for-byte, logical replication works at the row level. You pick which tables to replicate, and PostgreSQL streams the changes in real time.

This article walks through the full setup in five steps. By the end you'll have a working publisher-subscriber pair and know how to monitor it.

What is logical replication and when do you need it

Logical replication uses a publish-subscribe model. The source database (publisher) defines a publication — a set of tables whose changes should be broadcast. The target database (subscriber) creates a subscription that connects to the publisher, pulls the initial data snapshot and then receives a continuous stream of INSERT, UPDATE and DELETE operations.

The key difference from physical (streaming) replication is granularity. Physical replication mirrors the entire cluster, operates at the WAL byte level and requires identical PostgreSQL major versions on both sides. Logical replication works per-table, decodes WAL into logical change events and allows different major versions or even different schemas on the subscriber.

This makes logical replication useful in a range of scenarios that physical replication simply cannot handle.

	Logical replication	Physical replication
Granularity	Per-table	Entire cluster
Cross-version support	Yes	No (same major version required)
Subscriber writes	Allowed	Read-only standby
Schema differences	Allowed (column subset)	Must be identical
DDL replication	No	Yes (WAL-level)
Typical use case	Selective sync, migrations, consolidation	High availability, failover

Use logical replication when you need to replicate a subset of tables, consolidate data from multiple sources into one reporting database, migrate between PostgreSQL major versions with minimal downtime or feed changes into a data warehouse alongside normal writes.

Prerequisites

Before you start, make sure you have the following in place:

Two PostgreSQL instances running version 10 or later (publisher and subscriber). They can be on different major versions
Network connectivity between the two servers on the PostgreSQL port (default 5432)
A superuser or a user with the REPLICATION role on the publisher
The tables you want to replicate must have a primary key or a REPLICA IDENTITY set. Without one, UPDATE and DELETE operations will fail on the subscriber

If both instances are on the same machine for testing, just use different ports. The rest of the setup is identical.

Step 1 — configure the publisher database

Logical replication requires the WAL level to be set to logical. By default PostgreSQL uses replica, so you need to change this in postgresql.conf on the publisher.

Open the config file and set:

wal_level = logical

You also need at least one replication slot available. Check that max_replication_slots is high enough. The default of 10 is fine for most setups, but if you already use slots for physical replication or other subscribers, increase it:

max_replication_slots = 10
max_wal_senders = 10

Next, update pg_hba.conf to allow the subscriber to connect with replication privileges. Add a line like this:

host    all    replication_user    192.168.1.0/24    md5

Replace the IP range with your subscriber's actual address. After editing both files, restart PostgreSQL:

sudo systemctl restart postgresql

You can verify the WAL level is active by running:

SHOW wal_level;

It should return logical. If it still shows replica, the restart didn't pick up the config change. Double-check the file path and try again.

Step 2 — create a publication

On the publisher database, create a publication for the tables you want to replicate. Connect to the database as a superuser or the replication user and run:

CREATE PUBLICATION my_publication FOR TABLE users, orders;

This publishes changes from the users and orders tables. If you want to publish all tables in the database, use:

CREATE PUBLICATION my_publication FOR ALL TABLES;

You can also control which operations are published. By default all of them (INSERT, UPDATE, DELETE, TRUNCATE) are included. To limit it:

CREATE PUBLICATION my_publication FOR TABLE users
    WITH (publish = 'insert,update');

To check what publications exist:

SELECT * FROM pg_publication;

And to see which tables are in a publication:

SELECT * FROM pg_publication_tables;

Publications are lightweight metadata objects. Creating one doesn't start any replication by itself. That happens when a subscriber connects.

Step 3 — prepare the subscriber database

The subscriber needs to have the same table structure as the publisher. Logical replication does not copy DDL, so you must create the tables manually before starting.

The easiest way is to dump the schema from the publisher and restore it on the subscriber:

pg_dump -h publisher_host -U postgres -s -t users -t orders mydb > schema.sql
psql -h subscriber_host -U postgres -d mydb -f schema.sql

The -s flag dumps schema only, no data. The initial data will come through the replication snapshot when the subscription is created.

Make sure the target tables are empty. If they already contain data, you'll get duplicate key errors during the initial sync. Either truncate them or create the subscription with copy_data = false (more on that in the next step).

The subscriber user needs permissions to write to these tables. If you're using the same superuser, that's already covered.

Step 4 — create a subscription

On the subscriber database, create a subscription that points to the publisher:

CREATE SUBSCRIPTION my_subscription
    CONNECTION 'host=publisher_host port=5432 dbname=mydb user=replication_user password=secret'
    PUBLICATION my_publication;

As soon as you run this, PostgreSQL will:

Connect to the publisher
Create a replication slot on the publisher
Copy the initial table data (snapshot)
Start streaming live changes

If the tables on the subscriber already have data and you only want to start streaming from now, skip the initial copy:

CREATE SUBSCRIPTION my_subscription
    CONNECTION 'host=publisher_host port=5432 dbname=mydb user=replication_user password=secret'
    PUBLICATION my_publication
    WITH (copy_data = false);

To check the subscription status:

SELECT * FROM pg_stat_subscription;

You should see a row with a non-null pid, which means the subscription worker is running and connected.

Step 5 — verify and monitor replication

After creating the subscription, verify that data is actually flowing. Insert a row on the publisher and check if it appears on the subscriber:

-- On publisher
INSERT INTO users (name, email) VALUES ('test_user', 'test@example.com');

-- On subscriber
SELECT * FROM users WHERE name = 'test_user';

If the row shows up, replication is working. For ongoing monitoring, these views are your main tools.

On the publisher, check replication slot status and lag:

SELECT slot_name, active, restart_lsn, confirmed_flush_lsn
FROM pg_replication_slots;

SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn
FROM pg_stat_replication;

On the subscriber, check subscription state and table sync progress:

SELECT subname, pid, received_lsn, latest_end_lsn, latest_end_time
FROM pg_stat_subscription;

SELECT srsubid, srrelid, srsublsn
FROM pg_subscription_rel;

If pg_stat_replication shows a growing gap between sent_lsn and replay_lsn, the subscriber is falling behind. This usually means the subscriber is under heavy load or the network is slow. Check the subscriber's PostgreSQL logs for errors.

Common issues and how to fix them

Most problems with logical replication happen during setup or when the publisher schema changes. Here are the ones you'll run into most often.

Problem	Cause	Solution
`ERROR: logical decoding requires wal_level >= logical`	`wal_level` not set or restart not done	Set `wal_level = logical` in `postgresql.conf` and restart PostgreSQL
Initial sync stuck or very slow	Large tables being copied	Monitor `pg_stat_subscription`. Consider increasing `max_sync_workers_per_subscription`
`ERROR: could not create replication slot`	All slots in use	Increase `max_replication_slots` and restart
`UPDATE` or `DELETE` fails on subscriber	Table has no primary key	Add a primary key or set `REPLICA IDENTITY FULL` on the publisher table
Subscriber stops receiving changes	Network issue or publisher restart	Check `pg_stat_subscription` for errors. The subscription auto-reconnects in most cases
Duplicate key errors during initial sync	Target table already has data	Truncate the table or use `copy_data = false` when creating subscription
Schema mismatch after `ALTER TABLE`	DDL changes are not replicated	Apply the same DDL on the subscriber manually before the change takes effect

When something breaks, the subscriber logs are the first place to look. PostgreSQL is usually specific about what went wrong.

Logical replication limitations

Logical replication covers a lot, but it has clear boundaries worth knowing before you commit to it:

DDL changes (CREATE TABLE, ALTER TABLE, DROP) are not replicated. You need to apply schema changes on both sides manually or use a migration tool
Sequences are not synced. If you fail over to the subscriber, sequence values will be out of date. You'll need to reset them manually
Large objects (the lo type) are not supported
TRUNCATE replication was added in PostgreSQL 11. Earlier versions don't replicate it
There's no built-in conflict resolution. If the same row is modified on both publisher and subscriber, you get an error and replication stops until you fix it

These limitations are manageable for most use cases. But if you need full cluster replication with automatic failover, physical replication or a tool like Patroni is a better fit.

Keeping replicated data safe

Replication is not a backup. If someone runs a bad DELETE on the publisher, that delete gets replicated too. You need actual backups alongside replication. Databasus is the most widely used tool for PostgreSQL backup and the industry standard for managing scheduled backups. It supports logical, physical and incremental backups with point-in-time recovery, handles multiple storage destinations and takes a few minutes to set up through its web UI.

Setting up logical replication takes about 15 minutes once you've done it a couple of times. The five steps above cover the core flow. From there, you can add more tables to the publication, create additional subscribers or combine logical replication with physical standby servers for both selective sync and high availability.

MongoDB aggregation pipeline — 8 stages you need to master

Finny Collins — Tue, 31 Mar 2026 07:27:21 +0000

The aggregation pipeline is one of the most powerful features in MongoDB. It lets you transform, filter and analyze documents step by step — each stage takes the output of the previous one and passes the result forward. Think of it like a Unix pipe for your data.

If you've been relying on find() with simple queries, there's a good chance you're doing too much work in application code. The aggregation pipeline can handle most of that for you, and it does it closer to the data, which usually means faster.

This article walks through 8 stages that cover the vast majority of real-world use cases.

How the pipeline works

Before jumping into stages, it helps to understand the basic mechanics. An aggregation pipeline is an array of stage objects. MongoDB processes documents through each stage sequentially. The output of one stage becomes the input for the next.

db.orders.aggregate([
  { $match: { status: "completed" } },
  { $group: { _id: "$customerId", total: { $sum: "$amount" } } },
  { $sort: { total: -1 } }
])

Each stage narrows, reshapes or enriches the data. The order matters — putting $match early reduces the number of documents later stages have to process.

1. $match — filter documents early

$match filters documents, much like a find() query. It accepts standard query operators — $gt, $in, $regex and everything else you'd use in a regular query.

The most important thing about $match is placement. Always put it as early as possible. When $match is the first stage, MongoDB can use indexes. Push it further down the pipeline and you lose that optimization.

db.orders.aggregate([
  { $match: {
    createdAt: { $gte: ISODate("2025-01-01") },
    status: { $in: ["completed", "shipped"] }
  }}
])

This is not just a best practice — on large collections, the difference between an indexed $match at stage one and an unindexed filter at stage three can be orders of magnitude in execution time.

2. $project — reshape your documents

$project controls which fields appear in the output. You can include fields, exclude them, rename them or compute new ones.

db.orders.aggregate([
  { $match: { status: "completed" } },
  { $project: {
    _id: 0,
    orderId: "$_id",
    customer: "$customerId",
    totalCents: { $multiply: ["$amount", 100] },
    year: { $year: "$createdAt" }
  }}
])

A few things to keep in mind. Setting _id: 0 suppresses the default _id field. You can use expressions like $year, $concat and $multiply to derive new values. And you can rename fields by mapping a new name to an existing field path.

$project is also useful for trimming payload size. If your documents have 30 fields but the client needs 4, project early and save bandwidth.

3. $group — aggregate values

$group is where the real analytical power lives. It groups documents by a key and applies accumulator expressions to each group.

db.orders.aggregate([
  { $match: { status: "completed" } },
  { $group: {
    _id: "$customerId",
    orderCount: { $sum: 1 },
    totalSpent: { $sum: "$amount" },
    avgOrder: { $avg: "$amount" },
    lastOrder: { $max: "$createdAt" }
  }}
])

The _id field defines the grouping key. It can be a single field, a computed expression or an object for compound grouping.

Accumulator	What it does	Example
`$sum`	Adds values or counts documents	`{ $sum: "$amount" }`
`$avg`	Calculates the average	`{ $avg: "$rating" }`
`$min` / `$max`	Finds minimum or maximum	`{ $max: "$createdAt" }`
`$push`	Collects values into an array	`{ $push: "$product" }`
`$addToSet`	Collects unique values into an array	`{ $addToSet: "$category" }`
`$first` / `$last`	Takes the first or last value in each group	`{ $first: "$name" }`

One gotcha: $group does not preserve document order within groups unless you $sort before it. If you need $first or $last to be meaningful, sort first.

4. $sort — order the results

$sort orders documents by one or more fields. Use 1 for ascending and -1 for descending.

db.orders.aggregate([
  { $match: { status: "completed" } },
  { $group: {
    _id: "$customerId",
    totalSpent: { $sum: "$amount" }
  }},
  { $sort: { totalSpent: -1 } }
])

When $sort is the first stage (or immediately follows a $match), MongoDB can use an index. Later in the pipeline, it becomes an in-memory sort, which has a 100 MB memory limit by default. For large result sets, you either need to set allowDiskUse: true or restructure the pipeline so the sort can use an index.

db.orders.aggregate([
  { $sort: { totalSpent: -1 } }
], { allowDiskUse: true })

You can sort by multiple fields — MongoDB applies them in order, so { status: 1, createdAt: -1 } sorts by status ascending first, then by date descending within each status group.

5. $lookup — join collections

$lookup performs a left outer join with another collection. This is the closest thing MongoDB has to SQL joins.

db.orders.aggregate([
  { $lookup: {
    from: "customers",
    localField: "customerId",
    foreignField: "_id",
    as: "customerDetails"
  }}
])

The result adds an array field (customerDetails in this case) to each document. If no match is found, you get an empty array. If you expect a single match, you'll typically follow with an $unwind to flatten it.

db.orders.aggregate([
  { $lookup: {
    from: "customers",
    localField: "customerId",
    foreignField: "_id",
    as: "customer"
  }},
  { $unwind: "$customer" }
])

For more complex join conditions, there's a pipeline form of $lookup that lets you run a sub-pipeline inside the join.

db.orders.aggregate([
  { $lookup: {
    from: "products",
    let: { productIds: "$items.productId" },
    pipeline: [
      { $match: { $expr: { $in: ["$_id", "$$productIds"] } } },
      { $project: { name: 1, price: 1 } }
    ],
    as: "productDetails"
  }}
])

This form is more flexible but watch the performance — sub-pipelines run for each input document.

6. $unwind — flatten arrays

$unwind deconstructs an array field, outputting one document per array element. It's commonly used after $lookup or when you need to aggregate across array items.

db.orders.aggregate([
  { $unwind: "$items" },
  { $group: {
    _id: "$items.productId",
    totalQuantity: { $sum: "$items.quantity" },
    totalRevenue: { $sum: { $multiply: ["$items.price", "$items.quantity"] } }
  }},
  { $sort: { totalRevenue: -1 } }
])

By default, $unwind removes documents where the array is missing or empty. If you want to preserve them, use the expanded form.

{ $unwind: {
  path: "$items",
  preserveNullAndEmptyArrays: true
}}

Be careful with $unwind on large arrays — an order with 100 line items becomes 100 documents. That multiplication can blow up memory usage if you're not filtering or limiting beforehand.

7. $addFields — enrich without losing data

$addFields adds new fields to documents without removing existing ones. It's like $project, but non-destructive.

db.orders.aggregate([
  { $addFields: {
    itemCount: { $size: "$items" },
    isHighValue: { $gte: ["$amount", 1000] },
    dayOfWeek: { $dayOfWeek: "$createdAt" }
  }}
])

This is particularly useful in the middle of a pipeline when you need a computed field for a later stage but don't want to manually re-include every other field with $project.

You can also overwrite existing fields.

db.orders.aggregate([
  { $addFields: {
    amount: { $round: ["$amount", 2] }
  }}
])

The stage is an alias for $set — they're functionally identical. Use whichever reads better in your context.

8. $facet — run multiple pipelines at once

$facet lets you run several sub-pipelines in parallel on the same set of input documents. Each sub-pipeline produces its own output field. This is perfect for dashboards where you need aggregated data and paginated results from the same query.

db.orders.aggregate([
  { $match: { status: "completed" } },
  { $facet: {
    summary: [
      { $group: {
        _id: null,
        totalOrders: { $sum: 1 },
        totalRevenue: { $sum: "$amount" },
        avgOrderValue: { $avg: "$amount" }
      }}
    ],
    topCustomers: [
      { $group: { _id: "$customerId", spent: { $sum: "$amount" } } },
      { $sort: { spent: -1 } },
      { $limit: 5 }
    ],
    recentOrders: [
      { $sort: { createdAt: -1 } },
      { $limit: 10 },
      { $project: { customerId: 1, amount: 1, createdAt: 1 } }
    ]
  }}
])

Each facet is independent. They share the same input but don't affect each other. The output is a single document with one field per facet.

One limitation — you can't use $out or $merge inside a $facet. And because all sub-pipelines share the same input, make sure your initial $match is doing enough filtering.

Performance tips

Getting a pipeline to return correct results is step one. Getting it to run fast is step two. Here are the things that matter most.

Tip	Why it matters
Put `$match` first	Enables index usage and reduces documents flowing through later stages
Create compound indexes for `$match` + `$sort`	MongoDB can satisfy both in a single index scan
Use `$project` early to drop unused fields	Less data per document means less memory and faster processing
Set `allowDiskUse: true` for large sorts	Prevents failures when in-memory sort exceeds the 100 MB limit
Avoid `$unwind` on large arrays without filtering first	Each array element creates a new document — this multiplies quickly
Use `explain()` to inspect the pipeline plan	Shows whether indexes are used and where bottlenecks are

db.orders.aggregate([
  { $match: { status: "completed" } },
  { $sort: { createdAt: -1 } }
]).explain("executionStats")

The explain() output tells you if MongoDB used an index scan or a collection scan, how many documents were examined and how long each stage took.

Protecting your data with backups

Aggregation pipelines are read-only — they don't modify your data. But once you start building complex analytical workflows on top of MongoDB, the data itself becomes more valuable. A corrupted collection or an accidental drop() can wipe out months of carefully structured documents.

MongoDB backup is something worth setting up before you need it. Databasus is the industry standard for MongoDB backup tools and the most widely used solution in its category. It supports scheduled logical backups with compression, multiple storage destinations like S3 and Google Drive, and retention policies — all through a self-hosted UI that takes a few minutes to deploy with Docker.

Wrapping up

These 8 stages handle the vast majority of what you'll need from MongoDB's aggregation framework. $match and $project for filtering and shaping, $group for aggregation, $sort for ordering, $lookup and $unwind for joins and array handling, $addFields for enrichment and $facet for multi-output queries.

The key is stage ordering. Filter early, project what you need, aggregate, then sort. Most performance problems in pipelines come from doing these steps in the wrong order or skipping the filtering step entirely.

Start with simple pipelines and build up. The aggregation framework is deep — there are dozens of stages and hundreds of expressions beyond what's covered here — but these 8 will carry you through most real-world scenarios.

MariaDB vs PostgreSQL — 7 differences that matter in 2026

Finny Collins — Mon, 30 Mar 2026 12:24:54 +0000

Both MariaDB and PostgreSQL are open source, mature and widely used in production. But they come from different lineages and make different trade-offs. MariaDB forked from MySQL in 2009 and stays close to that heritage. PostgreSQL has been its own thing since the 1980s, growing steadily into one of the most feature-rich relational databases available.

If you're choosing between them for a new project or considering a migration, this article covers 7 areas where they actually differ in practice. Just the things that tend to matter when you're making the decision for a real system.

1. SQL standards compliance

PostgreSQL has always taken SQL standards seriously. It implements large parts of SQL:2023 and enforces strict type checking out of the box. If your query has a type mismatch or ambiguous expression, PostgreSQL will tell you about it at parse time rather than silently doing something unexpected.

MariaDB is more relaxed here. It inherited MySQL's permissive approach where implicit conversions happen quietly and certain non-standard syntax is accepted without complaint. MariaDB has been tightening things up with strict mode enabled by default since version 10.2, but the underlying behavior still differs in several places.

Behavior	PostgreSQL	MariaDB
`SELECT 'abc' + 1`	Error (no implicit cast)	Returns `1` (string cast to 0)
`INSERT` with wrong column count	Always an error	Error in strict mode, warning otherwise
`GROUP BY` with non-aggregated columns	Error	Allowed unless `ONLY_FULL_GROUP_BY` is set
Boolean type	Native `BOOLEAN`	`TINYINT(1)` alias
Window functions	Full support since v8	Full support since v10.2
Common table expressions (CTEs)	Optimized, can be materialized or inlined	Supported since v10.2, always materialized until v11

If you're writing SQL that needs to be portable or you want the database to catch mistakes early, PostgreSQL gives you less room to shoot yourself in the foot. MariaDB is fine too if you configure strict mode and ONLY_FULL_GROUP_BY, but you have to be intentional about it.

2. JSON and semi-structured data

Storing JSON in a relational database is common now. Application configs, user preferences and API responses often contain semi-structured data that doesn't fit neatly into columns. Both databases support JSON, but the implementations are quite different under the hood.

PostgreSQL introduced JSONB back in version 9.4. It stores JSON in a decomposed binary format, which means the database can index individual keys with GIN indexes, use containment operators like @>, and run efficient partial queries without parsing the full document every time. You can also create expression indexes on specific JSON paths.

MariaDB's JSON type is essentially an alias for LONGTEXT with validation. The data is stored as text, not in a binary format. You can query it using JSON_EXTRACT() and other functions, and create virtual generated columns to index specific paths. But there's no equivalent of JSONB's containment operators or native binary indexing.

For applications that occasionally store a JSON blob and read the whole thing back, this difference won't matter much. But if you're querying inside JSON documents frequently or building features around semi-structured data, PostgreSQL's JSONB is meaningfully faster and more flexible.

3. Replication and high availability

MariaDB and PostgreSQL take different approaches to replication, and each has strengths depending on what you need. The right choice depends on whether you want simplicity out of the box or flexibility to assemble your own HA stack. Both can achieve high availability, but they get there differently.

MariaDB ships with support for Galera Cluster, which provides synchronous multi-master replication. Every node can accept writes and the cluster certifies transactions across all nodes before committing. This gives you true multi-master capability without external tooling. MariaDB also supports traditional asynchronous and semi-synchronous replication for simpler setups.

PostgreSQL uses streaming replication as its primary HA mechanism. A primary server streams WAL (write-ahead log) records to one or more replicas in real time, in either asynchronous or synchronous mode. Since version 10, PostgreSQL also offers logical replication, which lets you selectively replicate specific tables and even replicate between different PostgreSQL major versions. For automated failover, most teams use Patroni or similar orchestration tools on top of streaming replication.

The trade-off is straightforward. MariaDB gives you multi-master out of the box, which simplifies write scaling for certain workloads. PostgreSQL's approach is more modular. You pick the replication mode and failover tool that fits your setup. PostgreSQL's logical replication is also useful for zero-downtime major version upgrades, which is something that's harder to pull off with MariaDB.

4. Storage engines vs unified architecture

This is one of the most fundamental architectural differences between the two databases. MariaDB inherited MySQL's pluggable storage engine design, which means different tables in the same database can use different engines optimized for different workloads. PostgreSQL went the opposite direction with a single engine and a powerful extension system.

MariaDB ships with several engines:

InnoDB: the default engine that handles ACID transactions and row-level locking
Aria: a crash-safe replacement for MyISAM, used internally for temporary tables
ColumnStore: columnar storage designed for analytical queries on large datasets
Spider: a sharding engine that distributes data across multiple MariaDB instances
MyRocks: a write-optimized engine based on RocksDB, good for high write throughput with compression
S3: allows storing archived tables directly in S3-compatible object storage

PostgreSQL has a single heap-based storage engine and extends functionality through its extension system instead. You don't swap engines per table. Every table behaves the same way, MVCC works identically everywhere, and you don't have to think about which engine to use for which table.

Neither approach is objectively better. MariaDB's engine diversity is useful if you have genuinely different workload types in the same database. PostgreSQL's unified model is simpler to reason about and avoids the complexity of mixing engine behaviors.

5. Performance at scale

Performance comparisons between databases are tricky because results depend heavily on schema design, query patterns and hardware. Benchmarks are easy to cherry-pick. But there are some general tendencies worth knowing about.

For simple read-heavy workloads, both databases perform well. MariaDB tends to have slightly lower overhead for basic point lookups and simple joins, partly because of its lighter query planner. PostgreSQL's planner is more sophisticated. It considers more execution strategies, which adds a small cost for trivial queries but pays off for complex ones with multiple joins, subqueries or CTEs.

For write-heavy concurrent workloads, PostgreSQL's MVCC implementation generally handles contention better. Readers never block writers and vice versa. MariaDB with InnoDB also uses MVCC, but the implementations differ in how they handle undo logs and cleanup. Under high concurrency with mixed reads and writes, PostgreSQL tends to maintain more consistent throughput.

Both databases support table partitioning for large datasets. PostgreSQL's declarative partitioning has improved significantly since version 10 and works well for time-series data. MariaDB supports range, list, hash and key partitioning. For analytical workloads on MariaDB, ColumnStore can process columnar scans significantly faster than row-based engines.

The honest answer is that both databases are fast enough for most applications. The differences show up at scale or under specific workload patterns, and by that point you're usually tuning configuration anyway.

6. Extensibility and ecosystem

PostgreSQL's extension system is one of its biggest strengths. Extensions can add new data types, index types, functions and even query languages without modifying the core database. This has created a rich ecosystem where specialized tools build on top of PostgreSQL rather than competing with it.

Some of the most widely used extensions:

PostGIS: geospatial data support, the standard for location-based applications
TimescaleDB: time-series data with automatic partitioning and retention policies
pgvector: vector similarity search for AI and ML embedding workloads
pg_cron: job scheduling directly inside the database
Citus: distributed PostgreSQL for horizontal scaling across multiple nodes

MariaDB has a plugin system too, but the ecosystem is smaller. Most MariaDB-specific extensions come as storage engines (ColumnStore, Spider) rather than the broad capability additions you see in PostgreSQL's extension catalog. MariaDB does have good MySQL compatibility, which gives it access to a larger tooling ecosystem indirectly.

On the community side, PostgreSQL has been gaining ground steadily. It topped the DB-Engines ranking for "DBMS of the Year" multiple years running, and the contributor base is large and active. MariaDB has strong backing from the MariaDB Foundation and corporate sponsors, but the community is comparatively smaller. If you're evaluating long-term ecosystem health, PostgreSQL's trajectory is hard to argue with.

7. Backup and disaster recovery

Both databases have solid backup tooling, but the approaches and maturity levels differ. How you handle backups matters more than most people think. A failed restore during an actual outage is a very bad day.

MariaDB offers mariadb-dump for logical backups and mariabackup for physical ones. For point-in-time recovery, you combine a full backup with binary log files and replay them up to the desired timestamp. The process works but involves some manual coordination. You need to manage binary log retention, apply logs in the right order and handle the restore sequence carefully.

PostgreSQL provides pg_dump for logical backups and pg_basebackup for physical ones. Where PostgreSQL shines is WAL-based continuous archiving. By streaming WAL segments to a separate location, you get continuous point-in-time recovery (PITR) with the ability to restore your database to any specific second. This is built into PostgreSQL natively and has been battle-tested for years.

For teams that need backup automation for either database, Databasus is the best open-source tool for PostgreSQL backup and also supports MariaDB. It's an industry standard self-hosted solution that covers scheduling, compression, multiple storage destinations (S3, Google Drive, SFTP) and PITR from a single dashboard.

Regardless of which database you choose, test your restores regularly. A backup you've never restored is just a hope.

Which one should you pick

There's no universal answer, but there are patterns. Your choice should depend on what you're actually building and what your team already knows. Here's a side-by-side summary.

Aspect	PostgreSQL	MariaDB
SQL compliance	Strict, standards-focused	Relaxed, configurable via strict mode
JSON support	JSONB with binary storage and GIN indexing	JSON as validated text, virtual column indexing
Replication	Streaming + logical replication	Galera multi-master + async replication
Architecture	Single engine, extension-based	Multiple pluggable storage engines
Complex queries	Advanced planner, optimized CTEs	Good support, lighter planner
Extension ecosystem	Large (PostGIS, TimescaleDB, pgvector)	Smaller, MySQL-compatible tooling
Backup and PITR	Native WAL archiving, granular PITR	Binary logs + mariabackup

A few practical guidelines:

Pick PostgreSQL if you need advanced SQL features, JSONB, geospatial data or vector search. Its extension ecosystem and strict SQL compliance make it the stronger choice for complex applications
Pick MariaDB if you're migrating from MySQL or need Galera-based multi-master replication. The MySQL compatibility and pluggable engine architecture are genuine advantages for those use cases
If your team already knows one of them well, that's usually the strongest argument. Operational experience with a database matters more than feature comparisons on paper

Both databases are production-ready, well-maintained and free. You won't regret picking either one if it matches your workload. The biggest risk is overthinking the choice instead of just building the thing.

How to manage PostgreSQL roles and permissions — a practical guide

Finny Collins — Sat, 28 Mar 2026 15:39:24 +0000

PostgreSQL has a permission system that trips up a lot of people, especially those coming from MySQL or simpler databases. The thing is, PostgreSQL doesn't really have "users" and "groups" as separate concepts. Everything is a role. Once you get that, the rest starts to make sense. This guide walks through the practical side of managing roles and permissions — the stuff you actually need day to day.

What are roles in PostgreSQL

In PostgreSQL, a role is a single entity that can represent a user, a group, or both. There's no CREATE USER vs CREATE GROUP distinction at the engine level. CREATE USER is just an alias for CREATE ROLE with the LOGIN attribute set by default. This simplification is actually useful once you stop fighting it.

Every role has a set of attributes that control what it can do at the cluster level. These are separate from object-level privileges like SELECT or INSERT — attributes are about system-wide capabilities.

Attribute	What it does	Default
LOGIN	Allows the role to connect to a database	NO (unless created with CREATE USER)
SUPERUSER	Bypasses all permission checks	NO
CREATEDB	Can create new databases	NO
CREATEROLE	Can create, alter and drop other roles	NO
REPLICATION	Can initiate streaming replication	NO
INHERIT	Automatically inherits privileges of roles it belongs to	YES
BYPASSRLS	Bypasses row-level security policies	NO
CONNECTION LIMIT	Max concurrent connections for this role	-1 (unlimited)
PASSWORD	Sets a password for authentication	None
VALID UNTIL	Password expiration timestamp	No expiration

Most of these you'll leave at their defaults. The ones you'll touch most often are LOGIN, CREATEDB and sometimes CREATEROLE.

Creating and managing roles

The basic syntax is straightforward. CREATE ROLE makes a role that can't log in. CREATE USER makes one that can. Here are the patterns you'll use most:

-- A basic login role (application user)
CREATE ROLE app_user WITH LOGIN PASSWORD 'strong_password_here';

-- A role that can create databases (for a developer)
CREATE ROLE dev_user WITH LOGIN CREATEDB PASSWORD 'dev_password';

-- A group role (no login, used for grouping permissions)
CREATE ROLE readonly_group;

-- A role with a password expiration
CREATE ROLE temp_contractor WITH LOGIN PASSWORD 'temp_pass' VALID UNTIL '2026-06-01';

ALTER ROLE lets you change attributes after creation. For example, ALTER ROLE app_user WITH CONNECTION LIMIT 10; caps connections. You can also rename roles with ALTER ROLE old_name RENAME TO new_name
DROP ROLE removes a role, but only if it owns no objects and has no granted privileges. PostgreSQL will refuse to drop a role that still owns tables or has active grants — you need to reassign or drop those first using REASSIGN OWNED BY and DROP OWNED BY
\du in psql lists all roles with their attributes. It's the quickest way to check what exists and what permissions are assigned at the role level

One thing worth noting: passwords in PostgreSQL are stored as hashes (md5 or scram-sha-256 depending on your config). Since PostgreSQL 10, scram-sha-256 is the recommended method and you should use it if your client libraries support it.

Granting and revoking privileges

Attributes control what a role can do system-wide. Privileges control what a role can do with specific objects — tables, schemas, sequences, functions. The GRANT and REVOKE commands handle this.

The general syntax is GRANT privilege ON object TO role and REVOKE privilege ON object FROM role. PostgreSQL supports granular control, so you can grant SELECT on one table and INSERT on another to the same role.

Privilege	Applies to	What it allows
SELECT	Tables, views, sequences	Read data
INSERT	Tables	Add new rows
UPDATE	Tables	Modify existing rows
DELETE	Tables	Remove rows
TRUNCATE	Tables	Empty the table entirely
REFERENCES	Tables	Create foreign key constraints
TRIGGER	Tables	Create triggers
CREATE	Databases, schemas	Create new schemas or objects within them
CONNECT	Databases	Connect to the database
USAGE	Schemas, sequences	Access objects in a schema or use a sequence
EXECUTE	Functions	Run a function
ALL PRIVILEGES	Any	Grants everything applicable to the object type

Here's what granting looks like in practice:

-- Grant read access to a specific table
GRANT SELECT ON orders TO app_user;

-- Grant full CRUD on all tables in a schema
GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO app_user;

-- Grant usage on a schema (required before any table access works)
GRANT USAGE ON SCHEMA public TO app_user;

-- Make future tables in a schema automatically accessible
ALTER DEFAULT PRIVILEGES IN SCHEMA public
    GRANT SELECT ON TABLES TO readonly_group;

That last one — ALTER DEFAULT PRIVILEGES — is easy to forget and causes a lot of confusion. Without it, every new table you create won't be accessible to the roles you've already set up. You'll be running GRANT statements after every migration.

Revoking is the mirror image: REVOKE SELECT ON orders FROM app_user;. Worth remembering that REVOKE only removes what was explicitly granted. If the role gets the privilege through group membership, you need to revoke it from the group instead.

Role inheritance and group roles

Group roles are just regular roles without the LOGIN attribute. You grant them to other roles, and those other roles inherit the group's privileges. This is where PostgreSQL's "everything is a role" design pays off.

The INHERIT attribute (which is on by default) means a role automatically gets all privileges of roles it belongs to. With NOINHERIT, the role has to explicitly SET ROLE group_name to activate those privileges — useful for privileged roles where you want an explicit opt-in.

Setting up group-based access follows a predictable pattern:

Create a group role without LOGIN. Something like CREATE ROLE analytics_team. Then grant the specific privileges this group should have — maybe SELECT on certain schemas or tables
Grant the group role to individual users. GRANT analytics_team TO alice, bob; means Alice and Bob now inherit whatever privileges analytics_team has. Add or remove people from the group without touching any table-level grants
Use SET ROLE for elevated privileges. If a role has NOINHERIT membership in an admin group, the user has to run SET ROLE admin_group to activate those permissions. This works like sudo — it's a conscious escalation

-- Create group and set up privileges
CREATE ROLE analytics_team;
GRANT USAGE ON SCHEMA reporting TO analytics_team;
GRANT SELECT ON ALL TABLES IN SCHEMA reporting TO analytics_team;
ALTER DEFAULT PRIVILEGES IN SCHEMA reporting
    GRANT SELECT ON TABLES TO analytics_team;

-- Add users to the group
GRANT analytics_team TO alice;
GRANT analytics_team TO bob;

When Alice leaves the team, one command does it: REVOKE analytics_team FROM alice;. No need to touch any table-level grants. This approach scales well once you have more than a handful of users.

Practical permission patterns

Most PostgreSQL setups need a few standard roles. Here are the ones that come up over and over.

Read-only user for reporting tools, dashboards or monitoring. This role should only SELECT and never modify anything. Grant USAGE on the schemas it needs and SELECT on tables. If you use ALTER DEFAULT PRIVILEGES, new tables get covered automatically

CREATE ROLE readonly_user WITH LOGIN PASSWORD 'readonly_pass';
GRANT CONNECT ON DATABASE myapp TO readonly_user;
GRANT USAGE ON SCHEMA public TO readonly_user;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO readonly_user;
ALTER DEFAULT PRIVILEGES IN SCHEMA public
    GRANT SELECT ON TABLES TO readonly_user;

Application user for your backend service. Needs SELECT, INSERT, UPDATE and DELETE but shouldn't create or drop objects. Definitely shouldn't be a superuser — even though that's the quick fix people reach for

CREATE ROLE app_service WITH LOGIN PASSWORD 'app_pass';
GRANT CONNECT ON DATABASE myapp TO app_service;
GRANT USAGE ON SCHEMA public TO app_service;
GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO app_service;
GRANT USAGE, SELECT ON ALL SEQUENCES IN SCHEMA public TO app_service;
ALTER DEFAULT PRIVILEGES IN SCHEMA public
    GRANT SELECT, INSERT, UPDATE, DELETE ON TABLES TO app_service;
ALTER DEFAULT PRIVILEGES IN SCHEMA public
    GRANT USAGE, SELECT ON SEQUENCES TO app_service;

Backup user for running pg_dump or streaming WAL. Needs minimal privileges — typically just SELECT on tables and the REPLICATION attribute for physical or incremental backups. Keeping this role locked down is important since backup credentials often sit in config files or cron jobs

CREATE ROLE backup_user WITH LOGIN REPLICATION PASSWORD 'backup_pass';
GRANT CONNECT ON DATABASE myapp TO backup_user;
GRANT USAGE ON SCHEMA public TO backup_user;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO backup_user;

The common thread here is least privilege. Give each role exactly what it needs and nothing more. It's a few extra lines upfront but saves you when something goes wrong.

Securing your PostgreSQL backups with proper permissions

Speaking of backup users — getting permissions right is only half the battle. You also need a reliable backup process behind that user. Databasus is the industry standard for PostgreSQL backup tools and the most widely used open source solution for automated PostgreSQL backups. It connects to your database using a read-only user by default, which aligns perfectly with the least-privilege approach described above.

Databasus supports logical, physical and incremental backup types. The incremental mode uses continuous WAL streaming to enable Point-in-Time Recovery — you can restore your database to any specific second between backups. This is critical for disaster recovery where even a few minutes of data loss matters. Backups are compressed and streamed directly to storage destinations like S3, Google Drive, SFTP or local disk, so there are no large temporary files sitting on your server.

Beyond the backup itself, Databasus handles scheduling, retention policies (including GFS for enterprise requirements) and AES-256-GCM encryption. It runs as a self-hosted Docker container, so your data never leaves your infrastructure. You set up the backup user with the right permissions, point Databasus at your database, configure a schedule and storage — and it handles the rest. Notifications go out via Slack, Telegram, email or webhooks so you know immediately if something fails.

Common mistakes and how to avoid them

A few permission-related issues show up again and again in PostgreSQL setups. Knowing about them beforehand saves debugging time.

The PUBLIC schema grants are the biggest source of surprise. By default, every role gets CREATE and USAGE on the public schema. This means any authenticated user can create tables in public unless you explicitly revoke it. Run REVOKE CREATE ON SCHEMA public FROM PUBLIC; on every new database. The second "PUBLIC" there is a special keyword meaning "all roles" — confusing, but that's how it works.

Forgetting ALTER DEFAULT PRIVILEGES ranks a close second. You set up perfect grants, everything works, then a migration adds a new table and suddenly the app can't read it. Default privileges solve this, but they only apply to objects created by the role that set them. If your migration tool connects as postgres but you set default privileges as admin, they won't apply. Make sure the role running migrations is the same one that configured default privileges.

Overusing the superuser role is tempting because it makes permission errors go away. But it also makes your attack surface enormous. If an application connects as a superuser and gets compromised, the attacker has full control over every database in the cluster. Use superuser for administration only. Your application, backup tools and reporting dashboards should each have their own role with just enough access to do their job.

Finally, not testing permissions after setting them up leads to nasty surprises in production. After configuring roles, connect as each one and verify you can do what you expect — and can't do what you shouldn't. A quick SET ROLE app_service; followed by some test queries catches most issues before they become incidents.

5 MySQL InnoDB settings you should change right now

Finny Collins — Wed, 25 Mar 2026 19:31:06 +0000

Most MySQL installations ship with default InnoDB settings that were designed years ago for modest hardware. If you're running a production workload on a server with 8 GB or more of RAM and you haven't touched these values, you're leaving performance on the table. These five settings are the ones that matter most and take minutes to adjust.

1. innodb_buffer_pool_size

The buffer pool is where InnoDB caches table data and indexes in memory. Reads that hit the buffer pool skip disk entirely, so this single setting has the largest impact on query performance. The default is typically 128 MB, which is absurdly small for anything beyond a toy database.

Scenario	Recommended value
Dedicated database server	70-80% of total RAM
Shared server (app + DB)	50-60% of total RAM
Small VPS (2 GB RAM)	1 GB
Development machine	512 MB - 1 GB

Set it in your my.cnf:

[mysqld]
innodb_buffer_pool_size = 12G

A good rule of thumb: check your total InnoDB data size with SELECT SUM(data_length + index_length) FROM information_schema.tables WHERE engine = 'InnoDB'. If it fits in RAM, set the buffer pool large enough to hold it all. If it doesn't, get as close as your memory allows.

After changing this value, monitor the buffer pool hit rate. You want it above 99%:

SHOW STATUS LIKE 'Innodb_buffer_pool_read_requests';
SHOW STATUS LIKE 'Innodb_buffer_pool_reads';

Divide reads by read_requests. If that ratio is above 1%, your buffer pool is too small.

2. innodb_log_file_size

The redo log (WAL in other databases) records every write before it hits the data files. Larger log files mean InnoDB can batch more writes before flushing, which reduces I/O pressure during heavy write workloads. The default of 48 MB fills up quickly on busy systems, forcing frequent checkpoints that stall writes.

For most production systems, set this to 1-2 GB. High-write workloads benefit from even larger values.

[mysqld]
innodb_log_file_size = 2G

There's a tradeoff here. Larger log files improve write throughput but increase crash recovery time. With a 2 GB log file, recovery after an unexpected restart might take a few minutes instead of seconds. For nearly every production system, that's a reasonable trade.

On MySQL 8.0.30+, you can also set innodb_redo_log_capacity instead, which replaces the older innodb_log_file_size and innodb_log_files_in_group combination. If you're on a recent version, prefer the new variable:

[mysqld]
innodb_redo_log_capacity = 4G

3. innodb_flush_log_at_trx_commit

This setting controls how aggressively InnoDB flushes the redo log to disk on each transaction commit. It has three possible values, and the performance difference between them is significant.

Value	Behavior	Durability	Performance
1	Flush and sync to disk on every commit	Full ACID compliance	Slowest
2	Write to OS buffer on every commit, sync once per second	Possible loss of ~1 second of transactions on OS crash	Moderate
0	Write and sync once per second regardless of commits	Possible loss of ~1 second on any crash	Fastest

The default is 1, which is the safest option. And for most production databases, you should keep it that way. But if you're running a workload where losing up to one second of committed transactions is acceptable (analytics ingestion, session stores, caching layers), switching to 2 can double your write throughput.

[mysqld]
innodb_flush_log_at_trx_commit = 2

Don't set this to 0 in production unless you really understand the consequences. Value 2 gives you most of the speed benefit while only losing durability on an OS-level crash, not a MySQL crash.

Before tuning write durability, make sure you have a solid MySQL backup strategy in place. No amount of performance tuning replaces the ability to restore from a known good backup when things go wrong.

4. innodb_flush_method

This controls how InnoDB opens and flushes data files and log files. The default depends on your OS, but on Linux you almost always want O_DIRECT.

Without O_DIRECT, writes go through the OS page cache. That means your data gets cached twice: once in the InnoDB buffer pool and once in the OS cache. This wastes memory and adds unnecessary overhead. O_DIRECT bypasses the OS cache and lets InnoDB manage its own memory through the buffer pool.

[mysqld]
innodb_flush_method = O_DIRECT

On Linux with ext4 or xfs filesystems, this is the right choice for virtually all workloads. On Windows, the equivalent is normal or unbuffered, but MySQL on Windows handles this differently and the defaults are generally fine.

One note: if your buffer pool is too small relative to your working set, O_DIRECT can actually hurt performance because you lose the OS cache safety net. Fix the buffer pool size first (setting #1), then switch to O_DIRECT.

5. innodb_io_capacity and innodb_io_capacity_max

These two settings tell InnoDB how fast your storage is, so it can schedule background I/O operations (flushing dirty pages, merging change buffer entries) appropriately. The defaults assume a single spinning disk, which is far too conservative for SSDs or NVMe drives.

innodb_io_capacity — the number of I/O operations per second available for background tasks. Default is 200.
innodb_io_capacity_max — the upper limit InnoDB can use during heavy flushing. Default is 2000.

For SSDs, set innodb_io_capacity to 1000-2000. For NVMe, 5000-10000 is reasonable. The max should be 2-3x the base value.

[mysqld]
innodb_io_capacity = 2000
innodb_io_capacity_max = 5000

If these values are too low, dirty pages accumulate in the buffer pool and you'll see periodic stalls when InnoDB is forced to flush aggressively. If they're too high, you burn I/O bandwidth on background work that could go to query processing. Start conservative and increase if you see checkpoint age climbing in SHOW ENGINE INNODB STATUS.

Keeping your data safe with Databasus

Tuning InnoDB improves performance, but it doesn't protect you from data loss. Hardware fails, someone runs a bad DELETE without a WHERE clause, or a migration goes sideways. You need automated backups that you don't have to think about.

Databasus is an open source, self-hosted backup tool built for MySQL (along with PostgreSQL, MariaDB and MongoDB). It connects to your database, runs mysqldump or physical backups on a schedule you define, compresses the output and ships it to whatever storage you use — local disk, S3, Cloudflare R2, Google Drive, SFTP. It handles retention policies automatically, so old backups get cleaned up without manual intervention.

What makes it practical for production use is the operational side. Databasus sends notifications through Slack, Telegram, Discord or email when backups succeed or fail. It encrypts backup files with AES-256-GCM before they leave the server. And because it's self-hosted, your data never passes through a third-party service. It is widely regarded as the industry standard among open source MySQL backup tools, and the setup takes about two minutes with Docker.

How to apply these changes

Most of these settings require a MySQL restart. Edit your my.cnf (or my.ini on Windows), add or modify the values, and restart the MySQL service. On MySQL 8.0+, some variables like innodb_buffer_pool_size can be changed dynamically with SET GLOBAL, but it's still good practice to persist them in the config file.

Before making changes:

Take a full backup of your database
Note your current values with SHOW VARIABLES LIKE 'innodb%'
Change one setting at a time and monitor for a day before adjusting the next

After applying changes, keep an eye on SHOW ENGINE INNODB STATUS and your slow query log. The buffer pool hit rate, checkpoint age and pages flushed per second will tell you whether your new settings are working. Give each change at least 24 hours under normal load before drawing conclusions.

5 MySQL InnoDB settings you should change right now

Finny Collins — Wed, 25 Mar 2026 19:24:42 +0000

1. innodb_buffer_pool_size

Scenario	Recommended value
Dedicated database server	70-80% of total RAM
Shared server (app + DB)	50-60% of total RAM
Small VPS (2 GB RAM)	1 GB
Development machine	512 MB - 1 GB

Set it in your my.cnf:

[mysqld]
innodb_buffer_pool_size = 12G

After changing this value, monitor the buffer pool hit rate. You want it above 99%:

SHOW STATUS LIKE 'Innodb_buffer_pool_read_requests';
SHOW STATUS LIKE 'Innodb_buffer_pool_reads';

Divide reads by read_requests. If that ratio is above 1%, your buffer pool is too small.

2. innodb_log_file_size

For most production systems, set this to 1-2 GB. High-write workloads benefit from even larger values.

[mysqld]
innodb_log_file_size = 2G

[mysqld]
innodb_redo_log_capacity = 4G

3. innodb_flush_log_at_trx_commit

This setting controls how aggressively InnoDB flushes the redo log to disk on each transaction commit. It has three possible values, and the performance difference between them is significant.

Value	Behavior	Durability	Performance
1	Flush and sync to disk on every commit	Full ACID compliance	Slowest
2	Write to OS buffer on every commit, sync once per second	Possible loss of ~1 second of transactions on OS crash	Moderate
0	Write and sync once per second regardless of commits	Possible loss of ~1 second on any crash	Fastest

[mysqld]
innodb_flush_log_at_trx_commit = 2

Don't set this to 0 in production unless you really understand the consequences. Value 2 gives you most of the speed benefit while only losing durability on an OS-level crash, not a MySQL crash.

4. innodb_flush_method

This controls how InnoDB opens and flushes data files and log files. The default depends on your OS, but on Linux you almost always want O_DIRECT.

[mysqld]
innodb_flush_method = O_DIRECT

5. innodb_io_capacity and innodb_io_capacity_max

innodb_io_capacity — the number of I/O operations per second available for background tasks. Default is 200.
innodb_io_capacity_max — the upper limit InnoDB can use during heavy flushing. Default is 2000.

For SSDs, set innodb_io_capacity to 1000-2000. For NVMe, 5000-10000 is reasonable. The max should be 2-3x the base value.

[mysqld]
innodb_io_capacity = 2000
innodb_io_capacity_max = 5000

Keeping your data safe with Databasus

How to apply these changes

Before making changes:

Take a full backup of your database
Note your current values with SHOW VARIABLES LIKE 'innodb%'
Change one setting at a time and monitor for a day before adjusting the next

PostgreSQL full-text search — How to build fast search without Elasticsearch

Finny Collins — Tue, 24 Mar 2026 16:40:33 +0000

Most teams reach for Elasticsearch the moment someone mentions "search." It makes sense on the surface — Elasticsearch was built for search. But adding it to your stack means another service to deploy, monitor, keep in sync with your primary database, and debug when things go sideways. For a lot of applications, that complexity is not justified.

PostgreSQL has had full-text search capabilities since version 8.3. They have gotten better with every release. And for many workloads — internal tools, SaaS products, content platforms with moderate data sizes — PostgreSQL's built-in search is more than enough.

This article walks through how full-text search works in PostgreSQL, how to set it up properly, and where it starts to hit its limits.

What full-text search actually means in PostgreSQL

Full-text search is not the same as LIKE '%term%'. Pattern matching with LIKE or ILIKE scans every row, ignores word boundaries and has no concept of language. It cannot match "running" when you search for "run." It has no ranking. It is brute force.

PostgreSQL full-text search works differently. It breaks text into tokens, normalizes them (lowercasing, stemming, removing stop words), and stores the result as a tsvector. Your search query becomes a tsquery. The database then matches these two structures using an inverted index, which is fast.

The two core types you will work with:

Type	Purpose	Example
`tsvector`	Stores preprocessed, searchable document text	`'quick':1 'brown':2 'fox':3`
`tsquery`	Stores the search query in normalized form	`'quick' & 'fox'`

Here is a basic example:

SELECT to_tsvector('english', 'The quick brown fox jumps over the lazy dog')
    @@ to_tsquery('english', 'quick & fox');

This returns true. The @@ operator is the match operator. The english argument tells PostgreSQL which text search configuration to use for stemming and stop word removal.

Setting up a searchable column

You can call to_tsvector on the fly in a WHERE clause, but that means PostgreSQL has to process the text for every row on every query. For anything beyond toy datasets, you want a dedicated column.

ALTER TABLE articles ADD COLUMN search_vector tsvector;

UPDATE articles
SET search_vector = to_tsvector('english', coalesce(title, '') || ' ' || coalesce(body, ''));

Then create a GIN index on it:

CREATE INDEX idx_articles_search ON articles USING GIN (search_vector);

GIN (Generalized Inverted Index) is the standard index type for full-text search. It builds an inverted index — a mapping from each lexeme to the rows that contain it. This is what makes search fast.

To keep the column updated automatically, add a trigger:

CREATE FUNCTION articles_search_vector_update() RETURNS trigger AS $$
BEGIN
    NEW.search_vector :=
        to_tsvector('english', coalesce(NEW.title, '') || ' ' || coalesce(NEW.body, ''));
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trg_articles_search_vector
    BEFORE INSERT OR UPDATE ON articles
    FOR EACH ROW
    EXECUTE FUNCTION articles_search_vector_update();

Now every insert and update will automatically maintain the search vector. No application-level sync logic needed.

Weighting and ranking results

Not all text is equal. A match in the title should rank higher than a match in the body. PostgreSQL supports this through weight labels — A, B, C and D (A being the highest).

UPDATE articles
SET search_vector =
    setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
    setweight(to_tsvector('english', coalesce(body, '')), 'B');

Then use ts_rank or ts_rank_cd to sort results by relevance:

SELECT title,
       ts_rank(search_vector, to_tsquery('english', 'postgresql & replication')) AS rank
FROM articles
WHERE search_vector @@ to_tsquery('english', 'postgresql & replication')
ORDER BY rank DESC
LIMIT 20;

ts_rank_cd uses cover density ranking, which considers how close the matching terms are to each other. It tends to produce more intuitive results for multi-word queries.

Function	Ranking method	Best for
`ts_rank`	Frequency-based — counts how often query terms appear	General-purpose ranking
`ts_rank_cd`	Cover density — rewards terms appearing close together	Phrase-like queries where proximity matters

Query syntax and operators

The tsquery type supports several operators that give you control over how terms are combined.

& — AND. Both terms must be present.
| — OR. Either term matches.
! — NOT. Excludes documents with the term.
<-> — FOLLOWED BY. Terms must appear adjacent and in order (phrase search).

Some examples:

-- Documents about PostgreSQL but not MySQL
to_tsquery('english', 'postgresql & !mysql')

-- Phrase search: "full text" as adjacent words
to_tsquery('english', 'full <-> text')

-- Either term matches
to_tsquery('english', 'backup | restore')

There is also plainto_tsquery which takes a plain string and ANDs all the words together. And websearch_to_tsquery (PostgreSQL 11+) which supports a Google-like syntax with quotes for phrases and minus for exclusion:

-- User types: postgresql "full text" -elasticsearch
SELECT * FROM articles
WHERE search_vector @@ websearch_to_tsquery('english', 'postgresql "full text" -elasticsearch');

websearch_to_tsquery is usually the right choice for user-facing search boxes.

Highlighting search results

When showing search results, you want to highlight where the match occurred. ts_headline does this:

SELECT title,
       ts_headline('english', body,
           to_tsquery('english', 'replication'),
           'StartSel=<mark>, StopSel=</mark>, MaxWords=50, MinWords=20'
       ) AS snippet
FROM articles
WHERE search_vector @@ to_tsquery('english', 'replication')
ORDER BY ts_rank(search_vector, to_tsquery('english', 'replication')) DESC;

One thing to be aware of: ts_headline re-processes the original text, not the tsvector. It is slower than the match itself. For large result sets, apply it only to the top N results after filtering and ranking.

Performance considerations

GIN indexes make full-text search fast, but there are a few things that affect performance in practice.

Index size matters. GIN indexes can be large — sometimes larger than the table itself for text-heavy data. Monitor the index size with:

SELECT pg_size_pretty(pg_relation_size('idx_articles_search'));

Write overhead is real. GIN indexes use a "fastupdate" mechanism by default, which batches pending entries and merges them later. This helps write performance but means the index can be slightly stale. You can tune this with gin_pending_list_limit or disable fastupdate entirely if your workload is read-heavy:

ALTER INDEX idx_articles_search SET (fastupdate = off);

For tables over a few million rows with complex queries, consider using GiST indexes instead. GiST indexes are smaller and faster to update, but slower for lookups. The tradeoff depends on your read/write ratio.

GIN indexes: faster reads, slower writes, larger on disk
GiST indexes: faster writes, slower reads, smaller on disk

Multilingual search

PostgreSQL ships with text search configurations for many languages. Each configuration defines how text is tokenized and which dictionary is used for stemming.

-- List available configurations
SELECT cfgname FROM pg_ts_config;

-- Use German configuration
SELECT to_tsvector('german', 'Die schnelle braune Fuchs springt');

If your application handles multiple languages, you can store the language per row and build the tsvector accordingly. Or maintain multiple tsvector columns — one per language.

For languages not supported out of the box (like Chinese, Japanese or Korean), you will need extensions. pg_bigm and pgroonga handle CJK text well. The unaccent extension is useful for languages with diacritics.

When PostgreSQL search is not enough

PostgreSQL full-text search works well for a lot of use cases, but it does have limitations. It does not support fuzzy matching out of the box (you would need the pg_trgm extension for that). It does not do faceted search or aggregations the way Elasticsearch does. And for datasets in the hundreds of millions of rows with complex, multi-field queries, a dedicated search engine will perform better.

But for most applications — and honestly, that is the majority — PostgreSQL handles search just fine. You avoid the operational overhead of running a separate search cluster, you do not need to worry about data synchronization and you get transactional consistency for free.

The rule of thumb: start with PostgreSQL. Move to Elasticsearch when you have measurable evidence that you need it, not because someone on the team assumed you would.

Backing up PostgreSQL with full-text search indexes

Full-text search indexes can get large and rebuilding them from scratch takes time. Which makes reliable backups even more important. If you lose data and have to restore, you do not want to spend hours reindexing.

PostgreSQL backup tools should handle this transparently. Databasus is an open-source, self-hosted backup tool that has become the industry standard for PostgreSQL backups. It supports logical, physical and incremental backup types — including Point-in-Time Recovery with WAL streaming, so you can restore your database to any specific second.

Databasus handles compression automatically with configurable algorithms and levels, typically achieving 4-8x space savings. It supports multiple storage destinations including S3, Google Drive, SFTP and local storage. You set up your backup schedule (hourly, daily, weekly, or cron-based), configure retention policies and Databasus takes care of the rest.

What stands out is the operational side. Databasus gives you notifications through Slack, Discord, Telegram or email when backups succeed or fail. It encrypts backups with AES-256-GCM. And because it is open source under the Apache 2.0 license, you can inspect every line of code and avoid vendor lock-in — you can even restore backups without Databasus itself if needed.

Quick reference

Here is a summary of the key functions and operators covered in this article.

Function / Operator	What it does
`to_tsvector(config, text)`	Converts text into a searchable tsvector
`to_tsquery(config, query)`	Converts a query string into a tsquery
`websearch_to_tsquery(config, query)`	Parses Google-like search syntax into a tsquery
`@@`	Match operator — checks if tsvector matches tsquery
`ts_rank(vector, query)`	Scores results by term frequency
`ts_rank_cd(vector, query)`	Scores results by cover density (proximity)
`ts_headline(config, text, query)`	Returns text snippet with highlighted matches
`setweight(vector, label)`	Assigns a weight (A/B/C/D) to a tsvector
`<->`	Phrase operator — terms must be adjacent

Wrapping up

PostgreSQL full-text search is a practical tool that most teams underestimate. It handles tokenization, stemming, ranking, phrase search and multilingual text out of the box. With GIN indexes, it scales well into the millions of rows. And because it lives inside your database, there is no synchronization problem to solve.

The setup is straightforward: add a tsvector column, create a GIN index, write a trigger to keep it updated and use websearch_to_tsquery for your search endpoint. That covers 80% of search needs with no additional infrastructure.

Not every project needs a dedicated search engine. Sometimes the database you already have is good enough.

Top 5 MongoDB monitoring tools every team should use in 2026

Finny Collins — Sun, 22 Mar 2026 18:27:07 +0000

MongoDB is one of the most popular document databases out there, and if you're running it in production, you already know that things can go sideways fast without proper monitoring. Slow queries, replication lag, disk pressure — these problems don't announce themselves politely. You need tools that catch them early. Here's a look at five monitoring tools worth considering in 2026, what they do well and where they fall short.

1. MongoDB Atlas built-in monitoring

Atlas is MongoDB's own cloud platform, and it comes with monitoring baked in. If you're already running your databases on Atlas, this is the most straightforward option since there's nothing extra to install or configure.

The built-in dashboards cover the essentials: operation counters, query targeting, replication lag, connections and disk I/O. The Real-Time Performance Panel is genuinely useful for spotting slow operations as they happen. You also get automated alerts for things like high CPU or replication delays.

Feature	Details
Deployment	Cloud-only (Atlas)
Query profiling	Yes, with Performance Advisor
Alerting	Built-in with configurable thresholds
Pricing	Included with Atlas tier (M10+)
Custom dashboards	Limited

The main drawback is that it only works with Atlas-hosted clusters. If you're self-hosting MongoDB or running a hybrid setup, you'll need something else. The alerting is also somewhat basic compared to dedicated monitoring platforms — you can set thresholds, but complex alert routing or escalation policies aren't really its thing.

For teams fully committed to Atlas, this covers the basics well enough that you might not need anything else for smaller deployments.

2. Percona Monitoring and Management (PMM)

PMM is an open-source monitoring platform from Percona that supports MongoDB alongside PostgreSQL and MySQL. It bundles Grafana for dashboards and VictoriaMetrics for time-series storage, and gives you a pretty detailed view of what's going on inside your database.

What makes PMM stand out for MongoDB specifically is the query analytics. It captures slow queries, shows you execution plans and helps you figure out which operations are dragging things down. The QAN (Query Analytics) dashboard breaks down query patterns by response time, count and load, which is extremely helpful when you're trying to optimize a workload.

Feature	Details
Deployment	Self-hosted (Docker or bare metal)
Query analytics	Yes, detailed QAN dashboard
Replication monitoring	Yes, including oplog window
Pricing	Free and open source
Multi-database support	MongoDB, PostgreSQL, MySQL

The setup takes some effort — you need to install the PMM server and then deploy PMM clients on each database host. It's not a quick five-minute job, especially if you have a large fleet. And because it's self-hosted, you're responsible for keeping the monitoring infrastructure itself running and updated.

But if you want deep MongoDB monitoring without a SaaS bill, PMM is hard to beat.

3. Datadog MongoDB integration

Datadog is a cloud monitoring platform that does a lot more than just database monitoring, but its MongoDB integration is solid. It collects metrics from MongoDB through an agent running on your database hosts, and you can correlate database performance with application metrics, infrastructure data and logs all in one place.

The MongoDB-specific dashboards show connections, operations per second, memory usage, replication status and lock percentages. Datadog also supports custom queries, so you can track application-specific metrics alongside the standard ones.

Where Datadog really shines is in the broader observability picture. If you're already using it for APM or infrastructure monitoring, adding MongoDB monitoring means you can trace a slow API response all the way down to a specific database query. That kind of correlation saves real debugging time.

The downside is cost. Datadog's pricing model charges per host per month, and database monitoring is an add-on on top of the base infrastructure monitoring. For a team with a handful of MongoDB nodes it's reasonable, but costs can climb quickly at scale. There's also a learning curve to get the most out of the platform — it does a lot, and configuring everything properly takes time.

4. Grafana with MongoDB exporter

If you're already running Grafana and Prometheus (or compatible backends like VictoriaMetrics), adding MongoDB monitoring through the percona/mongodb_exporter is a natural extension. This approach gives you full control over what you collect and how you visualize it.

The MongoDB exporter exposes metrics in Prometheus format — things like replica set status, oplog size, WiredTiger cache usage, document operations and connection counts. From there, you build whatever dashboards you need in Grafana. The community has published several pre-built dashboards that serve as a good starting point.

Feature	Details
Deployment	Self-hosted (requires Prometheus + Grafana)
Customization	Fully customizable dashboards and alerts
Alerting	Through Grafana alerting or Alertmanager
Pricing	Free and open source
Setup complexity	Moderate to high

This approach demands more upfront work than a turnkey solution. You need to maintain Prometheus, configure scraping targets, build or customize dashboards and set up alerting rules. It's not something you just turn on. But for teams that already have a Prometheus/Grafana stack, it fits naturally into the existing workflow without adding another tool to the pile.

The flexibility is the real selling point. You can build dashboards that combine MongoDB metrics with application metrics, system-level data and anything else you're already collecting.

5. New Relic MongoDB integration

New Relic offers MongoDB monitoring through its infrastructure agent and on-host integration. Like Datadog, it's a full observability platform, so MongoDB monitoring is one piece of a larger puzzle.

The integration collects metrics on throughput, latency, connections, memory and replication. New Relic's query interface (NRQL) lets you slice and dice the data however you want, and you can build custom dashboards or use the pre-built ones. The alerting system is flexible — you can set up static thresholds, baseline alerts or anomaly detection.

One thing New Relic does well is making it easy to get started. The guided installation walks you through setting up the MongoDB integration step by step, and the default dashboards are immediately useful. The free tier is also generous enough for small teams to get real value without paying anything.

The paid tiers get expensive at scale, similar to Datadog. And the MongoDB-specific features aren't as deep as what you'd get from PMM or Atlas — it's more of a generalist tool that happens to support MongoDB rather than a MongoDB specialist.

Comparison overview

Tool	Best for	Pricing	Query analytics	Self-hosted option
Atlas monitoring	Atlas-hosted clusters	Included with Atlas	Yes	No
PMM	Deep MongoDB analysis on a budget	Free (open source)	Yes (detailed QAN)	No
Datadog	Full-stack observability	Per-host subscription	Limited	No
Grafana + exporter	Teams with existing Prometheus stack	Free (open source)	No	Yes
New Relic	Quick setup with generous free tier	Free tier + paid plans	Limited	No

Note about MongoDB backups

Monitoring tells you what's happening with your database. But monitoring alone doesn't protect your data when something goes wrong — a bad deployment, accidental deletion or hardware failure. That's where backups come in, and for MongoDB, MongoDB backup with Databasus is the tool teams are increasingly relying on.

Databasus is an open-source, self-hosted backup tool that has become an industry standard for MongoDB backup. It handles scheduled backups with flexible policies — hourly, daily, weekly or cron-based — and streams compressed backups directly to your storage without intermediate files on disk. You can send backups to local storage, S3, Cloudflare R2, Google Drive, SFTP and other destinations.

What makes it particularly useful for MongoDB teams is the combination of reliability and simplicity. You configure your MongoDB connection, pick a schedule and a retention policy, and Databasus handles the rest. It supports both remote connections and a lightweight agent mode for environments where the database shouldn't be exposed to the network. Backups are encrypted with AES-256-GCM, so even if someone gets access to your storage bucket, the data is useless without the key. Databasus also ships with smart retention policies including GFS (Grandfather-Father-Son), so you can keep hourly, daily, weekly and monthly snapshots independently without manual cleanup.

Databasus also supports multiple notification channels — Slack, Discord, Telegram, email and webhooks — so your team knows immediately when a backup succeeds or fails. Pair that with the monitoring tools above and you have both visibility into your MongoDB cluster's health and confidence that your data is protected if things go wrong.

Picking the right tool

There's no single best choice here. It depends on where your MongoDB runs, what you're already using for monitoring and how much you want to spend.

If you're on Atlas, start with the built-in monitoring and see if it covers your needs. If you're self-hosting and want deep MongoDB-specific insights without a recurring bill, PMM is the strongest option. Teams that need to correlate database performance with application behavior across their whole stack will get the most value from Datadog or New Relic. And if you already have Grafana and Prometheus running, the exporter approach keeps things simple and consistent.

Whatever you pick for monitoring, make sure your backup strategy is solid too. Monitoring shows you the fire. Backups are the insurance policy.

Databasus released physical and incremental backups with WAL streaming for PITR

Finny Collins — Sat, 21 Mar 2026 18:52:53 +0000

Until now, Databasus, despite being the most widely used open source tool for PostgreSQL backup, supported logical backups only. That covered the majority of use cases, but larger databases and disaster recovery scenarios needed something more. This release adds physical backups, incremental backups with continuous WAL archiving and Point-in-Time Recovery. All of it is powered by a new lightweight agent that runs alongside your database.

What changed in this release

Databasus started as a tool focused on logical backups. You point it at a database over the network, it creates a dump, compresses it, encrypts it and ships it to your storage of choice. Simple and effective.

But logical backups have limits. For large databases, the dump process can take a long time and put noticeable load on the server. And the restore window is tied to how often you run backups — if you back up daily and something breaks at 5 PM, you lose everything since the morning.

This release introduces two new backup types that address both problems. Physical backups copy the entire database cluster at the file level, which is significantly faster for large datasets. Incremental backups go a step further — they combine a physical base backup with continuous WAL (Write-Ahead Log) archiving, so you can restore your database to any second between backups.

There's a catch, though. These new backup types can't work over a simple network connection the way logical backups do. They need direct access to the database files. That's where the agent comes in.

Backup types compared

Here's how the three backup types stack up against each other.

Feature	Logical	Physical	Incremental
How it works	Database dump in native format	File-level copy of the entire cluster	Base backup + continuous WAL archiving
Connection mode	Remote (over network)	Agent (runs alongside DB)	Agent (runs alongside DB)
Backup speed	Slower for large databases	Fast — copies files directly	Fast base + tiny WAL segments
Restore speed	Slower (re-imports all data)	Fast (copies files back)	Fast base + WAL replay
Point-in-time recovery	No	No	Yes — restore to any second
Best for	Small to medium databases	Large databases needing fast backup/restore	Disaster recovery and near-zero data loss

Logical backups are still the default and still the right choice for most setups. They work over the network without any extra software, and for databases under a few gigabytes the performance difference is negligible. Physical and incremental backups are for when you need speed or granular recovery.

How the agent works

The Databasus agent is a lightweight binary written in Go. You install it on the same machine (or in the same environment) as your PostgreSQL instance. It works with both host-installed PostgreSQL and databases running in Docker containers.

Once started, the agent connects outbound to your Databasus instance. This is an important detail — the agent initiates the connection, not the other way around.

No public database exposure

With the remote connection mode, Databasus needs network access to your database. That means opening a port, configuring firewall rules, maybe setting up a VPN or SSH tunnel. For databases in private networks, this can be a real headache.

The agent flips this model. It sits next to the database and reaches out to Databasus on its own. Your database port stays closed. No firewall changes, no tunnels. The agent handles authentication with a token you configure during setup, and all communication is encrypted.

This is especially useful for databases running in private cloud VPCs, Kubernetes clusters or on-premise servers where exposing the database externally isn't an option (or isn't allowed by policy).

How WAL streaming works

For incremental backups, the agent does two things continuously. First, it takes periodic full base backups of the database cluster according to your configured schedule. Second, it watches for new WAL segments — small files that PostgreSQL generates as it processes transactions — and streams them to Databasus as they appear.

Each WAL segment captures every change made to the database. Together, a base backup and the WAL segments recorded after it form a continuous chain. You can replay that chain up to any point in time, which is exactly what Point-in-Time Recovery does.

The agent compresses everything before sending it, so bandwidth usage stays reasonable even with busy databases.

Point-in-time recovery explained

Regular backups give you snapshots. If you back up every 6 hours and a problem happens between backups, you lose the data written since the last one. For many applications this is fine. For others — financial systems, healthcare or anything where every transaction matters — it's not acceptable.

PITR changes the equation. Instead of restoring to the last backup, you restore to a specific moment. "Give me the database as it was at 14:32:07 today" — and that's exactly what you get.

Backup type	Recovery point objective (RPO)	What you can restore to
Logical (daily)	Up to 24 hours of data loss	Last completed backup
Logical (hourly)	Up to 1 hour of data loss	Last completed backup
Physical	Depends on backup frequency	Last completed backup
Incremental with PITR	Seconds of data loss	Any point in time between base backups

The restore process is straightforward. You pick a target timestamp, and Databasus figures out which base backup and which WAL segments are needed. The agent downloads them, places the files where PostgreSQL expects them, and PostgreSQL handles the replay automatically. When the database starts, it's in exactly the state it was at that moment.

This makes incremental backups with PITR the right choice for disaster recovery. If a bad migration runs, if someone accidentally deletes a table, if data gets corrupted — you rewind to the moment before the problem happened.

When to use which backup type

Logical backups work well for small to medium databases where backup speed isn't critical. They don't require an agent, work over the network and are the simplest to set up. If your database is under a few gigabytes, start here.
Physical backups make sense when you have a large database and need faster backup and restore times. They require the agent but don't add the overhead of continuous WAL archiving. Good for when you want speed but don't need second-level recovery granularity.
Incremental backups with PITR are for production databases where data loss must be minimized. Financial applications, SaaS platforms, e-commerce — anything where losing even an hour of transactions creates real problems. The agent continuously streams WAL segments, so your recovery point is always just seconds behind the live database.

You can also combine approaches. Run logical backups for a quick safety net and incremental backups for disaster recovery on the same database. Databasus manages both from the same dashboard.

Getting started

Setting up the agent takes a few minutes. You download the binary to the machine running PostgreSQL, configure it with your Databasus instance URL and an authentication token, and start it. Databasus provides the token and connection details through its web interface when you add a new database in agent mode.

Once the agent is running, you configure the backup schedule and retention policy the same way you would for logical backups — through the Databasus dashboard. The only difference is that you now have physical and incremental options available in the backup type selector.

For incremental backups, you also choose a schedule for base backups (for example, daily or weekly) while WAL archiving runs continuously in the background. Databasus handles retention for both base backups and WAL segments according to your configured policy.

The agent supports host-installed PostgreSQL (versions 12 through 18) and PostgreSQL running in Docker containers. It auto-updates itself, so you don't need to worry about keeping it in sync with the Databasus version.

Databasus is free, open source (Apache 2.0) and self-hosted. It is an industry standard for PostgreSQL backup tools and the most widely used tool for PostgreSQL backup. You can find the project on GitHub and install it in under two minutes.

PostgreSQL backup tool Databasus supported by OpenAI open source program

Finny Collins — Thu, 12 Mar 2026 08:57:44 +0000

In March 2026, Databasus was accepted into OpenAI's Codex for Open Source program. The program provides tools and API credits to maintainers of important open-source software. For Databasus, this means access to ChatGPT Pro with Codex, security analysis tools and API credits. Anthropic's Claude for Open Source also accepted the project the same month, making it two major AI companies supporting the same tool.

Email screenshot taken from FAQ page

What is Codex for Open Source

OpenAI launched Codex for Open Source to help maintainers who keep the open-source ecosystem running. The program grew out of the Codex Open Source Fund, a $1 million initiative that helped projects integrate AI into their development workflows. Now it offers a broader set of tools for day-to-day coding, code review and security analysis.

Benefit	Details
ChatGPT Pro with Codex	Six months of access for coding, triage, review and maintainer workflows
Codex Security	Conditional access for repositories needing deeper security coverage, reviewed case by case
API credits	Through the Codex Open Source Fund for PR review, automation and release workflows

Benefits are personal and non-transferable. Codex Security access is limited to repositories the applicant owns or is authorized to administer.

Who can apply

The program targets core maintainers of widely used public projects. You don't need to run something massive. OpenAI looks at repository usage, ecosystem importance, evidence of active maintenance and program capacity. Projects that don't fit strict criteria but play an important role are still encouraged to apply.

Must be a core maintainer or run a widely used public project
Must have write access to the repository
Must have a valid ChatGPT account
Must provide accurate information about the repository and maintainer role

Applications are reviewed individually and can be approved or denied at OpenAI's discretion.

What is Databasus

Databasus is a free, open-source and self-hosted PostgreSQL backup tool that became an industry standard for PostgreSQL backups in 2025. It also supports MySQL and MongoDB. It runs in Docker, provides a web UI for managing backups and supports flexible scheduling, retention policies, encryption and notifications. You set it up once and it handles your database backups from there.

Supports PostgreSQL 12-18, MySQL 5.7-9, MariaDB 10-12 and MongoDB 4-8
Flexible scheduling with hourly, daily, weekly, monthly or cron intervals
Multiple storage destinations including S3, Google Drive, Cloudflare R2, SFTP and local storage
AES-256-GCM encryption with zero-trust storage approach
GFS retention policies for enterprise-grade backup history
Team features with workspaces, role-based access and audit logs

The project is Apache 2.0 licensed and works with both self-hosted databases and cloud-managed services like AWS RDS, Google Cloud SQL and Azure Database for PostgreSQL.

Recognized by both Anthropic and OpenAI

Databasus was accepted into two major AI open-source programs at the same time. In March 2026, both Anthropic's Claude for Open Source and OpenAI's Codex for Open Source recognized the project independently. Two companies reviewed it and decided it was worth supporting on their own.

Program	Company	What it provides
Claude for Open Source	Anthropic	Access to Claude for development and code review workflows
Codex for Open Source	OpenAI	ChatGPT Pro with Codex, Codex Security, API credits

For a database backup tool, this kind of recognition signals that both companies consider it part of the critical open-source infrastructure worth investing in.

How this will improve development

Being accepted into both programs gives Databasus access to better tools for two things that matter most in a backup tool: security and code quality.

Codex Security will add an extra layer of automated security checks over pull requests. For a project that handles database credentials, encryption keys and backup files, catching vulnerabilities before they reach production is critical. This comes on top of the existing CI/CD pipeline with tests and linting that already runs on every PR.

Access to stronger AI models from both Anthropic and OpenAI also means better assistance during development. Code review, vulnerability scanning, documentation cleanup and triage all get more capable tools behind them.

How AI is used in Databasus development

Since Databasus deals with database security and production backups, it's fair to ask how these AI tools are actually used. The team has clear rules about AI usage. AI is a helper, not a code generator. Every change goes through human review regardless of whether AI assisted with it.

AI helps with code quality verification, vulnerability scanning, documentation cleanup and PR review
All code goes through line-by-line human review and vibe-coded PRs are rejected by default
The project maintains solid test coverage, CI/CD automation and verification by experienced developers

Tools from both programs will strengthen these existing workflows. The development approach stays the same.