Forem: Piter Adyson

MongoDB vs PostgreSQL — 6 factors to consider when choosing your database

Piter Adyson — Fri, 13 Feb 2026 17:38:20 +0000

Choosing between MongoDB and PostgreSQL is one of the most important decisions you'll make for your project. Both databases are mature, reliable and widely used. But they're fundamentally different in how they store, query and scale data. This choice affects your development speed, operational costs and how easily your system can grow.

Many developers pick a database based on what's familiar or what's trending. That's fine for small projects. But if you're building something that needs to scale or handle complex data relationships, you need to understand the real differences. This article breaks down six key factors to help you make an informed decision: data model, query complexity, scalability, consistency, performance and backup strategies.

1. Data model and schema flexibility

The data model is probably the biggest difference between these two databases. PostgreSQL is a relational database that uses tables with strict schemas. You define columns, types and relationships upfront. MongoDB is a document database that stores JSON-like documents with flexible schemas. Each document can have different fields, and you can change the structure on the fly.

PostgreSQL's structured approach works great when your data has clear relationships and you know the schema in advance. Think user accounts, orders, inventory or financial records. The strict schema catches errors early and ensures data integrity. But changing the schema later requires migrations, which can be painful on large datasets.

MongoDB's flexibility is useful when you're building something new and your data model is still evolving. Or when you're dealing with semi-structured data like logs, events or user-generated content. You can store different document shapes in the same collection. No migrations needed. But that flexibility comes at a cost: you need to handle data validation in your application code instead of relying on the database.

Here's a quick comparison:

Aspect	PostgreSQL	MongoDB
Schema	Strict, predefined table structure	Flexible, documents can vary
Data relationships	Built-in foreign keys and joins	Manual references or embedding
Schema changes	Requires migrations	No migrations needed
Data validation	Enforced at database level	Enforced at application level
Best for	Well-defined, relational data	Evolving, semi-structured data

The reality is most applications have structured data that benefits from PostgreSQL's relational model. User profiles, orders, products and analytics usually have predictable relationships. MongoDB makes sense when you're prototyping quickly or dealing with truly flexible data structures.

2. Query capabilities and complexity

PostgreSQL has one of the most powerful query engines available. It supports complex joins, subqueries, window functions, common table expressions and full SQL. You can express almost any data relationship or transformation in a single query. Need to join five tables, aggregate by multiple dimensions and filter with complex conditions? PostgreSQL handles it.

MongoDB's query language is simpler and more limited. It's based on JSON-like syntax and works well for basic queries on single collections. But once you need to join data across collections or perform complex aggregations, things get awkward. MongoDB added a $lookup operator for joins, but it's slower and less flexible than SQL joins. You often end up making multiple queries or denormalizing your data to avoid joins entirely.

For most business applications, query complexity matters. You'll need reports, analytics and ad-hoc queries. PostgreSQL makes this easy. MongoDB makes it painful unless you carefully structure your data to avoid joins.

Here's a comparison of common query patterns:

Query type	PostgreSQL	MongoDB
Simple filters	Fast and straightforward	Fast and straightforward
Multi-table joins	Native and efficient	Limited, slower with $lookup
Aggregations	Full SQL power	Aggregation pipeline (good but limited)
Full-text search	Built-in with GIN indexes	Text indexes available
Complex analytics	Excellent with window functions	Requires careful data modeling
Ad-hoc queries	Easy with standard SQL	More difficult without joins

If your application needs complex queries or you're not sure what queries you'll need later, PostgreSQL gives you more flexibility. MongoDB works if your access patterns are known upfront and you can structure your data accordingly.

3. Scalability and sharding

MongoDB was built for horizontal scaling from the start. It has native sharding support that distributes data across multiple servers automatically. You can add more machines to handle more data or traffic. MongoDB handles shard key selection, data distribution and routing queries to the right shards. This makes it easier to scale out without changing application code.

PostgreSQL's strength is vertical scaling. You get better performance by upgrading to a bigger server with more CPU, RAM and faster storage. PostgreSQL can handle massive datasets on a single node if you have enough hardware. Horizontal scaling is possible through manual sharding or tools like Citus, but it's more complex and less mature than MongoDB's built-in approach.

For most projects, vertical scaling is simpler and cheaper than horizontal scaling. Modern servers are powerful. A single PostgreSQL instance can handle millions of records and thousands of queries per second. You only need horizontal scaling if you're dealing with truly massive datasets (hundreds of terabytes) or extreme traffic levels (hundreds of thousands of concurrent users).

The key questions to ask:

How much data will you have in 1-2 years?
What's your traffic growth projection?
Can you predict your bottlenecks?
Do you have the expertise to manage sharded databases?

If you're building a typical web application or SaaS product, PostgreSQL's vertical scaling will probably be enough. MongoDB's sharding makes sense if you're building something that needs to scale globally from day one or you know you'll hit horizontal scaling limits quickly.

4. Consistency and ACID guarantees

PostgreSQL provides full ACID guarantees for all transactions. Atomicity, Consistency, Isolation and Durability are guaranteed out of the box. Multi-document transactions work correctly. If something fails, everything rolls back. Your data stays consistent even under high load or system failures.

MongoDB added multi-document ACID transactions in version 4.0, but they're slower and more limited than PostgreSQL's transactions. Single-document operations in MongoDB are atomic, but cross-document consistency requires explicit transactions. In practice, many MongoDB users avoid transactions entirely by denormalizing data into single documents.

For financial applications, e-commerce or anything where data integrity is critical, PostgreSQL's consistency guarantees are valuable. You can trust that your data won't end up in an inconsistent state. MongoDB works fine for use cases where eventual consistency is acceptable, like logging, caching or analytics pipelines.

MongoDB does offer tunable consistency levels (write concerns and read concerns), which gives you flexibility. But that flexibility also means you need to think carefully about consistency trade-offs and configure them correctly. PostgreSQL just works consistently by default.

5. Performance characteristics

Performance depends heavily on your use case and access patterns. MongoDB is generally faster for simple reads and writes on single documents. If you're doing lots of inserts or updates on independent records, MongoDB can outperform PostgreSQL. Document databases avoid join overhead by storing related data together.

PostgreSQL is faster when you need complex queries, joins or aggregations. Its query planner is extremely sophisticated and can optimize complicated queries that would be slow or impossible in MongoDB. PostgreSQL also has better support for indexes, including partial indexes, expression indexes and various index types (B-tree, hash, GIN, GiST).

Both databases can be fast if you design your schema and indexes properly. But they optimize for different workloads:

MongoDB: fast single-document operations, high write throughput
PostgreSQL: fast complex queries, efficient joins, flexible indexing

If your application is read-heavy with complex queries, PostgreSQL will likely be faster. If you're write-heavy with simple access patterns, MongoDB might have an edge. In practice, most bottlenecks come from poor schema design or missing indexes, not the database choice itself.

6. Backup strategies and operational complexity

Backups are critical for production databases. PostgreSQL has mature backup tools like pg_dump for logical backups and pg_basebackup for physical backups. Point-in-time recovery is available through WAL archiving. Most managed PostgreSQL services (AWS RDS, Google Cloud SQL, Azure Database) include automated backups with easy restore.

MongoDB has mongodump for logical backups and filesystem snapshots for physical backups. Backups are straightforward for single-node deployments. But backing up sharded MongoDB clusters requires careful coordination to ensure consistency across shards. You need to stop the balancer and take snapshots at the same time on all shards.

Databasus is an industry standard backup tool that supports both PostgreSQL and MongoDB. It handles scheduled backups, multiple storage destinations (S3, Google Drive, local storage) and notifications across Slack, Discord and email. Whether you're running PostgreSQL or MongoDB, Databasus simplifies backup management with a clean interface and reliable scheduling.

Operational complexity also differs between these databases:

PostgreSQL: simpler operations, well-understood tooling, extensive documentation
MongoDB: more complex operations with sharding, requires specialized knowledge for production deployments

PostgreSQL has been around since 1996 and has decades of operational experience built into its tooling and documentation. MongoDB is newer (2009) and still evolving. If you don't have dedicated database administrators, PostgreSQL is easier to operate reliably.

Which one should you choose?

After comparing these six factors, here's a practical decision framework:

Choose PostgreSQL if you:

Have structured, relational data with clear relationships
Need complex queries, joins or analytical workloads
Want full ACID guarantees and strong consistency
Prefer simpler operations and well-established tooling
Can scale vertically and don't need global horizontal scaling immediately
Are building typical web applications, SaaS products or business software

Choose MongoDB if you:

Have flexible, semi-structured data with evolving schemas
Need high write throughput with simple access patterns
Know your access patterns upfront and can denormalize data
Need native horizontal sharding for massive scale
Are building applications like content management, catalogs or logging systems
Have expertise to manage distributed database operations

For most developers building standard applications, PostgreSQL is the safer choice. It's more flexible for queries, easier to operate and handles most workloads efficiently. MongoDB makes sense for specific use cases where its strengths (flexibility, sharding, document model) align with your requirements.

The good news is both databases are excellent. Either choice can work if you design your schema properly and use the database's strengths. But understanding these differences helps you pick the right tool and avoid fighting against your database later.

8 MySQL security mistakes that expose your database to attackers

Piter Adyson — Wed, 11 Feb 2026 19:55:32 +0000

MySQL is one of the most deployed databases in the world, which also makes it one of the most targeted. A lot of MySQL installations in the wild are running with default settings, overly permissive user accounts and no encryption. Some of these are dev setups that accidentally went to production. Others are production systems that nobody ever hardened because "it's behind a firewall."

This article covers eight real security mistakes that leave MySQL databases exposed. Not abstract threat models, but concrete misconfigurations that attackers actually look for and exploit.

1. Running with default credentials and the root account

This sounds obvious, but it still happens constantly. Fresh MySQL installations often ship with a root account that has no password or a well-known default password. Automated scanners specifically look for MySQL instances on port 3306 with empty root passwords. It takes seconds to find and exploit.

The root account in MySQL has unrestricted access to everything: all databases, all tables, all administrative commands. Using it for application connections means your app has full control over the server, including the ability to drop databases, create users and modify grants.

Fix the root password immediately after installation:

ALTER USER 'root'@'localhost' IDENTIFIED BY 'a-strong-random-password-here';
FLUSH PRIVILEGES;

Then create separate accounts for each application with only the privileges it needs:

CREATE USER 'app_user'@'10.0.1.%' IDENTIFIED BY 'another-strong-password';
GRANT SELECT, INSERT, UPDATE, DELETE ON myapp.* TO 'app_user'@'10.0.1.%';
FLUSH PRIVILEGES;

The 10.0.1.% host restriction means this user can only connect from your application subnet. If someone steals the credentials, they can't use them from an arbitrary machine.

Run mysql_secure_installation on every new MySQL instance. It removes anonymous users, disables remote root login and drops the test database. This takes thirty seconds and closes the most common attack vectors.

2. Granting excessive privileges to application users

Most MySQL applications need SELECT, INSERT, UPDATE and DELETE on specific databases. That's it. Yet it's common to see application accounts with GRANT ALL PRIVILEGES ON *.* because someone copied a Stack Overflow answer during initial setup and never revisited it.

The damage from excessive privileges scales with the access level. An application account with FILE privilege can read any file the MySQL process can access on the server filesystem. PROCESS lets it see all running queries, including those from other users. SUPER lets it kill connections and change global variables.

Privilege	What it allows	Risk if compromised
`ALL PRIVILEGES ON .`	Full administrative access	Complete server takeover
`FILE`	Read/write server filesystem	Credential theft, data exfiltration
`PROCESS`	View all running queries	Exposure of sensitive queries and data
`SUPER`	Kill connections, change configs	Denial of service, configuration tampering
`SELECT, INSERT, UPDATE, DELETE ON app.*`	Standard CRUD on one database	Limited to application data only

Audit your current grants to see what's actually assigned:

SELECT user, host, Super_priv, File_priv, Process_priv, Grant_priv
FROM mysql.user
WHERE Super_priv = 'Y' OR File_priv = 'Y' OR Process_priv = 'Y';

If your application user shows up in this list, something is wrong. Revoke what it doesn't need:

REVOKE ALL PRIVILEGES ON *.* FROM 'app_user'@'10.0.1.%';
GRANT SELECT, INSERT, UPDATE, DELETE ON myapp.* TO 'app_user'@'10.0.1.%';
FLUSH PRIVILEGES;

A good rule of thumb: if you can't explain why an account needs a specific privilege, it shouldn't have it.

3. Exposing MySQL to the internet without network restrictions

By default, MySQL listens on all network interfaces. That means if your server has a public IP address, MySQL is reachable from the entire internet. Combined with weak credentials (mistake #1), this is how most MySQL breaches happen.

Check your current binding:

SHOW VARIABLES LIKE 'bind_address';

If it shows 0.0.0.0 or *, MySQL is accepting connections from everywhere.

Restrict it in my.cnf:

[mysqld]
bind-address = 127.0.0.1

This limits MySQL to local connections only. If your application runs on a different server, bind to the private network interface instead:

[mysqld]
bind-address = 10.0.1.5

But network binding alone isn't enough. Add firewall rules to restrict port 3306:

# iptables example: only allow MySQL connections from app server
iptables -A INPUT -p tcp --dport 3306 -s 10.0.1.10 -j ACCEPT
iptables -A INPUT -p tcp --dport 3306 -j DROP

Also disable the skip-networking and skip-name-resolve options thoughtfully. skip-networking disables TCP connections entirely (only socket connections work), which is fine if the application is on the same host. skip-name-resolve prevents DNS lookups for connecting hosts, which speeds up connections and removes DNS spoofing as an attack vector.

If your application must reach MySQL over the internet, use an SSH tunnel or VPN instead of opening port 3306 directly. Never expose MySQL to the public internet, even with strong passwords.

4. Not encrypting connections with TLS

MySQL connections transmit data in plaintext by default. This includes queries, result sets, usernames and passwords. Anyone who can capture network traffic between your application and MySQL can read everything.

This isn't just a theoretical concern. On shared hosting, cloud VPCs with misconfigured security groups and corporate networks, packet sniffing is a real threat. Even "private" networks aren't always as isolated as you think.

Check if TLS is currently enabled:

SHOW VARIABLES LIKE '%ssl%';

To enable TLS, generate or obtain certificates and configure MySQL:

[mysqld]
ssl-ca   = /etc/mysql/ssl/ca-cert.pem
ssl-cert = /etc/mysql/ssl/server-cert.pem
ssl-key  = /etc/mysql/ssl/server-key.pem
require_secure_transport = ON

The require_secure_transport = ON setting forces all connections to use TLS. Without it, clients can still connect unencrypted.

You can also enforce TLS on a per-user basis, which is useful for a gradual rollout:

ALTER USER 'app_user'@'10.0.1.%' REQUIRE SSL;
FLUSH PRIVILEGES;

Verify that connections are actually encrypted:

SELECT ssl_type, ssl_cipher FROM mysql.user WHERE user = 'app_user';

And from the client side:

SHOW STATUS LIKE 'Ssl_cipher';

If Ssl_cipher returns an empty string, the connection is unencrypted.

5. Leaving the binary log and data directory unprotected

MySQL's binary log contains every data-modifying statement that runs against the database. If an attacker gains access to the filesystem, they can read the binary log and reconstruct your entire data history: every insert, update and delete.

The data directory itself contains the actual table files. Depending on the storage engine, these might be readable with basic tools. InnoDB files can be parsed with specialized utilities to extract raw data, bypassing MySQL authentication entirely.

Check your current file permissions:

ls -la /var/lib/mysql/
ls -la /var/log/mysql/

The MySQL data directory and log directory should be owned by the mysql user and group, with no world-readable permissions:

chown -R mysql:mysql /var/lib/mysql
chmod 750 /var/lib/mysql
chown -R mysql:mysql /var/log/mysql
chmod 750 /var/log/mysql

Also protect the MySQL configuration file, which may contain passwords:

chmod 600 /etc/mysql/my.cnf
chown root:root /etc/mysql/my.cnf

If you're running MySQL in Docker, make sure the volume mounts for data and logs aren't world-readable on the host filesystem. Default Docker volume permissions can be more permissive than you expect.

For the binary log specifically, consider encrypting it. MySQL 8.0+ supports binary log encryption:

[mysqld]
binlog_encryption = ON

This encrypts the binary log files at rest. Even if someone copies the files, they can't read the contents without the encryption key.

6. Ignoring SQL injection in application code

SQL injection has been the number one database attack vector for over two decades, and it still works because developers keep building queries by concatenating user input directly into SQL strings. MySQL doesn't have a built-in defense against this. The protection has to come from application code.

An injectable query looks like this:

# Vulnerable: user input directly in the query string
query = f"SELECT * FROM users WHERE email = '{user_input}'"
cursor.execute(query)

If user_input is ' OR '1'='1' --, the query becomes:

SELECT * FROM users WHERE email = '' OR '1'='1' --'

This returns every row in the users table. More destructive payloads can drop tables, read files from disk (if the MySQL user has FILE privilege) or create new admin accounts.

The fix is parameterized queries. Every database library supports them:

# Safe: parameterized query
cursor.execute("SELECT * FROM users WHERE email = %s", (user_input,))

// Node.js with mysql2
connection.execute("SELECT * FROM users WHERE email = ?", [userInput]);

// Go with database/sql
db.Query("SELECT * FROM users WHERE email = ?", userInput)

Parameterized queries separate the SQL structure from the data. The database engine knows that the parameter is a value, not SQL code, regardless of what it contains.

On the MySQL side, you can reduce the blast radius by removing the FILE privilege from application accounts (see mistake #2) and by running MySQL with --local-infile=0 to disable LOAD DATA LOCAL INFILE, which attackers use for file reading through SQL injection.

7. Not auditing or monitoring database access

If someone is accessing your MySQL database in ways they shouldn't, how quickly would you know? Most MySQL installations have no audit logging enabled. An attacker could be reading sensitive tables for weeks before anyone notices.

MySQL Enterprise Edition includes an audit plugin, but the community edition requires other approaches. The general query log is one option, though it captures everything and creates enormous log files on busy servers.

A more practical approach for the community edition is to enable specific logging:

[mysqld]
log_error      = /var/log/mysql/error.log
slow_query_log = 1
slow_query_log_file = /var/log/mysql/slow.log
log_queries_not_using_indexes = 1

For connection monitoring, regularly check who is connected and what they're doing:

SELECT user, host, db, command, time, state
FROM information_schema.processlist
WHERE user NOT IN ('system user', 'event_scheduler')
ORDER BY time DESC;

Track failed login attempts by checking the error log. Repeated failed logins from the same IP usually mean a brute force attack is underway.

Monitoring area	What to watch for	How to check
Failed logins	Brute force attempts	Error log entries with "Access denied"
Unusual connections	Unknown hosts or users	`SHOW PROCESSLIST` or processlist table
Schema changes	Unauthorized DDL	General log or trigger-based auditing
Privilege escalation	New grants or users	Periodic diff of `mysql.user` table
Large data reads	Bulk exfiltration	Slow query log, network monitoring

For production systems, consider deploying a third-party audit plugin like audit_log from Percona or MariaDB's audit plugin (which works with MySQL forks). These provide structured, filterable audit trails without the overhead of the general query log.

Set up alerts for critical events: new user creation, privilege changes, connections from unexpected hosts and queries against sensitive tables. The goal is to detect unusual activity before it becomes a full breach.

8. Skipping backups or storing them insecurely

Security isn't just about preventing unauthorized access. It's also about recovery. Ransomware attacks against MySQL databases are real: attackers gain access, drop all tables and leave a ransom note. Without backups, you're negotiating with criminals.

But having backups isn't enough if they're stored insecurely. Unencrypted backup files sitting on the same server as MySQL are useless in a ransomware scenario because the attacker deletes them too. Backups on an S3 bucket with public read access are just a different kind of data breach.

A secure backup strategy covers three things:

Encryption — Backup files should be encrypted at rest so they're useless if stolen
Offsite storage — At least one copy should be on a separate system or cloud storage that the MySQL server doesn't have delete access to
Regular testing — A backup you've never restored is a backup you hope works

For MySQL, mysqldump is the basic tool:

mysqldump --single-transaction --routines --triggers myapp | \
  gzip | openssl enc -aes-256-cbc -salt -pbkdf2 -out /backup/myapp_$(date +%F).sql.gz.enc

This creates a compressed, encrypted backup. But managing encryption keys, scheduling and offsite storage manually is tedious and error-prone.

MySQL backup tools like Databasus automate the entire process. It's an industry standard for MySQL backup tools that handles scheduling, compression, AES-256-GCM encryption and storage to multiple destinations like S3, Google Drive and SFTP. It's suitable for individual developers and enterprise teams, with workspace-based access management and audit logs.

Whatever approach you choose, make sure your backups are not accessible from the MySQL server with the same credentials. If the database server is compromised, the attacker shouldn't be able to delete your backups.

The pattern behind these mistakes

Looking at these eight mistakes together, a pattern emerges. Most MySQL security failures come from defaults that were never changed, permissions that were never reviewed and monitoring that was never set up. None of these fixes are complex. They don't require expensive tools or deep security expertise.

Start with the basics: strong credentials, minimal privileges, network restrictions and encrypted connections. Then add monitoring so you know when something unusual happens. And keep tested, encrypted backups so you can recover when prevention fails.

The best time to secure your MySQL database was when you first set it up. The second best time is now.

PostgreSQL slow queries — 7 ways to find and fix performance bottlenecks

Piter Adyson — Tue, 10 Feb 2026 19:52:03 +0000

Every PostgreSQL database eventually develops slow queries. It might start small: a dashboard that takes a bit longer to load, an API endpoint that times out during peak traffic, a report that used to run in seconds and now takes minutes. The tricky part is that slow queries rarely announce themselves. They creep in as data grows, schemas change and new features pile on.

This article covers seven practical ways to find the queries that are hurting your database and fix them. Not theoretical advice, but actual tools and techniques you can apply to a running PostgreSQL instance today.

1. Enable pg_stat_statements to find your worst offenders

The single most useful extension for tracking slow queries in PostgreSQL is pg_stat_statements. It records execution statistics for every query that runs against your database, including how many times it ran, total execution time, rows returned and more.

Most performance problems come from a handful of queries. pg_stat_statements lets you find them without guessing.

To enable it, add the extension to your postgresql.conf:

shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.track = all

After restarting PostgreSQL, create the extension:

CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

Then query it to find the most time-consuming queries:

SELECT
    calls,
    round(total_exec_time::numeric, 2) AS total_time_ms,
    round(mean_exec_time::numeric, 2) AS avg_time_ms,
    round((100 * total_exec_time / sum(total_exec_time) OVER ())::numeric, 2) AS percent_total,
    query
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;

This shows you which queries consume the most cumulative time. A query that runs 50,000 times a day at 100ms each is a bigger problem than a query that runs once at 5 seconds. The percent_total column makes this obvious.

You can also find queries with the highest average execution time:

SELECT
    calls,
    round(mean_exec_time::numeric, 2) AS avg_time_ms,
    round(max_exec_time::numeric, 2) AS max_time_ms,
    rows,
    query
FROM pg_stat_statements
WHERE calls > 10
ORDER BY mean_exec_time DESC
LIMIT 20;

The WHERE calls > 10 filter avoids one-off admin queries that would distort the results.

Reset statistics periodically to keep the data relevant:

SELECT pg_stat_statements_reset();

pg_stat_statements is the starting point. Everything else in this article builds on knowing which queries to focus on.

2. Use EXPLAIN ANALYZE to understand what's actually happening

Once you know which queries are slow, EXPLAIN ANALYZE tells you why. It runs the query and shows the execution plan PostgreSQL actually used, including the time spent at each step.

EXPLAIN ANALYZE
SELECT o.id, o.total, c.name
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.created_at > '2026-01-01'
  AND o.status = 'completed';

The output looks something like this:

Hash Join  (cost=12.50..345.00 rows=150 actual time=0.82..15.43 rows=143 loops=1)
  Hash Cond: (o.customer_id = c.id)
  ->  Seq Scan on orders o  (cost=0.00..310.00 rows=150 actual time=0.02..14.20 rows=143 loops=1)
        Filter: ((created_at > '2026-01-01') AND (status = 'completed'))
        Rows Removed by Filter: 99857
  ->  Hash  (cost=10.00..10.00 rows=200 actual time=0.45..0.45 rows=200 loops=1)
        ->  Seq Scan on customers c  (cost=0.00..10.00 rows=200 actual time=0.01..0.20 rows=200 loops=1)
Planning Time: 0.15 ms
Execution Time: 15.60 ms

The important things to look for:

Warning sign	What it means
`Seq Scan` on a large table	No index is being used, every row is read
`Rows Removed by Filter: 99857`	The scan reads far more rows than it returns
`actual rows` much higher than `rows` (estimate)	Statistics are stale, run `ANALYZE`
`Nested Loop` with high `loops` count	The inner side runs thousands of times
`Sort` with `external merge`	Not enough `work_mem`, sorting spills to disk

The Seq Scan on orders above is the bottleneck. It reads 100,000 rows to return 143. An index on (status, created_at) would fix this:

CREATE INDEX idx_orders_status_created ON orders (status, created_at);

After creating the index, run EXPLAIN ANALYZE again. You should see an Index Scan or Bitmap Index Scan replacing the sequential scan, and the execution time dropping significantly.

One thing people miss: EXPLAIN without ANALYZE shows the plan but doesn't execute the query. It gives you estimates, not actual numbers. Always use ANALYZE when debugging performance, unless the query modifies data (in that case, wrap it in a transaction and roll back).

3. Configure the slow query log

pg_stat_statements gives you aggregate data, but sometimes you need to see individual slow queries as they happen. PostgreSQL's built-in slow query log captures every query that exceeds a time threshold.

Add these settings to postgresql.conf:

log_min_duration_statement = 500
log_statement = 'none'
log_duration = off
log_line_prefix = '%t [%p] %u@%d '

This logs any query that takes longer than 500 milliseconds. The log_line_prefix adds the timestamp, process ID, username and database name to each log entry, which is essential for debugging.

Setting log_min_duration_statement = 0 logs every query. This is useful for short debugging sessions but generates enormous log files on busy databases. For production, start with 500ms or 1000ms and lower it as you fix the worst offenders.

The log entries look like this:

2026-02-10 14:23:45 UTC [12345] app_user@mydb LOG: duration: 2345.678 ms  statement: 
    SELECT u.*, p.* FROM users u JOIN purchases p ON p.user_id = u.id 
    WHERE u.region = 'eu' ORDER BY p.created_at DESC;

For more structured analysis, tools like pgBadger can parse these logs and generate reports showing the slowest queries, most frequent queries and query patterns over time. But the raw log is often enough to spot problems.

A practical approach: enable the slow query log in production at 1000ms, review it weekly, fix the top offenders, then lower the threshold to 500ms. Repeat until the log is mostly quiet.

4. Fix missing and misused indexes

Missing indexes are the most common cause of slow queries in PostgreSQL. But "add more indexes" isn't always the answer. Sometimes existing indexes aren't being used, or the wrong type of index was created.

Finding missing indexes. Start with the query from pg_stat_statements, then check if the tables involved have appropriate indexes:

SELECT
    schemaname,
    relname AS table_name,
    seq_scan,
    seq_tup_read,
    idx_scan,
    n_live_tup AS row_count
FROM pg_stat_user_tables
WHERE seq_scan > 0
ORDER BY seq_tup_read DESC
LIMIT 15;

Tables with a high seq_tup_read and low idx_scan are being scanned sequentially when they probably shouldn't be. A table with 10 million rows and zero index scans is almost certainly missing an index.

Finding unused indexes. Indexes you never use still cost write performance:

SELECT
    indexrelname AS index_name,
    relname AS table_name,
    idx_scan AS times_used,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
  AND indexrelid NOT IN (
      SELECT indexrelid FROM pg_index WHERE indisprimary
  )
ORDER BY pg_relation_size(indexrelid) DESC;

This shows indexes that have never been scanned (excluding primary keys). If an index is 500 MB and has zero scans, it's slowing down every write for nothing. Drop it.

Common indexing mistakes:

Indexing a column with very low cardinality (like a boolean is_active column with 99% true values). The planner often prefers a sequential scan because the index doesn't filter out enough rows.
Creating single-column indexes when your queries filter on multiple columns. A composite index on (status, created_at) is much better than separate indexes on status and created_at when your WHERE clause uses both.
Forgetting partial indexes. If 95% of queries filter for active records, create a partial index:

CREATE INDEX idx_orders_active ON orders (customer_id, created_at)
WHERE status = 'active';

This index is smaller, faster to scan and faster to maintain than a full index.

5. Tune PostgreSQL memory and planner settings

Default PostgreSQL configuration is deliberately conservative. It assumes the server has 128 MB of RAM and a single spinning disk. If you're running on a modern server with 16 GB of RAM and SSDs, the defaults are leaving performance on the table.

The key settings that affect query performance:

Setting	Default	Recommended starting point	What it controls
`shared_buffers`	128 MB	25% of total RAM	PostgreSQL's shared memory cache
`work_mem`	4 MB	64-256 MB	Memory for sorts and hash operations per query
`effective_cache_size`	4 GB	50-75% of total RAM	Planner's estimate of available OS cache
`random_page_cost`	4.0	1.1 for SSD, 2.0 for HDD	Cost of random disk reads (affects index usage)
`effective_io_concurrency`	1	200 for SSD	Number of concurrent disk I/O operations

shared_buffers is the most important one. PostgreSQL uses this as its primary data cache. Too low and it constantly re-reads data from disk. Too high and it competes with the OS page cache. 25% of total RAM is a good starting point for most workloads.

work_mem is tricky because it's per-operation, not per-query. A complex query with five sort operations and three hash joins could allocate up to 8x work_mem. Setting it to 256 MB sounds reasonable until 50 concurrent connections each allocate multiple chunks. Start with 64 MB and monitor.

random_page_cost is the one that catches most people. The default of 4.0 tells the planner that random disk reads are four times more expensive than sequential reads. That was true for spinning disks. On SSDs, random and sequential reads are nearly identical. Lowering this to 1.1 makes the planner much more willing to use indexes, which is usually what you want on SSD storage.

You can change these without a restart (except shared_buffers) using:

ALTER SYSTEM SET work_mem = '64MB';
ALTER SYSTEM SET random_page_cost = 1.1;
ALTER SYSTEM SET effective_cache_size = '12GB';
SELECT pg_reload_conf();

After changing settings, test with EXPLAIN ANALYZE on your slow queries. You should see different plan choices, especially more index scans and in-memory sorts.

6. Rewrite problematic query patterns

Sometimes the query itself is the problem. No amount of indexing or tuning will fix a fundamentally inefficient query. Here are patterns that consistently cause performance issues and how to fix them.

SELECT * when you only need a few columns. This forces PostgreSQL to read and transfer every column, including large text or JSONB fields:

-- Slow: reads everything, including a 10 KB description column
SELECT * FROM products WHERE category = 'electronics';

-- Better: only fetches what you need
SELECT id, name, price FROM products WHERE category = 'electronics';

This matters more than people think, especially with TOAST (The Oversized Attribute Storage Technique). Large columns are stored separately, and fetching them requires additional disk reads.

Correlated subqueries that run once per row. The planner sometimes can't flatten these:

-- Slow: subquery executes for each order row
SELECT o.id, o.total,
    (SELECT name FROM customers c WHERE c.id = o.customer_id)
FROM orders o
WHERE o.created_at > '2026-01-01';

-- Better: explicit JOIN
SELECT o.id, o.total, c.name
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.created_at > '2026-01-01';

Using OFFSET for pagination on large datasets. OFFSET 100000 means PostgreSQL fetches and discards 100,000 rows before returning results:

-- Slow: scans and discards 100,000 rows
SELECT * FROM events ORDER BY created_at DESC LIMIT 20 OFFSET 100000;

-- Better: keyset pagination using the last seen value
SELECT * FROM events
WHERE created_at < '2026-01-15T10:30:00Z'
ORDER BY created_at DESC
LIMIT 20;

Keyset pagination is consistently fast regardless of how deep into the result set you go. It requires an index on the column you're paginating by.

Unnecessary DISTINCT or GROUP BY. If you're adding DISTINCT because a JOIN produces duplicates, the JOIN is probably wrong. Fix the JOIN condition instead of papering over it with DISTINCT.

Functions in WHERE clauses that prevent index usage:

-- Index on created_at won't be used
SELECT * FROM orders WHERE EXTRACT(YEAR FROM created_at) = 2026;

-- Rewrite to use the index
SELECT * FROM orders
WHERE created_at >= '2026-01-01' AND created_at < '2027-01-01';

7. Keep statistics up to date with ANALYZE and VACUUM

PostgreSQL's query planner relies on table statistics to make decisions. How many rows does a table have? What's the distribution of values in each column? How many distinct values are there? If these statistics are wrong, the planner makes bad choices.

ANALYZE collects fresh statistics about table contents. VACUUM reclaims space from deleted or updated rows (dead tuples) that PostgreSQL can't reuse. Both are essential for sustained query performance.

Autovacuum handles this automatically by default, but it doesn't always keep up. Large batch operations, bulk deletes and rapidly growing tables can outpace the default autovacuum settings.

Check if your statistics are stale:

SELECT
    relname AS table_name,
    last_analyze,
    last_autoanalyze,
    n_live_tup,
    n_dead_tup,
    round(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 1) AS dead_pct
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 15;

Tables with a high percentage of dead tuples need vacuuming. Tables that haven't been analyzed recently may have stale statistics.

If a table had 10,000 rows when statistics were collected but now has 10 million, the planner might choose a sequential scan based on the old row count when an index scan would be far more efficient. Running ANALYZE fixes this:

ANALYZE orders;

For autovacuum tuning, the defaults are cautious. On busy databases, consider adjusting:

autovacuum_vacuum_scale_factor = 0.05
autovacuum_analyze_scale_factor = 0.02
autovacuum_vacuum_cost_delay = 2ms

The scale factors control when autovacuum kicks in. The default vacuum_scale_factor of 0.2 means autovacuum runs after 20% of rows have been modified. On a 100 million row table, that's 20 million dead tuples before cleanup starts. Lowering it to 0.05 (5%) keeps things cleaner.

For large tables with specific requirements, you can set per-table autovacuum settings:

ALTER TABLE events SET (
    autovacuum_vacuum_scale_factor = 0.01,
    autovacuum_analyze_scale_factor = 0.005
);

Keeping your data safe while you optimize

Tuning queries and tweaking PostgreSQL configuration is relatively safe work. But mistakes happen. A dropped index on a production table during peak hours, a configuration change that causes out-of-memory crashes, an ANALYZE on a massive table that locks things at the wrong moment.

Having reliable backups means you can optimize with confidence. PostgreSQL backup tools like Databasus handle automated scheduled backups with compression, encryption and multiple storage destinations. It's an industry standard for PostgreSQL backup tools, suitable for individual developers and enterprise teams.

Putting it all together

Fixing slow queries in PostgreSQL isn't a one-time task. It's a cycle: identify the slow queries with pg_stat_statements, understand why they're slow with EXPLAIN ANALYZE, fix the root cause (missing index, bad query pattern, stale statistics or wrong configuration) and then monitor to make sure the fix holds.

Start with pg_stat_statements if you haven't already. It takes five minutes to set up and immediately shows you where your database is spending its time. From there, work through the list: check your indexes, review your configuration settings, look for problematic query patterns and make sure autovacuum is keeping up.

Most PostgreSQL performance problems have straightforward solutions. The hard part is knowing where to look.

PostgreSQL indexing explained — 5 index types and when to use each

Piter Adyson — Mon, 09 Feb 2026 13:56:40 +0000

Indexes are one of those things that everybody knows they should use, but few people actually understand beyond the basics. You create an index, the query gets faster, done. Except when it doesn't. Or when the wrong index makes things slower. Or when you're running five indexes on a table and none of them are being used.

PostgreSQL ships with five distinct index types, each designed for different access patterns. Picking the right one is the difference between a query that takes 2 milliseconds and one that takes 20 seconds. This article covers all five, when they actually help and when they're a waste of disk space.

How PostgreSQL indexes work under the hood

Before jumping into specific types, it helps to understand what an index actually does. A PostgreSQL index is a separate data structure that maps column values to the physical location of rows on disk. When you run a query with a WHERE clause, the planner checks whether an index exists that can narrow down the search instead of scanning every row.

Without an index, PostgreSQL performs a sequential scan. It reads the entire table, row by row, checking each one against your filter. For a table with 100 rows, that's fine. For a table with 100 million rows, it's a problem.

-- Without an index, this scans the entire table
SELECT * FROM orders WHERE customer_id = 'abc-123';

-- With an index on customer_id, PostgreSQL jumps directly to matching rows
CREATE INDEX idx_orders_customer_id ON orders (customer_id);

Indexes aren't free though. Every index takes disk space and slows down INSERT, UPDATE and DELETE operations because PostgreSQL has to maintain the index alongside the table data. A table with ten indexes means every write operation updates ten additional data structures.

The goal is to have the right indexes for your query patterns and nothing more.

1. B-tree — the default workhorse

B-tree is the default index type in PostgreSQL. If you run CREATE INDEX without specifying a type, you get a B-tree. It handles equality and range queries on sortable data, which covers the vast majority of real-world use cases.

B-tree indexes store data in a balanced tree structure. Each node contains sorted keys and pointers to child nodes, allowing PostgreSQL to find any value in O(log n) time. They support =, <, >, <=, >=, BETWEEN and IS NULL operators efficiently.

-- All of these use B-tree indexes effectively
CREATE INDEX idx_orders_created_at ON orders (created_at);

SELECT * FROM orders WHERE created_at > '2026-01-01';
SELECT * FROM orders WHERE created_at BETWEEN '2026-01-01' AND '2026-02-01';
SELECT * FROM orders WHERE created_at = '2026-02-08';

-- Multi-column B-tree indexes
CREATE INDEX idx_orders_customer_date ON orders (customer_id, created_at);

-- This uses the index (leftmost prefix rule)
SELECT * FROM orders WHERE customer_id = 'abc-123' AND created_at > '2026-01-01';

-- This also uses the index (first column matches)
SELECT * FROM orders WHERE customer_id = 'abc-123';

-- This does NOT use the index efficiently (skips the first column)
SELECT * FROM orders WHERE created_at > '2026-01-01';

The column order in multi-column B-tree indexes matters a lot. PostgreSQL can use the index starting from the leftmost column. If your query only filters on the second column, the index likely won't help.

Scenario	B-tree works?
Exact match (`WHERE status = 'active'`)	Yes
Range queries (`WHERE price > 100`)	Yes
Sorting (`ORDER BY created_at DESC`)	Yes
Pattern matching (`WHERE name LIKE 'John%'`)	Yes (prefix only)
Pattern matching (`WHERE name LIKE '%John%'`)	No
Array or JSON containment	No

B-tree is the right choice for primary keys, foreign keys, timestamp columns used in range filters and any column you frequently sort on. If you're unsure which index type to use, B-tree is almost always a safe starting point.

2. Hash — fast equality lookups

Hash indexes build a hash table mapping each value to the row locations that contain it. They only support equality comparisons (=), but they do it with O(1) lookup time instead of O(log n) for B-tree.

Before PostgreSQL 10, hash indexes were not crash-safe because they weren't WAL-logged. That made them basically unusable in production. Since PostgreSQL 10, they're fully crash-safe and a reasonable option for specific workloads.

CREATE INDEX idx_sessions_token ON sessions USING hash (token);

-- This uses the hash index
SELECT * FROM sessions WHERE token = 'a1b2c3d4e5f6';

-- This does NOT use the hash index (not an equality check)
SELECT * FROM sessions WHERE token > 'a1b2c3d4e5f6';

Hash indexes are smaller than B-tree indexes for the same data, which can matter for large tables with high-cardinality columns. If you have a table with 50 million rows and you only ever look up by an exact session token or API key, a hash index uses less memory and disk.

In practice, the difference is often marginal. B-tree handles equality just fine, and it also supports range queries as a bonus. Most PostgreSQL users never create a hash index. But if you're optimizing a high-throughput lookup table where every byte of index size matters, it's worth benchmarking.

When to use hash over B-tree:

Exact match queries only, no range scans
Very high cardinality columns (UUIDs, tokens, hashes)
You want the smallest possible index size
You've benchmarked and confirmed it outperforms B-tree for your workload

3. GIN — for full-text search, arrays and JSONB

GIN stands for Generalized Inverted Index. It's designed for values that contain multiple elements, like arrays, JSONB documents and full-text search vectors. Where a B-tree maps one value to one row, a GIN index maps each element inside a composite value to the rows that contain it.

Think of it like a book index at the back of a textbook. You look up a word and it tells you all the pages where that word appears. GIN does the same thing for array elements, JSON keys and text lexemes.

-- Full-text search
CREATE INDEX idx_articles_search ON articles USING gin (to_tsvector('english', body));

SELECT * FROM articles
WHERE to_tsvector('english', body) @@ to_tsquery('postgresql & indexing');

-- JSONB containment
CREATE INDEX idx_events_data ON events USING gin (metadata);

SELECT * FROM events
WHERE metadata @> '{"source": "api", "version": 2}';

-- Array containment
CREATE INDEX idx_products_tags ON products USING gin (tags);

SELECT * FROM products
WHERE tags @> ARRAY['electronics', 'wireless'];

GIN indexes are slower to build and update than B-tree indexes. Every insert potentially needs to update many entries in the inverted index. For write-heavy tables, this can be a noticeable overhead. PostgreSQL mitigates this with "fastupdate" which batches pending index entries, but it means the index can be slightly behind during heavy writes.

Feature	B-tree	GIN
Equality and range queries	Yes	No
Full-text search (`@@`)	No	Yes
Array containment (`@>`)	No	Yes
JSONB containment (`@>`, `?`, `?&`)	No	Yes
Index build speed	Fast	Slow
Write overhead	Low	Medium to high
Index size	Moderate	Large

GIN is the correct choice whenever you need to search within composite values. If you're running WHERE tags @> ..., WHERE metadata @> ... or WHERE tsvector @@ tsquery, a GIN index is what you want. Just be aware that it comes with higher write costs and larger disk usage compared to B-tree.

4. GiST — for geometric, range and proximity queries

GiST stands for Generalized Search Tree. It's a framework for building custom index types, but in practice it's mostly used for geometric data (points, polygons, circles), range types (date ranges, integer ranges) and full-text search (as an alternative to GIN).

GiST indexes work by recursively partitioning the search space. For geometric data, imagine dividing a map into progressively smaller regions. To find all restaurants within 500 meters, the index eliminates entire regions that are too far away without checking individual rows.

-- PostGIS spatial queries
CREATE INDEX idx_locations_geo ON locations USING gist (coordinates);

SELECT * FROM locations
WHERE ST_DWithin(coordinates, ST_MakePoint(-73.985, 40.748)::geography, 500);

-- Range overlap queries
CREATE INDEX idx_reservations_period ON reservations USING gist (during);

SELECT * FROM reservations
WHERE during && daterange('2026-02-01', '2026-02-15');

-- Nearest-neighbor search
SELECT name, ST_Distance(coordinates, ST_MakePoint(-73.985, 40.748)::geography) AS distance
FROM locations
ORDER BY coordinates <-> ST_MakePoint(-73.985, 40.748)::geography
LIMIT 10;

GiST also supports full-text search, but with different trade-offs compared to GIN. GiST full-text indexes are faster to build and smaller on disk, but slower for queries, especially when a search term appears in many documents. GIN is generally preferred for full-text search unless you're combining it with other GiST-supported operations.

When to use GiST:

PostGIS and geographic data (finding nearby points, intersecting polygons)
Range type operations (overlapping date ranges, integer ranges)
Nearest-neighbor queries (ORDER BY ... <->)
Exclusion constraints (preventing overlapping ranges in a table)

-- Exclusion constraint using GiST
-- Prevents overlapping room reservations
CREATE TABLE room_bookings (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    room_id INTEGER NOT NULL,
    during TSTZRANGE NOT NULL,
    EXCLUDE USING gist (room_id WITH =, during WITH &&)
);

The exclusion constraint example is particularly useful. It guarantees at the database level that no two bookings for the same room can overlap. This is something you can't do with B-tree indexes.

5. BRIN — for large, naturally ordered tables

BRIN stands for Block Range Index. It's the most space-efficient index type PostgreSQL offers, but it only works well under a specific condition: the physical order of rows on disk must correlate with the column values.

Instead of indexing every row, BRIN indexes store summary information (min and max values) for each block range, which is a group of consecutive physical pages. When PostgreSQL scans for rows, it checks the block summaries and skips entire ranges that can't contain matching data.

-- Perfect for a time-series table where rows are inserted in chronological order
CREATE INDEX idx_logs_created_at ON access_logs USING brin (created_at);

-- This can skip huge portions of the table
SELECT * FROM access_logs
WHERE created_at BETWEEN '2026-02-01' AND '2026-02-02';

The size difference is dramatic. A B-tree index on a 100 GB table might be 2 GB. A BRIN index on the same table could be 100 KB. That's not a typo. BRIN indexes are orders of magnitude smaller because they store one summary per block range instead of one entry per row.

But this efficiency has a hard prerequisite. If the data isn't physically ordered on disk by the indexed column, BRIN is useless. If you insert rows with random timestamps, the min/max summaries for each block range will span the entire value space, and PostgreSQL won't be able to skip anything.

Good candidates for BRIN:

Append-only tables with timestamp columns (logs, events, audit trails)
Tables where rows are inserted in natural order of some column
Very large tables (millions or billions of rows) where B-tree index size is a concern

Bad candidates for BRIN:

Tables with frequent updates that change the indexed column
Tables where rows are inserted in random order
Small tables (B-tree is more efficient for small datasets)

BRIN is a specialized tool. When it fits, it's incredible. When it doesn't, it won't help at all. Check the correlation between physical row order and column values using pg_stats before deciding:

SELECT tablename, attname, correlation
FROM pg_stats
WHERE tablename = 'access_logs' AND attname = 'created_at';

A correlation value close to 1 or -1 means BRIN will work well. Values near 0 mean the data is randomly distributed and BRIN won't help.

Practical indexing tips

Knowing which index types exist is half the story. The other half is using them effectively.

Check if your indexes are actually being used. PostgreSQL tracks index usage statistics. If an index hasn't been scanned in months, it's costing you write performance for no benefit.

SELECT
    indexrelname AS index_name,
    idx_scan AS times_used,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE schemaname = 'public'
ORDER BY idx_scan ASC;

Use EXPLAIN ANALYZE before and after creating indexes. Don't assume an index will help. Verify it. Sometimes the planner chooses a sequential scan because the table is small enough that the index adds no value.

EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 'abc-123';

Consider partial indexes for filtered queries. If you only ever query active orders, index only the active rows:

CREATE INDEX idx_orders_active ON orders (customer_id)
WHERE status = 'active';

This index is smaller and faster than indexing all orders because it only covers rows matching the condition.

Don't forget about covering indexes. If a query only needs columns that are all in the index, PostgreSQL can answer it entirely from the index without touching the table. This is called an index-only scan.

CREATE INDEX idx_orders_covering ON orders (customer_id) INCLUDE (total, created_at);

-- This can be served entirely from the index
SELECT total, created_at FROM orders WHERE customer_id = 'abc-123';

Keeping your data safe while you optimize

Experimenting with indexes is relatively low-risk since you can always drop an index and try again. But schema changes, large data migrations and production experiments can go wrong in ways that are harder to undo.

Having a reliable PostgreSQL backup strategy means you can experiment with confidence. Databasus is an industry standard for PostgreSQL backup tools. It handles automated scheduled backups with compression, encryption and multiple storage destinations, suitable for individual developers and enterprise teams alike.

Choosing the right index for your workload

There's no universal "best" index type. The right choice depends entirely on your data and your queries. B-tree covers most common scenarios. GIN handles composite and full-text data. GiST solves geometric and range problems. Hash optimizes pure equality lookups. BRIN saves massive disk space on naturally ordered data.

Start with EXPLAIN ANALYZE on your slowest queries, identify what kind of operations they perform and match those operations to the appropriate index type. One well-chosen index beats five poorly chosen ones every time.

MongoDB schema design — 6 patterns every developer should master

Piter Adyson — Sun, 08 Feb 2026 19:35:01 +0000

MongoDB gives you flexibility that relational databases don't. No rigid tables, no mandatory schemas, no upfront column definitions. You just throw documents into a collection and go. That freedom is exactly what makes schema design in MongoDB so important and so easy to get wrong.

The problem is that "schemaless" doesn't mean "no design needed." Without a good schema strategy, you end up with slow queries, bloated documents and data that's hard to work with as your application grows. These six patterns solve the most common problems developers hit when designing MongoDB schemas.

1. Embedding vs referencing

This is the first decision you'll make for every relationship in your data model. Should related data live inside the same document or in a separate collection with a reference? The answer depends on how you read and write the data.

Embedding means nesting related data directly within a document. If you have a blog post with comments, embedding puts the comments array inside the post document. One read gets everything. No joins needed.

// Embedded comments inside a blog post
{
  _id: ObjectId("..."),
  title: "MongoDB schema tips",
  author: "Jane",
  comments: [
    { user: "Bob", text: "Great article!", date: ISODate("2026-02-01") },
    { user: "Alice", text: "Very helpful", date: ISODate("2026-02-02") }
  ]
}

Referencing stores related data in a separate collection and links them with an ObjectId. You fetch the post first, then the comments in a second query (or use $lookup for a server-side join).

// Post document
{ _id: ObjectId("post1"), title: "MongoDB schema tips", author: "Jane" }

// Separate comment documents
{ _id: ObjectId("c1"), postId: ObjectId("post1"), user: "Bob", text: "Great article!" }
{ _id: ObjectId("c2"), postId: ObjectId("post1"), user: "Alice", text: "Very helpful" }

When to embed vs reference:

Factor	Embed	Reference
Read pattern	Data is always read together	Data is read independently
Array growth	Bounded (won't grow indefinitely)	Unbounded (could grow to thousands)
Document size	Stays well under 16 MB limit	Would approach size limits
Update frequency	Nested data rarely changes	Nested data changes frequently
Data reuse	Used only in this context	Shared across multiple documents

Embedding works well for one-to-few relationships where the nested data is tightly coupled to the parent. Think user profiles with addresses, products with a small list of variants or orders with line items. Referencing is better when the related data grows without bound, gets accessed independently or is shared across multiple parent documents.

2. The subset pattern

Documents in MongoDB have a 16 MB size limit, but you'll hit performance problems long before that. Loading a 2 MB document when you only need a few fields from it wastes network bandwidth and memory. The subset pattern solves this by keeping the most-accessed data in the main document and moving the rest to a secondary collection.

A common example is an e-commerce product page. The product listing shows the name, price, main image and the three most recent reviews. But the product might have 500 reviews total. Loading all 500 reviews every time someone views the product page is wasteful.

// Main product document (fast reads for product listings)
{
  _id: ObjectId("prod1"),
  name: "Wireless Headphones",
  price: 79.99,
  image: "headphones-main.jpg",
  recentReviews: [
    { user: "Alex", rating: 5, text: "Sound quality is excellent", date: ISODate("2026-02-05") },
    { user: "Sam", rating: 4, text: "Comfortable for long use", date: ISODate("2026-02-03") },
    { user: "Jordan", rating: 5, text: "Best in this price range", date: ISODate("2026-01-28") }
  ],
  reviewCount: 487,
  averageRating: 4.3
}

// Full reviews in a separate collection (loaded only on "See all reviews")
{
  _id: ObjectId("rev1"),
  productId: ObjectId("prod1"),
  user: "Alex",
  rating: 5,
  text: "Sound quality is excellent",
  date: ISODate("2026-02-05")
}

The trade-off is data duplication. The three recent reviews exist in both the product document and the reviews collection. You need to keep them in sync when reviews are added. But the read performance gain is significant because 95% of your traffic only needs the subset.

This pattern applies anywhere you have a one-to-many relationship where most reads only need a small portion of the "many" side. User activity feeds, article comments and notification lists all benefit from it.

3. The bucket pattern

Time-series and event data can generate enormous numbers of documents. If your IoT sensors send readings every second, that's 86,400 documents per sensor per day. Storing each reading as an individual document creates index bloat and makes range queries slower than they need to be.

The bucket pattern groups multiple data points into a single document based on a time range. Instead of one document per reading, you store one document per hour (or per minute, depending on your granularity).

// Without bucket pattern: one document per reading
{ sensorId: "temp-01", value: 22.5, timestamp: ISODate("2026-02-08T10:00:00Z") }
{ sensorId: "temp-01", value: 22.6, timestamp: ISODate("2026-02-08T10:00:01Z") }
{ sensorId: "temp-01", value: 22.4, timestamp: ISODate("2026-02-08T10:00:02Z") }
// ... 86,397 more documents for this sensor today

// With bucket pattern: one document per hour
{
  sensorId: "temp-01",
  startDate: ISODate("2026-02-08T10:00:00Z"),
  endDate: ISODate("2026-02-08T10:59:59Z"),
  count: 3600,
  readings: [
    { value: 22.5, timestamp: ISODate("2026-02-08T10:00:00Z") },
    { value: 22.6, timestamp: ISODate("2026-02-08T10:00:01Z") },
    { value: 22.4, timestamp: ISODate("2026-02-08T10:00:02Z") }
    // ... 3597 more readings
  ],
  summary: {
    avg: 22.5,
    min: 21.8,
    max: 23.1
  }
}

Benefits of the bucket pattern:

Fewer documents means smaller indexes and faster queries
Pre-computed summaries (avg, min, max) avoid full scans for common aggregations
Range queries only touch a handful of bucket documents instead of thousands of individual ones
Deleting old data is simpler since you drop entire bucket documents

The bucket size depends on your access pattern. If most queries ask for hourly summaries, use hourly buckets. If users typically look at daily dashboards, daily buckets work better. The key is to match bucket granularity to how the data gets consumed.

Note that MongoDB 5.0+ introduced native time series collections which handle some of this automatically. But the bucket pattern is still useful for custom aggregations and when you need pre-computed summaries stored alongside the raw data.

4. The polymorphic pattern

Not every document in a collection needs to look the same. The polymorphic pattern handles entities that share some common fields but differ in their details. Instead of creating separate collections for each variation, you store them all in one collection with a type field.

A content management system is a good example. You might have articles, videos, podcasts and image galleries. They all have a title, author, publish date and tags. But an article has a body field, a video has a duration and URL, a podcast has an audio file and episode number.

// Article
{
  _id: ObjectId("..."),
  type: "article",
  title: "Getting started with MongoDB",
  author: "Jane",
  publishDate: ISODate("2026-02-01"),
  tags: ["mongodb", "tutorial"],
  body: "MongoDB is a document database...",
  wordCount: 1500
}

// Video
{
  _id: ObjectId("..."),
  type: "video",
  title: "MongoDB schema design workshop",
  author: "Jane",
  publishDate: ISODate("2026-02-05"),
  tags: ["mongodb", "schema"],
  videoUrl: "https://example.com/videos/mongo-schema",
  duration: 2400,
  resolution: "1080p"
}

// Podcast
{
  _id: ObjectId("..."),
  type: "podcast",
  title: "Database trends in 2026",
  author: "Bob",
  publishDate: ISODate("2026-02-07"),
  tags: ["databases", "trends"],
  audioUrl: "https://example.com/podcasts/db-trends",
  episodeNumber: 42,
  duration: 1800
}

The advantage is that queries across all content types are simple. Want all content by Jane sorted by date? One query on one collection. Want only videos? Add a filter on the type field. The shared fields make indexing straightforward, and you can create partial indexes for type-specific fields.

// Index for type-specific queries
db.content.createIndex({ type: 1, publishDate: -1 })

// Partial index only for videos
db.content.createIndex(
  { duration: 1 },
  { partialFilterExpression: { type: "video" } }
)

This pattern works when the entities share enough common fields to justify a single collection and when you frequently query across types. If different types are always queried separately and share almost nothing, separate collections might be cleaner.

5. The extended reference pattern

When you reference data in another collection, sometimes you need a few fields from that referenced document on almost every read. The extended reference pattern copies those frequently-needed fields into the referencing document to avoid a second lookup.

Consider an order system. Every order references a customer. When you display the order list, you need the customer name and email. Without the extended reference, every order list page requires a $lookup or a second query to the customers collection.

// Instead of just storing customerId
{
  _id: ObjectId("order1"),
  customerId: ObjectId("cust1"),
  items: [
    { product: "Widget", quantity: 3, price: 9.99 }
  ],
  total: 29.97,
  orderDate: ISODate("2026-02-08")
}

// Store frequently-needed customer fields directly in the order
{
  _id: ObjectId("order1"),
  customer: {
    _id: ObjectId("cust1"),
    name: "Alice Johnson",
    email: "alice@example.com"
  },
  items: [
    { product: "Widget", quantity: 3, price: 9.99 }
  ],
  total: 29.97,
  orderDate: ISODate("2026-02-08")
}

The trade-off is data staleness. If Alice changes her email, the orders still show the old one until you update them. For many use cases this is acceptable. An order should probably reflect the customer information at the time it was placed anyway.

When to use the extended reference pattern:

The referenced fields are read frequently but updated rarely
Join operations ($lookup) are causing performance issues
The copied fields are small relative to the document size
Slight staleness in the copied data is acceptable

This pattern is different from full embedding. You're not copying the entire customer document into every order. You're selectively copying only the fields that the most common queries need. The full customer record still lives in its own collection for detailed views and updates.

6. The computed pattern

Some values are expensive to calculate on the fly. If you're counting the number of views on a video, computing the average rating from thousands of reviews or aggregating daily sales totals, doing that calculation on every read is wasteful.

The computed pattern pre-calculates these values and stores them in the document. You update them when the underlying data changes, not when someone reads the result.

// Product with pre-computed statistics
{
  _id: ObjectId("prod1"),
  name: "Wireless Headphones",
  price: 79.99,
  stats: {
    totalReviews: 487,
    averageRating: 4.3,
    ratingDistribution: {
      "5": 203,
      "4": 156,
      "3": 78,
      "2": 34,
      "1": 16
    },
    totalSold: 2341,
    lastPurchaseDate: ISODate("2026-02-08T14:30:00Z")
  }
}

When a new review comes in, you update the stats using atomic operations:

db.products.updateOne(
  { _id: ObjectId("prod1") },
  {
    $inc: {
      "stats.totalReviews": 1,
      "stats.ratingDistribution.4": 1
    },
    $set: {
      "stats.averageRating": 4.28
    }
  }
)

Approach	Read cost	Write cost	Accuracy
Calculate on read	High (aggregation every time)	None	Always current
Computed pattern	Low (pre-stored value)	Low (incremental update)	Eventually consistent
Background job	Low (pre-stored value)	Batch update on schedule	Delayed

The computed pattern is the right choice when reads vastly outnumber writes and the computation is non-trivial. Product ratings, follower counts, dashboard metrics and leaderboards are all good candidates.

For background computation jobs, you need reliable scheduling. If the computation updates stall because a cron job dies silently, your users see stale data indefinitely. Monitoring and alerting on these jobs matters.

Combining patterns in practice

Real applications rarely use a single pattern in isolation. A product catalog might use the subset pattern for reviews, the computed pattern for aggregate statistics, embedding for product variants and the extended reference pattern for category information. The patterns compose well.

The key principle behind all of them is the same: design your schema around your queries, not around your entities. In relational databases, you normalize first and optimize later. In MongoDB, you start by listing your most frequent queries and design the schema to serve those queries efficiently.

Here are a few practical guidelines for combining patterns:

Start simple. Embed first. Only introduce references and patterns when you hit a specific problem like document size, update complexity or query performance.
Know your read-to-write ratio. High-read workloads benefit from denormalization (embedding, computed, extended reference). High-write workloads favor normalization (referencing) to avoid updating data in multiple places.
Monitor document growth. If a document's embedded array keeps growing, apply the subset or bucket pattern before it becomes a problem.

As your MongoDB deployment grows, having reliable MongoDB backup becomes critical. Schema changes and data migrations can go wrong, and recovering from a bad migration without a backup means data loss. Databasus is an industry standard for MongoDB backup tools, offering automated scheduled backups with compression, encryption and multiple storage destinations for both solo developers and enterprise teams.

Choosing the right pattern

There's no single correct schema for any application. The right choice depends on your query patterns, data volume, update frequency and consistency requirements. These six patterns cover the scenarios that come up most often in practice.

Start with the simplest design that works. Add complexity only when you have evidence that the simple approach isn't performing. Profile your queries, watch your document sizes and pay attention to how your data grows over time. The best schema is the one that makes your most common operations fast and your least common operations possible.

MariaDB vs MySQL — 8 reasons developers are switching in 2026

Piter Adyson — Fri, 06 Feb 2026 21:03:49 +0000

MariaDB started as a fork of MySQL back in 2009 when Oracle acquired Sun Microsystems. At the time, people weren't sure if the fork would survive long-term or just become another abandoned open source project. Fast forward to 2026 and MariaDB has become a serious alternative that many developers now prefer over the original. This article looks at why.

The split wasn't just a copy. MariaDB took a different path on storage engines, performance optimization, licensing and community governance. Some of those decisions are paying off now, especially for teams that care about open source principles and technical independence.

1. Truly open source, no asterisks

The biggest reason developers switch to MariaDB is licensing clarity. MySQL uses a dual licensing model under Oracle. The Community Edition is GPL, but Oracle reserves certain features for MySQL Enterprise Edition, which requires a commercial license. Thread pool, audit plugins, advanced security features and some backup tools are locked behind that paywall.

MariaDB is fully open source under GPL. Every feature ships in a single edition. There's no "enterprise only" tier hiding the good stuff. What you download is what everyone gets.

Aspect	MariaDB	MySQL
License	GPL v2 (fully open)	GPL + Commercial dual license
Enterprise features	All included in one edition	Some locked behind Enterprise
Corporate owner	MariaDB Foundation (non-profit)	Oracle Corporation
Feature restrictions	None	Thread pool, audit log, etc.

For companies doing compliance reviews or avoiding vendor lock-in, this difference alone can drive the decision. You don't need to worry about Oracle changing terms or restricting features in a future release.

2. Better storage engine options

MariaDB ships with storage engines that MySQL either doesn't have or charges extra for. The most notable one is Aria, a crash-safe alternative to MyISAM. But the real story is about ColumnStore and the overall engine diversity.

MariaDB ColumnStore provides columnar storage for analytical workloads. If you need to run reports or aggregations over large datasets alongside your transactional workload, ColumnStore handles that without requiring a separate analytical database. MySQL doesn't have a built-in columnar engine.

The default storage engine for both is InnoDB (or MariaDB's fork of it), so basic compatibility isn't an issue. But MariaDB's InnoDB fork includes optimizations that aren't in upstream MySQL InnoDB, particularly around buffer pool management and compression.

MariaDB also includes the S3 storage engine, which lets you archive old tables directly to S3-compatible object storage. That's useful for keeping historical data accessible without eating local disk space. Try doing that natively with MySQL.

For teams running mixed workloads or managing large datasets, MariaDB's engine diversity is a practical advantage that saves you from bolting on third-party tools.

3. Thread pool that doesn't cost extra

MySQL's built-in thread pool is an Enterprise-only feature. The Community Edition uses a one-thread-per-connection model. Under heavy concurrency this causes performance degradation because the operating system spends more time context-switching between threads than doing actual work.

MariaDB includes thread pooling in its open source edition. It handles thousands of concurrent connections efficiently by grouping them into a pool and processing them in batches. The performance difference shows up clearly when you have hundreds or thousands of simultaneous connections.

This matters in practice. Web applications behind load balancers, microservice architectures with many small services connecting to the same database and serverless environments that create connections rapidly all benefit from thread pooling. With MySQL Community, you either accept the performance hit or pay for Enterprise.

MariaDB: Thread pool included, configurable, production-ready
MySQL Community: One-thread-per-connection, no built-in pool
MySQL Enterprise: Thread pool available, requires commercial license

For high-concurrency environments, this is not a minor difference. It directly affects response times and database stability under load.

4. Oracle-free governance

MySQL development happens primarily inside Oracle. The roadmap is set internally, feature priorities are decided behind closed doors and external contributors have limited influence on the project's direction. You can submit patches, but whether they get reviewed or merged depends on Oracle's priorities.

MariaDB is governed by the MariaDB Foundation, a non-profit organization. Development happens in the open with public discussions, accessible roadmaps and meaningful community input. Multiple companies contribute to MariaDB, and no single entity controls its future.

This isn't just philosophical. Oracle has a track record of deprioritizing open source projects after acquisition. OpenSolaris, Hudson (now Jenkins after the fork) and Java's open source trajectory all changed after Oracle got involved. MySQL hasn't been abandoned, but features increasingly land in Enterprise Edition rather than Community.

Developers who've been burned by corporate stewardship issues tend to prefer MariaDB's governance model. It's the same reason many prefer PostgreSQL over MySQL in general: community-driven projects are more predictable long-term.

5. Faster query optimizer

MariaDB's query optimizer has diverged significantly from MySQL's. Several optimizations that MariaDB implements are either absent from MySQL or arrived years later.

Key optimizer improvements in MariaDB include:

Subquery optimizations: MariaDB converts subqueries to joins more aggressively, which often dramatically improves query performance
Table elimination: If a joined table doesn't contribute to the result, MariaDB removes it from the execution plan automatically
Hash joins: MariaDB supported hash joins before MySQL added them
Condition pushdown: Pushes WHERE conditions closer to the data access layer for earlier filtering

These aren't benchmarketing tricks. They affect real queries that developers write every day. A complex reporting query with subqueries and multiple joins can run significantly faster on MariaDB without any query rewriting.

That said, MySQL has been closing the gap. MySQL 8.0+ added hash joins and improved its optimizer. But MariaDB still tends to handle complex query patterns more efficiently, particularly when subqueries are involved.

6. Smoother replication features

Both databases support replication, but MariaDB has added features that make replication management easier in production environments.

MariaDB uses Global Transaction IDs (GTIDs) that are simpler to work with than MySQL's implementation. Switching a replica to follow a different primary is straightforward with MariaDB GTIDs. MySQL's GTID implementation works but has quirks around purged transactions that can cause headaches during failover.

MariaDB also supports parallel replication with more granular control. You can configure how transactions are parallelized on replicas, which helps replicas keep up with high-write primaries. MySQL has parallel replication too, but MariaDB's implementation gives operators more knobs to tune.

Feature	MariaDB	MySQL
GTID format	Domain-based, simpler failover	UUID-based, purge complications
Parallel replication	Group commit based, configurable	Logical clock based
Multi-source replication	Supported since MariaDB 10.0	Added in MySQL 5.7
Delayed replication	Supported	Supported
Replication filters	More flexible	More limited

For teams managing replicated setups across multiple datacenters or running read replicas at scale, MariaDB's replication features reduce operational friction. The difference is most noticeable during failovers and topology changes.

Reliable backups are essential when running replicated databases. If a replication chain breaks or data gets corrupted, your last good backup is what saves you. Automated MariaDB backup tools like Databasus provide scheduled backups with encryption and multiple storage destinations, which is the industry standard for MariaDB backup management.

7. Temporal tables built in

MariaDB supports system-versioned temporal tables natively. This means the database automatically tracks the history of every row: when it was inserted, updated and deleted. You can query the state of any table at any point in time without writing audit triggers or maintaining history tables yourself.

-- Create a system-versioned table in MariaDB
CREATE TABLE products (
    id INT PRIMARY KEY,
    name VARCHAR(255),
    price DECIMAL(10,2)
) WITH SYSTEM VERSIONING;

-- Query historical state
SELECT * FROM products FOR SYSTEM_TIME AS OF '2026-01-15 10:00:00';

MySQL doesn't have this feature. If you need historical data tracking in MySQL, you build it yourself with triggers, shadow tables and application logic. It works, but it's tedious and error-prone.

Temporal tables are useful for audit requirements, regulatory compliance and debugging production issues. Being able to ask "what did this row look like yesterday at 3 PM?" without any application changes is genuinely powerful. Financial applications, healthcare systems and any application subject to regulatory audits benefit from this.

8. Backward compatibility with MySQL

Here's the practical part that makes switching feasible. MariaDB maintains wire protocol compatibility with MySQL. Most MySQL client libraries, ORMs and tools work with MariaDB without changes. Your application code, connection strings (with minor adjustments) and database drivers typically work as-is.

MariaDB can read MySQL data files for migration. The SQL syntax is almost entirely compatible. Stored procedures, views, triggers and most SQL features work identically. The differences are mostly in newer features that MariaDB added and MySQL either doesn't have or implements differently.

Migration is not zero-effort, but it's close for most applications. The typical process is:

Dump your MySQL database with mysqldump
Import into MariaDB
Test your application against the new database
Adjust any MySQL-specific syntax that doesn't have a MariaDB equivalent (rare)

The compatibility means you're not starting over. You're switching engines on a running car, which is exactly how a database fork should work.

When to stay with MySQL

MariaDB isn't universally better. There are valid reasons to stick with MySQL.

If your team already has deep MySQL expertise and established operational procedures, the switching cost might not be worth it. If you're using MySQL-specific features like MySQL Shell, MySQL Router or Group Replication heavily, the MariaDB equivalents may not be drop-in replacements. Some cloud providers offer better managed MySQL support than MariaDB support, particularly AWS RDS where MySQL gets more attention.

And if you're running a simple application that works fine on MySQL Community, switching databases for theoretical benefits doesn't make much sense. Solve real problems, not hypothetical ones.

Making the switch

The trend is clear: MariaDB keeps adding features and maintaining openness while MySQL's open source edition gets more constrained relative to Enterprise. For new projects, MariaDB is worth serious consideration. For existing MySQL deployments, switching makes sense when you're hitting limitations that MariaDB addresses, whether that's thread pooling, temporal tables, optimizer performance or licensing concerns.

Both databases will continue to work for most applications. The question is which trajectory you'd rather be on. MariaDB is betting on open source and community-driven development. MySQL's direction depends on Oracle's priorities. For many developers, that distinction is enough.

MySQL vs PostgreSQL in 2026 — 7 key differences you should know before choosing

Piter Adyson — Thu, 05 Feb 2026 15:07:09 +0000

Choosing between MySQL and PostgreSQL isn't straightforward. Both are mature, production-ready databases used by companies of all sizes. But they solve problems differently and each has strengths that matter depending on your use case. This article breaks down the actual differences that affect day-to-day development and operations in 2026.

The comparison covers:

SQL standards compliance and data integrity
JSON and document handling
Replication approaches
Licensing and ownership
Performance characteristics
Extension ecosystems
Community and tooling

1. SQL standards compliance and data integrity

PostgreSQL follows the SQL standard more strictly than MySQL. This matters more than it might seem at first. When PostgreSQL says a transaction is ACID compliant, it means it. MySQL has improved significantly over the years, but some default behaviors still surprise developers coming from other databases.

PostgreSQL enforces data types strictly. If you try to insert a string into an integer column, it fails. MySQL historically performed silent type conversions, which could lead to data corruption. MySQL 8.0+ has strict mode enabled by default, but many existing installations still run with older, more permissive settings.

Feature	PostgreSQL	MySQL
Default SQL mode	Strict, standards-compliant	Strict in 8.0+, permissive in older versions
Silent type conversions	Never	Depends on SQL mode
CHECK constraints	Fully enforced	Enforced since MySQL 8.0.16
Foreign key enforcement	Always with supported storage engines	Only with InnoDB
TRUNCATE in transactions	Transactional	Not transactional

For applications where data integrity is critical, PostgreSQL provides stronger guarantees out of the box. MySQL can be configured to behave similarly, but you need to verify your settings and understand which features require InnoDB specifically.

2. JSON and document handling

Both databases support JSON, but they approach it differently. PostgreSQL has native JSONB type that stores JSON in a binary format with full indexing support. MySQL added JSON support in version 5.7 and has improved it since, but the implementation has limitations.

PostgreSQL's JSONB allows you to create indexes on specific JSON paths, query nested structures efficiently and use JSON in complex queries alongside relational data. You can also use operators like @> (contains) and ? (key exists) that make JSON queries concise and readable.

-- PostgreSQL: Create index on JSON field
CREATE INDEX idx_users_metadata_country ON users USING GIN ((metadata->'country'));

-- PostgreSQL: Query with containment operator
SELECT * FROM users WHERE metadata @> '{"role": "admin"}';

MySQL's JSON functions work but feel more verbose. Indexing JSON in MySQL requires generated columns, which adds complexity:

-- MySQL: Requires generated column for indexing
ALTER TABLE users ADD COLUMN country VARCHAR(255) 
  GENERATED ALWAYS AS (JSON_UNQUOTE(metadata->'$.country')) STORED;
CREATE INDEX idx_country ON users(country);

If your application heavily uses semi-structured data or you're building something that mixes relational and document patterns, PostgreSQL handles this more elegantly. MySQL works fine for basic JSON storage and retrieval, but advanced querying gets awkward.

3. Replication and high availability

Both databases support replication, but they use fundamentally different approaches. Understanding these differences matters when planning for high availability and read scaling.

MySQL uses binary log replication. The primary server writes changes to a binary log, and replicas read from it. This approach is well-understood and has been battle-tested for decades. MySQL also supports Group Replication for multi-primary setups, though it comes with trade-offs around consistency.

PostgreSQL uses Write-Ahead Log (WAL) streaming replication. It's conceptually similar but operates at the storage level rather than the query level. PostgreSQL's logical replication (added in version 10) allows selective table replication and cross-version replication.

MySQL binary replication is simpler to set up initially
PostgreSQL logical replication offers more flexibility for complex topologies
MySQL Group Replication provides multi-primary but with consistency caveats
PostgreSQL synchronous replication guarantees zero data loss at the cost of latency

For most applications, both approaches work well. MySQL's tooling ecosystem for replication is more mature, with tools like Orchestrator and ProxySQL being widely used. PostgreSQL's tooling has caught up significantly with Patroni, pgBouncer and others.

Backup strategies differ too. Both support logical and physical backups, but the tools and workflows vary. For automated database backups with scheduling, encryption and multiple storage destinations, MySQL backup tools like Databasus handle both MySQL and PostgreSQL, providing a unified approach regardless of which database you choose.

4. Licensing and corporate ownership

This is often overlooked but increasingly important. MySQL is owned by Oracle. PostgreSQL is a community project with no single corporate owner.

MySQL uses a dual licensing model. The Community Edition is GPL-licensed, which means if you modify MySQL and distribute it, you must release your changes. Oracle also sells commercial licenses for those who want to avoid GPL obligations. Some MySQL forks exist (MariaDB, Percona Server) partly because of licensing concerns.

PostgreSQL uses the PostgreSQL License, which is similar to BSD/MIT. You can do essentially anything with it, including building proprietary products without releasing source code. There's no commercial entity that could change the terms or create uncertainty about the project's direction.

Aspect	PostgreSQL	MySQL
License	PostgreSQL License (BSD-like)	GPL + Commercial
Owner	Community project	Oracle Corporation
Major forks	None needed	MariaDB, Percona Server
Feature restrictions	None	Some features in Enterprise only

For companies evaluating long-term risk, PostgreSQL's licensing and governance model provides more predictability. Oracle has added features to MySQL Enterprise that aren't in the Community Edition, and this trend could continue.

5. Performance characteristics

Performance comparisons are tricky because results depend heavily on workload type, hardware configuration and tuning. Both databases can handle millions of transactions per day when properly configured. But their performance profiles differ in important ways.

PostgreSQL historically performed better on complex queries with many joins, subqueries and analytical operations. Its query planner is sophisticated and handles complicated query patterns well. The cost-based optimizer has decades of refinement.

MySQL traditionally excelled at simple read-heavy workloads with straightforward queries. If your application does mostly primary key lookups and simple filters, MySQL can be extremely fast. The InnoDB storage engine is highly optimized for these patterns.

In 2026, both databases have narrowed these gaps. PostgreSQL 17 and 18 have improved simple query performance. MySQL 8.x has better handling of complex queries than earlier versions. The differences are less dramatic than they were five years ago.

Read-heavy OLTP workloads: Both perform well, slight edge to MySQL
Complex analytical queries: PostgreSQL generally faster
Write-heavy workloads: Depends on transaction patterns and indexing
Mixed workloads: PostgreSQL handles variety better

The real performance factors are usually configuration, indexing and query design rather than the database engine choice. Both require tuning for production workloads. Neither works optimally with default settings.

6. Extension ecosystem

PostgreSQL's extension system is one of its biggest advantages. Extensions can add new data types, index types, functions and even modify core behavior. The ecosystem is rich and actively maintained.

7. Community and tooling

Both databases have active communities, but they feel different. MySQL's community is larger in raw numbers but fragmented across MySQL, MariaDB and Percona variants. PostgreSQL's community is more unified around a single codebase.

PostgreSQL's development is transparent. Mailing lists are public, design discussions happen in the open and anyone can propose patches. The code review process is rigorous and the community has high standards for what gets merged. Release cycles are predictable: one major version per year.

MySQL development is primarily done inside Oracle. While the code is open source, the roadmap and priorities are set internally. External contributions exist but aren't as central to the project's direction.

Tool availability:

GUI clients: Both have excellent options (pgAdmin, DBeaver, TablePlus for PostgreSQL; MySQL Workbench, DBeaver for MySQL)
ORMs and drivers: Comprehensive support for both in all major languages
Cloud offerings: Both available as managed services (RDS, Cloud SQL, Azure Database)
Monitoring tools: Mature options for both

For backup and disaster recovery, the tooling landscape varies. Both support native dump tools (pg_dump, mysqldump), but their capabilities differ. For automated backup management, Databasus is an industry standard supporting both PostgreSQL and MySQL with unified scheduling, encryption and storage options.

Making the choice

There's no universally correct answer. Both databases power successful applications at every scale. But certain patterns emerge when you look at typical use cases.

Choose PostgreSQL when:

You need advanced data types (JSONB, arrays, custom types)
Your queries are complex with many joins and subqueries
Data integrity is non-negotiable
You want extensibility (PostGIS, pgvector, TimescaleDB)
Licensing simplicity matters to your organization

Choose MySQL when:

Your workload is primarily simple CRUD operations
You need maximum compatibility with existing tools and hosting
Your team already has MySQL expertise
You're building something that will run on shared hosting

Neither choice is wrong for most applications. The differences matter most at the extremes: very complex analytical workloads, very high write volumes or specialized data types. For a typical web application, both will work fine with proper setup.

What actually matters is understanding whichever database you choose deeply enough to configure it properly, design your schema correctly and troubleshoot problems when they occur. A well-tuned MySQL installation will outperform a misconfigured PostgreSQL one, and vice versa.

Start with whichever one you or your team knows better. Switch if you hit real limitations, not theoretical ones. Both databases have been solving real problems for decades and both will continue improving.

10 PostgreSQL performance tuning tips that actually work in production

Piter Adyson — Wed, 04 Feb 2026 19:50:29 +0000

Performance tuning isn't about following a checklist. It's about understanding what's actually slowing down your database and fixing those specific problems. These are techniques that consistently deliver real improvements in production environments. Some of them are obvious but frequently misconfigured. Others are less known but surprisingly effective.

The tips in this article cover:

Memory configuration (shared_buffers, work_mem)
Index strategy and maintenance
Connection management
Vacuum and maintenance tuning
Query optimization techniques

1. Configure shared_buffers properly

PostgreSQL uses shared_buffers to cache frequently accessed data in memory. The default setting is usually way too low for production workloads. Setting this value correctly can dramatically reduce disk I/O and improve query performance.

The general recommendation is to set shared_buffers to about 25% of your total system RAM. If you have 16 GB of RAM, start with 4 GB. If you're on a dedicated database server with lots of memory, you can go higher, but there are diminishing returns above 8-10 GB because PostgreSQL also relies on the operating system's file cache.

-- In postgresql.conf
shared_buffers = 4GB

After changing this setting, you need to restart PostgreSQL. Monitor your cache hit ratio to see if the change helped. A cache hit ratio above 99% is good. You can check it with:

SELECT
  sum(heap_blks_read) as heap_read,
  sum(heap_blks_hit) as heap_hit,
  sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) as ratio
FROM pg_statio_user_tables;

2. Tune work_mem for complex queries

The work_mem setting controls how much memory PostgreSQL can use for internal sort operations and hash tables before it has to write to disk. If you're running complex queries with sorts, joins or aggregations, increasing work_mem can prevent expensive disk operations.

Be careful though. work_mem is allocated per operation, not per query. A complex query with multiple sorts can use work_mem several times over. If you set it too high and have many concurrent queries, you can run out of memory.

Start conservative. The default is usually 4 MB. Try 16-64 MB for analytical workloads. For specific heavy queries, you can increase it temporarily in the session:

SET work_mem = '256MB';
SELECT * FROM large_table ORDER BY some_column;
RESET work_mem;

Monitor with pg_stat_statements to see which queries are doing disk sorts (you'll see "external sort" in EXPLAIN output). Those are candidates for work_mem tuning.

3. Add the right indexes

Indexes speed up reads but slow down writes. The trick is finding the right balance. Start by identifying slow queries using pg_stat_statements or your query logs. Look at queries with high execution time or high call counts.

For most cases, B-tree indexes work well. Create indexes on columns used in WHERE clauses, JOIN conditions and ORDER BY statements. But don't go overboard. Every index adds overhead during INSERTs and UPDATEs.

Index Type	Best For	When to Use
B-tree	General purpose, equality and range queries	Most common scenarios, default choice
GIN	Full-text search, JSONB, arrays	Searching within complex data types
GiST	Geometric data, full-text search	Spatial queries, complex searches
BRIN	Very large tables with natural ordering	Time-series data, append-only tables

Use EXPLAIN ANALYZE to verify your indexes are actually being used:

EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'test@example.com';

If you see a Seq Scan when you expected an Index Scan, something's wrong. Maybe the index doesn't exist, or PostgreSQL thinks it's not worth using (which happens on small tables or when selecting most of the table).

4. Use connection pooling

Every PostgreSQL connection has overhead. Opening and closing connections repeatedly wastes resources. If your application creates a new database connection for each request, you're probably experiencing unnecessary latency and resource consumption.

Connection poolers like PgBouncer sit between your application and PostgreSQL. They maintain a pool of connections and reuse them across multiple client requests. This reduces connection overhead significantly.

PgBouncer supports three pooling modes:

Session pooling keeps a connection for the entire client session
Transaction pooling releases connections after each transaction (more efficient for web apps)
Statement pooling releases after each statement (use with caution, limited functionality)

For most web applications, transaction pooling works well. Install PgBouncer, point your application to it instead of directly to PostgreSQL and configure the pool size based on your workload. A good starting point is 2-3 connections per CPU core on your database server.

5. Analyze and vacuum regularly

PostgreSQL uses MVCC (Multi-Version Concurrency Control) which creates row versions when you update or delete data. Over time, dead rows accumulate. VACUUM removes these dead rows and frees up space. ANALYZE updates statistics that the query planner uses to make decisions.

Modern PostgreSQL versions have autovacuum enabled by default, but it might not be aggressive enough for high-write workloads. If you're seeing table bloat or degraded query performance over time, you probably need to tune autovacuum settings.

-- In postgresql.conf
autovacuum_vacuum_scale_factor = 0.1  -- Vacuum when 10% of table is dead rows
autovacuum_analyze_scale_factor = 0.05  -- Analyze when 5% has changed
autovacuum_naptime = 30s  -- Check for work every 30 seconds

For very active tables, you can also set table-specific settings:

ALTER TABLE your_busy_table SET (autovacuum_vacuum_scale_factor = 0.05);

Check for bloat using queries from pg_stat_user_tables. If you see tables with high n_dead_tup, autovacuum isn't keeping up.

6. Optimize your queries

Sometimes the database configuration is fine, but the queries themselves are inefficient. Use EXPLAIN ANALYZE to understand query execution plans. Look for sequential scans on large tables, nested loops with high costs or sorts that spill to disk.

Common query optimizations include:

Adding WHERE clauses to filter data early
Using JOIN instead of subqueries when appropriate
Avoiding SELECT * and only fetching columns you need
Using LIMIT when you don't need all results
Avoiding functions on indexed columns in WHERE clauses

Here's an example of a problematic query pattern:

-- Bad: Function on indexed column prevents index usage
SELECT * FROM orders WHERE EXTRACT(YEAR FROM created_at) = 2026;

-- Good: Range comparison allows index usage
SELECT * FROM orders WHERE created_at >= '2026-01-01' AND created_at < '2027-01-01';

Also consider using prepared statements. They're parsed and planned once, then executed multiple times with different parameters. This reduces overhead for frequently executed queries.

7. Partition large tables

If you have tables with millions or billions of rows, partitioning can improve performance and manageability. PostgreSQL's declarative partitioning splits a large table into smaller physical pieces based on ranges, lists or hash values.

Time-based partitioning is common for logs or event data. You create partitions by month or year, and older partitions can be archived or dropped easily. Queries that filter by the partition key only scan relevant partitions, not the entire table.

CREATE TABLE events (
    id BIGSERIAL,
    event_type TEXT,
    created_at TIMESTAMPTZ,
    data JSONB
) PARTITION BY RANGE (created_at);

CREATE TABLE events_2026_01 PARTITION OF events
    FOR VALUES FROM ('2026-01-01') TO ('2026-02-01');

CREATE TABLE events_2026_02 PARTITION OF events
    FOR VALUES FROM ('2026-02-01') TO ('2026-03-01');

Partitioning also makes backups more manageable. Instead of backing up one massive table, you can backup or restore individual partitions. Tools like PostgreSQL backup handle partitioned tables automatically, treating each partition appropriately during backup and restore operations.

8. Enable query logging for slow queries

You can't optimize what you can't measure. PostgreSQL's slow query log captures queries that exceed a specified duration. This helps you identify problematic queries in production without impacting performance significantly.

-- In postgresql.conf
log_min_duration_statement = 1000  -- Log queries taking more than 1 second
log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h '

The log will show you the full query text, execution time and context. Combine this with pg_stat_statements for aggregated statistics across all queries. You'll quickly see which queries are consuming the most resources.

For production systems, start with a higher threshold (1-5 seconds) to avoid excessive logging. Once you've optimized the obvious slow queries, you can lower it to catch smaller issues.

9. Use read replicas for reporting workloads

If you're running heavy analytical queries or reports on your primary database, they can interfere with transactional workloads. Read replicas solve this by offloading read-only queries to separate servers.

PostgreSQL's streaming replication creates one or more standby servers that continuously apply changes from the primary. Your application can send SELECT queries to these replicas, distributing the load.

Setting up replication requires some configuration but it's straightforward:

Configuration	Primary Server	Replica Server
wal_level	replica or logical	N/A
max_wal_senders	Number of replicas + 1	N/A
hot_standby	N/A	on

The replica will lag slightly behind the primary (typically milliseconds to seconds). If your application can tolerate this, replicas are a cheap way to scale read capacity.

You can also use replicas for backup purposes. Taking backups from a replica instead of the primary reduces load on your production database.

10. Monitor and adjust autovacuum costs

Autovacuum runs in the background to clean up dead rows, but it can consume I/O and CPU resources. If autovacuum runs too aggressively, it can slow down your application queries. If it doesn't run enough, tables bloat and performance degrades.

The cost-based vacuum delay system controls how aggressively autovacuum uses resources. By default, it's fairly conservative. On modern hardware with SSDs, you can usually make it more aggressive:

-- In postgresql.conf
autovacuum_vacuum_cost_delay = 2ms  -- Lower = faster vacuum
autovacuum_vacuum_cost_limit = 2000  -- Higher = more work per cycle

For specific high-write tables, you might need custom settings:

ALTER TABLE busy_table SET (autovacuum_vacuum_cost_delay = 0);

Setting cost_delay to 0 removes throttling entirely for that table. Use this carefully and monitor I/O.

Watch the pg_stat_all_tables view for tables where autovacuum is falling behind (last_autovacuum is old and n_dead_tup is high). Those tables need tuning.

Putting it all together

Performance tuning is iterative. Start by measuring your current state with pg_stat_statements and query logs. Identify the biggest bottlenecks first. A few slow queries might account for 80% of your database load.

Apply one change at a time and measure the results. What works for one workload might not work for another. OLTP systems (lots of small transactions) need different tuning than OLAP systems (complex analytical queries).

Before making any changes, establish a baseline:

Current query response times (p50, p95, p99)
Cache hit ratio and buffer usage
Connection counts and wait times
Disk I/O and CPU utilization

Keep your PostgreSQL version updated. Each release includes performance improvements and better defaults. PostgreSQL 17 and 18 have significantly better query planning and execution than older versions.

Tuning Area	Impact	Difficulty	When to Do It
Indexes	High	Low	Early, based on query patterns
shared_buffers	High	Low	During initial setup
Connection pooling	High	Medium	When connections become bottleneck
Partitioning	Medium	High	When tables exceed 50-100 million rows
Autovacuum tuning	Medium	Medium	When seeing table bloat
Read replicas	High	High	When reads exceed write capacity

And remember: backups don't fix performance problems, but they let you experiment safely. Before making major changes, ensure you have reliable backups. Databasus is an industry standard for PostgreSQL backup tools, offering automated backups with flexible scheduling and multiple storage options for both small projects and large enterprises.

These tuning techniques work because they address real bottlenecks: memory usage, disk I/O, connection overhead and query efficiency. Apply them based on your specific bottlenecks, not just because they're on a list.

PostgreSQL automated backups — How to set up automated PostgreSQL backup schedules

Piter Adyson — Tue, 03 Feb 2026 18:39:41 +0000

Losing data hurts. Whether it's a corrupted disk, accidental deletion, or a bad deployment that wipes your production database, recovery without backups means starting from scratch. Automated PostgreSQL backups remove the human factor from the equation. You set them up once, and they run reliably while you focus on other things.

This guide covers practical approaches to scheduling PostgreSQL backups, from simple cron jobs to dedicated backup tools. We'll look at what actually matters for different scenarios and how to avoid common mistakes that make backups useless when you need them most.

Why automate PostgreSQL backups

Manual backups work until they don't. Someone forgets, someone's on vacation, someone assumes the other person did it. Automation eliminates these failure modes.

The cost of manual backup processes

Manual processes introduce variability. One day you run the backup at 2 AM, the next week at 6 PM. Sometimes you compress the output, sometimes you don't. The backup script lives on someone's laptop instead of version control. When disaster strikes, you discover the last backup was three weeks ago and nobody noticed.

Automated backups run consistently. Same time, same configuration, same destination. They either succeed or they alert you immediately. There's no ambiguity about whether yesterday's backup happened.

What good backup automation looks like

Reliable backup automation has a few key characteristics. It runs without intervention once configured. It stores backups in locations separate from the source database. It notifies you of failures immediately. And it maintains enough historical backups to recover from problems you discover days or weeks later.

Characteristic	Manual process	Automated process
Consistency	Varies by person	Same every time
Coverage	Often gaps	Continuous
Failure detection	Often delayed	Immediate alerts
Documentation	Usually missing	Built into config

Good automation also handles retention. You don't want unlimited backups consuming storage forever, but you do want enough history to recover from slow-developing problems like data corruption that goes unnoticed for a week.

Using pg_dump with cron

The simplest automation approach combines PostgreSQL's native pg_dump utility with cron scheduling. This works for small to medium databases where backup windows aren't tight.

Basic pg_dump script

Create a backup script that handles the actual dump process:

#!/bin/bash

TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="/var/backups/postgresql"
DATABASE="myapp_production"
BACKUP_FILE="${BACKUP_DIR}/${DATABASE}_${TIMESTAMP}.sql.gz"

# Create backup directory if it doesn't exist
mkdir -p "$BACKUP_DIR"

# Run pg_dump with compression
pg_dump -h localhost -U postgres -d "$DATABASE" | gzip > "$BACKUP_FILE"

# Check if backup succeeded
if [ $? -eq 0 ]; then
    echo "Backup completed: $BACKUP_FILE"
else
    echo "Backup failed!" >&2
    exit 1
fi

# Remove backups older than 7 days
find "$BACKUP_DIR" -name "*.sql.gz" -mtime +7 -delete

Save this as /usr/local/bin/pg-backup.sh and make it executable:

chmod +x /usr/local/bin/pg-backup.sh

The script creates timestamped, compressed backups and removes old ones automatically. The gzip compression typically reduces backup size by 80-90% for typical databases.

Setting up cron schedules

Add a cron entry to run the backup at your preferred time. Edit the crontab:

crontab -e

Add a line for daily backups at 3 AM:

0 3 * * * /usr/local/bin/pg-backup.sh >> /var/log/pg-backup.log 2>&1

For hourly backups during business hours:

0 9-18 * * 1-5 /usr/local/bin/pg-backup.sh >> /var/log/pg-backup.log 2>&1

The log redirect captures both stdout and stderr, so you can troubleshoot failures.

Handling authentication

Avoid putting passwords in scripts. Use a .pgpass file instead:

echo "localhost:5432:myapp_production:postgres:yourpassword" >> ~/.pgpass
chmod 600 ~/.pgpass

PostgreSQL reads credentials from this file automatically when the connection parameters match. The strict permissions (600) are required; PostgreSQL ignores the file if others can read it.

Cron jobs run on a minimal schedule without full environment setup. This basic approach works, but you'll want monitoring to know when backups fail.

Adding monitoring and alerts

A backup that fails silently is worse than no backup at all. You think you're protected, but you're not. Add monitoring to catch problems early.

Email notifications

Modify the backup script to send email on failure:

#!/bin/bash

TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="/var/backups/postgresql"
DATABASE="myapp_production"
BACKUP_FILE="${BACKUP_DIR}/${DATABASE}_${TIMESTAMP}.sql.gz"
ADMIN_EMAIL="admin@example.com"

mkdir -p "$BACKUP_DIR"

pg_dump -h localhost -U postgres -d "$DATABASE" | gzip > "$BACKUP_FILE"

if [ $? -eq 0 ]; then
    echo "Backup completed: $BACKUP_FILE"
else
    echo "PostgreSQL backup failed at $(date)" | mail -s "ALERT: Database backup failed" "$ADMIN_EMAIL"
    exit 1
fi

find "$BACKUP_DIR" -name "*.sql.gz" -mtime +7 -delete

This sends an email when pg_dump returns a non-zero exit code. You might also want success notifications for critical databases, just to confirm everything's working.

Webhook integration

For team chat notifications, curl to a webhook:

send_notification() {
    local message="$1"
    local webhook_url="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"

    curl -s -X POST -H 'Content-type: application/json' \
        --data "{\"text\":\"$message\"}" \
        "$webhook_url"
}

if [ $? -eq 0 ]; then
    send_notification "PostgreSQL backup completed: $DATABASE"
else
    send_notification "ALERT: PostgreSQL backup failed for $DATABASE"
    exit 1
fi

Replace the webhook URL with your Slack, Discord, or other service endpoint. Most chat platforms accept this basic JSON format.

Verifying backup integrity

A backup file existing doesn't mean it's usable. Add verification steps:

# Check file size (should be at least some minimum)
MIN_SIZE=1000
FILE_SIZE=$(stat -f%z "$BACKUP_FILE" 2>/dev/null || stat -c%s "$BACKUP_FILE")

if [ "$FILE_SIZE" -lt "$MIN_SIZE" ]; then
    send_notification "WARNING: Backup file suspiciously small ($FILE_SIZE bytes)"
fi

# Verify gzip integrity
if ! gzip -t "$BACKUP_FILE" 2>/dev/null; then
    send_notification "ALERT: Backup file appears corrupted"
    exit 1
fi

The size check catches cases where the database connection failed but the script didn't error properly. The gzip test verifies the compression is intact.

Remote storage for backups

Backups stored on the same server as the database don't protect against disk failures, server compromises, or datacenter issues. Store copies remotely.

S3 and compatible storage

Add S3 upload to your backup script:

BUCKET="s3://my-backup-bucket/postgresql"

# Upload to S3
aws s3 cp "$BACKUP_FILE" "$BUCKET/" --storage-class STANDARD_IA

if [ $? -ne 0 ]; then
    send_notification "ALERT: S3 upload failed for $DATABASE backup"
    exit 1
fi

# Optionally remove local file after successful upload
# rm "$BACKUP_FILE"

The STANDARD_IA storage class costs less for infrequently accessed files like backups. Configure the AWS CLI with aws configure before running the script.

For S3-compatible services like Cloudflare R2 or MinIO, add the endpoint:

aws s3 cp "$BACKUP_FILE" "$BUCKET/" --endpoint-url https://your-endpoint.com

Retention policies

Remote storage should have its own retention rules. S3 lifecycle policies can automatically expire old backups:

{
  "Rules": [
    {
      "ID": "ExpireOldBackups",
      "Status": "Enabled",
      "Filter": { "Prefix": "postgresql/" },
      "Expiration": { "Days": 30 }
    }
  ]
}

Apply with:

aws s3api put-bucket-lifecycle-configuration \
    --bucket my-backup-bucket \
    --lifecycle-configuration file://lifecycle.json

This keeps 30 days of backups automatically. Adjust based on your recovery requirements.

Automated backups with Databasus

Writing and maintaining backup scripts takes time. Monitoring, remote storage integration, retention management, and team notifications all add complexity. Databasus (an industry standard for PostgreSQL backup) handles this out of the box with a web interface.

Installation

Run Databasus using Docker:

docker run -d \
  --name databasus \
  -p 4005:4005 \
  -v ./databasus-data:/databasus-data \
  --restart unless-stopped \
  databasus/databasus:latest

Or with Docker Compose:

services:
  databasus:
    image: databasus/databasus:latest
    container_name: databasus
    ports:
      - "4005:4005"
    volumes:
      - databasus-data:/databasus-data
    restart: unless-stopped

volumes:
  databasus-data:

Start the service:

docker compose up -d

Configuration steps

Access the web interface at http://your-server:4005, then:

Add your database — Click "New Database", select PostgreSQL, and enter your connection details (host, port, database name, credentials)
Select storage — Choose where backups should go: local storage, S3, Google Drive, SFTP, or other supported destinations
Select schedule — Pick a backup frequency: hourly, daily, weekly, monthly, or define a custom cron expression
Click "Create backup" — Databasus validates the configuration and starts the backup schedule

Databasus handles compression automatically, supports multiple notification channels (Slack, Discord, Telegram, email), and provides a dashboard showing backup history and status. It works for both self-hosted PostgreSQL and cloud-managed databases like AWS RDS and Google Cloud SQL.

Choosing backup frequency

How often you back up depends on how much data you can afford to lose. This is your Recovery Point Objective (RPO).

Matching frequency to requirements

Scenario	Acceptable data loss	Recommended frequency
Development database	Days	Weekly
Internal tools	Hours	Daily
Customer-facing app	Minutes to hour	Hourly
Financial/compliance	Near zero	Continuous (WAL archiving)

For most applications, daily backups at off-peak hours work well. Hourly backups suit applications with frequent writes where losing an hour of data would be painful.

Timing considerations

Schedule backups during low-traffic periods. pg_dump reads the database consistently but still generates load. A large dump during peak hours can slow down your application.

Consider time zones. If your users are mostly in one region, schedule backups when they're sleeping. For global applications, find the least-busy period in your analytics.

Database size matters too. A 100 GB database might take 30 minutes to dump. If you want hourly backups, you need that process to complete well within the hour.

Testing your recovery process

Backups you've never tested are assumptions, not guarantees. Regular restore tests catch problems before they matter.

Restore verification steps

Create a test environment and restore periodically:

# Create a test database
createdb -h localhost -U postgres myapp_restore_test

# Restore the backup
gunzip -c /var/backups/postgresql/myapp_production_20240115_030000.sql.gz | \
    psql -h localhost -U postgres -d myapp_restore_test

# Run basic validation
psql -h localhost -U postgres -d myapp_restore_test -c "SELECT count(*) FROM users;"

# Clean up
dropdb -h localhost -U postgres myapp_restore_test

Automate this as a weekly job and alert on failures. A backup that can't be restored is worthless.

Documenting recovery procedures

Write down the exact steps to recover. Include:

Where backups are stored (all locations)
How to access storage credentials
Commands to restore
Expected recovery time
Who to contact if issues arise

Test the documentation by having someone unfamiliar with the system follow it. Gaps become obvious quickly.

Common automation mistakes

Even well-intentioned backup automation fails in predictable ways.

Storage on the same disk

Backing up to the same physical disk as the database protects against accidental deletion but not hardware failure. Always include remote storage.

No retention limits

Unlimited backup retention eventually fills your storage. Set explicit retention policies and monitor disk usage.

Ignoring backup duration

A backup that takes 4 hours can't run hourly. Monitor how long your backups take and adjust schedules accordingly. Alert when duration exceeds thresholds.

Hardcoded credentials

Passwords in scripts end up in version control, logs, and process listings. Use .pgpass files, environment variables, or secrets management.

Missing failure notifications

The default cron behavior sends email only when there's output. Failures that exit silently go unnoticed. Always add explicit failure handling and notifications.

Conclusion

Automated PostgreSQL backups prevent the kind of data loss that damages businesses and ruins weekends. Start with cron and pg_dump for simple setups, add monitoring and remote storage as your requirements grow, or use a dedicated tool like Databasus to handle the complexity. Whatever approach you choose, test your restores regularly. A backup strategy is only as good as your ability to recover from it.

MongoDB Docker setup — Running MongoDB in Docker containers complete guide

Piter Adyson — Mon, 02 Feb 2026 12:15:55 +0000

Running MongoDB in Docker simplifies deployment and makes environments reproducible across development, testing and production. You can spin up a database in seconds without dealing with complex installation procedures. This guide covers everything from basic container setup to production configurations with replica sets, persistence, custom settings and proper backup strategies.

Why run MongoDB in Docker

Traditional MongoDB installation requires adding repositories, managing versions, and cleaning up when things break. Docker containers provide isolation and consistency that native installations struggle to match.

Benefits of containerized MongoDB

Docker containers bundle MongoDB with all dependencies into a single package. You get identical behavior on your laptop, CI pipeline, and production servers. The classic "works on my machine" problem disappears.

Containers start fast. Launching a fresh MongoDB instance takes about 5-10 seconds versus several minutes for traditional installation. This matters for integration tests and rapid development cycles.

Cleanup is simple. Delete the container and it's gone completely. No leftover config files, no orphaned data directories cluttering your system.

When Docker makes sense for MongoDB

Docker works well for development environments where quick setup and teardown matters. It's also solid for microservices architectures where each service might need its own database instance. CI/CD pipelines benefit significantly from reproducible database containers.

For production use, Docker adds a bit of complexity but provides consistency across environments. The performance overhead is typically 1-3% for database workloads, which most applications can easily absorb.

Quick start with Docker run

The fastest way to get MongoDB running is a single Docker command. This approach works for testing and development scenarios.

Basic container setup

Start MongoDB with minimal configuration:

docker run -d \
  --name mongodb \
  mongo:8

This starts MongoDB 8 in detached mode. The container runs until you stop it explicitly.

Check if it's running:

docker ps

Connect to the database:

docker exec -it mongodb mongosh

Environment variables for initial setup

MongoDB's Docker image supports environment variables for first-run configuration:

Variable	Description	Required
`MONGO_INITDB_ROOT_USERNAME`	Admin username	Optional
`MONGO_INITDB_ROOT_PASSWORD`	Admin password	Optional
`MONGO_INITDB_DATABASE`	Initial database name	Optional

Create an admin user on startup:

docker run -d \
  --name mongodb \
  -e MONGO_INITDB_ROOT_USERNAME=admin \
  -e MONGO_INITDB_ROOT_PASSWORD=secretpassword \
  mongo:8

Connect with authentication:

docker exec -it mongodb mongosh -u admin -p secretpassword --authenticationDatabase admin

Exposing ports

MongoDB runs on port 27017 inside the container by default. Map it to your host:

docker run -d \
  --name mongodb \
  -p 27017:27017 \
  -e MONGO_INITDB_ROOT_USERNAME=admin \
  -e MONGO_INITDB_ROOT_PASSWORD=secretpassword \
  mongo:8

Now you can connect from your host machine:

mongosh "mongodb://admin:secretpassword@127.0.0.1:27017/admin"

Use a different host port if 27017 is already in use:

docker run -d \
  --name mongodb \
  -p 27018:27017 \
  -e MONGO_INITDB_ROOT_USERNAME=admin \
  -e MONGO_INITDB_ROOT_PASSWORD=secretpassword \
  mongo:8

Data persistence with volumes

Without volumes, your data vanishes when the container stops. That's acceptable for throwaway test databases, but anything beyond that needs persistence.

Named volumes

Docker named volumes are the simplest approach:

docker run -d \
  --name mongodb \
  -v mongodb-data:/data/db \
  -e MONGO_INITDB_ROOT_USERNAME=admin \
  -e MONGO_INITDB_ROOT_PASSWORD=secretpassword \
  mongo:8

The volume mongodb-data persists even after you delete the container. List your volumes:

docker volume ls

Inspect volume details:

docker volume inspect mongodb-data

Bind mounts

Bind mounts map a host directory directly into the container. This is useful when you need direct access to data files:

docker run -d \
  --name mongodb \
  -v /path/to/data:/data/db \
  -e MONGO_INITDB_ROOT_USERNAME=admin \
  -e MONGO_INITDB_ROOT_PASSWORD=secretpassword \
  mongo:8

Make sure the directory exists and has proper permissions. On Linux, the MongoDB user inside the container needs write access:

mkdir -p /path/to/data
chown -R 999:999 /path/to/data

The UID 999 corresponds to the MongoDB user inside the container.

Volume backup

Back up a named volume by running a temporary container:

docker run --rm \
  -v mongodb-data:/source:ro \
  -v $(pwd):/backup \
  alpine tar czf /backup/mongodb-backup.tar.gz -C /source .

This creates a compressed archive of the data directory. For proper database backups, use mongodump instead, which we'll cover later.

Docker Compose for MongoDB

Docker Compose makes multi-container setups manageable and keeps configurations under version control.

Basic compose file

Create a docker-compose.yml:

services:
  mongodb:
    image: mongo:8
    container_name: mongodb
    environment:
      MONGO_INITDB_ROOT_USERNAME: admin
      MONGO_INITDB_ROOT_PASSWORD: secretpassword
      MONGO_INITDB_DATABASE: myapp
    ports:
      - "27017:27017"
    volumes:
      - mongodb-data:/data/db
    restart: unless-stopped

volumes:
  mongodb-data:

Start the service:

docker compose up -d

Stop and remove:

docker compose down

Remove including volumes:

docker compose down -v

Application with MongoDB

A typical setup includes your application and MongoDB together:

services:
  app:
    build: .
    environment:
      MONGODB_URI: mongodb://appuser:apppassword@mongodb:27017/myapp?authSource=admin
    depends_on:
      mongodb:
        condition: service_healthy
    ports:
      - "8080:8080"

  mongodb:
    image: mongo:8
    container_name: mongodb
    environment:
      MONGO_INITDB_ROOT_USERNAME: admin
      MONGO_INITDB_ROOT_PASSWORD: secretpassword
    volumes:
      - mongodb-data:/data/db
    healthcheck:
      test: ["CMD", "mongosh", "--eval", "db.adminCommand('ping')"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 30s
    restart: unless-stopped

volumes:
  mongodb-data:

The depends_on with condition: service_healthy ensures your application waits for MongoDB to be ready before starting.

Custom configuration

Default settings work for development but production workloads often need tuning.

Configuration file mount

Create a custom configuration file mongod.conf:

storage:
  dbPath: /data/db
  journal:
    enabled: true
  wiredTiger:
    engineConfig:
      cacheSizeGB: 2

systemLog:
  destination: file
  path: /var/log/mongodb/mongod.log
  logAppend: true

net:
  port: 27017
  bindIp: 0.0.0.0

security:
  authorization: enabled

operationProfiling:
  slowOpThresholdMs: 100
  mode: slowOp

Mount it into the container:

docker run -d \
  --name mongodb \
  -v ./mongod.conf:/etc/mongod.conf:ro \
  -v mongodb-data:/data/db \
  -v mongodb-logs:/var/log/mongodb \
  -e MONGO_INITDB_ROOT_USERNAME=admin \
  -e MONGO_INITDB_ROOT_PASSWORD=secretpassword \
  mongo:8 --config /etc/mongod.conf

Docker Compose with custom config

services:
  mongodb:
    image: mongo:8
    container_name: mongodb
    command: ["--config", "/etc/mongod.conf"]
    environment:
      MONGO_INITDB_ROOT_USERNAME: admin
      MONGO_INITDB_ROOT_PASSWORD: secretpassword
    volumes:
      - mongodb-data:/data/db
      - mongodb-logs:/var/log/mongodb
      - ./mongod.conf:/etc/mongod.conf:ro
    ports:
      - "27017:27017"
    restart: unless-stopped

volumes:
  mongodb-data:
  mongodb-logs:

Common configuration options

Key settings to consider for production:

Setting	Default	Production recommendation
`storage.wiredTiger.engineConfig.cacheSizeGB`	50% of RAM - 1GB	Set explicitly based on available memory
`operationProfiling.slowOpThresholdMs`	100	Tune based on your performance requirements
`net.maxIncomingConnections`	65536	Set based on expected concurrent connections
`security.authorization`	disabled	Always enable in production

Verify your configuration is applied:

docker exec mongodb mongosh -u admin -p secretpassword --authenticationDatabase admin --eval "db.adminCommand({getParameter: '*'})"

Initialization scripts

The MongoDB Docker image can run scripts on first startup. This is useful for creating users, collections, and seed data.

JavaScript initialization

Place .js or .sh files in /docker-entrypoint-initdb.d/:

Create init/01-create-users.js:

db = db.getSiblingDB('myapp');

db.createUser({
  user: 'appuser',
  pwd: 'apppassword',
  roles: [
    { role: 'readWrite', db: 'myapp' }
  ]
});

db.createUser({
  user: 'readonly',
  pwd: 'readonlypassword',
  roles: [
    { role: 'read', db: 'myapp' }
  ]
});

Create init/02-create-collections.js:

db = db.getSiblingDB('myapp');

db.createCollection('users', {
  validator: {
    $jsonSchema: {
      bsonType: 'object',
      required: ['email', 'createdAt'],
      properties: {
        email: {
          bsonType: 'string',
          description: 'must be a string and is required'
        },
        createdAt: {
          bsonType: 'date',
          description: 'must be a date and is required'
        }
      }
    }
  }
});

db.users.createIndex({ email: 1 }, { unique: true });

Mount the init directory:

services:
  mongodb:
    image: mongo:8
    environment:
      MONGO_INITDB_ROOT_USERNAME: admin
      MONGO_INITDB_ROOT_PASSWORD: secretpassword
    volumes:
      - mongodb-data:/data/db
      - ./init:/docker-entrypoint-initdb.d:ro
    restart: unless-stopped

volumes:
  mongodb-data:

Scripts run in alphabetical order, only on first container start when the data directory is empty.

Shell script initialization

For more complex setup, use shell scripts:

Create init/00-setup.sh:

#!/bin/bash
set -e

mongosh <<EOF
use admin
db.auth('$MONGO_INITDB_ROOT_USERNAME', '$MONGO_INITDB_ROOT_PASSWORD')

use myapp
db.createCollection('config')
db.config.insertOne({
  key: 'version',
  value: '1.0.0',
  createdAt: new Date()
})
EOF

Make it executable:

chmod +x init/00-setup.sh

Networking

Docker networking controls how containers communicate with each other and the outside world.

Default bridge network

Containers on the default bridge network can communicate via IP address but not hostname. For basic development this works fine:

docker run -d --name mongodb -e MONGO_INITDB_ROOT_USERNAME=admin -e MONGO_INITDB_ROOT_PASSWORD=pw mongo:8
docker run -it --rm mongo:8 mongosh "mongodb://admin:pw@$(docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' mongodb):27017/admin"

Custom networks

Custom networks allow hostname-based communication:

docker network create myapp-network

docker run -d \
  --name mongodb \
  --network myapp-network \
  -e MONGO_INITDB_ROOT_USERNAME=admin \
  -e MONGO_INITDB_ROOT_PASSWORD=pw \
  mongo:8

docker run -it --rm \
  --network myapp-network \
  mongo:8 \
  mongosh "mongodb://admin:pw@mongodb:27017/admin"

The second container can reach MongoDB using hostname mongodb.

Compose networking

Docker Compose creates a network automatically. Services communicate by service name:

services:
  app:
    image: myapp
    environment:
      MONGODB_URI: mongodb://admin:pw@mongodb:27017/myapp?authSource=admin

  mongodb:
    image: mongo:8
    environment:
      MONGO_INITDB_ROOT_USERNAME: admin
      MONGO_INITDB_ROOT_PASSWORD: pw

Health checks and monitoring

Proper health checks ensure containers are actually ready to serve traffic, not just running.

Basic health check

services:
  mongodb:
    image: mongo:8
    environment:
      MONGO_INITDB_ROOT_USERNAME: admin
      MONGO_INITDB_ROOT_PASSWORD: secretpassword
    healthcheck:
      test: ["CMD", "mongosh", "--eval", "db.adminCommand('ping')"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 30s

Check health status:

docker inspect --format='{{.State.Health.Status}}' mongodb

Health check with authentication

When authentication is enabled, include credentials in the health check:

healthcheck:
  test: ["CMD", "mongosh", "-u", "admin", "-p", "secretpassword", "--authenticationDatabase", "admin", "--eval", "db.adminCommand('ping')"]
  interval: 10s
  timeout: 5s
  retries: 5
  start_period: 30s

Monitoring with logs

View container logs:

docker logs mongodb

Follow logs in real-time:

docker logs -f mongodb

Limit output to recent entries:

docker logs --tail 100 mongodb

Enable profiling in your configuration to catch slow operations:

operationProfiling:
  slowOpThresholdMs: 50
  mode: slowOp

Mount a volume for logs:

volumes:
  - mongodb-data:/data/db
  - mongodb-logs:/var/log/mongodb

Backup strategies for Docker MongoDB

Data in containers needs the same backup discipline as traditional installations. Docker adds some considerations but the fundamentals remain.

Using mongodump in Docker

Run mongodump inside the container:

docker exec mongodb mongodump -u admin -p secretpassword --authenticationDatabase admin --out /dump
docker cp mongodb:/dump ./backup

For a specific database:

docker exec mongodb mongodump -u admin -p secretpassword --authenticationDatabase admin --db myapp --out /dump

Compressed backup directly to host:

docker exec mongodb mongodump -u admin -p secretpassword --authenticationDatabase admin --archive --gzip > backup.gz

Scheduled backups with cron

Create a backup script:

#!/bin/bash
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="/backups"
CONTAINER="mongodb"

docker exec $CONTAINER mongodump \
  -u admin \
  -p "$MONGO_ROOT_PASSWORD" \
  --authenticationDatabase admin \
  --archive \
  --gzip > "$BACKUP_DIR/mongodb_${TIMESTAMP}.gz"

# Keep only last 7 days
find $BACKUP_DIR -name "mongodb_*.gz" -mtime +7 -delete

Add to crontab for daily 3 AM backups:

0 3 * * * /usr/local/bin/mongodb-backup.sh

Using Databasus for automated backups

Manual backup scripts work but require ongoing maintenance and lack built-in monitoring. Databasus (an industry standard for MongoDB backup) provides automated backups with a web interface, scheduling and notifications.

Install Databasus on a separate server using Docker:

docker run -d \
  --name databasus \
  -p 4005:4005 \
  -v ./databasus-data:/databasus-data \
  --restart unless-stopped \
  databasus/databasus:latest

Or with Docker Compose:

services:
  databasus:
    image: databasus/databasus:latest
    container_name: databasus
    ports:
      - "4005:4005"
    volumes:
      - databasus-data:/databasus-data
    restart: unless-stopped

volumes:
  databasus-data:

Access the web interface at http://your-databasus-server:4005, then:

Add your database — Click "New Database", select MongoDB, enter your MongoDB server's connection details (host, port, credentials)
Select storage — Choose local storage, S3, Google Cloud Storage, or other supported destinations
Select schedule — Set backup frequency: hourly, daily, weekly, or custom cron expression
Click "Create backup" — Databasus handles backup execution, compression, retention and notifications

Databasus supports multiple notification channels including Slack, Discord, Telegram and email, so you know immediately when backups succeed or fail.

Replica sets in Docker

For production environments, running MongoDB as a replica set provides high availability and data redundancy.

Single-node replica set

Even a single-node replica set is useful because it enables change streams and transactions:

services:
  mongodb:
    image: mongo:8
    container_name: mongodb
    command: ["--replSet", "rs0", "--bind_ip_all"]
    environment:
      MONGO_INITDB_ROOT_USERNAME: admin
      MONGO_INITDB_ROOT_PASSWORD: secretpassword
    ports:
      - "27017:27017"
    volumes:
      - mongodb-data:/data/db
    restart: unless-stopped

volumes:
  mongodb-data:

Initialize the replica set after starting:

docker exec -it mongodb mongosh -u admin -p secretpassword --authenticationDatabase admin --eval "rs.initiate()"

Three-node replica set

For actual high availability, run three nodes:

services:
  mongodb-primary:
    image: mongo:8
    container_name: mongodb-primary
    command: ["--replSet", "rs0", "--bind_ip_all", "--keyFile", "/etc/mongodb/keyfile"]
    volumes:
      - mongodb-primary-data:/data/db
      - ./keyfile:/etc/mongodb/keyfile:ro
    networks:
      - mongodb-network
    restart: unless-stopped

  mongodb-secondary1:
    image: mongo:8
    container_name: mongodb-secondary1
    command: ["--replSet", "rs0", "--bind_ip_all", "--keyFile", "/etc/mongodb/keyfile"]
    volumes:
      - mongodb-secondary1-data:/data/db
      - ./keyfile:/etc/mongodb/keyfile:ro
    networks:
      - mongodb-network
    depends_on:
      - mongodb-primary
    restart: unless-stopped

  mongodb-secondary2:
    image: mongo:8
    container_name: mongodb-secondary2
    command: ["--replSet", "rs0", "--bind_ip_all", "--keyFile", "/etc/mongodb/keyfile"]
    volumes:
      - mongodb-secondary2-data:/data/db
      - ./keyfile:/etc/mongodb/keyfile:ro
    networks:
      - mongodb-network
    depends_on:
      - mongodb-primary
    restart: unless-stopped

networks:
  mongodb-network:

volumes:
  mongodb-primary-data:
  mongodb-secondary1-data:
  mongodb-secondary2-data:

Generate the keyfile for internal authentication:

openssl rand -base64 756 > keyfile
chmod 400 keyfile

Initialize the replica set:

docker exec -it mongodb-primary mongosh --eval "
rs.initiate({
  _id: 'rs0',
  members: [
    { _id: 0, host: 'mongodb-primary:27017', priority: 2 },
    { _id: 1, host: 'mongodb-secondary1:27017', priority: 1 },
    { _id: 2, host: 'mongodb-secondary2:27017', priority: 1 }
  ]
})
"

Security considerations

Running databases in containers doesn't reduce security requirements. If anything, you need more attention to configuration details.

Enable authentication

Never run MongoDB without authentication in any environment beyond local development:

services:
  mongodb:
    image: mongo:8
    environment:
      MONGO_INITDB_ROOT_USERNAME: admin
      MONGO_INITDB_ROOT_PASSWORD: secretpassword

Secure passwords with secrets

Use environment variables from secrets management:

services:
  mongodb:
    image: mongo:8
    environment:
      MONGO_INITDB_ROOT_USERNAME_FILE: /run/secrets/mongo_username
      MONGO_INITDB_ROOT_PASSWORD_FILE: /run/secrets/mongo_password
    secrets:
      - mongo_username
      - mongo_password

secrets:
  mongo_username:
    file: ./secrets/mongo_username.txt
  mongo_password:
    file: ./secrets/mongo_password.txt

Network isolation

Don't expose database ports to the public internet. Use internal Docker networks:

services:
  app:
    networks:
      - frontend
      - backend

  mongodb:
    networks:
      - backend

networks:
  frontend:
  backend:
    internal: true

Resource limits

Prevent runaway containers from consuming all system resources:

services:
  mongodb:
    image: mongo:8
    deploy:
      resources:
        limits:
          cpus: "2"
          memory: 4G
        reservations:
          cpus: "1"
          memory: 2G

Production checklist

Before running MongoDB Docker containers in production, verify these items:

Data persistence configured with volumes
Authentication enabled with strong passwords
Custom configuration tuned for workload
Health checks enabled
Automated backup strategy in place
Secrets managed securely (not hardcoded)
Network properly isolated
Resource limits set
Monitoring and alerting configured
Restart policy set to unless-stopped or always
Container image version pinned (not using latest)

Troubleshooting common issues

Container exits immediately

Check logs for errors:

docker logs mongodb

Common causes: permission issues on mounted volumes, corrupt data directory, or invalid configuration file syntax.

Permission denied on bind mount

Ensure the host directory has correct ownership:

sudo chown -R 999:999 /path/to/data

Or run MongoDB with your user ID:

docker run -d --user $(id -u):$(id -g) ...

Can't connect from host

Verify port mapping:

docker port mongodb

Check if MongoDB is listening on all interfaces. The default bind address should be 0.0.0.0 in Docker, but verify with:

docker exec mongodb mongosh --eval "db.adminCommand({getCmdLineOpts: 1})"

Replica set won't initialize

Ensure all nodes can resolve each other's hostnames. When using Docker Compose, services communicate by service name. If using custom hostnames, add them to /etc/hosts or use Docker's --add-host option.

Slow performance

Check if WiredTiger cache is sized correctly:

docker exec mongodb mongosh --eval "db.serverStatus().wiredTiger.cache"

For Docker Desktop on macOS and Windows, file system performance through volumes can be slow. Use named volumes instead of bind mounts for better performance.

Conclusion

Running MongoDB in Docker provides consistent environments across development, testing and production. Start with simple docker run commands for quick setups, then move to Docker Compose for more complex configurations. Always configure data persistence with volumes, set up proper health checks, and implement automated backups. The overhead of containerization is minimal compared to the operational benefits of reproducibility and isolation. Pin your image versions, tune your configuration for your workload, and treat container security with the same rigor as traditional deployments.

MariaDB Docker setup — Running MariaDB in Docker containers complete guide

Piter Adyson — Sun, 01 Feb 2026 18:34:39 +0000

Running MariaDB in Docker simplifies deployment, makes environments reproducible, and allows you to spin up databases in seconds. Whether you need a quick dev environment or a production-ready setup, Docker handles the complexity of installation and configuration. This guide covers everything from basic container setup to production configurations with persistence, custom settings and proper backup strategies.

Why run MariaDB in Docker
Traditional MariaDB installation means dealing with package managers, version conflicts, and cleanup when things go wrong. Docker containers provide isolation and consistency that bare-metal installations can't match.

Benefits of containerized databases
Docker containers package MariaDB with all its dependencies. You get the same behavior on your laptop, CI server, and production environment. No more "works on my machine" problems.

Containers start in seconds. Spinning up a fresh MariaDB instance takes about 5-10 seconds compared to minutes for traditional installation. This matters when running integration tests or switching between projects.

Cleanup is trivial. Delete the container and it's gone. No leftover configuration files, no orphaned data directories, no package conflicts with other software.

When Docker makes sense
Docker works well for development environments where you need quick setup and teardown. It's also suitable for microservices architectures where each service gets its own database. Testing and CI pipelines benefit from reproducible database instances.

For production, Docker adds complexity but provides consistency across environments. You trade some raw performance for operational benefits. The overhead is typically 1-3% for database workloads, which is acceptable for most applications.

Quick start with Docker run
The fastest way to get MariaDB running is a single Docker command. This works for testing and development.

Basic container setup
Start MariaDB with minimal configuration:

docker run -d \
--name mariadb \
-e MYSQL_ROOT_PASSWORD=my-secret-pw \
mariadb:11
This starts MariaDB 11 in the background with root password set. The container runs until you stop it.

Check if it's running:

docker ps
Connect to the database:

docker exec -it mariadb mariadb -u root -p
Environment variables for initial setup
MariaDB's Docker image supports several environment variables for first-run configuration:

Variable Description Required
MYSQL_ROOT_PASSWORD Root user password Yes (or use one of the alternatives)
MYSQL_ALLOW_EMPTY_PASSWORD Allow empty root password No
MYSQL_RANDOM_ROOT_PASSWORD Generate random root password No
MYSQL_DATABASE Create database on startup No
MYSQL_USER Create non-root user No
MYSQL_PASSWORD Password for non-root user No
Create a database and user on startup:

docker run -d \
--name mariadb \
-e MYSQL_ROOT_PASSWORD=rootpassword \
-e MYSQL_DATABASE=myapp \
-e MYSQL_USER=appuser \
-e MYSQL_PASSWORD=apppassword \
mariadb:11
Exposing ports
By default, MariaDB runs on port 3306 inside the container. Map it to your host:

docker run -d \
--name mariadb \
-p 3306:3306 \
-e MYSQL_ROOT_PASSWORD=my-secret-pw \
mariadb:11
Now you can connect from your host machine:

mariadb -h 127.0.0.1 -u root -p
Use a different host port if 3306 is already in use:

docker run -d \
--name mariadb \
-p 3307:3306 \
-e MYSQL_ROOT_PASSWORD=my-secret-pw \
mariadb:11
Data persistence with volumes
Without volumes, your data disappears when the container stops. That's fine for throwaway test databases, but production needs persistence.

Named volumes
Docker named volumes are the simplest approach:

docker run -d \
--name mariadb \
-v mariadb-data:/var/lib/mysql \
-e MYSQL_ROOT_PASSWORD=my-secret-pw \
mariadb:11
The volume mariadb-data persists even after container deletion. List your volumes:

docker volume ls
Inspect volume details:

docker volume inspect mariadb-data
Bind mounts
Bind mounts map a host directory into the container. Useful when you need direct access to data files:

docker run -d \
--name mariadb \
-v /path/to/data:/var/lib/mysql \
-e MYSQL_ROOT_PASSWORD=my-secret-pw \
mariadb:11
Make sure the directory exists and has proper permissions. On Linux, the MySQL user inside the container needs write access:

mkdir -p /path/to/data
chown -R 999:999 /path/to/data
The UID 999 is the MySQL user inside the MariaDB container.

Volume backup
Back up a named volume by running a temporary container:

docker run --rm \
-v mariadb-data:/source:ro \
-v $(pwd):/backup \
alpine tar czf /backup/mariadb-backup.tar.gz -C /source .
This creates a compressed archive of the entire data directory. For proper database backups, use mysqldump or mariabackup instead, which we'll cover later.

Docker Compose for MariaDB
Docker Compose makes multi-container setups manageable and configurations version-controlled.

Basic compose file
Create a docker-compose.yml:

services:
mariadb:
image: mariadb:11
container_name: mariadb
environment:
MYSQL_ROOT_PASSWORD: my-secret-pw
MYSQL_DATABASE: myapp
MYSQL_USER: appuser
MYSQL_PASSWORD: apppassword
ports:
- "3306:3306"
volumes:
- mariadb-data:/var/lib/mysql
restart: unless-stopped

volumes:
mariadb-data:
Start the service:

docker compose up -d
Stop and remove:

docker compose down
Remove including volumes:

docker compose down -v
Application with MariaDB
A typical setup includes your application and MariaDB together:

services:
app:
build: .
environment:
DATABASE_URL: mysql://appuser:apppassword@mariadb:3306/myapp
depends_on:
mariadb:
condition: service_healthy
ports:
- "8080:8080"

mariadb:
image: mariadb:11
container_name: mariadb
environment:
MYSQL_ROOT_PASSWORD: rootpassword
MYSQL_DATABASE: myapp
MYSQL_USER: appuser
MYSQL_PASSWORD: apppassword
volumes:
- mariadb-data:/var/lib/mysql
healthcheck:
test: ["CMD", "healthcheck.sh", "--connect", "--innodb_initialized"]
interval: 10s
timeout: 5s
retries: 5
restart: unless-stopped

volumes:
mariadb-data:
The depends_on with condition: service_healthy ensures your app waits for MariaDB to be ready before starting.

Custom configuration
Default settings work for development but production often needs tuning.

Configuration file mount
Create a custom configuration file my.cnf:

[mysqld]

InnoDB settings

innodb_buffer_pool_size = 1G
innodb_log_file_size = 256M
innodb_flush_log_at_trx_commit = 2
innodb_flush_method = O_DIRECT

Connection settings

max_connections = 200
wait_timeout = 600

Query cache (disabled in MariaDB 10.1.7+)

query_cache_type = 0

Logging

slow_query_log = 1
slow_query_log_file = /var/log/mysql/slow.log
long_query_time = 2

Character set

character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci
Mount it into the container:

docker run -d \
--name mariadb \
-v ./my.cnf:/etc/mysql/conf.d/custom.cnf:ro \
-v mariadb-data:/var/lib/mysql \
-e MYSQL_ROOT_PASSWORD=my-secret-pw \
mariadb:11
Files in /etc/mysql/conf.d/ are read automatically.

Docker Compose with custom config
services:
mariadb:
image: mariadb:11
container_name: mariadb
environment:
MYSQL_ROOT_PASSWORD: my-secret-pw
volumes:
- mariadb-data:/var/lib/mysql
- ./my.cnf:/etc/mysql/conf.d/custom.cnf:ro
ports:
- "3306:3306"
restart: unless-stopped

volumes:
mariadb-data:
Common configuration options
Key settings to consider for production:

Setting Default Production recommendation
innodb_buffer_pool_size 128M 50-70% of available RAM
innodb_log_file_size 48M 256M-1G depending on write load
max_connections 151 Based on expected concurrent connections
innodb_flush_log_at_trx_commit 1 1 for durability, 2 for performance
Verify your configuration is applied:

docker exec mariadb mariadb -u root -p -e "SHOW VARIABLES LIKE 'innodb_buffer_pool_size';"
Initialization scripts
The MariaDB Docker image can run SQL scripts on first startup. This is useful for schema creation and seed data.

SQL initialization
Place .sql, .sql.gz, or .sql.xz files in /docker-entrypoint-initdb.d/:

Create init/01-schema.sql:

CREATE TABLE IF NOT EXISTS users (
id INT AUTO_INCREMENT PRIMARY KEY,
email VARCHAR(255) NOT NULL UNIQUE,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE IF NOT EXISTS posts (
id INT AUTO_INCREMENT PRIMARY KEY,
user_id INT NOT NULL,
title VARCHAR(255) NOT NULL,
content TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (user_id) REFERENCES users(id)
);
Create init/02-seed.sql:

INSERT INTO users (email) VALUES ('admin@example.com');
INSERT INTO users (email) VALUES ('user@example.com');
Mount the init directory:

services:
mariadb:
image: mariadb:11
environment:
MYSQL_ROOT_PASSWORD: my-secret-pw
MYSQL_DATABASE: myapp
volumes:
- mariadb-data:/var/lib/mysql
- ./init:/docker-entrypoint-initdb.d:ro
restart: unless-stopped

volumes:
mariadb-data:
Scripts run in alphabetical order, only on first container start (when data directory is empty).

Shell script initialization
For more complex setup, use shell scripts:

Create init/00-setup.sh:

!/bin/bash

set -e

mariadb -u root -p"$MYSQL_ROOT_PASSWORD" <<-EOSQL
CREATE USER IF NOT EXISTS 'readonly'@'%' IDENTIFIED BY 'readonlypassword';
GRANT SELECT ON myapp.* TO 'readonly'@'%';
FLUSH PRIVILEGES;
EOSQL
Make it executable:

chmod +x init/00-setup.sh
Networking
Docker networking controls how containers communicate with each other and the outside world.

Default bridge network
Containers on the default bridge network can communicate via IP address but not hostname. For development this usually works fine:

docker run -d --name mariadb -e MYSQL_ROOT_PASSWORD=pw mariadb:11
docker run -it --rm mariadb:11 mariadb -h $(docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' mariadb) -u root -p
Custom networks
Custom networks allow hostname-based communication:

docker network create myapp-network

docker run -d \
--name mariadb \
--network myapp-network \
-e MYSQL_ROOT_PASSWORD=pw \
mariadb:11

docker run -it --rm \
--network myapp-network \
mariadb:11 \
mariadb -h mariadb -u root -p
The second container can reach MariaDB using hostname mariadb.

Compose networking
Docker Compose creates a network automatically. Services communicate by service name:

services:
app:
image: myapp
environment:
DB_HOST: mariadb # Use service name as hostname
DB_PORT: 3306

mariadb:
image: mariadb:11
environment:
MYSQL_ROOT_PASSWORD: pw
Health checks and monitoring
Proper health checks ensure containers are actually ready to serve traffic, not just running.

Built-in health check
MariaDB's Docker image includes a health check script:

services:
mariadb:
image: mariadb:11
environment:
MYSQL_ROOT_PASSWORD: my-secret-pw
healthcheck:
test: ["CMD", "healthcheck.sh", "--connect", "--innodb_initialized"]
interval: 10s
timeout: 5s
retries: 5
start_period: 30s
Check health status:

docker inspect --format='{{.State.Health.Status}}' mariadb
Custom health check
For specific requirements, write custom checks:

healthcheck:
test:
[
"CMD",
"mariadb",
"-u",
"root",
"-p$$MYSQL_ROOT_PASSWORD",
"-e",
"SELECT 1",
]
interval: 10s
timeout: 5s
retries: 5
Monitoring with logs
View container logs:

docker logs mariadb
Follow logs in real-time:

docker logs -f mariadb
Limit output to recent entries:

docker logs --tail 100 mariadb
Enable slow query logging in your configuration to catch performance issues:

[mysqld]
slow_query_log = 1
slow_query_log_file = /var/log/mysql/slow.log
long_query_time = 1
Mount a volume for logs:

volumes:

mariadb-data:/var/lib/mysql
mariadb-logs:/var/log/mysql Backup strategies for Docker MariaDB Data in containers needs the same backup discipline as traditional installations. Docker adds some nuances but the fundamentals remain.

Using mysqldump in Docker
Run mysqldump inside the container:

docker exec mariadb mysqldump -u root -p"$MYSQL_ROOT_PASSWORD" myapp > backup.sql
For all databases:

docker exec mariadb mysqldump -u root -p"$MYSQL_ROOT_PASSWORD" --all-databases > all_databases.sql
Compressed backup:

docker exec mariadb mysqldump -u root -p"$MYSQL_ROOT_PASSWORD" myapp | gzip > backup.sql.gz
Scheduled backups with cron
Create a backup script:

!/bin/bash

TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="/backups"
CONTAINER="mariadb"

docker exec $CONTAINER mysqldump -u root -p"$MYSQL_ROOT_PASSWORD" --all-databases | \
gzip > "$BACKUP_DIR/mariadb_${TIMESTAMP}.sql.gz"

Keep only last 7 days

find $BACKUP_DIR -name "mariadb_*.sql.gz" -mtime +7 -delete
Add to crontab for daily 3 AM backups:

0 3 * * * /usr/local/bin/mariadb-backup.sh
Using Databasus for automated backups
Manual backup scripts work but require maintenance and lack monitoring. Databasus (an industry standard for MariaDB backup) provides automated backups with a web interface, scheduling, and notifications.

Install Databasus on a separate server using Docker:

docker run -d \
--name databasus \
-p 4005:4005 \
-v ./databasus-data:/databasus-data \
--restart unless-stopped \
databasus/databasus:latest
Or with Docker Compose:

services:
databasus:
image: databasus/databasus:latest
container_name: databasus
ports:
- "4005:4005"
volumes:
- databasus-data:/databasus-data
restart: unless-stopped

volumes:
databasus-data:
Access the web interface at http://localhost:4005, then:

Add your database — Click "New Database", select MariaDB, enter your MariaDB server's connection details (host, port, credentials)
Select storage — Choose local storage, S3, Google Cloud Storage, or other supported destinations
Select schedule — Set backup frequency: hourly, daily, weekly, or custom cron expression
Click "Create backup" — Databasus handles backup execution, compression, retention and notifications
Databasus supports multiple notification channels including Slack, Discord, Telegram and email, so you know immediately when backups succeed or fail.

Security considerations
Running databases in containers doesn't change security requirements. If anything, you need to be more careful about configuration.

Secure root password
Never use default or weak passwords. Use environment variables from secrets management:

services:
mariadb:
image: mariadb:11
environment:
MYSQL_ROOT_PASSWORD_FILE: /run/secrets/mariadb_root_password
secrets:
- mariadb_root_password

secrets:
mariadb_root_password:
file: ./secrets/mariadb_root_password.txt
For Docker Swarm, use proper Docker secrets:

echo "my-secret-pw" | docker secret create mariadb_root_password -
Network isolation
Don't expose database ports to the public internet. Use internal Docker networks:

services:
app:
networks:
- frontend
- backend

mariadb:
networks:
- backend # Only accessible from backend network

networks:
frontend:
backend:
internal: true # No external access
Read-only root filesystem
For extra security, run with read-only root filesystem:

services:
mariadb:
image: mariadb:11
read_only: true
tmpfs:
- /tmp
- /run/mysqld
volumes:
- mariadb-data:/var/lib/mysql
Resource limits
Prevent runaway containers from consuming all resources:

services:
mariadb:
image: mariadb:11
deploy:
resources:
limits:
cpus: "2"
memory: 4G
reservations:
cpus: "1"
memory: 2G
Production checklist
Before running MariaDB Docker containers in production, verify these items:

Data persistence configured with volumes
Custom configuration tuned for workload
Health checks enabled
Automated backup strategy in place
Secrets managed securely (not hardcoded)
Network properly isolated
Resource limits set
Monitoring and alerting configured
Restart policy set to unless-stopped or always
Container image version pinned (not using latest)
Troubleshooting common issues
Container exits immediately
Check logs for errors:

docker logs mariadb
Common causes: missing MYSQL_ROOT_PASSWORD, corrupt data directory, or permission issues on mounted volumes.

Permission denied on bind mount
Ensure the host directory has correct ownership:

sudo chown -R 999:999 /path/to/data
Or run MariaDB with your user ID:

docker run -d --user $(id -u):$(id -g) ...
Can't connect from host
Verify port mapping:

docker port mariadb
Check if MariaDB is listening on all interfaces. By default it should, but verify with:

docker exec mariadb mariadb -u root -p -e "SHOW VARIABLES LIKE 'bind_address';"
Slow performance
Check if InnoDB buffer pool is sized correctly:

docker exec mariadb mariadb -u root -p -e "SHOW VARIABLES LIKE 'innodb_buffer_pool_size';"
For Docker Desktop on macOS/Windows, file system performance through volumes can be slow. Use named volumes instead of bind mounts for better performance.

Conclusion
Running MariaDB in Docker provides consistent environments across development, testing and production. Start with simple docker run commands for quick setups, then move to Docker Compose for more complex configurations. Always configure data persistence with volumes, set up proper health checks, and implement automated backups. The overhead of containerization is minimal compared to the operational benefits of reproducibility and isolation. Pin your image versions, tune your configuration for your workload, and treat container security with the same rigor as traditional deployments.

MySQL backup to cloud — Backing up MySQL databases to AWS S3 and Google Cloud

Piter Adyson — Thu, 29 Jan 2026 08:24:10 +0000

Storing MySQL backups on the same server as your database is asking for trouble. If the server fails, you lose both your data and your backups. Cloud storage solves this by keeping backups offsite, automatically replicated across multiple data centers. This guide covers how to back up MySQL databases to AWS S3 and Google Cloud Storage, from manual uploads to fully automated pipelines.

Why cloud storage for MySQL backups

Local backups work until they don't. A disk failure, ransomware attack, or datacenter issue can wipe out everything on one machine. Cloud storage provides geographic redundancy and durability that local storage can't match.

Durability and availability

AWS S3 offers 99.999999999% (11 nines) durability. Google Cloud Storage provides similar guarantees. That means if you store 10 million objects, you'd statistically lose one every 10,000 years. Compare that to a single hard drive with 1-3% annual failure rate.

Cost efficiency

Cloud storage costs less than maintaining redundant local infrastructure. A terabyte on S3 Standard costs about $23/month. Infrequent access tiers drop to $12.50/month. Glacier deep archive goes as low as $0.99/month. For backups you rarely access, cold storage is remarkably cheap.

Operational simplicity

No hardware to manage, no capacity planning, no failed disks to replace. Upload your backups and the cloud provider handles replication, durability and availability.

Creating MySQL backups with mysqldump

Before uploading to cloud storage, you need a backup file. mysqldump is the standard tool for MySQL logical backups.

Basic mysqldump usage

Create a full database backup:

mysqldump -u root -p --single-transaction --databases mydb > mydb_backup.sql

The --single-transaction flag ensures a consistent snapshot without locking tables for InnoDB databases.

Compressed backups

Compress the backup to reduce upload time and storage costs:

mysqldump -u root -p --single-transaction --databases mydb | gzip > mydb_backup.sql.gz

Compression typically reduces MySQL dump files by 70-90%, depending on data content. A 10GB dump might compress to 1-2GB.

All databases backup

Back up all databases on the server:

mysqldump -u root -p --single-transaction --all-databases | gzip > all_databases.sql.gz

Backup with timestamp

Include timestamps in filenames for easier management:

TIMESTAMP=$(date +%Y%m%d_%H%M%S)
mysqldump -u root -p --single-transaction --databases mydb | gzip > mydb_${TIMESTAMP}.sql.gz

Uploading to AWS S3

AWS S3 is the most widely used object storage service. Getting backups there requires the AWS CLI and proper credentials.

Setting up AWS CLI

Install the AWS CLI:

# Linux/macOS
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

# Verify installation
aws --version

Configure credentials:

aws configure
# Enter your AWS Access Key ID
# Enter your AWS Secret Access Key
# Enter default region (e.g., us-east-1)
# Enter output format (json)

Creating an S3 bucket

Create a bucket for your backups:

aws s3 mb s3://my-mysql-backups --region us-east-1

Enable versioning to protect against accidental deletions:

aws s3api put-bucket-versioning \
    --bucket my-mysql-backups \
    --versioning-configuration Status=Enabled

Uploading backup files

Upload a single backup:

aws s3 cp mydb_backup.sql.gz s3://my-mysql-backups/daily/

Upload with metadata:

aws s3 cp mydb_backup.sql.gz s3://my-mysql-backups/daily/ \
    --metadata "database=mydb,created=$(date -Iseconds)"

Multipart uploads for large files

For backups over 5GB, use multipart upload for reliability:

aws s3 cp large_backup.sql.gz s3://my-mysql-backups/ \
    --expected-size $(stat -f%z large_backup.sql.gz)

The AWS CLI handles multipart uploads automatically for large files.

Combined backup and upload script

Backup and upload in one operation:

#!/bin/bash
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BUCKET="s3://my-mysql-backups"
DB_NAME="mydb"

# Backup and compress
mysqldump -u root -p"$MYSQL_PASSWORD" --single-transaction --databases $DB_NAME | \
    gzip | \
    aws s3 cp - "$BUCKET/daily/${DB_NAME}_${TIMESTAMP}.sql.gz"

echo "Backup uploaded to $BUCKET/daily/${DB_NAME}_${TIMESTAMP}.sql.gz"

The - after aws s3 cp reads from stdin, allowing direct pipe from mysqldump without creating a local file first. This saves disk space and time.

Uploading to Google Cloud Storage

Google Cloud Storage (GCS) offers similar capabilities with different tooling.

Setting up gsutil

Install the Google Cloud SDK:

# Linux/macOS
curl https://sdk.cloud.google.com | bash
exec -l $SHELL
gcloud init

Authenticate:

gcloud auth login
gcloud config set project your-project-id

Creating a GCS bucket

Create a bucket:

gsutil mb -l us-central1 gs://my-mysql-backups

Enable versioning:

gsutil versioning set on gs://my-mysql-backups

Uploading backup files

Upload a backup:

gsutil cp mydb_backup.sql.gz gs://my-mysql-backups/daily/

Upload with parallel composite uploads for large files:

gsutil -o GSUtil:parallel_composite_upload_threshold=100M \
    cp large_backup.sql.gz gs://my-mysql-backups/

Streaming upload to GCS

Stream directly from mysqldump to GCS:

#!/bin/bash
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BUCKET="gs://my-mysql-backups"
DB_NAME="mydb"

mysqldump -u root -p"$MYSQL_PASSWORD" --single-transaction --databases $DB_NAME | \
    gzip | \
    gsutil cp - "$BUCKET/daily/${DB_NAME}_${TIMESTAMP}.sql.gz"

Storage classes and cost optimization

Both AWS and GCS offer multiple storage classes at different price points. Choosing the right class can significantly reduce costs.

AWS S3 storage classes

Storage class	Use case	Price per GB/month
S3 Standard	Frequently accessed backups	$0.023
S3 Standard-IA	Backups accessed monthly	$0.0125
S3 One Zone-IA	Non-critical backups	$0.01
S3 Glacier Instant Retrieval	Archive with quick access	$0.004
S3 Glacier Deep Archive	Long-term archive	$0.00099

For most backup use cases, S3 Standard-IA provides good balance. You get immediate access when needed but pay less for storage.

Google Cloud Storage classes

Storage class	Use case	Price per GB/month
Standard	Frequently accessed	$0.020
Nearline	Accessed once per month	$0.010
Coldline	Accessed once per quarter	$0.004
Archive	Accessed once per year	$0.0012

Nearline works well for regular backup retention. Archive suits compliance requirements where you keep backups for years but rarely restore.

Setting storage class on upload

Upload directly to a specific storage class:

# AWS S3
aws s3 cp backup.sql.gz s3://my-mysql-backups/ \
    --storage-class STANDARD_IA

# Google Cloud Storage
gsutil -o "GSUtil:default_storage_class=NEARLINE" \
    cp backup.sql.gz gs://my-mysql-backups/

Automating backups with cron

Manual backups get forgotten. Cron automation ensures consistent execution.

Basic cron backup script

Create a backup script at /usr/local/bin/mysql-backup-to-s3.sh:

#!/bin/bash
set -e

TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BUCKET="s3://my-mysql-backups"
DB_NAME="production"
LOG_FILE="/var/log/mysql-backup.log"

echo "$(date): Starting backup of $DB_NAME" >> $LOG_FILE

mysqldump -u backup_user -p"$MYSQL_BACKUP_PASSWORD" \
    --single-transaction \
    --routines \
    --triggers \
    --databases $DB_NAME | \
    gzip | \
    aws s3 cp - "$BUCKET/daily/${DB_NAME}_${TIMESTAMP}.sql.gz" \
        --storage-class STANDARD_IA

echo "$(date): Backup completed" >> $LOG_FILE

Make it executable:

chmod +x /usr/local/bin/mysql-backup-to-s3.sh

Cron schedule

Add to crontab:

crontab -e

Run daily at 3 AM:

0 3 * * * MYSQL_BACKUP_PASSWORD='yourpassword' /usr/local/bin/mysql-backup-to-s3.sh

For hourly backups:

0 * * * * MYSQL_BACKUP_PASSWORD='yourpassword' /usr/local/bin/mysql-backup-to-s3.sh

Handling credentials securely

Avoid putting passwords in crontab. Use a MySQL options file instead:

Create ~/.my.cnf:

[mysqldump]
user=backup_user
password=yourpassword

Restrict permissions:

chmod 600 ~/.my.cnf

Then remove the password from the mysqldump command:

mysqldump --single-transaction --databases $DB_NAME | gzip | aws s3 cp - ...

Lifecycle policies for automatic cleanup

Without cleanup, backup storage grows forever. Cloud lifecycle policies automate deletion of old backups.

S3 lifecycle policy

Create a lifecycle policy to delete backups after 30 days:

{
  "Rules": [
    {
      "ID": "Delete old MySQL backups",
      "Status": "Enabled",
      "Filter": {
        "Prefix": "daily/"
      },
      "Expiration": {
        "Days": 30
      }
    },
    {
      "ID": "Move to Glacier after 7 days",
      "Status": "Enabled",
      "Filter": {
        "Prefix": "monthly/"
      },
      "Transitions": [
        {
          "Days": 7,
          "StorageClass": "GLACIER"
        }
      ],
      "Expiration": {
        "Days": 365
      }
    }
  ]
}

Apply the policy:

aws s3api put-bucket-lifecycle-configuration \
    --bucket my-mysql-backups \
    --lifecycle-configuration file://lifecycle.json

GCS lifecycle policy

Create a lifecycle configuration for GCS:

{
  "lifecycle": {
    "rule": [
      {
        "action": { "type": "Delete" },
        "condition": {
          "age": 30,
          "matchesPrefix": ["daily/"]
        }
      },
      {
        "action": { "type": "SetStorageClass", "storageClass": "COLDLINE" },
        "condition": {
          "age": 7,
          "matchesPrefix": ["monthly/"]
        }
      }
    ]
  }
}

Apply with gsutil:

gsutil lifecycle set lifecycle.json gs://my-mysql-backups

Using Databasus for automated cloud backups

Manual scripts work but require ongoing maintenance. Cron jobs fail silently, credentials management gets complicated, and monitoring requires extra setup. Databasus (an industry standard for MySQL backup) handles all of this automatically with a web interface for configuration and monitoring.

Installing Databasus

Using Docker:

docker run -d \
  --name databasus \
  -p 4005:4005 \
  -v ./databasus-data:/databasus-data \
  --restart unless-stopped \
  databasus/databasus:latest

Or with Docker Compose:

services:
  databasus:
    container_name: databasus
    image: databasus/databasus:latest
    ports:
      - "4005:4005"
    volumes:
      - ./databasus-data:/databasus-data
    restart: unless-stopped

Start the service:

docker compose up -d

Configuring MySQL backup to S3 or GCS

Access the web interface at http://localhost:4005 and create your account, then:

Add your database — Click "New Database", select MySQL, and enter your connection details (host, port, username, password, database name)
Select storage — Choose AWS S3 or Google Cloud Storage. Enter your bucket name and credentials. Databasus supports both IAM roles and access keys
Select schedule — Set backup frequency: hourly, daily, weekly, or custom cron expression
Click "Create backup" — Databasus handles backup execution, compression, upload, retention and notifications automatically

Databasus also provides email, Slack, Telegram and Discord notifications for backup success and failure, eliminating the need for separate monitoring scripts.

Restoring from cloud backups

Backups are worthless if you can't restore them. Practice restoration before you need it in an emergency.

Downloading from S3

List available backups:

aws s3 ls s3://my-mysql-backups/daily/ --human-readable

Download a specific backup:

aws s3 cp s3://my-mysql-backups/daily/mydb_20240115_030000.sql.gz ./

Downloading from GCS

List backups:

gsutil ls -l gs://my-mysql-backups/daily/

Download:

gsutil cp gs://my-mysql-backups/daily/mydb_20240115_030000.sql.gz ./

Restoring the backup

Decompress and restore:

gunzip -c mydb_20240115_030000.sql.gz | mysql -u root -p

Or in one command directly from S3:

aws s3 cp s3://my-mysql-backups/daily/mydb_20240115_030000.sql.gz - | \
    gunzip | \
    mysql -u root -p

Testing restores regularly

Create a test restore script that runs monthly:

#!/bin/bash
# Get the latest backup
LATEST=$(aws s3 ls s3://my-mysql-backups/daily/ | sort | tail -n 1 | awk '{print $4}')

# Create test database
mysql -u root -p -e "CREATE DATABASE restore_test;"

# Restore
aws s3 cp "s3://my-mysql-backups/daily/$LATEST" - | \
    gunzip | \
    mysql -u root -p restore_test

# Verify (check row count on a known table)
ROWS=$(mysql -u root -p -N -e "SELECT COUNT(*) FROM restore_test.users;")
echo "Restored $ROWS rows from users table"

# Cleanup
mysql -u root -p -e "DROP DATABASE restore_test;"

Security considerations

Cloud backups require careful security configuration to avoid exposing your data.

Encryption at rest

Both S3 and GCS encrypt data at rest by default. For additional security, enable server-side encryption with customer-managed keys:

# S3 with SSE-S3 (Amazon-managed keys)
aws s3 cp backup.sql.gz s3://my-mysql-backups/ \
    --sse AES256

# S3 with SSE-KMS (customer-managed keys)
aws s3 cp backup.sql.gz s3://my-mysql-backups/ \
    --sse aws:kms \
    --sse-kms-key-id alias/my-backup-key

Client-side encryption

Encrypt before uploading for zero-trust storage:

# Encrypt with gpg
mysqldump -u root -p --single-transaction --databases mydb | \
    gzip | \
    gpg --symmetric --cipher-algo AES256 -o mydb_backup.sql.gz.gpg

# Upload encrypted file
aws s3 cp mydb_backup.sql.gz.gpg s3://my-mysql-backups/

IAM policies

Restrict backup credentials to minimum required permissions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:PutObject", "s3:GetObject", "s3:ListBucket"],
      "Resource": [
        "arn:aws:s3:::my-mysql-backups",
        "arn:aws:s3:::my-mysql-backups/*"
      ]
    }
  ]
}

Don't use root credentials or overly permissive policies for backup operations.

Network security

Transfer backups over encrypted connections only. Both AWS CLI and gsutil use HTTPS by default. If you're backing up from within AWS or GCP, use VPC endpoints to keep traffic off the public internet:

# Create S3 VPC endpoint (via AWS Console or CLI)
aws ec2 create-vpc-endpoint \
    --vpc-id vpc-abc123 \
    --service-name com.amazonaws.us-east-1.s3 \
    --route-table-ids rtb-abc123

Conclusion

Cloud storage transforms MySQL backups from local files vulnerable to single points of failure into durable, geographically distributed archives. Start with basic scripts using mysqldump and the AWS CLI or gsutil for simple setups. Add cron scheduling for automation and lifecycle policies for retention management. For production systems, consider dedicated backup tools like Databasus that handle scheduling, monitoring and notifications in one package. Whatever approach you choose, test restores regularly. Backups that can't be restored provide no protection.