Forem: Vasilii Petrushin

Applications in prod. How to handle skyrocket growth with caching.

Vasilii Petrushin — Fri, 28 Mar 2025 14:54:37 +0000

We are lucky, and the company we work at has skyrocketed. In the previous steps, we set up the Postgres database(s) well, set up monitoring and alerting, optimized our data and queries, and even switched the application to read from read replicas. And now we have so high request rate that our app and Postgres can’t handle it. There are many tricks to increase Postgres productivity. But in the end, to apply a long-term fix, we have to fight the root cause of our problems. We have only one option: to decrease the QPS (Query Per Second) on our Postgres instances and decrease the database size. Real life with a highly loaded system is complex and full of pain, and we don’t want to get lost with it.

We have 6 basic strategies to fight QPS and table sizes as a long-term solution:

caching,
rebuilding business processes to decrease SQL queries numbers required to process the transaction,
Split application to microservice architecture and move tables used by microservice to other instances or database technology, like switching to NoSQL databases.
Switching to asynchronous transaction processing,
tables partitioning,
sharding the database, or splitting transaction processing from one database cluster to multiple.

Following only one strategy will not guide us to success; in real life, it is always mixed.

Let’s discuss how it is better to apply these strategies. It is a huge topic, so in this article, we start with cashing.

Caching

Caching is a brilliant strategy to decrease both QPS and API latency heavily. It is applicable when the data and content produced by the application and/or results of API calls are the same for a group of users at some period of time and, therefore, can be cached. The main things - you should have a clear caching rules (why, when, what to cache), TTL for cached content, and a fallback algorithm. There are 3 main approaches to apply caching:

in-app caching like cachetools for Python, Caffeine Cache for Java/SpringBoot, etc.,
use a key-value in-memory database like Redis or Memcached,
use an external app for caching like reverse proxy - nginx, traefik, etc.

Sometimes, terms like ‘cache tiering’ are being used, but I don’t like it; the most successful high-load applications typically use all ‘tiers’ at the same time.

The benefits of in-app caching:

cached data can be shared with and controlled by all classes and functions in the worker process,
no dependency on any third-party infrastructure application like an external Nginx cache, Redis, or Memcached,
The FASTEST cache solution for apps ever, provides maximum LATECY DECREASING effect.

Weaks:

the data in the cache is NOT PERSISTENT (*), any restart of the worker process will completely destroy the cache,
requires additional memory for application worker process, and it is always limited,
dedicated to a single worker process and cannot be shared between multiple processes.

Taking into account benefits and weaknesses, in-app caching best suits any short-lived, frequently-accessible, small-sized data, like session details, tokens, feature flags, counters - for throttling, accounting, and metrics, user/sessions limits, etc. Some weaks can be mitigated; for example, on worker process start sequence, the local in-app cache can be preheated with useful data.

The benefits of key-value databases for caching - Redis and its clones, Memcached, etc:

data can be distributed between application worker processes
atomic operations, operations like INCR/DECR without race conditions
some PERSISTENCY features
sharding and high-availability features
data structures like Strings, Lists, Sets, Sorted Sets, Hashes, and more
housekeeping policies like LRU, LFU, and TTL-based eviction
Pub/Sub & Streams: can be used for messaging.

Weaks:

Even with persistency features enabled, you may LOOSE some DATA on Redis restarts,
it is an in-memory databases, and the total cache size is still limited.

So, key-value databases are suitable for distributed caching and messaging, real-time analytics, distributed locking, leaderboards, session storages, etc.

With external caching proxies,

the responses could be sent to the client before reaching the application server. The benefits of external caching on proxies:

reduces large amounts of requests to the application workers with very low latencies,
reduces bandwidth to your servers if some kind of CDN is being used, like CloudFlare, AWS CloudFront, Google Cloud CDN, etc.

Weaks:

supports only HTTP/HTTPS protocols
less flexibility, can cache only entire response, not all data can be cached due to security reasons, etc,
hard to maintain cache keys, proxy configs can be very complicated.

Very important to understand the caching rules.

Here are some basic questions to answer to build the caching rules and choose the caching strategies:

What data to cache, what are the data types, what is the size, etc?
What will the outcome be from caching - how much QPS are we gonna save, how faster is the API call going to be, etc?
Security - What if data will be exposed to another worker process or another class/function in your app? In the case of Redis, what if data can be accessible with the terminal and redis-cli?
What should cache keys look like?
What group of users/API calls/business processes/whatever grouping we have will be affected by the caching algorithm?
What about sharing the cache between processes?
etc

TTL

Just believe me and never use cached data without TTL! It is the basic metric and easiest way to refresh the stale data and free the cache space if data is not needed anymore. Also, if the cache supports eviction policies like LRU or LFU, the policy should be set, too. TTL and eviction policies help to keep your cache operational and automate housekeeping processes.

Fallback.

Fallbacks are the most important part of building a caching system. Fallbacks should handle any exceptions with caching: refresh the cache if data is stale with TTL expired, recalculate and update cache if no data available by the key, do the calculations if cache is not available (any Redis error, network issue between worker process and Redis, in-app cache is disabled, etc). In other words, fallbacks allow applications to work even if caches are not available at all. IMHO, a well-designed system should keep working and at least handle the median workload without any 503/502 or other errors, even if caches become unavailable by an incident or by misconfiguration. It may work with higher latencies or may go asynchronous, but it keeps working and serving clients in any case. But it could happen in a perfect world only.

Cache preheat.

In theory, your app should work without caches, but real life is more interesting. Usually, applications can’t work and/or can’t handle the workload without data in caches. You will definitely lose the data in in-app caches on restarts, or you could lose data in Redis or proxy caches, and it will affect the application performance. To avoid performance degradation after data loss, you need to fill caches with data. This process is called cache preheat. Preheating routines usually run on application restarts, but sometimes the preheating runs periodically like cron does to refresh the data in caches.

And, of course, your Grafana or an Application Performance Monitoring (APM) tool must have enough metrics and graphs to see the outcome and efficiency of your efforts to decrease QPS and latencies.

Conclusions

Caching is a great strategy to decrease QPS and latency.
We have in-app, external, and proxy caching solutions, and all of them have their own benefits and weaknesses. Should consider it when applying.
Always keep clear caching rules and policies - what data and how to cache.
TTL and fallbacks should always be applied, it is a hygienic factor.
To prevent performance degratation if data in caches has lost consider use cache preheating.
Always use the graphs, metrics, and numbers before and after applying caching to evaluate your efforts and make your clients and bosses happy.

PS: I use AI a lot as a learning partner or an advisor but not in production. No AI was used to write this article.

(*) For clarity, you can persist the in-app cache, for example, as Nginx does. But it is a very complicated topic with a lot of factors to be taken into account, like security, IOPSes, housekeeping, surviving on restarts, etc. By the current most popular common backend application design approach, the application worker processes are stateless, we consider the in-app cache is not persistent.

How to Postgres indexes review

Vasilii Petrushin — Tue, 25 Feb 2025 15:02:54 +0000

Hey, in addition to Part 2, I recently started to review indexes in one of my prod databases. I would say, it turned out to be very educational.

The database has been developed for a few years, and thousands of new features and migrations have been applied. I was surprised, at how many indexes have been created and never used, or did not used for a long time.

Why index management is so important? The indexes are costly! The costs of indexing:

Indexes are eating storage, you need high-speed, highly-available storage for that.
Indexing takes CPU and some IOPSes on each INSERT/UPDATE/DELETE query, and gives additional latency on these queries.
Indexes take time to rebuild on database restore during incident recovery.

Any index is a trade-off. On one side it accelerates queries and reduces IOPSes, on another it creates an additional workload.

At the last database review, I found unused indexes for 50GB on a 600GB database. How am I found that?
How to find unused indexes:

-- Unused indexes
SELECT 
    relname AS table_name, 
    indexrelname AS index_name, 
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0  -- No scans
ORDER BY pg_relation_size(indexrelid) DESC;
-- Total size of unused indexes
SELECT 
    pg_size_pretty(SUM(pg_relation_size(indexrelid))) AS indexes_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0;

The unused indexes are sorted by size and it is a priority to fix first. The total size of unused indexes shows how bad the whole situation is.

How to find indexes, that cover the same columns:

-- Duplicate indexes cover the same columns
SELECT 
    indrelid::regclass AS table_name,
    array_agg(indexrelid::regclass) AS duplicate_indexes
FROM pg_index
GROUP BY indrelid, indkey
HAVING COUNT(*) > 1;

Yes, it is very interesting and a good reason to review the indexes.

A sorted table to see the percentage of idx_scan/(seq_scan + idx_scan) when using the table. A good occasion to review the index coverage of tables.

-- Index usage vs seq scan
SELECT
    relname AS table_name,
    seq_scan, seq_tup_read,
    idx_scan, idx_tup_fetch,
    round(100 * idx_scan::numeric / NULLIF(seq_scan + idx_scan, 0), 2) AS index_usage_percent
FROM pg_stat_user_tables
ORDER BY index_usage_percent DESC NULLS LAST;

The query example for migrations, that drops the unused index:

ALTER TABLE table_name DROP INDEX IF EXISTS index_name;

The reasons, why this index "crisis" has happened:

a lot of new features were added and removed in the application, and indexes have been created, but we forgot to review and remove indexes;
a lot of indexes have been created because it was obvious - developers plan to filter or join by these columns, but Postgres plan in real life does not want to use it;
indexes have never been reviewed.

Conclusions:

Indexing is required to run queries, but it is costly for the CPU on INSERTs/UPDATEs/DELETEs, and takes extra storage space.
Indexes review should be included in the regular database audit and maintenance procedures.
Here are the extremely useful queries to get an overview of the current situation with indexes for your Postgres database.

PS: I use AI a lot as a learning partner or an advisor, but not in production. No AI was used to write this article.

Postgres in prod part 2. Problem-solving and optimizations, heading to 10x growth.

Vasilii Petrushin — Fri, 31 Jan 2025 12:41:45 +0000

In part 1. we discussed how to build the Postgres cluster and supplementary infra to start the project in production. I assume, we were lucky with the startup and it began to grow. In my experience, at this point the baby’s problems and stupid bugs are solved, the database size is more than 350GB of data, we have a couple of tables of more than 50-80 GB, QPS (Queries Per Second) is 3000 - 20 000 queries per second at the peak times, and so on. The users/clients count increased by 2x-10x per year or more, and growth and workloads become more or less predictable and follow our marketing activities. Now we have other, more complicated challenges.

Solving the issues

Now our app started to crash at peak time, and we have 5xx - HTTP timeouts and server errors, or 499 caused by client timeouts. Many possible reasons appear in different ways. We started digging logs with the why-why-why method and on the next “why” found that in most cases the root cause is increased Postgres latencies.

The “why” way to the root cause could be indirect, for example, you see in the logs API throws timeouts because can’t obtain a database connection from its own pool. Why? Because all connections are busy serving other earlier requests, and all older requests are waiting in the queue. Why is that? Because average SQL query times become much longer than before. The same indirect case - HTTP-worker pool could be busy, and HTTP requests could wait in your application queue until client timeout, and you could get 499 in the log. In the easiest case, we could see the API endpoint with the long-running query, which leads to the 502 timeout or 503 server error. All of these cases could produce different visual effects on the application, logging, monitoring, and alerting behavior, but all of them are caused by increased Postgres times. The bad is, and it is proven by my life, that increasing HTTP or database connections pool sizes in the application config would not help and could made the things worse. As well as increasing CPU/memory/IOPSes for Postgres would not help too. Why?...

But how to diagnose and fix issues? First of all, you need a set of monitoring tools, as described in Part 1.
In the easiest case, when the situation is clear you can extract a timeouting query from logs or from the monitoring system and you can go and debug it.
In harder cases, I start the diagnosis process like this.

1. Dashboards - the starting point

Check the Postgres dashboard in Grafana for CPU load and QPS - it should be at higher levels than usual. Just in case, if we have no other (network or...) issues.

2. Digging logs

Ensure the slow queries in Postgres logs are enabled:

log_min_duration_statement: "100ms"

I would recommend then starting with the Loki query to get queries running for more than 1 second to exclude more effective queries and focus on the heaviest queries:

{namespace="production-db",pod="postgres-0"} |~ "duration: \\d{4,}"

For all long-running queries:

{namespace="production-db",pod="postgres-0"} |~ "duration: "

3. Blocking and blocked queries

If you get a lot of queries - usually COMMIT, UPDATE …, or SELECT … FOR UPDATE - running more than 30 seconds - it is a symptom of locking. In that case, let’s check the blocked queries in psql console.

-- blocked queries
select pid, username, pg_blocking_pids(pid) as blocked_by, 
query as blocked_query
from pg_stat_activity
where cardinality(pg_blocking_pids(pid)) > 0;

4. Resolving locks

If found locks, resolve by killing the locking queries, and check if locking and blocked queries are optimal.

SELECT pg_cancel_backend(pid) -- for graceful stop
SELECT pg_terminate_backend(pid) -- forced query termination

Use it with caution, in general, it is safe to kill locked queries because they are waiting for execution permission and do nothing at the moment of termination. If it's inside the transaction block - it is also safe, the transaction will be just rolled back.

To analyze SELECT … FOR UPDATE - run EXPLAIN ANALYZE SELECT … without FOR UPDATE, and ensure that the query uses indexes and filters without sequential scans. If it runs with seqscans create additional indexes.
How to analyze the UPDATE against production data?
It is scary to run UPDATE on a production database, but what to do if you fall into an incident and have to fix it ASAP? In some cases, we have a safe way to debug UPDATE or INSERT queries by using transactions. Be careful, if UPDATE runs on a huge table and changes a lot of records - it could run for hours and could lock the table for writing for a long time to perform UPDATE as it is, and then ROLLBACK.

-- easy way if we're lucky
BEGIN;
EXPLAIN ANALYZE UPDATE …;
ROLLBACK;

If we have a huge table and our UPDATE changes a lot of records, then we can analyze SELECT with the same WHERE part.

-- safest way is the table is huge and UPDATE changes a lot of rows
EXPLAIN ANALYZE SELECT * FROM table WHERE … here are the conditions from the UPDATE query.

If EXPLAIN reports that the query runs with seqscans - create additional indexes.

If the queries are already good, but you are still having issues - congrats, you and your developers must fix the application logic to avoid locking. There are three major ways to go - change the application logic to utilize an append-only approach, change the logic to make such update requests asynchronous, or decrease the overall QPS to allow the Postgres instance to focus on the activities, which causes the locks. The third way may be cheaper at the moment of the incident but will not solve the root cause of locking. I want to write an additional article on these approaches, and how they can help to survive with high loads. The good is that often resolving the locks manually immediately helps to bring your application up until the next deadlock, so your team has some time to develop and apply the hotfix.

5. Temporary files

If there are no locks, and/or most long-running queries are below 30 seconds - with the high CPU usage - are the symptoms of nonoptimal queries, nonoptimal data structures, or high temporary files usage. We need to dig deeper. Ensure that we have temp_files logs enabled:

log_temp_files: 0

Two ways to check if queries use temporary files. One way - Loki queries:

{app="postgres", log~="temporary file"}
{namespace="production-db", pod=~"postgres.*"} |= "temporary"

Second - check temp_blks_read and temp_blks_written:

SELECT * FROM pg_stat_statements WHERE 
temp_blks_read > 0 OR temp_blks_written > 0;

Additionally, you could check the Grafana or Postgres pod directly if it uses high I/O and has high iowait. If you use remote data storage like AWS Elastic Block Storage volume, it has limited IOPSes, and absolute numbers of IOPSes could be relatively small, but Postgres container will consume high CPU iowait!

If high temporary file usage is confirmed, increase the work_mem and maintenance_work_mem Postgres parameters and memory requests/limits in the Kubernetes manifest. Be careful; changing these parameters requires a restart of Postgres instance. If it is unclear from logs or our temp_blks_written analysis how much work_mem to add, I recommend upgrading with big steps, like increasing 2x from the current.

6. Nonoptimal queries and data.

The most common cause of long-running SELECT queries is sequential scan when Postgres applies filters or joins by scanning the entire table. Imagine, how long will it take if your table has 100+GB of data. The query could take minutes with significant I/O consumption. How do we detect it?

The symptoms are, together or separately:

high IOPSes in Grafana or iotop output, high iowait
the query appears in the Postgres log with a duration of more than 100ms (we enabled the log of long-running queries in Part 1.),
in Grafana query analysis this query appears with high time consumption,
502/503 errors in your app API calls.

Analyzing the query:

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM wallet_transaction WHERE currency_id = 10;
Seq Scan on wallet_transaction  (cost=0.00..25000000.00 rows=10000 width=100) (actual time=0.100..7500.500 rows=10000 loops=1)
  Filter: (currency_id = 10)
  Buffers: shared hit=1571 read=60955 dirtied=739
  I/O Timings: shared read=6878.111
  Rows Removed by Filter: 5000000
Planning Time: 0.200 ms
Execution Time: 7502.300 ms

It is a very primitive query far from real life, but showing the problem of sequential scans:

unpredictable execution cost (cost from zero to high numbers, not matching with actual running time),
high buffer usage,
high I/O consumption.

It means that we cannot predict the actual load, generated by this query, the cache and memory usage is high and could be ineffective, and we could abuse the cache, which is limited, for other queries. Also, we’re abusing the disk I/O, which is critical and limited for clouds and its managed database solutions. High disk I/O also leads to high CPU iowait utilization, it just steals CPU for other tasks. Happily, in the most cases the solution is simple - to create the right index for your table. With index, Postgres will perform the index scan and you can save from 100x to even 10 000x of I/O and buffers. In general, to prevent/decrease such issues set the rule for your project to perform the EXPLAIN for all SELECT queries at the code development stage. And keep this problem in mind when developing the database and table structure. In my experience, the sequential scan is acceptable and more effective for small tables up to a few thousand rows. For bigger tables indexes and index scans are necessary.

In some cases, when you have complex filters, the Postgres planner performs the sequential scans even if you have proper indexes. In those cases, consider creating indexes with more complex expressions, order, and other parameters, matching the SELECT query WHERE expressions. Or create composite indexes by 2-3 columns if it is closer to WHERE expressions. Also, check the random_page_cost parameter and consider lowering it. For SSD it should be between 1.1 and 1.5.

Preventive actions

Earlier we discussed a lot of pain, but the good is that we can avoid the most. We can identify and fix problematic queries peacefully, during working hours, with proper testing, and without issues to the clients. In my routine, I just set up in the Calendar application the recurring event on each second Monday of each month to preserve time for Postgres analysis. Once a month I look at the dashboards, dig the logs, and play with the Postgres console to find, identify, and analyze potential problems, queries, and application behavior. The result is the tickets to our developers and to my team. For complex cases I prepare the information to discuss the potential issues and prevention actions with the team and with the boss.

Depending on the database change rate and release cycle of your app you can set up the Postgres analysis sessions once in 2 or 4 weeks.

Very helpful to prevent the issues is to accept the EXPLAIN rule in your organization. According to this rule, when your backenders developing something, they must check the EXPLAIN ANALYZE for all new or modified SELECTs, SELECTs … FOR UPDATE, and UPDATE queries, and ensure that all these queries run effectively.

With these simple steps in a couple of months, you will optimize your database and prevent a lot of issues.

You did a lot of optimizations, but it did not help

You did a lot of optimizations, increased IOPSes, and added CPU and memory resources, fixed non-optimal queries, but it did not help. Maybe, it is time to switch application reads to read replicas? There are a couple of problems with that, but it surely helps a lot to focus the CPU, memory, and IOPSes to process transactions on the cluster primary and move all read workload and analytics to replicas. The pain points of read replicas:

read replica needs for near-real-time data synchronization with the primary
each read replica requires the same resources as the primary, the cluster with primary and 2 read replicas would take 3x CPU, 3x RAM, and 3x disk space
your application has to support reads from another database, and you need application reconfiguration or fixes to use read replicas.

IMHO the ideal architecture with read replicas is 1 primary and 2 replicas with small max_standby_archive_delay - I use from 15 to 30 seconds for production, and 1 replica outside of cluster with huge max_standby_archive_delay - up to 30-90 minutes.
Why is that?
For application reads and to prevent data losses we need a replica with the most actual data as possible, and a small max_standby_archive_delay will help us to achieve it. It will stop any queries on read replicas to catch up with the primary if the replication lag becomes >= max_standby_archive_delay.
For high availability, we need two replicas behind the pooler to read from the application if we need any maintenance or reconfiguration.
We also realize that we have analytics queries with huge datasets returning, which can prevent wal-logs from applying to the replica, and could lead to big replica lag. For those cases, we need the read replica with high max_standby_archive_delay. Usually, such a workload does not require high availability, so we can have only one read replica tolerant for big replica lag to save some resources.
With read replicas, you also have to control the replication lag carefully. The query example to get the current replica lag in seconds:

SELECT EXTRACT(EPOCH FROM NOW() - 
pg_last_xact_replay_timestamp()) AS replication_lag_seconds;

Here is the link to the Postgres manifest to use with the Zalando operator - 1 primary, 2 replicas with increased resources.
Here is the link to the Postgres manifest to use with the Zalando operator - single replica cluster for analytics and high max_standby_archive_delay.

Debugging long-running queries on read replicas.

The queries with huge returning data rows can cause problems on Read replicas. Imagine, if your query has no or minimal filters and runs against a huge 100GB+ table. The problem is not only in the query, CPU, or I/O usage itself, but it also takes a lot of time for the application to retrieve data. On the server, it also causes buffer abuse and if it runs on read replicas it causes locks and delays in applying the archivelogs. You also will see this query in the Postgres logs as a long-running. As I said before, in case of lag >= max_standby_archive_delay read replica will cancel all queries until catches up with the primary with the error ERROR: canceling statement due to conflict with recovery. To prevent such behavior and protect your read replicas all SELECTs returning many data must be paged and must have ordering and limits. Example of bad query:

SELECT * FROM wallet_transaction 
WHERE currency_id = 10;

Of course, you have to check for indexes by currency_id, and create it if not. But anyway, the query will run for a long time and the amount of data is huge if the wallet_transaction table is big. On read replica, the wal-log application will be blocked while the client is consuming data from the query result.

SELECT * FROM wallet_transaction 
WHERE currency_id = 10 
ORDER BY created_at DESC
OFFSET 10100 LIMIT 100;

This query is better, but anyway, it is bad. It limits the dataset, but Postgres will scan the full table and/or indexes anyway, consuming CPU/Mem/IOPSes.

SELECT * FROM wallet_transaction
WHERE currency_id = 10 
AND created_at BETWEEN NOW() - INTERVAL '1 month' AND NOW()
ORDER BY created_at DESC
OFFSET 10100 LIMIT 100;

The best query. Now the result is limited, it takes a little time to retrieve data for the client, and the index/table scans will be limited by date. Also, it is paged, so your app can get and show all data to a user.

In real life, in my experience, the first query could take from 10 seconds to minutes and could lead to query cancelling errors. 2nd query could take 1 - 15 seconds, and 3rd from 5 to 500 milliseconds. So, for big tables always use ORDER/OFFSET/LIMIT to limit the results, and intervals to limit the data selection and scans.

Conclusions

In the second part, we discussed the second stage of growth problems of Postgres databases. The main root causes are:

high temporary files usage,
locks and deadlocks in the database,
non-optimal table structure and lack of indexing, which causes sequential scans,
huge result sets from SELECT queries.

I estimate this growth stage in the following numbers:

database size 350GB to 2TB,
there are big tables from 50GB to 1TB,
there are tables with 100+ million to billions of records,
QPS (queries per second) starts from 3000 to 10-20 000 and more,
the number of clients/orders/views and/or other business metrics are growing 2x to 10x per 6-12 months.

At this stage, you are going to face new problems and challenges and I showed the debug process from query/database optimizations to making decisions and debugging read replicas.
Also, I provided some examples of useful queries to detect problems and Postgres cluster manifests for databases. Please check my Github repo https://github.com/petrushinvs/databases-in-prod.

Good luck.

Postgres indexes review has been added.

Postgres in prod. Part 1 - setup and startup for growth.

Vasilii Petrushin — Thu, 30 Jan 2025 11:26:59 +0000

In this article, I wanted to share my experience gained with the pain and challenges of business growth and other database-related problems in production use. When your system starts with an empty database and the number of users and requests is low, everything is easy, and it's a good time to prepare for future high loads, to prepare for great success in business terms. My approach to systems and infrastructure engineering is to focus on being ready to support the business growth and success, not only on current operations. You start with your business from some small numbers, then grow 2x, 4x, 10x, and at some point, the business growth skyrockets, and you have to handle this or lose. It’s a challenge and no one wants to lose.

Here is my experience building and maintaining databases to support business growth. There will be 3 major parts: the beginning - set up to grow, from 2x to 10x - the basic steps of optimizations, and from 10x to skyrocket - complex optimizations and architecture for a fast-growing business.

Any big clouds like AWS, Google, or Azure has its own perfectly managed solutions for Postgres database with a lot of nice docs on how to implement it according to the rules of the well-architected framework, you just pay for it ;), and live with their restrictions. The key benefit of using Postgres in Kubernetes on-premise is nearly unlimited IOPSes. The current SSDs and NVMes provide 100-120 000 IOPS. But AWS RDS limits are up to 16 000 per instance. They have dedicated IO too, but it is costly, at the price starting from a monthly salary of senior DevOps ;). The second great benefit - the Postgres on dedicated hardware is 5 to 10 times cheaper than a managed solution. So, I vote for Postgres and Kubernetes to save some cash and invest it in business growth and/or in my pocket.

To build good and reliable solutions is nothing to invent, we’re just going to follow the best practices. Best practices are pretty much the same for all IT industries, and suitable for any cloud platform or on-premise deployments. The examples are AWS Well-Architected Framework or Google Cloud Architecture Framework, just read. At the start, we’re going to run the Postgres in small/medium instance size, but in a safe, reliable, and ready-to-grow configuration.

There are two major ways to deploy Postgres to Kubernetes. First is using Helm charts like bitnami/postgresql or Serge’s postgresql-single, which has benefits compared to bitnami chart. The second way to run Postgres is to use the Kubernetes operator. There are a couple of them, here we will talk about Zalando Postgres operator. The helm way is better when in the future you will not plan to use more than one or two Postgres clusters per Kubernetes cluster. If you realize, that you will run more, then the Operator way is for you with all its automation and management advantages.
Here is an example of PostgreSQL database manifest for Zalando Postgres Operator pg-prod.yaml.
This example was tested for intensive transaction processing with query rates up to 3000-4000 per second and database size up to 350GB (data and indexes). It is enough to start most projects with thousands of active users like medium-volume webshops, news- and community-support websites, medium-sized gaming or gambling, etc. We assume, that we already set up the operator and the S3 bucket with encryption and access control for backups and WAL-log storage.

Let’s review the config.

Please, open this manifest in the new tab or window and let's have a look.

We want maximum availability with minimum downtime. So, numberOfInstances: 2. It will create a classic Postgres primary-secondary cluster pair managed by Zalando Postgres operator and Patroni under the hood.

The top lines in the config are about name, version, etc. Let's look at the Postgres config part.
Firstly, I prefer autovacuum enabled for small and medium-sized deployments. Autovacuum helps to keep the tables in good condition for effective data searches and manipulations from one side, from another, it does not affect much the overall Postgres performance. For huge and high-load deployments you can deactivate autovacuum. I am going to discuss it in future articles.

      autovacuum_analyze_scale_factor: "0.1"
      autovacuum_vacuum_scale_factor: "0.2"

The second, and very important part is logging. With these settings we will control slow queries, lock waits, and temporary files - which heavily affect overall Postgres performance, and all issues can be detected and fixed quickly. Detailed explanations of each type of issue and how to fix it I’ll give later in the next articles. For detailed migration control, data protection, and for security reasons, we want to log all DDL (data definition language) queries too.

      log_destination: "stderr"
      log_connections: "off"
      log_disconnections: "off"
      log_min_duration_statement: "100ms"
      log_statement: "ddl"
      log_lock_waits: "on"
      log_temp_files: "0"

The next part is connection limits.

      max_connections: "150"
      superuser_reserved_connections: "5"

It’s simple, Postgres spend a lot of resources to control connections. For production use, we must limit the number of connections to prevent CPU/Memory consumption and preserve some connection pool for superuser for maintenance, recovery, and problem-solving.

      max_standby_archive_delay: "900s"
      max_standby_streaming_delay: "900s"
      wal_level: "logical"
      max_wal_senders: "4"
      max_replication_slots: "4"

Our cluster has been built using physical replication protocol, we set the wal_level to replica to allow replication. The max_wal_senders and max_replication_slots are stand for 1 or 2 standby replicas, base-backup (if you will use it), and 1 or 2 reserved connections for future physical replication. For example, if you want to have running replicas in another region, or for data analytics. Also, we must set the limits for delays between secondary and primary. Some queries can cause significant replication delays between primary and secondary. But we want to have the most recent data in the replica. If these parameters are set, the secondary will drop and decline all queries, decreasing workload to allow itself to catch up with the primary. Also, there is a metric for Prometheus monitoring to control the replica lag - “pg_replication_lag”. Here is an example of a Grafana graph to control the replica lag, and adding the expression >= will give you the alert expression.

sum by(kubernetes_pod_name) (pg_replication_lag{source="$cluster",kubernetes_pod_name="$instance"})

Parallelism settings would work more or less well for 2 to 10 CPU cores instance, for current values refer to resources section of manifest.

      max_wal_senders: "4"
      max_replication_slots: "4"
      max_worker_processes: "16"
      max_parallel_workers: "8"
      max_parallel_workers_per_gather: "2"
      max_parallel_maintenance_workers: "2"

Memory and other configuration options are like this. The work_mem and maintenance_work_mem are good for orders/financial data processing when most rows are numbers or relatively small varchars, text, or JSON. If your workload and most of your data are big texts, JSONs, or blogss, consider increasing these values, and increasing memory requests/limits in the resources section too.

      max_wal_size: "4GB"
      min_wal_size: "2GB"
      wal_keep_size: "2GB"
      effective_cache_size: "4GB"
      shared_buffers: "2GB"
      work_mem: "64MB"
      maintenance_work_mem: "256MB"
      # we're on ssd
      effective_io_concurrency: "100"
      random_page_cost: "1.1"
      # enable the extentions, pg_stat_statements is a must-have for production
      shared_preload_libraries: "pg_stat_statements,pg_cron,pg_trgm,pgcrypto,pg_stat_kcache"
      track_io_timing: "on"
      pg_stat_statements.max: "1000"
      pg_stat_statements.track: "all"
      cron.database_name: "postgres"
      synchronous_commit: "local"

Very important podAnnotations to protect our Postgres pods from eviction and for monitoring

  podAnnotations:
    cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
    prometheus.io/port: "9187"
    prometheus.io/scrape: "true"
    prometheus.io/path: "/metrics"

To protect our production to avoid connection storms and improve connection management enable the connection pooler (pbgouncer).

  enableConnectionPooler: true
  enableReplicaConnectionPooler: true

For backup I chose the logical backup method supported by the operator: I wanted the operator to care for backups with its own supplied methods, otherwise, you have to setup your own backup solution.

  enableLogicalBackup: true
  logicalBackupSchedule: "30 10 * * *"

Now check the additionalVolumes and sidecars sections. We’re going to enable as a sidecars promtail to ship the logs to our Loki service, and postgres-exporter for advanced monitoring. The configs are in configmaps supplied in the manifest. Promtail config is pretty straightforward. Let’s check the postgres-monitoring-queries configmap. We run the pg_stat_statements set of metrics for the query analysis dashboard to manage the query efficiency and to get the basic info for future query optimizations. Also pg_replication_lag metric is here, this is a very important metric for managing the data quality in replicas. The 3-rd metric pg_postmaster_start_time_seconds is good for Postgres uptime/restart management and alerting.

Under the hood, the Zalando Postgres operator takes care of many things and helps us a lot with routine maintenance. In operator configuration, we set up S3 storage for backups and WAL-log archive, backup options, healthchecks, timeouts, etc. After deployment, we will have WAL-logs and backups shipped to S3, and the operator will take care of its retention. Also operator will gracefully restart Postgres instances and poolers for reconfiguration, or in case of failures.

Now a few words about monitoring and alerting. With the Prometheus or Victoriametrics installed, we will have our Postgres metrics scraped. For Grafana I would recommend to setup kubernetes-pods, pg-monitoring, pgbouncer, and pg-replication-lag dashboards. And also, it would be great to setup pg-query-overview.json and pg-query-drilldown.json. These dashboards was originally been built by Percona for its PMM, and then quickly ported by me to pure Grafana and Victoriametrics/Prometheus, so its configuration could be tricky and take time. But you will get a cool picture of pg_stat_statements_calls, a sets of most time-consuming queries, most called queries, etc, and will get insights about queries from query analytics - timings, buffers, reads, et cetera. And you don’t need to set up any third parties and extra Postgres monitoring tools like pgbadger, SolarWinds Postgres analytics, and others.
The alerts set we use:

alert on postgresql is running if pg_up != 1 or no-data
alert on pg_replication_lag > max_standby_archive_delay or any number you consider reasonable
alerts like PersistentVolume is filling up from default alerts set
alerts like High CPU iowait from node-exporter to control IO
a set of basic kubernetes alerts like PodCPULimit, ContainerCrashLooping, KubePodNotReady, etc. In addition to altering I would recommend referring to the Awesome Prometheus Alerts project to get ideas and solutions on how to monitor your infrastructure.

Conclusions

So, in the end, after applying the Posgresql manifest we have up and running:

the highly available Postgres cluster with primary and standby nodes,
the encrypted WAL-logs storage and backup storage in the S3 bucket with auth and access control,
pg-bouncer for primary with SSL encryption,
pg-bouncer for replica with SSL encryption,
Kubernetes service for direct connection to primary with SSL (cluster-name.namespace.svc),
Kubernetes service for direct connection to replica with SSL (cluster-name-repl.namespace.svc),
provisioned primary database,
provisioned application users with creds in secrets,
all stuff runs on special nodes (node groups), defined by nodeAffinity,
logging subsystem is set up and allows control of slow queries, temporary files, data definition queries, and default Postgres logs,
all basic Postgres metrics are provided by posgres-exporter with enhanced query and replication lag monitoring.

With Victoriametrics/Prometheus and Grafana, we also have the full monitoring and alerting solution to manage and maintain our databases. With your business growth, the workload will grow, and it provides you with information to support this growth, detect failures, and tune your Postgres instance, database, tables, and queries. And by the most of parameters, this setup of Postgres and monitoring tools will comply with the best practices from the point of view of ‘well-architected frameworks’, and be ready to grow up to 1-2-3 million users.
Check the Posgres manifests for Zalando operator in my Github repo: https://github.com/petrushinvs/databases-in-prod/tree/main/postgres/kubernetes

How to solve slow application responses caused by slow queries, Postgres performance degradation, replica lags, and other issues - see the next article, Part 2 - 2x to 10x growth, basic steps of query and database optimizations.

Good luck.

Part 2.