Forem: mayur

Oracle to Postgres : Happily Ever After edition

mayur — Sat, 19 Oct 2024 23:17:38 +0000

Prelude :
The launch of the Oracle Autonomous Database ignited a lively debate about the future of the database administrator (DBA) role.

Oracle envisioned a world where the DBA would become obsolete with a snap of their fingers, just like Thanos.

However, Postgres said to Oracle “Hold my beer”.

Few years before that time, I made the leap from the Oracle world to Postgres, seeing Postgres as the Galactus of databases (The Eater of database worlds), ready to devour the competition.

PG : Eater of the database worlds.

Postgres Ecosystem : Credits to Vonng (Author of PIGSTY)

Credits : @dotnetschizo

The signs were everywhere. The two communities couldn’t have been more different. Oracle is flashy and glamorous, driven by the latest marketing buzz. In contrast, the Postgres community is a bubbling cauldron of innovation, with contributions flowing in from passionate developers around the globe. It was obvious to me that Postgres’s extreme extensibility would outshine any single company’s vision.

The ”Ever After” :

There’s been plenty of chatter about migrating from Oracle to Postgres, but most of the focus is on the planning and execution phases. Rarely do we hear about what happens afterward, like the happily-ever-after in a Disney movie.

Having been involved in numerous Oracle-to-Postgres migrations, I often find myself called in to address performance issues or to fix critical elements overlooked during the migration. So, I thought, why not have a blog or a talk dedicated to the “ever after” of these migrations? Let’s explore what happens once the dust settles!

Watch out for those Materialized Views

Oracle’s optimizer (planner), boasts a magical query rewrite option with materialized views. If the optimizer detects that any part of an SQL query matches a materialized view definition, it can seamlessly replace that portion of the SQL text with the materialized view. This clever trick saves significant time and resources typically spent on repetitive complex joins and aggregates on large tables.

Unfortunately, Postgres doesn’t have an equivalent to this Mview rewrite feature in its planner. This means that while migration tools can help convert code and schema, they can’t detect or provide alternatives to such internal rewrites happening within the database engine.

Imagine you’re celebrating a smooth Oracle to Postgres migration over the weekend, only to have your pager go off due to a critical incident in the weekly production batch. Oracle developers often leverage this shortcut for query optimization in batch processes or weekend-monthly jobs, where retrieving the most recent data isn’t a priority. Although daily OLTP loads might not benefit significantly from Mview query rewrites, less frequent or off-peak tasks often do. This is one reason why such issues can be hard to detect during pre-go-live smoke tests.

It’s always better to identify query rewrites while still on Oracle, before the migration, rather than scrambling to optimize a heavy batch process over the weekend post-migration to Postgres. You can easily enable or disable Mview rewrite, allowing you to establish a baseline during load testing that covers all application aspects with the rewrite on, and then conduct another test with the Mview rewrite disabled. This process will highlight any outliers, helping you identify which modules require performance optimization on Postgres. These optimizations can then be tackled in a stress-test environment, ensuring a smoother transition and fewer surprises post-migration.

Multi-node RAC <> Multi-node Patroni cluster

Migration tooling and planning often overlook this critical aspect of database architecture. A three-node Oracle RAC setup doesn’t directly translate to a three-node Postgres Patroni cluster, typically used as a high-availability (HA) solution. The main reason is that all three nodes in an Oracle RAC handle write traffic, whereas Postgres replicas are read-only.

This becomes a headache if your application heavily relies on global temporary tables, as Postgres replicas can’t drop, create, or truncate these tables. PGTT extension could help regarding reducing code conversion effort by mimicking Oracle’s global temporary table implementation in Postgres. However, it still creates a temporary table on the fly, the first time it is used in a session. Due to this even with PGTT you need primary for a code with temp table usage to work. Apart from being bottleneck in offloading some of the traffic to replica, frequent drop-create of temp tables causes Postgres catalog tables to bloat.

Oracle RAC implementations are usually designed to minimize global cache fusion (internode communication), meaning the application is logically modularized into multiple services, each interacting with a dedicated node. This setup is not just for high availability but also for load balancing.

Even if we employ load balancers on the Postgres side, there’s always a limit to how much load can be offloaded to replicas due to their read-only nature. Therefore, when converting a multi-node RAC to a Postgres HA cluster, it’s crucial to anticipate the additional load on the primary node and upgrade the hardware accordingly. This might mean better CPUs, higher throughput instance and higher limits in the connection pool. Moreover, concurrency-related issues might arise, necessitating thorough load testing before going live. So, brace for the load, test rigorously, and ensure your primary node is up to the challenge!

During the PG-Ibiza conference, someone asked why we don’t use logical replication to create a multi-active, multi-node cluster as a replacement for Oracle RAC. While it’s possible to use a flavor of Bi-Directional Replication (EDB-BDR,PgEdge,AWS Pg_Active) or a custom setup, you would need to manage conflicts, split-brain scenarios, logical replication bugs, and resynchronization issues. This blog by depesz’s highlights pain of logical replication at scale. This means transitioning from the relatively low-maintenance environment of Oracle RAC to a multi-active setup that requires significantly more attention and care. For obvious reasons, this approach is unlikely to receive immediate approval from CTOs or infrastructure managers at large companies. However, if you have a single cluster to manage, it is feasible to pursue this solution.

Scalar Subquery Caching

Oracle introduced the fascinating feature of scalar subquery caching starting with version 11g. This nifty feature helps mitigate developer mistakes by optimizing slow subqueries in production. As a result, suboptimal code in Oracle could go unnoticed for ages. But once you migrate to Postgres, it can hit you like a wrecking ball. If these issues arise in parts of the code triggered by specific datasets, they can be tough to detect during load testing.

Let me share an example. We had a tiny application with a small 150 MB database, running a single query most of the time. Despite its small size, it managed to bring an AWS c5.2xlarge instance (8 vCPUs, 16 GB RAM, SSD io1 storage) to its knees. There were frequent CPU spikes, the application encountered numerous timeout errors, and the server appeared overloaded. Developers blamed Postgres, pointing out that the same application had run smoothly on Oracle for months, even though the tables involved were only a few MBs in size.

Upon examining the problematic query, I found it to be a simple SELECT statement with a few Common Table Expressions (CTEs) referenced multiple times as scalar subqueries. The explain plan revealed that one particular CTE was the root cause. Thanks to Depesz’s tool, it was clear that most of the time was spent on the CTE scan. This happened because Postgres executed the CTE scan for every row produced by the main query. In Oracle, the same CTE scan was cached via scalar subquery caching, allowing it to narrowly escape detection for a long time.
Since this issue occurred on Postgres 13, we couldn’t use the enable_memoize option available in Postgres 14, which provides a similar feature. However, even with enable_memoize, we’ve noticed performance degradation in other databases and haven’t seen significant benefits in production systems. Also enable_memoize still has performance issues.

Solution was pretty simple, break large sql into tiny parts using temp table and index temp table of problematic CTE. There was further option to put this in an immutable function call and utilize caching in Postgres but developers seem happy with just temp table indexing.

I had to obfuscate plan, tables and query due to company security policy so below pictures are messy.

Offending query was select on a view and partial view definition:

You could see cte_policy_status being put in multiple scalar subqueries above.

Plan is big 300+ lines so I would just copy relevant part:

Problematic query took around 3–4 minutes per execution and there were multiple sessions concurrently executing it. You can see cte_policy_status being main bottleneck clearly, thanks to depesz’s tool.

Happily Ever After Version

Problematic query finished in milliseconds after splitting it using temp tables & indexing/analyzing temp tables, logical io reduced significantly and everyone lived happily ever after.

SQL Plan Baselines and Hints

Oracle’s dbms_spm guarantees SQL execution plan stability, a feature we lack in Postgres for locking down a specific execution plan for a SQL statement. Postgres does have pg_hint_plan, but it’s not quite the same as dbms_spm. After migration, SQLs with spm baselines could surprise you. However, most of these baselines are created to address missing statistics or incorrect timings for collecting stats. This can be resolved by reviewing the process flow and running ANALYZE on tables after any bulk data loads. For the remaining minor cases, pg_hint_plan can be a lifesaver. One example is the transactional outbox pattern. implementation in microservice architecture.

In this pattern, every transaction includes a corresponding insert or update in the outbox. The data volume continuously fluctuates due to deletions based on polling frequency, and the table is constantly read by a message relay. This is a perfect recipe for disaster as it generates many dead tuples. Autoanalyze may collect stats when the outbox table is empty, resulting in subsequent full scans. Tiny cloud instances provisioned for microservices can quickly hit I/O, throughput, and CPU limits. Increasing autovacuum frequency helps but at the cost of instance throughput and CPU capacity. Even so, with many such full scan queries running on a bloated outbox table, the database can become overloaded.

A simple index hint via pg_hint_plan on the outbox table’s queries can save DBAs from many sleepless nights. By proactively managing these potential pitfalls, you can ensure smoother operations and keep your database running efficiently.

To Err is human, To exception divine.

The reason for this distinction is evident in the PostgreSQL documentation, showcased in all its glory.

Oracle and PostgreSQL behave differently in their approach to exceptions. In Oracle, one can perform various mental gymnastics within the exception handling section, utilizing statement-level rollbacks and commits. While I won’t delve into the functional differences here, it’s worth noting that when developers attempt to replicate Oracle’s behavior in PostgreSQL, they often resort to savepoints or unnecessarily complicate the exception handling logic. Exception handling is internally via savepoint.

Using savepoints to mimic exception handling code can lead to catastrophic subtransaction SLRU lock wait events, potentially bringing your application to a complete halt.

The famous gitlab incident….

Their ruby code roughly translates to below flow.

Here it shows performance cliff caused by SLRU wait.

The good news is that starting with PostgreSQL 17, you can increase cache sizes related to SLRU, such as subtransaction_buffers, multixact_member_buffers, and notify_buffers, which helps mitigate the risk of such performance cliffs.

Additionally, you can use clever techniques to avoid using exceptions altogether.
For example: Instead of pk violation exception section you could use on conflict clause.

Added benefit, on conflict .. do nothing reduces dead tuples.

Last but not the least

At PGConf-EU 2023, I gave a lightning talk about the hidden objects that can cause portability and potential licensing issues when using cloud vendor-provided tools for Oracle (self-hosted) to Postgres (cloud) migration. I was surprised to find that many DBAs were unaware of these issues in cloud environments.

In just a year, the community has already come up with a solution. ExtensionMigrationAssistant is an open-source tool designed to detect cloud vendor-specific objects and functions in a migrated Postgres schema. Additionally, the creators have established a company, DataCloudGaze, to assist with end-to-end Oracle to Postgres migrations. This rapid response is a testament to Postgres’s vibrant and innovative community.

So, despite all the challenges moving out of Oracle, rest assured that Postgres’s dynamic community has got your back, constantly evolving and coming up with solutions to make your transition smoother and more efficient.

References :

https://blogs.oracle.com/infrastructure/post/how-the-autonomous-database-will-change-the-dba-rolefor-the-better

https://docs.oracle.com/en/database/oracle/oracle-database/19/racad/introduction-to-oracle-rac.html#GUID-D04AA2A7-2E68-4C5C-BD6E-36C62427B98E

https://explain.depesz.com/

https://www.datacloudgaze.com/post/extension-migration-assistance

https://about.gitlab.com/blog/2021/09/29/why-we-spent-the-last-month-eliminating-postgresql-subtransactions/

https://microservices.io/patterns/data/transactional-outbox.html?source=post_page-----11a3c27c011f--------------------------------

https://www.depesz.com/2023/06/04/a-tale-about-incomplete-upgrade-from-postgresql-12-to-14/

Schrödinger’s Backups: A Lesson in Uncertainty

mayur — Thu, 17 Oct 2024 14:46:49 +0000

Backups are a lot like Schrödinger’s cat. You think you have them, but when it’s time to check, they might not be what you expect! Even with the best tools in your arsenal, like pgBackRest, things can still go hilariously wrong. It’s not that pgBackRest falters; instead, the chaos is caused by human miscommunication and policy missteps. It’s a reminder that no matter how sharp the tool, it’s all about how you wield it!

Incident 1 :

So, here’s the deal. A big miscommunication-mismanagement-misalignment went down between the top level decision-makers behind the one of my previous company’s backup retention policy. Result? BOOM. Backup retention was set to 60 days in pgbackrest stanza, instead of the seven years as written in the some obscure corner in the confluence (and it was a single stanza, no typical hot/cold separation omg!!). One fine day, I went hunting for an old backup after a developer mistake in code deleted last quarters data (Hence needed recovery from a old backup)… and there was nothing! Zero, zilch, zip, nada, nothing.

This blog by pgstef is an excellent overview of retention policy for those new to the postgres. We had weekly automated backup restore tests configured, but they were ultimately futile in safeguarding against such a significant mistake by management in defining the retention policy.

Incident 2:

But wait, it gets better (or worse). The company had a second backup server with different pgbackrest stanza for a few more databases where retention was set to seven years as expected. Unfortunately, they had lumped cold and hot backups into a single stanza. What did this lead to? WAL archives going back to the dawn of time itself — like, 3.2 billion years ago. The storage ballooned to petabytes. PETABYTES. Yes, you read that right. It was as if the company was trying to archive the entire universe. Talk about overkill.

Problem 3:

Of course, we eventually separated the cold and hot backup stanzas for each database cluster. You’d think that would resolve the issue, right? Well, the quantum curse wasn’t entirely lifted. A single 32TB database took over an entire day to back up! Just imagine the Recovery Time Objective (RTO) for a point-in-time recovery (PITR). 😭 Just contemplating it was enough to break our spirits. But as they say, we don’t bow to the god of death just yet.
We upgraded pgBackRest from v1.1 to v2.46, increased the process-max to 12, switched compression to lz4, upgraded the instance to r5b.16xlarge (64vCPU,512GB ram) and changed the storage to io2 blockexpress (64K IOPS). This brought the full backup time down to 4 hours, with incrementals taking about 1 hour. Nonetheless, there’s still a looming threat: if the database growth rate isn’t curbed, the RTO will continue to rise.

Enter PostgreSQL 17 to Save the Day:

Enter PostgreSQL 17 (I wrote this article when PG17 RC1 was released and used it for demo, you should use GA version). With this, we could do a full backup once and then rely on incremental backups going forward. No more waiting for days on end. For month-end and year-end backups? Just restore them on ZFS with high compression, and shut down the database when not in use. No more insane delays. Just streamlined, efficient recovery and peace of mind.

1. Performing the Initial Full Backup
To set up your backup strategy, start with a full backup of your database.

2. Creating Incremental Backups
After completing the full backup, you can perform incremental backups to capture only the changes made since the last backup.

3. Using pg_combinebackup to Merge and Propagate Full Backups
The pg_combinebackup utility is introduced to manage and combine incremental backups efficiently. This utility allows you to merge incremental backups with a previous full backup, creating a new, up-to-date full backup without having to perform a full backup from scratch.

Below are the steps demonstrating above concept tested on Postgres-v17rc1:

A) Make sure wal summary is turned on before you take 1st full backup.

B) Sample database and dummy data gen for tests (note: I will increment data in between incremental backup).

C) Create first full backup using pg_baseback in the usual way

D) Create next incrementals and change data for tests

1st incremental uses backup_manifest of First full, second incremental we can use either 1st incremental or first full. Here we use 1st incremental as a base for 2nd incremental.

1st incremental

2nd incremental

E) Use pg_combinebackup to shift full backup forward in time without taking another full backup.

This is very powerful concept, we are moving full backup ahead in time and bringing it closer to the most recent incremental. Due to this we not only save on resources but also reduce recovery time significantly.

F) Restore new combined full backup and compare data

I will restore it on another port 6666 for the sake of testing and easy comparison for the blog format, you should always restore on a different host to minimize risks.

Restoring database on port 6666:

Verifying data by comparing restored vs original

PostgreSQL 17’s native incremental backup may not address human induced challenges by default, but it certainly simplifies key aspects of backup management. It enables easier implementation of retention policies, facilitates easy hot and cold data separation, and allows for a single full backup followed by incremental backups indefinitely, significantly improving recovery time objectives (RTO).

Finally, We have defeated schrodinger and kept our backup alive.

PS: Quantum Gravity : Optimizing Cold Storage with ZFS
It’s a wild idea, not yet used in Production(that I know of). Try ZFS for high compression and effective storage management. You can restore backups on ZFS, apply high compression (we were able to achieve ~6x compression ratio with zstd) and keep restored db but turned off, isolate machine from external connections completely and you have a very efficient cold backup with a 100% guaranteed restored Postgres.

Autovacuum Tuning in The Cloud Era

mayur — Mon, 14 Oct 2024 09:30:03 +0000

How often do you hear DBAs vent, "Autovacuum used to fly on bare-metal, but in the cloud, it feels like it's dragging forever!" Or they might say, "I've doubled the max_autovacuum_workers, yet dead tuples just keep stacking up," or "Autovacuum isn't getting the throughput our cloud provider promised!" And then there's the developer chiming in, "Every time Autovacuum kicks in on that big table, our application queries start timing out!" These frustrations are all too common in cloud environments, where expectations meet the harsh reality of "you get what you pay for."

In the cloud era, PostgreSQL's Autovacuum tuning can feel like a tightrope walk - juggling application performance, costs, and autovacuum efficiency due to cloud-imposed limitations. The cloud puts strict caps on I/O and throughput. Push Autovacuum too hard, and it'll devour your application's resources, tanking performance. But dial it down too much, and bloat creeps in, eventually strangling the application's efficiency. It's all about finding that sweet spot.

Often DBAs increase autovacuum_max_workers , maintenance_work_mem, adjust autovacuum_vacuum_scale_factor but still wonder why it fails to clean the database effectively. This inefficiency arises because Autovacuum throughput is constrained by the autovacuum_vacuum_cost_limit and further limited by host restrictions on Cloud. Before we delve into tuning Autovacuum, we need some metrics at hand. Autovacuum is known for being I/O-intensive, but just how much strain does it place on your system? Let's explore.

#Firstly, let's see how much IO to expect.
#At most, an autovacuum can do IO as shown below.
Max Autovacuum throughput = (1000/autovacuum_vacuum_cost_delay) * (autovacuum_vacuum_cost_limit/vacuum_cost_page_hit) * 8 KB (default block_size)

#For PG17 default settings:
Max Autovacuum throughput = (1000/2) * (200/1) * 8 KB = 800MB per second

Now let's see what is io throughput limit on different cloud and at what cost?

Starting with AWS :-(since it's most familiar to me)
Old answer but I just wanted to show that throughput limit in AWS depends on both instance class and storage class.

Burstable instance class : It's too small throughput and generally not suitable for high performance demanding production workloads hence ignoring it.

Standard instance class : General purpose standard instances of latest graviton series offer better throughput upto 2.5GBps.
Gbps/Mbps here is in Bits hence division by 8 to get GB/s or MB/s.

"Up to" is a very vague term. I found another piece of the puzzle in AWS document.

Since maximum performance is guaranteed for only 30 minutes per day, we will focus only on baseline values.

Next storage classes and limits:
We will only consider SSDs and latest storage classes availble.
GP3: GP3 offers 15MB/s for tiny (below 400GB) databases and customizable range of 62.5MB/s to 500MB/s for larger database.

io2 :

For io2 we can reach up to 500MB/s as we crank up iops knob.

Cost : Addressing the Elephant in the room

Let's see in AWS pricing calculator how much we would need to spend for reaching 500MB/s throughput capacity on AWS (in cheapest region) for 1TB RDS Postgres database. For simplifying calculations and comparisons we will consider single-AZ RDS, you should always use multi-AZ in production.
Instance class m7g.4xlarge (baseline is 625MB/s), Storage io2 256000 iops

holy IOPS!! 26K for 1TB

Don't worry GP3 saves the day.

Seems reasonable cost if you have high throughput requirement on RDS.

Next we will see pricing and capacity for Google Cloud SQL for PostgreSQL and Azure Database for PostgreSQL — Flexible Server to achieve same 500MB/s throughput. These two are closest matching counterparts of Amazon RDS Postgres.

Google :

So we will need a Zonal extreme PD and N2 VM with 64 vCPUs to get 500MB/s on GCP.

Google pricing calculator is very limited, it does not allow me to select zonal extreme-pd and just offers generalized cost of ssd.

Azure Database for PostgreSQL — Flexible Server

We will need a configuration of at least 16 vCores (for example, the D16ds_v5 instance) as it can provide up to 600 MB/s throughput.

SSD v2 is still in preview as of writing this article.

So estimated monthly cost would be around $1,400.00 slightly lower than AWS.

Why did I dive into the cost of compute and storage? Well, it’s not just about convincing the cloud FinOps team that we need a buffer for throughput — though that’s part of it. As a DBA, you might find yourself in a production incident where a small capacity bump could cost your company only a few hundred dollars. But in doing so, you’d be preventing a potential disaster worth thousands! Sometimes, the difference between smooth sailing and costly downtime is just a matter of proactively increasing capacity before things go south. 🚀

Juggling Cost, Autovacuum I/O, and Application Performance

Autovacuum Tuning consider the following two scenarios:

Your “Pet” Database:
If you only have a single beloved database to care with ample throughput capacity, you can fine-tuneautovacuum_vacuum_cost_limit and autovacuum_max_workersto keep the vacuum backlog low. Regularly monitor dead tuples using pg_stat_all_tables to identify any potential issues. For very large tables that require special attention, adjust the table-level Autovacuum parameters while maintaining conservative settings for the rest.
Microservices Architecture:
In a company operating a microservices architecture with hundreds of independent RDS PostgreSQL databases, cloud FinOps experts aim to optimize costs by selecting RDS instance classes and storage that align with the application’s throughput and latency requirements — a practice known as right-sizing. However, this approach often overlooks the Autovacuum component. Being on an undersized instance means that merely increasing autovacuum_vacuum_cost_limit, reducing autovacuum_vacuum_cost_delay, or adding more workers won’t suffice.

As a DBA, you must think creatively :

Consider partitioning problematic tables, enabling Autovacuum to operate on smaller chunks. With indexes local to partitions, their sizes decrease, aiding Autovacuum as it’s vacuuming all indexes each time.
Detect and drop unused indexes. Lesser the number of indexes, quicker Autovacuum can finish.
Minimize long-running transactions to prevent them from holding up xmin, rendering some Autovacuum runs ineffective.
Keep autovacuum_work_mem large around 1GB.
If Autovacuum on large tables still impacts application performance by hogging throughput, you may need to dial back cost limit and delay.
However, don’t go too conservative on cost limit and delay; this risks TXID wraparound, which can bring down your entire database. Always implement an Early Warning System for TXID wraparound (AWS offers a detailed guide on this).

Additionally, log these warnings and observe dead tuple growth trends in your most critical databases. Share these findings with FinOps, advocating funding for more throughput capacity at both instace and storage level.

A picture is worth a thousand words