Forem: Kris Iyer

Demystifying the AWS Advanced JDBC Driver: Pools, Plugins, and the Traps I Hit

Kris Iyer — Wed, 29 Apr 2026 19:17:33 +0000

Demystifying the AWS Advanced JDBC Driver: Pools, Plugins, and the Traps I Hit

Date: 2026-04-29
Status: Published

TL;DR

The AWS Advanced JDBC Driver wraps your database driver with a plugin chain that handles failover, read/write splitting, and connection monitoring. The critical gotcha: it can create internal connection pools separate from your application's HikariCP. If you're on v2.x with the F0 profile, you're hitting a hardcoded 30-connection ceiling regardless of your external pool config. The fix: upgrade to v3.x and use connectionPoolType=hikari with cp-MaximumPoolSize properties, or drop profiles entirely and configure plugins manually.

Key invariant: cp-MaximumPoolSize >= external maximumPoolSize to avoid the internal pool becoming your bottleneck.

Quick wins:

Check your driver version: v3.3.0+ recommended
If using F0 profile on v2.x, upgrade immediately
Set exception-override-class-name: software.amazon.jdbc.util.HikariCPSQLException
Keep socketTimeout=0 and let efm2 handle liveness detection
Mark read-only transactions with @Transactional(readOnly=true) to benefit from read/write splitting

Why I'm writing this

I spent a few hours chasing a performance regression that had no business existing. The service had a HikariCP pool configured for 50 connections per pod. I'd checked the Spring Boot YAML. The property names were right. The values were right. The configuration was loading at startup — I'd watched Hikari log it.

And yet, under load, the pool count plateaued at exactly 30. Not 50. Not 45. Thirty. Every time. Across every pod. Tomcat threads piled up behind a 10-second wait, connection creation time sat at 10,000 ms, and our p99 latency went vertical.

The answer, when I found it, was about two layers below where I'd been looking — inside a hardcoded lambda in a specific version of the AWS JDBC driver. I'd been tuning the wrong pool.

This post is what I wish I'd had at the start of that investigation. If you're running Spring Boot against Aurora PostgreSQL or MySQL through software.amazon.jdbc.Driver, there are a handful of things about how this driver actually works that aren't obvious from the README. Get them wrong and you get slow requests, or failed failovers, or both. Let me save you the trouble.

What the AWS Advanced JDBC Driver actually is

The docs call it a "wrapper," and that's literal — it's a thin java.sql.Driver that sits between your app and the underlying org.postgresql.Driver (or MySQL equivalent). Your URL ends up looking like this:

jdbc:aws-wrapper:postgresql://<endpoint>:5432/<db>?wrapperProfileName=F0

Everything after jdbc:aws-wrapper: is a conventional JDBC URL the wrapper passes down. What the wrapper adds is a plugin chain:

your application
  -> HikariCP (external, app-managed)
    -> aws-advanced-jdbc-wrapper
      -> [plugin 1] -> [plugin 2] -> ... -> [terminal plugin]
        -> org.postgresql.Driver
          -> Aurora instance (writer or reader)

Each plugin intercepts JDBC calls — getConnection, prepareStatement, execute — and can rewrite, retry, monitor, or split them. The plugins are why you're using this driver in the first place. They're what give you fast failover, read/write splitting, and enhanced failure monitoring. Everything else about driver configuration exists to serve the plugin chain.

Configuration profiles: convenience with teeth

The driver ships with named configuration profiles — presets that bundle a plugin list and a set of timeouts. The best-known is F0, which you turn on with wrapperProfileName=F0. F0 bundles "fast failover" — the recommended plugin set for Aurora.

Profiles are handy because they let an app team ship one URL parameter instead of a dozen properties. They're also the single biggest source of "how is this even possible?" incidents I've seen, because a profile can silently set properties you can't override from outside.

The F0 gotcha: a few hours I won't get back

Before v3.1.0, the F0 profile eagerly constructed a second, internal HikariCP pool — separate from your application's — with properties baked into a lambda at profile-load time. I didn't find this in the docs. I found it by decompiling the JAR:

// From DriverConfigurationProfiles.class in aws-advanced-jdbc-wrapper-2.6.8.jar
// (I verified this via bytecode decompilation after running out of other theories)
config.setMaximumPoolSize(30);                       // HARD CEILING
config.setConnectionTimeout(SECONDS.toMillis(10));   // 10-second wait on exhaustion

There is no property you can set to override these. The external pool config is ignored by the internal pool. The cp- property prefix (I'll get to it below) doesn't exist in v2.6.8 at all — the string "cp-" literally doesn't appear anywhere in the JAR.

Here's what was actually happening in the service at runtime:

App borrowed a logical connection from the external HikariCP (configured max = 50).
External HikariCP asked the wrapper for a physical connection.
The wrapper routed through its internal HikariCP (hardcoded max = 30).
Under load, the internal pool saturated at 30. Attempts 31–50 waited up to 10 seconds and then failed.
From my dashboards: external hikaricp.connections capped at 30, connections.pending climbed to about 170, and connections.creation.avg sat at 10,000 ms.

From the outside, this looks like a pool-sizing bug. I lost a few hours to it before the pieces clicked. The fix is a driver version bump.

v3.x: `cp-` properties and `connectionPoolType=hikari`

In v3.1.0 the driver added (PR #1658) a new URL parameter (documented under the read/write splitting plugin's internal connection pooling section):

?connectionPoolType=hikari

When that's set, the internal pool is built via HikariPooledConnectionProvider's no-arg constructor, which reads properties prefixed with cp- and forwards them to the internal Hikari config:

data-source-properties:
  cp-MaximumPoolSize: "50"
  cp-MinimumIdle: "5"
  cp-ConnectionTimeout: "30000"

The catch I hit next: cp- properties are silently ignored when wrapperProfileName=F0 is also active. The F0 preset supplies its own HikariPoolConfigurator lambda that takes precedence and still hardcodes maxPoolSize=30. F0 and cp-MaximumPoolSize cannot coexist. Pick one.

For Aurora with read/write splitting and proper pool sizing on v3.x, I dropped the profile and assembled the plugin list by hand:

spring:
  datasource:
    url: jdbc:aws-wrapper:postgresql://${database_endpoint}:5432/${db}?connectionPoolType=hikari&readerHostSelectorStrategy=roundRobin
    driver-class-name: software.amazon.jdbc.Driver
    hikari:
      connection-timeout: 60000
      maximum-pool-size: 50
      minimum-idle: 10
      data-source-properties:
        wrapperPlugins: readWriteSplitting,auroraConnectionTracker,failover,efm2
        cp-MaximumPoolSize: "50"
        cp-MinimumIdle: "5"
        cp-ConnectionTimeout: "30000"
        connectTimeout: "10000"
        loginTimeout: "10000"
        socketTimeout: "0"
        failureDetectionTime: "60000"
        failureDetectionCount: "5"
        failureDetectionInterval: "15000"
        monitoring-connectTimeout: "10000"
        monitoring-socketTimeout: "5000"
        monitoring-loginTimeout: "10000"
      exception-override-class-name: software.amazon.jdbc.util.HikariCPSQLException

This replaces what F0 was giving me (the plugin set and timeouts) while keeping cp-* effective.

When to use presets vs manual configuration

This is a gap in the official docs — there's no guidance on when presets are the right choice vs when you should go manual. Having dug through the source code and the preset codes, here's how I think about it.

The preset families:

Family	Pool type	Presets	What they're for
A / B / C	No pool	A0, A1, A2, B, C0, C1	Failover + monitoring only. No internal connection pooling. You bring your own (external) pool or don't pool at all.
D / E / F	Internal pool	D0, D1, E, F0, F1	Failover + monitoring + internal HikariCP pool (managed by the wrapper). `F0` is the most commonly referenced.
G / H / I	External pool	G0, G1, H, I0, I1	Designed for apps that manage their own pool externally. The wrapper does not create internal pools.
SF_ prefix	(matches base)	SF_D0, SF_D1, SF_E, SF_F0, SF_F1	Spring Framework variants — same as their base preset but with `readWriteSplitting` disabled (Spring handles routing via separate DataSource beans).

The number suffix indicates failure-detection sensitivity: 0 = normal, 1 = easy/less sensitive (or aggressive, depending on the family), 2 = aggressive.

The problem with pool presets (D/E/F families): every preset that creates an internal pool hardcodes the same HikariCP values in a lambda with no override mechanism:

Property	Hardcoded value	Overridable via `cp-*`?
`maxPoolSize`	30	No — preset lambda takes precedence
`connectionTimeout`	10 seconds	No
`minimumIdle`	2	No
`idleTimeout`	15 minutes	No
`keepaliveTime`	3 minutes	No
`validationTimeout`	1 second	No
`maxLifetime`	1 day	No
`initializationFailTimeout`	-1	No

This applies to D0, D1, E, F0, F1 and their SF_ variants — all of them hardcode maxPoolSize=30. The cp-* properties (like cp-MaximumPoolSize) are silently ignored when any of these presets are active, because the preset's HikariPoolConfigurator lambda overrides the HikariPooledConnectionProvider's property-reading path.

When to use a preset:

You're prototyping, running a small service, or don't have specific pool-sizing requirements.
maxPoolSize=30 and connectionTimeout=10s are acceptable for your workload.
You want a known-good plugin + timeout combination without thinking about individual settings.
You're using a no-pool preset (A/B/C family) and bringing your own external pool — these have no hardcoded pool values to collide with.

When to go manual (drop the preset):

You need to control maxPoolSize, connectionTimeout, or any other pool property — which is most production deployments. This is what I had to do.
You're running at non-trivial throughput where 30 connections per internal pool is a ceiling (this was my exact situation).
You want cp-* properties to actually take effect.
You're combining readWriteSplitting with @Transactional(readOnly=true) in Spring and need internal pools with custom sizing.

The manual approach means specifying connectionPoolType=hikari + wrapperPlugins=... + cp-* properties explicitly, instead of wrapperProfileName=F0. You lose the convenience of a single preset name, but you gain control over every property. For reference, the Configuration Presets docs list what each preset bundles, so you can replicate the plugin list and timeouts manually while overriding only the pool properties you need.

External pooling vs internal pooling — what each layer is actually doing

This is something most folks will need to pay attention to. These two layers are not redundant. They do different jobs.

External pool (my application's HikariCP, managed by Spring Boot)

Scope: one pool per Spring DataSource bean, typically one per pod.
Holds: logical connections — the java.sql.Connection objects my code calls .prepareStatement on.
Gates: how many threads can hold a connection concurrently. If this is 50, request #51 waits or times out.
Maps to: how many Tomcat threads can simultaneously sit inside a DB-touching request.

Internal pool (managed by the wrapper, one per Aurora instance)

Scope: with readWriteSplitting + connectionPoolType=hikari, one internal pool per Aurora instance — a writer pool, and one pool per reader. The wrapper routes logical connections to the right instance based on read-only hints (setReadOnly(true) or @Transactional(readOnly=true) in Spring).
Holds: physical connections — TCP/TLS sessions to a specific Aurora node.
Gates: how many physical sockets stay open to each instance.
Maps to: Aurora's per-instance max_connections. The default formula is LEAST({DBInstanceClassMemory/9531392}, 5000), so memory-rich instances like db.r7i.4xlarge (128 GiB) hit the 5,000 hard cap rather than scale further.

Why both are needed — and the official caveat

The external pool's logical connections are cheap — Java objects wrapping references into the internal pool. The internal pool's physical connections are expensive — TLS handshake, auth, wire protocol. The wrapper hands out a single logical connection from the external pool while keeping the physical session pinned to the correct instance (writer for writes, reader-N for reads).

Without the internal pool layer, every getConnection() from the external pool would open a fresh physical connection to some instance. That undoes HikariCP's entire point.

Important caveat from the AWS docs: the ReadWriteSplitting plugin documentation explicitly states:

"Using internal and external pools at the same time has not been tested and may result in problematic behaviour."

The docs go further and recommend disabling external connection pools entirely when using internal pooling:

"If you want to use the driver's internal connection pooling, we recommend that you explicitly disable external connection pools (provided by Spring). You need to check the spring.datasource.type property to ensure that any external connection pooling is disabled."

Here's the thing that's easy to miss: if your Spring Boot app has spring.datasource.hikari.* properties and connectionPoolType=hikari in the JDBC URL, you're running double pools whether you intended to or not. connectionPoolType=hikari only controls the wrapper's internal pool — it doesn't replace or disable the external one. Spring Boot independently auto-detects HikariCP on the classpath and creates the external HikariDataSource bean. Unless you explicitly set spring.datasource.type=org.springframework.jdbc.datasource.SimpleDriverDataSource, both pools are active. This is almost certainly the configuration most Spring Boot teams end up with.

In practice, I've run both pools together under sustained load without issues — but that's my workload, not a guarantee. The double-pool architecture works when you treat the external pool as a concurrency gate and the internal pools as physical-session caches, and keep cp-MaximumPoolSize >= maximumPoolSize so the internal layer never becomes the bottleneck. But if you're hitting edge cases — connections leaking, intermittent stale-connection errors after failover, or pool metrics that don't add up — this officially-untested interaction is the first thing to suspect.

So how do you actually disable the external pool?

This is the part I want to make crystal clear, because it's easy to think you've solved double-pooling when you haven't.

Why you're probably running double pools right now: Spring Boot auto-detects HikariCP on your classpath (it's pulled in by spring-boot-starter-data-jpa or spring-boot-starter-jdbc) and creates a HikariDataSource bean automatically. Setting connectionPoolType=hikari in the wrapper URL does not turn this off — that only tells the wrapper to create its own internal pools. These are two independent systems that don't know about each other.

If your application.yaml looks like this, you have two pools:

# THIS IS DOUBLE-POOLING — both pools are active
spring:
  datasource:
    url: jdbc:aws-wrapper:postgresql://...?connectionPoolType=hikari&readerHostSelectorStrategy=roundRobin
    driver-class-name: software.amazon.jdbc.Driver
    hikari:                          # ← Spring Boot sees this and creates external HikariCP
      maximum-pool-size: 50
      minimum-idle: 10
      data-source-properties:
        cp-MaximumPoolSize: "50"     # ← wrapper sees this and creates internal HikariCP
        cp-MinimumIdle: "5"

To run single-pool (internal only), set spring.datasource.type to a non-pooling DataSource implementation. This tells Spring Boot to skip HikariCP auto-detection. The catch: without the hikari: section, there's no data-source-properties: block to put your cp-* and wrapper properties in. You have two options.

Option A — pass everything as URL parameters. Reliable but the URL gets long:

# SINGLE-POOL (internal only) — cp-* and plugin config in the URL
spring:
  datasource:
    type: org.springframework.jdbc.datasource.SimpleDriverDataSource   # ← disables external HikariCP
    url: >-
      jdbc:aws-wrapper:postgresql://${database_endpoint}:5432/${database_name}
      ?connectionPoolType=hikari
      &readerHostSelectorStrategy=roundRobin
      &wrapperPlugins=readWriteSplitting,auroraConnectionTracker,failover,efm2
      &cp-MaximumPoolSize=50
      &cp-MinimumIdle=5
      &cp-ConnectionTimeout=30000
      &connectTimeout=10000
      &loginTimeout=10000
      &socketTimeout=0
      &failureDetectionTime=60000
      &failureDetectionCount=5
      &failureDetectionInterval=15000
      &monitoring-connectTimeout=10000
      &monitoring-socketTimeout=5000
      &monitoring-loginTimeout=10000
    driver-class-name: software.amazon.jdbc.Driver
    # No hikari: section — Spring won't create an external pool

Option B — use the wrapper's own DataSource class. The wrapper provides AwsWrapperDataSource which accepts properties directly, keeping the YAML clean:

# SINGLE-POOL (internal only) — using AwsWrapperDataSource
spring:
  datasource:
    type: software.amazon.jdbc.ds.AwsWrapperDataSource
    url: jdbc:postgresql://${database_endpoint}:5432/${database_name}   # ← note: no aws-wrapper: prefix
    driver-class-name: org.postgresql.Driver                            # ← the underlying driver, not the wrapper
    connection-properties:
      wrapperPlugins: readWriteSplitting,auroraConnectionTracker,failover,efm2
      connectionPoolType: hikari
      readerHostSelectorStrategy: roundRobin
      cp-MaximumPoolSize: "50"
      cp-MinimumIdle: "5"
      cp-ConnectionTimeout: "30000"
      connectTimeout: "10000"
      loginTimeout: "10000"
      socketTimeout: "0"
      failureDetectionTime: "60000"
      failureDetectionCount: "5"
      failureDetectionInterval: "15000"
      monitoring-connectTimeout: "10000"
      monitoring-socketTimeout: "5000"
      monitoring-loginTimeout: "10000"

Note the differences with AwsWrapperDataSource: the URL drops the jdbc:aws-wrapper: prefix (it's a plain jdbc:postgresql: URL since the wrapper IS the DataSource), and driver-class-name points to the underlying driver, not the wrapper. See the DataSource configuration docs for details.

To run single-pool (external only), remove connectionPoolType=hikari from the URL. The wrapper won't create internal pools, and every getConnection() from the external HikariCP opens a physical connection through the wrapper on-demand:

# SINGLE-POOL — only the external HikariCP is active
spring:
  datasource:
    url: jdbc:aws-wrapper:postgresql://...?readerHostSelectorStrategy=roundRobin
    driver-class-name: software.amazon.jdbc.Driver
    hikari:
      maximum-pool-size: 50
      minimum-idle: 10
      # No cp-* properties needed — no internal pool exists

Trade-offs at a glance

Configuration	External pool	Internal pool	What you get	What you lose
Double pool (most Spring Boot apps)	Spring HikariCP (`hikari:` section)	Wrapper HikariCP (`connectionPoolType=hikari` + `cp-*`)	Full Spring metrics, health checks, familiar config surface. Physical connections cached per Aurora instance.	Running an officially-untested combination. Two pools to reason about. Higher DB connection count than expected.
Internal only via `SimpleDriverDataSource` (`spring.datasource.type=SimpleDriverDataSource`)	Disabled	Wrapper HikariCP	The configuration AWS actually tests against. Clean single-pool model.	No `hikaricp.` Micrometer metrics from Spring. No HikariCP health indicator in `/actuator/health`. `cp-` properties must go in the URL — gets unwieldy with many parameters.
Internal only via `AwsWrapperDataSource` (`spring.datasource.type=software.amazon.jdbc.ds.AwsWrapperDataSource`)	Disabled	Wrapper HikariCP	AWS-tested single-pool model. Clean YAML via `connection-properties` block — no URL stuffing.	Same observability trade-offs as `SimpleDriverDataSource` (no Spring Hikari metrics/health). Different URL format (`jdbc:postgresql:` not `jdbc:aws-wrapper:postgresql:`) and `driver-class-name` points to the underlying driver. See DataSource docs.
External only (no `connectionPoolType` in URL)	Spring HikariCP	None	Familiar Spring config. Full metrics.	No per-instance physical connection caching. `@Transactional(readOnly=true)` with `readWriteSplitting` triggers a full connection switch per call (see Spring Boot limitation below).

Where I am with this

I've been experimenting with the double-pool setup and so far it's been working without problems under sustained load across multiple pods. The external pool gives you the Micrometer metrics that make diagnosing issues possible — the hikaricp.connections.pending signal is how I caught the F0 ceiling issue — and the internal pool gives you efficient physical-connection reuse across reader/writer instances. The key invariant is cp-MaximumPoolSize >= maximumPoolSize so the internal layer never becomes the bottleneck.

The one tangible downside I've observed: you use more database connections than you'd expect. The external pool holds logical connections while the internal pools independently hold physical connections per Aurora instance. In practice the connection count on Aurora ends up higher than what the external pool size alone would suggest, because the internal pools maintain their own minimum-idle and maximum-size independently. For a fleet of pods, this adds up — make sure your Aurora instance's max_connections has headroom for pods × cp-MaximumPoolSize × (1 + number_of_readers), not just pods × maximumPoolSize.

If you're hitting unexplained edge cases — connections leaking, intermittent stale-connection errors after failover, or pool metrics that don't add up — the officially-untested double-pool interaction is the first thing to suspect. Switching to internal-only (spring.datasource.type=SimpleDriverDataSource or AwsWrapperDataSource) is the cleanest way to eliminate it as a variable.

It's also worth noting that you can use the wrapper without HikariCP entirely — the internal pool with connectionPoolType=hikari is a self-contained HikariCP instance managed by the wrapper. If you're building a non-Spring app or a lightweight service, running only the internal pool is the cleaner architecture and avoids the double-pool question altogether.

F0 vs SF_F0: should Spring Boot apps use `readWriteSplitting`?

This is one of the more confusing areas in the docs, and it matters because it determines your entire read/write routing architecture.

From the source code:

Preset	Plugins	Internal pool
F0	`auroraInitialConnectionStrategy`, `auroraConnectionTracker`, `readWriteSplitting`, `failover`, `efm2`	Yes (maxPoolSize=30)
SF_F0	`auroraInitialConnectionStrategy`, `auroraConnectionTracker`, `failover`, `efm2`	Yes (maxPoolSize=30)

The only difference: SF_F0 drops readWriteSplitting. Both have the same internal pool. The SF_ prefix stands for "Spring Framework" — these variants are meant for Spring apps.

Why does the Spring variant disable read/write splitting?

The Spring Boot limitations section of the ReadWriteSplitting plugin docs explains:

The use of read/write splitting with the annotation @Transactional(readOnly=true) is **only* recommended for configurations using an internal connection pool.*

When Spring encounters @Transactional(readOnly=true), it calls conn.setReadOnly(true) before the method and conn.setReadOnly(false) after. The readWriteSplitting plugin responds by switching from writer→reader→writer on every annotated method call. Without an internal pool, each switch is a full TCP/TLS reconnect — the docs call this "substantial performance degradation." The SF_ presets sidestep this by disabling the plugin entirely and recommending two separate Spring DataSource beans instead (one for the writer cluster endpoint, one for the reader endpoint), letting Spring handle routing.

The contradiction: SF_F0 has internal pools — exactly the prerequisite the docs say makes readWriteSplitting safe. With internal pools, the setReadOnly toggle reuses cached physical connections from the per-instance pools (writer pool, reader pool), making the switch a cheap object swap rather than a TCP reconnect. So SF_F0 disables a plugin that should work fine with the internal pools it already provides.

My read: the SF_ presets were likely created before connectionPoolType=hikari made the internal-pool + readWriteSplitting combination clean and testable. The docs haven't fully reconciled this — they warn about the overhead, correctly note that internal pools mitigate it, but then the SF_ presets still disable it out of caution.

Three paths for Spring Boot read/write splitting:

Approach	readWriteSplitting plugin	How reads route to readers	Trade-off
Plugin with internal pools (what we use)	Enabled	`@Transactional(readOnly=true)` triggers `setReadOnly(true)` → plugin routes to reader via cached internal pool	Single DataSource bean. Clean. Requires internal pools for acceptable switching overhead.
Two DataSource beans (what SF_ presets assume)	Disabled	Spring's `AbstractRoutingDataSource` or `@Qualifier` annotations route to a writer or reader DataSource at the service layer	No plugin overhead. More application-level wiring. Each DataSource can independently use the wrapper for failover/monitoring.
Plugin without internal pools (don't do this)	Enabled	`setReadOnly` triggers a full physical connection switch per call	Substantial overhead. The docs explicitly warn against this.

If you're already on manual config with connectionPoolType=hikari and cp-* properties (which you need anyway for pool sizing), enabling readWriteSplitting works — the internal pools handle the switching cost. If you prefer the two-DataSource approach, use a no-readWriteSplitting configuration (like SF_F0's plugin list, but with manual pool sizing since the preset hardcodes maxPoolSize=30).

Either way, don't mix the two: having readWriteSplitting enabled while also routing via separate DataSources would result in double routing logic that's hard to reason about.

HikariCP and virtual threads: a known compatibility issue

If you're running on JDK 21+ and considering Spring Boot's spring.threads.virtual.enabled=true, there is an open HikariCP bug (#2398) to be aware of. The issue is filed against HikariCP 7.0.2: the ConcurrentBag.requite() method uses a yield-spin loop (Thread.yield() 255 times for every parkNanos) that saturates all carrier threads under virtual-thread load. The result is CPU throttling at the pod level and potential liveness-probe failures — the exact kind of silent performance regression that's hard to diagnose without knowing about this issue.

As of this writing, the proposed fix in PR #2399 has not been merged. Spring Boot 3.5.7's BOM pins HikariCP 6.3.3 by default rather than 7.x, and the bug report doesn't reproduce against the 6.x line — so check your effective HikariCP version before assuming you're affected. The workaround if you are is to disable virtual threads (-Dspring.threads.virtual.enabled=false). If you're running the AWS JDBC wrapper with HikariCP as your external pool and enabling virtual threads on a 7.x version, this is the interaction to watch — it's not a wrapper bug, but it surfaces at the same layer (connection pool) and looks similar in dashboards to the internal-pool ceiling problem I described earlier.

Sizing rule

For P pods, external pool size E, and R readers in the Aurora cluster, the physical connection footprint is:

Writer instance:   up to P * cp-MaximumPoolSize physical connections
Per reader:        up to P * cp-MaximumPoolSize physical connections
Total:             P * cp-MaximumPoolSize * (1 + R)

If cp-MaximumPoolSize is the bottleneck, logical getConnection() calls sit in the internal pool's wait queue — which is exactly the v2.6.8 failure mode, just on a newer version where you technically can fix it. The invariant to hold: cp-MaximumPoolSize >= external pool size so the internal layer never becomes the bottleneck. Going higher is fine as long as the total stays under Aurora's max_connections per instance with ~20% headroom.

Life of a single SELECT

When I was first onboarding someone to this, the thing that actually landed was walking through one request end-to-end:

Tomcat thread calls userRepository.findById(42).
Spring Data borrows a logical connection from external HikariCP (external pool count goes up by 1).
Transaction manager begins a tx. Say it's @Transactional(readOnly=true) — the read-only hint is set on the logical connection.
First real statement flows through the plugin chain. readWriteSplitting sees the read-only flag, picks reader-1 (round-robin), and routes to reader-1's internal pool.
Reader-1's internal pool hands over a physical session; the wrapper binds it to the logical connection for the rest of the tx.
Query executes on reader-1.
Tx commits. Physical session returns to reader-1's internal pool; logical connection returns to external Hikari.

The plugin catalog, and when I use which

Plugins are a comma-separated list on wrapperPlugins. Order matters. The driver applies them outside-in.

What I always run for Aurora

failover — detects Aurora writer/reader failover events via topology awareness, invalidates broken connections, reroutes to the current writer. Without this, a writer failover leaves the driver holding a dead TCP session until OS-level timeouts fire (minutes). (There's also a newer failover2 plugin worth evaluating for new deployments.)
auroraConnectionTracker — maintains the map of live connections per instance. failover needs it to know which connections to invalidate.
efm2 — Enhanced Failure Monitor v2. A background thread per connection probes the socket at failureDetectionInterval; if failureDetectionCount consecutive probes fail within failureDetectionTime, the connection is marked bad and failover kicks in. v2 is current; v1 / efm is deprecated and should not be used in new configs.

What I enable conditionally

readWriteSplitting — routes read-only transactions to readers, writes to the writer. Enable when you have one or more readers and your code marks read transactions properly (@Transactional(readOnly=true)). Without the hint, the plugin sends everything to the writer and you get no benefit. I've seen more than one team enable it and then wonder why their readers sit idle.
iamAuth — IAM-based auth instead of password. Enable if you're doing IAM to Aurora; otherwise skip.
awsSecretsManager — pulls creds from Secrets Manager at connection time. Overlaps with external secret-rotation workflows; I enable only if I'm not rotating through Kubernetes secrets.
federatedAuth / okta — SSO-style auth; niche in my experience.
dev / logQueryPlansWhenNeeded — debugging only, never prod.

My default stack for Aurora PG + HikariCP

wrapperPlugins: readWriteSplitting,auroraConnectionTracker,failover,efm2

I put readWriteSplitting first so routing happens before failover/topology logic — that way failover can reroute a connection to the "current" writer regardless of who it was bound to. efm2 is last because it's terminal: it wraps the underlying connection with monitoring.

Aurora with multiple readers: the configuration I'm shipping

This is what I'm running now against a 1 writer + 2 reader Aurora cluster. It's not the only sensible config, but I've run it in anger through a few load tests and it's the one I trust.

url: jdbc:aws-wrapper:postgresql://${endpoint}:5432/${db}?connectionPoolType=hikari&readerHostSelectorStrategy=roundRobin
hikari:
  connection-timeout: 60000
  maximum-pool-size: 50
  minimum-idle: 10
  data-source-properties:
    wrapperPlugins: readWriteSplitting,auroraConnectionTracker,failover,efm2
    cp-MaximumPoolSize: "50"
    cp-MinimumIdle: "5"
    cp-ConnectionTimeout: "30000"
    # I let efm2 handle liveness. TCP timeout is intentionally 0.
    connectTimeout: "10000"
    loginTimeout: "10000"
    socketTimeout: "0"
    # efm2 tuning — see "failover budget" below
    failureDetectionTime: "60000"        # grace period before monitoring starts
    failureDetectionInterval: "15000"    # 15s between probes
    failureDetectionCount: "5"           # 5 failed probes = dead
    monitoring-connectTimeout: "10000"
    monitoring-socketTimeout: "5000"
    monitoring-loginTimeout: "10000"
  exception-override-class-name: software.amazon.jdbc.util.HikariCPSQLException

Reader host selection

readerHostSelectorStrategy controls how readWriteSplitting picks a reader:

roundRobin — distributes reads evenly. My default.
random — statistically even but variable in any given second.
leastConnections — picks the reader with the fewest active physical connections. Worth it when readers have meaningfully different workloads, but adds a small lookup cost per acquisition.
fastestResponse — picks the reader with the lowest observed response latency. Useful when readers have asymmetric hardware or load.

For a homogeneous reader fleet, roundRobin is the cleanest and cheapest. I've only ever needed leastConnections once, for an asymmetric deployment.

The exception-translation line I almost missed

exception-override-class-name: software.amazon.jdbc.util.HikariCPSQLException is easy to skip over (see the Spring Boot + HikariCP example where it's buried at the bottom of the YAML). Without it, HikariCP sees failover-triggered SQLExceptions as "normal" and tries to hand out connections the wrapper has already invalidated. Pool stays confused, latency stays bad, and the ordinary failover recovery path never fully completes. Not optional if you're on HikariCP + failover. Set it once and never think about it again.

Performance aspects

Where time actually goes

Under steady load, the wrapper's overhead breaks down into three categories:

Plugin chain traversal — every JDBC call walks through the chain. For N plugins and M statements per transaction, you pay N×M method-dispatch overhead. On v3.x it's low single-digit microseconds — not zero, but invisible unless you're chasing the last 1% of p99. The rule I follow: don't enable plugins you aren't using.
Physical connection creation — TLS handshake + auth + wire setup. One-time per internal pool slot; amortized, it's invisible unless the pool is cold or under-sized and the driver is creating sessions continuously.
Monitoring traffic — efm2 sends lightweight probes per connection. At failureDetectionInterval=15000 the volume is tiny.

Metrics I always watch

Metric	What it tells me
`hikaricp.connections` (total)	External pool size. Should grow to `maximumPoolSize` under load. If it plateaus below the configured max, I'm hitting the internal pool ceiling — that's exactly how I finally caught the v2.6.8 F0 issue.
`hikaricp.connections.active`	Currently in-use logical connections. Near the max = contention.
`hikaricp.connections.pending`	Threads waiting to borrow. Steady-state non-zero = bottleneck. I alert on this.
`hikaricp.connections.creation` (ms)	Time to acquire a physical connection through the wrapper. Single-digit ms is normal; 10,000 ms means an internal-pool wait timed out. This is the specific signal that said "the problem isn't the external pool."
`hikaricp.connections.timeout`	Borrow timeouts. Always zero when healthy.
Aurora `DatabaseConnections`	Physical conns per instance. Should roughly equal `sum over pods of (active internal-pool conns to this role)`. Cross-reference with `cp-MaximumPoolSize`.
Aurora `Deadlocks`, `CommitLatency`	Independent of the driver but often regress together if pool sizing forces serialization at the app layer.

My sizing calculator

For P pods, E external pool size, R_n reader count, target Aurora M max_connections per instance with 20% headroom:

cp-MaximumPoolSize = E                                     # invariant; no internal-pool wait
Writer physical at peak      = P * cp-MaximumPoolSize
Per-reader physical at peak  = P * cp-MaximumPoolSize       (round-robin balances across readers)
Sanity: P * cp-MaximumPoolSize <= 0.8 * M

Plug in your own numbers: P * cp-MaximumPoolSize per role. Check this against the max_connections for your Aurora instance class and leave ~20% headroom for maintenance connections and other clients.

Failover — what happens under the hood

Aurora failover — writer restart, reader promotion, or AZ failover — is the specific scenario the wrapper's plugins were built to survive. The first time I watched a failover in production with this stack, I actually wanted to know what was happening step by step. Here's what I worked out.

Sequence during a writer failover

Writer instance goes unresponsive. TCP sockets from my pods to that writer stop returning packets.
efm2's monitor thread hits failureDetectionCount consecutive probe failures within failureDetectionTime. The underlying connection is marked bad.
My app's next statement on that connection throws a SQLException tagged with a failover-relevant SQLState.
failover catches it, queries Aurora topology (via the RDS DNS or the cluster's topology endpoint), identifies the new writer, and reconnects transparently.
If configured (failoverMode=reader-or-writer), the reconnect can fall back to a reader for the brief window where no writer is available. Default is writer.
auroraConnectionTracker walks its table of open connections to the dead instance and invalidates them.
External HikariCP sees the invalidation through HikariCPSQLException (this is the moment exception-override-class-name matters) and evicts the bad logical connections.
New logical connections open against fresh internal-pool slots bound to the new writer.

End-to-end with default timers: detection ~75 seconds (failureDetectionTime=60000 + up to failureDetectionCount=5 × failureDetectionInterval=15000), reconnect ~5-15 seconds (Aurora DNS propagation + fresh handshake). My app's p99 takes a visible bump during that window; business recovers within ~90 seconds.

Tuning the detection budget

Aggressive (~15-30 s to detect): failureDetectionTime=15000, failureDetectionInterval=5000, failureDetectionCount=3. More probe traffic; more false positives on transient network blips.
Default (~75 s, what's in the YAML above): what I run by default. Good for most apps.
Lax (~3+ min): raise failureDetectionTime past 120000. Only use this if you have independent health-signal paths and don't want efm2 to chatter.

One thing I stopped doing: don't set socketTimeout small on the main connection (socketTimeout=5000 and friends) hoping to catch failures faster. That fires on every slow query — including legitimate long-running reports — and turns every transient spike into connection churn. Let efm2 own liveness detection. Keep socketTimeout=0. I learned this the hard way after a 12-minute query triggered a pool-wide connection churn event.

Resilience patterns worth knowing

`failoverMode`

Controls fallback when no writer is reachable:

strict-writer — only reconnect to a writer. Default when connecting via the cluster writer endpoint. During a prolonged failover, connections stall until a new writer is up.
reader-or-writer — fall back to a reader for reads if no writer is available. Default when connecting via the read-only cluster endpoint. Useful for read-heavy apps that can tolerate writes being rejected; writes still fail until the writer is back.
strict-reader — never connect to the writer. Dedicated read-replica deployments only.

My default is strict-writer (which matches the implicit default for cluster-writer-endpoint connections). I've only ever overridden it for a reporting workload where read availability mattered more than write availability.

Connection churn during failover (don't panic)

The immediate aftermath of a failover event looks rough on dashboards: connections.creation spikes to seconds (new TLS handshakes), connections.timeout briefly non-zero, p99 climbs. All expected. The key is the spike ends, typically within ~30 seconds of the new writer being healthy. If you see a sustained elevated connections.creation after the event, check whether exception-override-class-name is configured — without it, HikariCP keeps handing out invalidated connections and the churn doesn't stop on its own.

Read-only traffic during failover

Readers are unaffected by writer failover. readWriteSplitting + correctly-marked read-only transactions means read traffic keeps flowing while writes pause for ~30-60 seconds. For read-heavy apps, marking transactions readOnly=true turns out to be both a performance win and an availability one. Do it for both reasons.

Blue/green deployments

If you're doing Aurora blue/green (RDS Blue/Green), the switchover is a writer-failover-like event from the driver's perspective. The plugins cover it with no extra config, but the same detection-budget trade-offs apply: faster detection = faster cutover = more false-positive risk during normal ops.

RDS Proxy: when, and how it interacts with this driver

If you've read this far, you're either using or considering RDS Proxy. The two layers — RDS Proxy in front of Aurora, the AWS JDBC driver inside your app — solve overlapping but not identical problems, and the AWS guidance you'd want to read together is scattered across the proxy planning page, the wrapper README, and a plugin doc most people miss.

When AWS recommends RDS Proxy

The planning page lists the canonical cases: "too many connections" pressure, T2/T3 instances where connection-setup CPU is significant, Lambda / serverless workloads, apps without a built-in pool, centralized IAM auth or Secrets Manager rotation, failover speedup (advertised at "up to 66%", typically <35 s for Multi-AZ Aurora), and Blue/Green deployments. For a long-lived Spring Boot pod with a well-tuned HikariCP, only the last three are particularly compelling — the multiplexing benefit is mostly theoretical when your external pool is sized correctly.

How RDS Proxy actually routes

The thing that catches teams out is the assumption that the proxy "splits reads and writes intelligently." It doesn't. From the endpoints docs, the proxy exposes two endpoints — a read/write endpoint that sends every request to the current writer, and a read-only endpoint that sends every request to some reader (with proxy-level rebalancing if a reader fails). There's no SQL inspection. The proxy routes where you point it, not what you send through it. SQL-aware splitting still requires application-side logic — either two DataSource beans in your app or the srw plugin described below.

Plugin compatibility behind RDS Proxy

The wrapper README's RDS Proxy section is unambiguous:

"Functionality like Failover, Enhanced Host Monitoring, and Read/Write Splitting is not compatible since the driver relies on cluster topology and RDS Proxy handles this automatically. The driver remains useful with RDS Proxy for authentication workflows, such as IAM authentication and AWS Secrets Manager integration."

Translated:

Plugin	Behind RDS Proxy
`failover`, `failover2`	Drop. Proxy handles writer failover; topology lookups conflict with the hidden pool.
`efm2`	Drop. Per-connection probes don't see the underlying Aurora node.
`readWriteSplitting`	Drop. Relies on topology that's invisible behind the proxy.
`iamAuth`	Keep if you want JDBC-layer IAM (alternative to configuring it on the proxy).
`awsSecretsManager`	Optional — overlaps with proxy auth. Usually skip.
`srw` (Simple R/W Splitting)	Keep — purpose-built for this combination.

The `srw` plugin — SQL-aware splitting through RDS Proxy

Available since v3.0.0 and documented here. Unlike readWriteSplitting, srw doesn't query the cluster for topology. You give it two explicit endpoints — srwWriteEndpoint (your read/write proxy endpoint) and srwReadEndpoint (your read-only proxy endpoint) — and it switches between them on Connection#setReadOnly(true/false). With Spring's @Transactional(readOnly=true), you keep the same single-DataSource ergonomics you'd have with readWriteSplitting against direct Aurora.

Two gotchas. Role verification (verifyNewSrwConnections=true by default) runs SELECT pg_catalog.pg_is_in_recovery() after switching, with up to a 60-second retry budget, to defend against DNS-cache staleness right after failover. Useful on paper; it conflicts with autocommit=false because the verification query opens a transaction. Either set setReadOnly before disabling autocommit, or set verifyNewSrwConnections=false. Mutual exclusion: don't combine srw with readWriteSplitting or gdbReadWriteSplitting on the same connection. They're alternatives, not layers.

Decision tree

Setup	Plugins	Read/write split mechanism
Direct to Aurora, no proxy	`readWriteSplitting`, `auroraConnectionTracker`, `failover`, `efm2` (+ `cp-*`)	Wrapper plugin, one DataSource, `@Transactional(readOnly=true)` routes via topology.
RDS Proxy + wrapper, SQL-aware split	`srw` (+ `iamAuth` if needed)	`srw` switches between two proxy endpoints on `setReadOnly`. One DataSource.
RDS Proxy + plain `org.postgresql.Driver`	n/a	Two DataSource beans (one per proxy endpoint). App routes manually.
Lambda / serverless	n/a	RDS Proxy + plain driver. The wrapper's value is amortized warm-pool benefits — irrelevant for cold invocations.

Pinning — the multiplexing trap

RDS Proxy multiplexes by handing one backend session to multiple client connections, but only when the session is resettable. The pinning rules for Aurora PostgreSQL disable multiplexing on SET, PREPARE/DEALLOCATE/EXECUTE, temporary tables, declared cursors, LISTEN, advisory locks, and any statement >16 KB. Hibernate with server-side prepared statements pins on every session. There are real teams (Aggarwal's 12-hour revert is the most-cited public postmortem) that hit ~100% pinning under load and pulled the proxy out the same day. The diagnostic is the DatabaseConnections.PinnedConnections CloudWatch metric — if pinned connections approach total, you're paying for a proxy that isn't actually multiplexing.

My take

RDS Proxy and the AWS JDBC driver aren't usually a "pick one" decision — they solve different concerns and can layer cleanly if you pick the right plugins. Three rules I'd hold:

Failover ownership belongs to one layer. Don't run failover + efm2 behind a proxy. The proxy already does it; you're paying twice and risking conflicting reactions to transient errors.
Read/write splitting needs an explicit choice. Two DataSource beans, or srw, or readWriteSplitting (no proxy). Pick one — never two.
The wrapper still earns its keep behind a proxy if you're using IAM auth or srw. Otherwise plain org.postgresql.Driver is simpler and the wrapper's plugin chain is mostly cosmetic.

If your motivation for either layer is "make the app faster," neither is the answer — that's a query / index / cache problem.

The checklist I run through before shipping

Before I put the wrapper in front of production traffic, I go through this list. Nothing on it is optional.

Driver version ≥ 3.3.0. cp-* properties landed in v3.1.0 and efm2 has been available since v2.4.0, so this is not the floor for those features. The reason I draw the line at 3.3.0: it includes the readWriteSplitting + failover plugin-ordering fix and removes a 5-second sleep from the failover recovery path. If you're below 3.1.0, cp-* won't work at all.
F0 profile not in use unless version-aware — on v2.x, F0 hardcodes maxPoolSize=30. I've been burned.
cp-MaximumPoolSize ≥ maximumPoolSize on the external pool.
exception-override-class-name set to software.amazon.jdbc.util.HikariCPSQLException.
socketTimeout=0 — liveness belongs to efm2.
Read-only transactions annotated — otherwise readWriteSplitting is decorative.
Aurora max_connections supports pods × cp-max × (1 + readers) with 20% headroom.
Topology endpoint reachable from every pod (cluster and per-instance DNS resolve via VPC DNS).
Plugin list ordered: readWriteSplitting,auroraConnectionTracker,failover,efm2.
Observability wired — hikaricp.connections.pending alert on non-zero steady state.

Where this leaves me

The AWS JDBC Driver is one of those libraries where the defaults are opinionated but not obvious, the configuration surface is large, and the version-to-version behavior has shifted in ways that invalidate older docs you'll find on the internet. The cases where I've seen teams get into trouble all look the same: they adopted a profile without reading what was inside it, or they moved from v2.x to v3.x without re-checking whether the properties they'd set still did anything.

If I could boil this post down to one practical habit: don't trust the external pool metrics alone. The wrapper adds a whole second layer of pooling between your hikaricp.connections count and the actual network. When the external pool metrics look fine but your requests are slow, look inside. And if you're still on v2.x with F0, upgrade — there is no property you can set to make it behave.

I lost a few hours to this. You shouldn't have to lose any.

References

AWS Advanced JDBC Wrapper — driver docs

Using the JDBC Driver — full parameter reference including wrapperPlugins, wrapperProfileName, wrapperDialect, and all connection properties
Configuration Presets — what F0, F1, SF0, etc. actually configure (plugins, pool settings, timeouts)
Host Selection Strategies — roundRobin, random, leastConnections, highestWeight
Failover Configuration Guide — failoverMode, detection tuning, transactional behavior during failover
Framework Integration — notes on Spring Boot, Hibernate, and other framework specifics
DataSource Configuration — alternative to driver-mode configuration via AwsWrapperDataSource
Compatibility — supported databases, JDBC versions, known limitations

AWS Advanced JDBC Wrapper — plugin docs

readWriteSplitting — reader routing, internal connection pooling with connectionPoolType=hikari, cp-* properties, readerHostSelectorStrategy
failover — classic failover plugin; topology detection, connection invalidation
failover2 — newer failover implementation (v2); recommended for new deployments
efm2 (Host Monitoring) — failureDetectionTime, failureDetectionInterval, failureDetectionCount, monitoring timeouts
auroraConnectionTracker — connection-to-instance mapping for failover invalidation

AWS Advanced JDBC Wrapper — examples and changelog

Spring Boot + HikariCP example — working YAML with exception-override-class-name and HikariCP data-source properties
Spring + Hibernate example — Hibernate-specific session factory integration
Spring Transaction Failover example — handling transactional rollback during failover
PR #1658 — configurable internal pool — the change (v3.1.0) that made cp-* properties work outside of profiles
Changelog — version-to-version migration notes

RDS Proxy

Planning where to use Amazon RDS Proxy — the canonical use-case list (Lambda, T2/T3, IAM auth, Blue/Green, failover speedup)
RDS Proxy endpoints — read/write vs read-only endpoints; "the proxy routes where you point it" semantics
Avoiding pinning — full list of session-state operations that disable multiplexing per engine
Wrapper README — RDS Proxy section — the official statement that failover, efm2, and readWriteSplitting are incompatible with RDS Proxy
Simple Read/Write Splitting Plugin (srw) — the topology-agnostic plugin purpose-built for use behind RDS Proxy (since v3.0.0)
RDS Proxy pricing — per vCPU-hour for provisioned, per ACU-hour for Serverless

External

HikariCP — About Pool Sizing
Aurora PostgreSQL — Performance and scaling for Amazon Aurora PostgreSQL (source for the max_connections default formula and the 5,000-connection cap)
Aggarwal — "Experience with AWS RDS Proxy in production, and why we had to revert it in 12 hours" (cited in the pinning section)

This PHD from AWS Might Save Your Weekend!

Kris Iyer — Mon, 10 Nov 2025 23:48:42 +0000

There are some lessons you only learn the hard way in cloud operations — and this was one of them.

A few weeks ago, one of our Amazon RDS databases restarted itself in the middle of the night.

No deploys. No CloudWatch alarms. Just… downtime.

When I checked the console later, I saw the engine version had been upgraded — even though Auto Minor Version Upgrade was disabled.

That’s not supposed to happen, right?

The Mystery

My first reaction: “We must’ve messed up the configuration.”

But after a deep dive, I realized AWS had forced the upgrade because the version we were running had reached end-of-support.

Apparently, this is expected behavior.

When your RDS or Aurora version goes out of support, AWS reserves the right to automatically upgrade it to a supported version — even with auto-upgrade turned off.

That’s when I learned my biggest oversight:

The AWS Personal Health Dashboard (PHD) had already warned me about the change.

I just hadn’t been looking.

What Really Happens During a Forced Upgrade

Here’s where AWS is very clear about what happens — when an Aurora or RDS version reaches End of Standard Support (EOSS), Aurora performs an automatic upgrade to keep the cluster compliant.

And even if Auto Minor Version Upgrade is disabled, the process still triggers restarts across the cluster.

AWS describes it like this:

“When a version reaches EOSS, Aurora performs an automatic upgrade to keep the cluster compliant with supported versions, even if Auto Minor Version Upgrade is disabled.

During this process, cluster nodes are restarted sequentially, and DNS endpoints are briefly remapped to the new hosts, which can cause temporary connection errors such as: connection refused or host not resolving.”

That short sentence hides a lot of operational pain.

The Connection Pooling Trap

If your application uses connection pooling (as most do), this restart can leave behind stale or dead connections that linger long after the cluster is back up.

Here’s what I saw in logs:

connection refused errors when the DNS endpoint switched hosts.
Application threads stuck waiting on sockets that would never recover.
Connection pools holding references to the old host until they were recycled.

Interestingly, HikariCP handled it gracefully — it dropped the bad connections and automatically re-established new ones.

But some other clients we had running didn’t recover cleanly; their pools held onto dead connections until we performed an application rolling restart to clear them.

It was a subtle but painful reminder that even small differences in connection management can turn a “brief restart” into a customer-facing outage.

The fix?

Implement aggressive connection validation and retry logic in your database clients.
For pooled connections, ensure you’re using health checks like connectionTestQuery or validationTimeout.
Consider shorter max lifetime settings in your pool so old connections are recycled faster after restarts.
And of course — know when AWS is about to restart your cluster (that’s what PHD is for).

The Bigger Picture — It’s Not Just RDS

While my story centers around RDS, the AWS Personal Health Dashboard covers far more than databases.

Nearly every critical service you run in production can appear here — and if you’re not watching, you can miss events that matter.

A few common examples:

Amazon EC2: Instance retirements, hardware maintenance, or networking reconfigurations.
Amazon Aurora & RDS: Engine upgrades, SSL/TLS certificate rotations, or storage maintenance.
Amazon SQS & SNS: Service endpoint updates, feature deprecations, and regional throttling events.
AWS Lambda & EventBridge: Deprecation of older runtimes, event delivery changes, or region-specific service modifications.
Amazon EKS & ECS: Control plane upgrades, patching windows, or underlying node retirements.

Deprecation notices are one of the most valuable (and easily missed) categories in PHD. They often appear weeks or months in advance, giving teams the time to plan migrations or version upgrades — before a breaking change occurs.

Each of these can show up in the Personal Health Dashboard before they impact you — but only if you’re subscribed, alerting, and paying attention.

What the Personal Health Dashboard Actually Does

Most teams glance at the Personal Health Dashboard once and forget it exists.

But if you dig in, you’ll realize it’s a quiet goldmine of visibility into what AWS is doing to your infrastructure.

💡 Proactive warnings about maintenance, deprecations, and upcoming service changes.
🔔 Integration options via EventBridge or SNS for automated alerts (to Slack, email, etc.).
🧾 Historical logs of past events so you can correlate AWS maintenance with your own incidents.
🔍 Account-level insights, not just global AWS status updates.

It’s basically AWS’s way of saying:

“Hey, we’re about to touch something you own. Just so you know.”

What I Changed After That

After that forced upgrade, I decided I’d never let another PHD event go unnoticed.

Here’s what I plan (and what I’d recommend others do too):

Set up PHD notifications using SNS → EventBridge → Slack/Teams.
Create a weekly check for open/scheduled events in your ops stand-up.
Add version lifecycle tracking for all RDS, Aurora, EC2, and other managed services.
Update your incident runbook to check PHD during investigations.
Tag PHD alerts in PagerDuty with “AWS Provider Change” so you can separate them from internal incidents.
Watch for deprecation notices — these are often the first sign a version, runtime, or API will be retired.

Now, when AWS schedules something, we know before production does.

Looking Ahead — AI, Automation, and MCP

The next step for me has been making all this smarter.

AWS gives you the data through PHD, but there’s still a lot of noise.

AI and automation can help make sense of it.

With tools like Amazon Q, QuickSight, and Bedrock, you can summarize or query health events in natural language — “What’s changing next week in us-east-1?” or “Which clusters are nearing end-of-support?”

And if you want to take it a step further, AWS Labs has released the MCP Servers project, which defines a standard interface that allows AI assistants or bots to securely access AWS data (like the Health API) and answer those same questions automatically.

It’s early days, but it’s easy to see how something like this could become an Ops Copilot — a chat assistant that not only reports on AWS health events but also suggests who owns the impacted resource, what runbook applies, and what to do next.

Why It Matters

Cloud automation is great — until it surprises you.

The Personal Health Dashboard is your early warning system for when AWS changes something under the hood.

If you’re running production workloads, it’s not optional.

It’s as essential as CloudWatch or your favorite APM.

Here’s the mindset shift that helped me:

Automation doesn’t remove responsibility.

It just changes where you have to pay attention.

TL;DR

AWS can still upgrade your RDS/Aurora cluster when versions hit end-of-support.
The Personal Health Dashboard will warn you — if you’re watching.
Connection pools can fail during DNS endpoint remapping; some (like HikariCP) recover automatically, others may need a restart.
PHD isn’t just for RDS — it covers EC2, EventBridge, SQS, SNS, Lambda, EKS, and more.
Deprecation notices are critical early warnings — act on them before they become incidents.
Automate those alerts via SNS/EventBridge to avoid surprises.
Tools like AI and MCP can make this even smarter — turning AWS health data into insight.
Treat PHD as a core monitoring tool, not an afterthought.

Since that incident, I’ve made the PHD part of our daily ops hygiene.

It’s not flashy, but it’s saved us from more than one nasty surprise.

If you’ve been ignoring it (like I did), maybe it’s time to give it another look.

Have you ever been surprised by an AWS upgrade, deprecation, or hidden maintenance event? I’d love to hear how you handled it — drop a comment below.

Kubernetes Features for operating resilient workloads on Amazon EKS

Kris Iyer — Fri, 14 Mar 2025 13:02:31 +0000

Photo by Yomex Owo on Unsplash

Introduction

In Kubernetes, managing highly available applications is critical for maintaining service reliability and resiliency. A scalable and resilient architecture keeps your applications and services running without disruptions, which keeps your customers and users happy! Thankfully, there are several configuration options to meet these important NFRs for your k8s workloads.

Control Plane

Amazon Elastic Kubernetes Service (EKS) is a managed Kubernetes service that makes it easy to run Kubernetes on AWS without installing, operating, and maintaining your own Kubernetes control plane or worker nodes. The EKS architecture is designed to eliminate any single points of failure that may compromise the availability and durability of the Kubernetes control plane and offers an SLA of 99.95% for API server endpoint availability.

Last December at re:Invent 2024, AWS announced a new mode for managing Amazon EKS clusters: Amazon EKS Auto Mode promises simplified cluster operations, improved application performance, availability, and security, and continuously optimizes compute costs. With EKS Auto Mode, you can focus on running your applications without worrying about the underlying infrastructure and resilience of the control plane.

AWS EKS Auto mode

It is designed to be highly available, fault-tolerant, and scalable. Following the recommendations of the Well-Architected Framework, Amazon EKS runs the Kubernetes control plane across multiple AWS Availability Zones (AZ) to ensure high availability. The cluster control plane auto-scales based on the load and any unhealthy control plane instances are replaced automatically. The availability of EC2 instances attached to an EKS Cluster is covered under the Amazon Compute SLA.

In a nutshell, as far as the resilience of the control plane is concerned, AWS EKS auto mode handles all of that for you allowing engineering teams to focus on building applications and business logic rather than managing the infra!

Data Plane

Multi-Region Kubernetes Clusters

If you are one of those with stringent availability requirements you may choose to operate across multiple AWS Regions. This approach protects against larger-scale disasters or regional outages. However, the cost and complexity of implementing such a setup can be significantly higher. This architecture pattern is typically reserved for disaster recovery and business continuity. Therefore understanding the Spectrum of Resilience Strategies is very important:

We can categorize resilience strategies based on their:

Recovery Time Objective (RTO): How quickly do you need to restore service?
Recovery Point Objective (RPO): How much data loss you can tolerate?
Cost: The expense of implementing and maintaining the strategy.
Complexity: The difficulty of setup and ongoing management.

Here’s a breakdown of the strategies, moving from “cold” to “hot” standby:

1. Backup and Restore (Very Cold Passive):

Description: Regularly backing up data and infrastructure configurations to a separate region or storage location. If a disaster occurs, you restore the backups to a new environment.
RTO: Very high (hours to days).
RPO: High (potential for significant data loss, depending on backup frequency).
Cost: Lowest (storage costs for backups, minimal infrastructure costs).
Complexity: Low (relatively simple backup and restore procedures).
Use Cases: Suitable for non-critical workloads with relaxed RTO/RPO requirements, where cost is the primary concern.
Shades:
- Frequency of backup. Daily, Hourly, etc.
Location of backup. S3, Glacier, another region.
Automation of restore. Manual vs Fully automated.

2. Pilot Light (Active/Passive — Cold Standby):

Description: Maintaining a minimal “pilot light” environment in a secondary region, including core infrastructure components. When a disaster occurs, you scale up the pilot light to full capacity.
RTO: Moderate (minutes to hours).
RPO: Low (data replication is typically used, minimizing data loss).
Cost: Low to moderate (cost of the pilot light environment, data replication).
Complexity: Moderate (setup and testing of failover procedures).
Use Cases: Suitable for workloads that require faster recovery than backup and restore, but can tolerate some downtime.
Shades:
- Amount of infrastructure kept running. Minimal core services, or almost a full duplicate.
Automation of scaling. Manual, semi, or fully automated.
Data replication type. Async, or sync.

3. Warm Standby (Active/Passive — Warm Standby):

Description: Maintaining a fully scaled-down, but functional, environment in a secondary region. Data is continuously replicated. When a disaster occurs, you switch traffic to the warm standby.
RTO: Low (minutes).
RPO: Very low (near-zero data loss).
Cost: Moderate to high (cost of the warm standby environment, data replication).
Complexity: High (complex failover and failback procedures).
Use Cases: Suitable for critical workloads that require minimal downtime.
Shades:
Amount of traffic the warm standby receives. No traffic, or a small amount of test traffic.
Testing of failover. Frequent, or infrequent.
Data replication consistency. Strong or eventual consistency.

4. Active/Active (Hot Standby):

Description: Running identical environments in multiple regions simultaneously, with traffic distributed across them. If one region fails, traffic is automatically routed to the remaining regions.
RTO: Very low (seconds).
RPO: Very low (near-zero data loss).
Cost: Highest (cost of running full environments in multiple regions).
Complexity: Highest (complex traffic management, data synchronization, and application design).
Use Cases: Suitable for mission-critical workloads that require continuous availability and cannot tolerate any downtime.
Shades:
Traffic distribution. Weighted, or even distribution.
Data replication. Synchronous, or asynchronous.
Application design. Region-aware applications, or not.

Expanding on the Cost-Benefit Trade-Off:

Cost of Downtime: The “cost of being down” is not just financial. It includes reputational damage, customer churn, and lost productivity.
Workload Characteristics: The nature of your workload influences the appropriate strategy. Real-time applications require lower RTO/RPO than batch processing.
Compliance Requirements: Regulatory requirements may dictate specific RTO/RPO targets.
Testing and Validation: Regularly testing failover procedures is crucial to ensure they work as expected.
Geographic Distribution: Active/active can also improve performance by serving users from the closest region.

By carefully evaluating these pillars, you can choose the resilience strategy that best balances cost, complexity, and risk for your specific needs for your Multi-regional deployments.

Multi-cluster

This is a popular pattern where we deploy workloads across multiple Amazon EKS clusters to eliminate a Kubernetes cluster from being a single point of failure. Multi-cluster architecture also provides opportunities for testing, maintenance, and upgrades without disrupting production environments. By diverting traffic or workloads to a set of clusters during planned maintenance activities, one can ensure continuous service availability and achieve near-zero downtime.

It does mean we use Application Load Balancer (ALB) or Network Load Balancer (NLB) to distribute traffic to replicas of services running inside a cluster or even load balance traffic across multiple clusters. When using ALB, we can create dedicated target groups for each cluster. Using weighted target groups, we can then control the percentage of traffic each cluster gets. For workloads that use an NLB, we can use AWS Global Accelerator to distribute traffic across multiple clusters.

Note: Avoid potential pitfalls with configuration drifts across clusters. When adopting a multi-cluster architecture for resiliency, it is essential to reduce the operational overhead of managing clusters individually. The idea is to treat clusters as a unit. Should an issue arise in the deployment, it’s easier to fix if all clusters share the same workload version and configuration.

Compute Resources (Nodes/Node Groups)

Kubernetes Cluster AutoScaler (CA)

The Kubernetes Cluster Autoscaler is a popular Cluster Autoscaling solution maintained by SIG Autoscaling. It is responsible for ensuring that your cluster has enough nodes to schedule your pods without wasting resources. It watches for pods that fail to schedule and for underutilized nodes. It then simulates the addition or removal of nodes before applying the change to your cluster. The AWS Cloud Provider implementation within CA controls the .DesiredReplicas field of your EC2 Auto Scaling Groups.

Automated Node Scaling: The CA automatically adjusts the number of nodes in your cluster based on the resource requests of pending pods. If pods are waiting to be scheduled and there aren’t enough resources, the CA will provision new nodes. Conversely, if nodes are underutilized, the CA can scale down the cluster, removing unnecessary nodes.
Improved Resource Utilization: By dynamically scaling the number of nodes, the CA helps to optimize resource utilization. You avoid over-provisioning nodes, which can lead to wasted resources and increased costs. The cluster more closely matches the resource demands.
Cost Optimization: Scaling down the number of nodes when they are not needed can significantly reduce your cloud computing costs. You only pay for the resources you use.
Simplified Cluster Management: The CA automates the process of adding and removing nodes, reducing the manual effort required to manage your Kubernetes infrastructure. This frees up your operations team for other tasks.
Increased Application Availability: By ensuring that there are enough resources available to run your applications, the CA can help to improve application availability and prevent resource starvation. Pods are more likely to be scheduled quickly.
Support for Diverse Environments: The CA is designed to work with various cloud providers (AWS, Azure, GCP) and even on-premises Kubernetes clusters. This makes it a versatile solution for managing Kubernetes in different environments.
Integration with Kubernetes: The CA is a core Kubernetes component and integrates seamlessly with other Kubernetes features, such as the scheduler and the HPA.
Configuration Flexibility: The CA offers a range of configuration options, allowing you to customize its behavior to meet your specific needs. You can control the minimum and maximum number of nodes, the types of instances to use, and other parameters.
Node Draining: When scaling down, the CA gracefully drains nodes by evicting pods before terminating the instance. This prevents application disruptions and ensures a smooth scaling process.
Scaling Faster: There are a couple of things you can do to make sure your data plane scales faster and recovers quickly from failures in individual Nodes and make CA more effecient: — Over-provision capacity at the Node level — Reduce Node startup time by using Bottlerocket or vanillaAmazon EKS optimized Linux

Karpenter

Karpenter was introduced in this space as an AWS alternative to the CA which allowed for greater flexibility with fine-grained control over cluster and node management like never before for EKS.

Faster and More Efficient scaling by directly provisions EC2 instances based on the needs of pending pods and thus eliminates the overhead of managing node groups and significantly speeds up the scaling process, especially for workloads with fluctuating demands.
Fine-grained Control control over the types of instances it provisions. You can specify instance types, availability zones, architectures (e.g., ARM64), and other instance properties through “provisioners.” This allows you to optimize resource utilization and cost efficiency for different workloads. You can tailor the compute resources to the workload. Karpenter can efficiently provision worker nodes for a wide variety of workloads, including those with specialized hardware requirements (like GPUs) or architectural needs.
Simplified Cluster Management by dynamically provisioning nodes. You don’t need to pre-configure and manage multiple node groups with varying instance types. This reduces operational overhead and makes it easier to manage your EKS cluster. Less configuration and management are required.
Improved Cost Optimization: Karpenter integrates well with AWS cost-saving features. It can automatically provision nodes using Spot Instances, Savings Plans, or other cost-effective options, helping you minimize your EKS spending.
Better Integration with EKS: As an AWS-developed tool, Karpenter seamlessly integrates with EKS and other AWS services. This leads to a smoother experience and allows Karpenter to leverage AWS-specific features and best practices.
Right-Sized Instances: Karpenter provisions instances that precisely match the resource requests of pending pods. This avoids over-provisioning and improves resource utilization. You don’t get instances that are too large or too small.
Automated Node Draining: When scaling down, Karpenter gracefully drains nodes by evicting pods before terminating the instance. This prevents application disruptions and ensures a smooth scaling process.
Declarative Configuration: Karpenter uses declarative configuration through provisioners, making it easy to manage and version your worker node configurations. With Karpenter, you can define NodePools with constraints on node provisioning like taints, labels, requirements (instance types, zones, etc.), and limits on total provisioned resources. When deploying workloads, you can specify various scheduling constraints in the pod specifications like resource requests/limits, node selectors, node/pod affinities, tolerations, and topology spread constraints. Karpenter will then provision right-sized nodes based on these specifications.

Here is a Quick high-level comparison:

EKS Auto Mode

As discussed under the control plane section above, EKS Auto Mode extends into the data plane with its powerful features that simplify worker node management by leveraging Karpenter under the hood. So in essence what you get is Managed Karpenter where AWS manages the Karpenter installation and configuration for you. EKS Auto Mode provides a superior experience compared to manually configuring the Cluster Autoscaler.

For the vast majority of EKS users, especially those starting new clusters or looking for a simplified solution, EKS Auto Mode is the recommended approach for autoscaling worker nodes.

Let’s break down some examples of situations where EKS Auto Mode might not be sufficient and you might need more direct control via Karpenter or (less commonly) the Cluster Autoscaler:

A decision tree for EKS Auto Mode

1. Fine-Grained Instance Type Control:

Specific Instance Families: Auto Mode lets you choose instance types, but you might need a very specific generation (e.g., c5.xlarge vs. c7g.xlarge for Graviton) or a particular instance family due to workload requirements (e.g., memory-optimized instances). Auto Mode's selection might not always align perfectly with your needs and complex computing use cases.
Heterogeneous Instance Types within a Provisioner: While Karpenter can provision nodes with different instance types, Auto Mode simplifies this and may not offer the same level of granular control. You might want a mix of instances within a single provisioner based on cost or performance needs.
Custom AMIs: You might require a custom Amazon Machine Image (AMI) with specific software pre-installed or security hardening applied. Auto Mode typically uses standard AMIs, so you’d need more control for custom AMIs.

2. Taints and Tolerations:

Node Taints: Taints are used to repel pods from specific nodes. You might need to taint nodes for specialized workloads (e.g., GPU nodes) and then use tolerations in your pod specifications to allow those pods to run on the tainted nodes. Auto Mode might not offer the fine-grained control to apply specific taints during node provisioning.
Complex Tolerations: While Auto Mode does consider tolerations in pod specs if you have a complex set of tolerations, sometimes direct Karpenter configuration is better to ensure the right nodes are provisioned.

3. Node Labels:

Application-Specific Labels: Labels are used to organize and select nodes. You might need to apply specific labels to nodes for application-specific purposes (e.g., environment, team, or application name). Auto Mode’s labeling might not be flexible enough for all cases.
Node Pool Management: If you want to create and manage distinct sets of nodes (node pools) with different labels and configurations, you might need more direct control than Auto Mode provides.

4. Advanced Karpenter Features:

Provisioner Prioritization: If you want to prioritize certain provisioners (sets of instance types and configurations) over others, you would need to configure Karpenter directly.
Custom Scheduling Logic: For very specialized scheduling needs beyond what Kubernetes natively provides, you might need to use advanced Karpenter features.
Integration with other tools: If you have existing infrastructure-as-code (IaC) or configuration management tools that directly manage Karpenter, switching to Auto Mode could require significant changes.

5. Node Lifecycle Management:

Node Draining Customization: While Auto Mode handles node draining, you might have specific requirements for how nodes are drained (e.g., specific pod eviction policies or pre-shutdown scripts).
Node Replacement Strategies: You might need to define custom node replacement strategies based on your application’s requirements.

In summary , Auto Mode is excellent for most common scenarios. However, if your worker node requirements involve very specific instance types, complex taints, and tolerations, application-specific labels, advanced Karpenter features, or customized node lifecycle management, then configuring Karpenter directly gives you the necessary fine-grained control. You’ll know you need it when the Auto Mode configuration options simply don’t offer the knobs you need to turn.

Note: It’s important to keep in mind that EKS Auto Mode is still under active development, and AWS is continuously adding new features and improvements. So, it’s always a good idea to check the latest AWS documentation to see if Auto Mode meets your specific needs.

Deployment Strategies

In Kubernetes, several options for deploying applications are available, each suited for different needs and scenarios. Some of the popular techniques include blue-green deployments, canary deployments, and rolling updates. Each method is unique in what it has to offer for managing updates and minimizing downtime.

Rolling Updates

Kubernetes Rolling Updates are a critical feature for deploying and updating applications with zero downtime. They work by gradually replacing old pods with new ones, ensuring a smooth transition and minimizing disruption to users.

Choosing the Right Values is critical. The optimal values for maxSurge and maxUnavailable depend on your application's specific requirements and resource constraints. Here are some factors to consider:

Application Sensitivity: If your application is critical and cannot tolerate any downtime or performance degradation, you should use conservative values for maxSurge and maxUnavailable (e.g., 10-20%).
Resource Availability: If your cluster has limited resources, you should use lower values for maxSurge to avoid resource exhaustion.
Update Speed: If you need to deploy updates quickly, you can use higher values for maxSurge and maxUnavailable to speed up the process.

When you’re first setting up rolling updates, it’s best to start with conservative values for maxSurge and maxUnavailable and gradually increase them as you gain confidence, monitor the application, and fine-tune it as per your needs.

2. Canary Deployments

A Canary deployment strategy is a gradual rollout where you deploy a small percentage of the new version (the “canary”) alongside the existing version of your service. It is typically used when deploying new features or updates to a subset of users or servers to test them in a live environment and is often used for applications that require frequent updates. This strategy allows for the testing of new features with minimal impact on the production environment and can help us identify issues before they affect the entire system. A service mesh provides advanced traffic management features, including fine-grained traffic splitting, header-based routing, and more. This is highly recommended for more sophisticated canary deployments. For instance, Istio offers a VirtualService that defines how traffic is routed to your services where you you may choose to route most traffic (e.g., 90%) to v1 of your service and a small percentage (e.g., 10%) to v2 (the canary). If there were no service mesh involved you have to back this with an ingress controller strategy or a load balancer to do the same. For instance, Ambassador a popular ingress controller for Kubernetes offers canary releases based on weights.

3. Blue-Green Deployments

With the Blue-Green deployment strategy, you have two complete environments (blue and green). One environment (e.g., blue) is live, serving all traffic. You deploy the new version to the other environment (green). After testing in green, you switch all traffic from blue to green. Blue/green is about faster deployments and simplified rollbacks. The new release candidate is tested before being switched to the production environment, allowing for a smooth transition without any downtime or errors. This approach also heavily depends on your load balancer or ingress control and the ability to switch traffic from blue to green or vice-versa in the event of a rollback.

In summary, choosing a deployment strategy depends on the specific requirements and characteristics of the application or service being deployed. Canary deployments are great for services that go with frequent updates and testing, Rolling deployments are a great choice for zero-downtime deployments, and Blue-Green deployments are ideal for minimizing downtime during deployments.

Topology Spread Constraints (TSC)

Kubernetes Topology Spread Constraints (TSC) ensure pods are spread across zones during scale-up. However, they don’t guarantee balanced distribution during scale-down. The Kubernetes descheduler can be used to address this imbalance.

apiVersion: apps/v1
kind: Deployment # Or StatefulSet, etc.
spec:
  template:
    spec:
      topologySpreadConstraints:
        - maxSkew: <number> # Maximum difference in pods across topologies
          topologyKey: <string> # The topology domain (e.g., zone, node)
          whenUnsatisfied: <string> # How to handle if constraints can't be met
          labelSelector: # Selects the pods to which this applies
            matchLabels:
              <label-key>: <label-value>
          # Optional: minDomains: <number> # Minimum number of domains to spread across

Let’s break down the key parameters:

maxSkew: This is the most crucial parameter. It defines the maximum difference in the number of pods between any two topologies (e.g., zones). A maxSkew of 1 means that the difference in pod count between any two zones should be no more than 1.
topologyKey: This specifies the topology domain. Common values:
kubernetes.io/hostname: Spreads pods across nodes.
topology.kubernetes.io/zone: Spreads pods across availability zones.
topology.kubernetes.io/region: Spreads pods across regions. (Less common)
Custom labels: You can use any label as a topology key, giving you very flexible control.
whenUnsatisfied: This dictates what Kubernetes should do if the constraint cannot be satisfied when a pod is scheduled:
DoNotSchedule: (Recommended in most cases) Prevents the pod from being scheduled if the constraint cannot be met. This ensures the spread is maintained.
ScheduleAnyway: Allows the pod to be scheduled even if the constraint is violated. This is generally not recommended as it defeats the purpose of the constraint.
labelSelector: This uses standard Kubernetes label selectors to specify which pods the constraint applies to. This is essential to target the constraint.
minDomains (Optional): Specifies the minimum number of topology domains (like zones) that pods should be spread across. This is useful for high availability, ensuring your application runs in a minimum number of zones.

This ensures that no two zones have more than one pod difference for pods labeled app: my-app

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfied: DoNotSchedule
    labelSelector:
      matchLabels:
        app: my-app

Likewise, you could spread by nodes to limit the difference in the number of app: my-app pods on any two nodes to a maximum of 2.

topologySpreadConstraints:
  - maxSkew: 2
    topologyKey: kubernetes.io/hostname
    whenUnsatisfied: DoNotSchedule
    labelSelector:
      matchLabels:
        app: my-app

Important Considerations:

DoNotSchedule is Crucial: In almost all cases, you should use whenUnsatisfied: DoNotSchedule. Otherwise, the constraint becomes meaningless.
Label Selectors are Essential: Without alabelSelector, the constraint will apply to all pods, which is rarely what you want.
maxSkew and Replica Count: maxSkew interacts with the number of replicas. If you have fewer replicas than topology domains, perfect spreading might not be possible.
Planning: Think about your failure domains (zones, nodes) and how you want your application to behave in the event of a failure. This will help you determine the appropriate maxSkew and topologyKey.

Descheduler

The Descheduler is a valuable tool for managing and optimizing Kubernetes clusters. By evicting the appropriate pods, the Descheduler can help improve resource utilization, maintain the desired state of the cluster, and enhance the scalability and security of the cluster in the face of node failures and vulnerabilities. Becomes very useful to maintain the balance and spread across your zones specifically during scale-down events.

Topology Aware Routing on Amazon EKS

Topology Aware Routing (Also referred to as Topology Aware Hints or TAH before v1.27) provides a mechanism to help keep network traffic within the zone where it originated. Preferring same-zone traffic between Pods in your cluster can help with reliability, performance (network latency and throughput), or cost.

Kubernetes clusters are increasingly deployed in multi-zone environments. Topology Aware Routing provides a mechanism to help keep traffic within the zone it originated from. When calculating the endpoints for a Service, the EndpointSlice controller considers the topology (region and zone) of each endpoint and populates the hints field to allocate it to a zone. Cluster components such as kube-proxy can then consume those hints, and use them to influence how the traffic is routed (favoring topologically closer endpoints).

You can enable Topology Aware Routing for a Service by setting the service.kubernetes.io/topology-mode annotation to Auto. When there are enough endpoints available in each zone, Topology Hints will be populated on EndpointSlices to allocate individual endpoints to specific zones, resulting in traffic being routed closer to where it originated from.

When using Horizontal Pod Autoscaler, topology spread constraints ensure newly created pods are spread among AZs during scaling out. However, when scaling in, the deployment controller won’t consider AZ balance, and instead randomly terminates pods. This may cause the endpoints in each AZ to be disproportionate and disable Topology Aware Routing. The descheduler tool can help you re-balance pods by evicting improperly placed pods so that the Kubernetes scheduler can reschedule them with the appropriate constraints in effect.

Topology Aware Routing and Node Affinity

While node affinity is not required for basic topology-aware routing, it becomes very useful in the following scenarios:

More Control Over Pod Placement: Node affinity gives you fine-grained control over where pods are initially scheduled. While basic topology-aware routing ensures traffic is preferentially routed to the same zone, it doesn’t guarantee that pods will be evenly distributed across zones. Node affinity allows you to express preferences or requirements for pod placement.
Taints and Tolerations: If you use taints to restrict pods to specific nodes (e.g., GPU nodes), you’ll need tolerations in your pods to allow them to run on those nodes. Node affinity can then be used to further refine placement within those tainted nodes.

Let's consider the following scenario : Regional Redundancy with Specialized Hardware for your ML Training job:

Regional Redundancy: Your application must be resilient to zone failures. You want pods to be spread across at least three availability zones in a region.
Specialized Hardware: Your training jobs require GPUs. You have a set of nodes in each zone equipped with GPUs.
Data Locality (Performance): For performance reasons, you want training jobs to run on GPU nodes in the same zone where the training data resides (assuming data locality is a factor in your application’s architecture).

In this scenario, basic Topology Aware Routing is not sufficient as you also need to combine this with special node types such as matching node labels as shown below:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-training-app
spec:
  replicas: 9 # 3 replicas per zone (adjust as needed)
  selector:
    matchLabels:
      app: ml-training
  template:
    metadata:
      labels:
        app: ml-training
    spec:
      affinity:
        nodeAffinity: # Ensure pods are on GPU nodes
          requiredDuringSchedulingIgnoredDuringExecution: # Must be on a GPU node
            nodeSelectorTerms:
            - matchExpressions:
              - key: gpu # Label on your GPU nodes (e.g., nvidia.com/gpu)
                operator: Exists
          preferredDuringSchedulingIgnoredDuringExecution: # Preference for same zone
          - weight: 100
            preference:
              matchExpressions:
              - key: topology.kubernetes.io/zone
                operator: In
                values:
                  - $(ZONE) # Placeholder for zone label
      topologySpreadConstraints: # Ensure spread across zones
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfied: DoNotSchedule
        labelSelector:
          matchLabels:
            app: ml-training
      containers:
      # ... (your container definition)
      env:
      # ... (environment variable injection for ZONE as before)
      initContainers:
      # ... (init container for ZONE injection as before)

Pod Disruption Budget (PDB)

A pod disruption budget (PDB) is a Kubernetes policy that helps ensure the high availability of your applications running on the platform. It defines the minimum number of pods from a specific deployment that must be available at any given time. This ensures that even during maintenance operations or unforeseen disruptions, your application remains functional with minimal downtime.

Here’s a breakdown of how PDBs work:

Minimum Available Pods: You define the minimum number of pods that your application needs to function properly. This minimum acceptable number is specified in the PDB configuration.
Voluntary Disruptions: PDBs primarily target voluntary disruptions, which are planned events initiated by the cluster administrator or automated processes. These disruptions could include:
Node Drain : Taking a node out of service for maintenance requires draining the pods running on it. A PDB can prevent evictions from exceeding a safe limit to ensure application functionality.
Rolling Updates : Upgrading deployments often involve rolling restarts where new pods are introduced while old ones are terminated. A PDB can pace this rollout to avoid overwhelming the system.
Not for Involuntary Disruptions: PDBs don’t have control over involuntary disruptions caused by hardware failures, network issues, or software crashes. These events can still cause your application to become unavailable.

Benefits of using Pod Disruption Budgets:

High Availability: PDBs minimize downtime during planned maintenance or upgrades by preventing accidental pod evictions beyond a safe threshold.
Operational Efficiency: They provide a safety net for cluster administrators, allowing them to perform maintenance tasks with confidence.
Resource Optimization: PDBs can help prevent unnecessary pod evictions, leading to more efficient resource utilization within the cluster.

Here are some additional points to remember:

PDB Definition: PDBs are defined as YAML or JSON objects and applied using the kubectl apply command.
Disruption Budget: The budget can be specified as an absolute number of pods or a percentage of the total replicas in the deployment.
Always Allow Unhealthy Pod Eviction: It’s recommended to set the alwaysAllowUnhealthyPodEviction policy to true in your PDB. This allows evicting misbehaving pods during a node drain to proceed without waiting for them to become healthy.
Testing & Simulation: Test PDB configurations thoroughly to ensure they align with your application’s availability requirements and architecture. Simulate disruptions and verify that the desired number of Pods remains available.
Monitor and Alert: Implement monitoring and alerting mechanisms to detect PDB violations. This enables proactive management and ensures timely intervention in case of availability issues.
Graceful Shutdowns: Configure your applications to handle graceful shutdowns when evicted by a PDB. This allows them to complete ongoing tasks, release resources, and avoid data loss or corruption.

By implementing pod disruption budgets, you can enhance the resilience and availability of your applications running on Kubernetes clusters.

Example Pod Disruption Budget (PDB) in Kubernetes:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: hello-world
spec:
  selector:
    matchLabels:
      app: hello-world
  minAvailable: 30%

Pod Disruption Budgets (PDBs) work alongside deployment strategies like rolling updates with maxSurge and maxUnavailable settings to manage application availability during upgrades.

Pod Disruption Budget (PDB) and Rolling Update Strategy

You don’t necessarily need both a pod disruption budget (PDB) and a rolling update strategy, but they can work together effectively to achieve different goals for application availability in Kubernetes. They can be complementary for robust application availability during both maintenance and upgrades:

PDB sets a safety floor: It defines the minimum number of pods that must be available even during a rolling update with maxUnavailable.
Rolling update controls the rollout: It manages the pace of pod replacement within the boundaries set by the PDB’s minimum availability.
While maxSurge and maxUnavailable define the deployment's rollout strategy, a PDB sets a hard minimum on the number of available pods. This minimum should ideally be greater than or equal to maxUnavailable to avoid conflicts.
Scenario: Let’s say your deployment has 5 replicas, maxSurge is set to 1 (allowing 1 extra pod), and maxUnavailable is set to 2 (allowing 2 pods to be unavailable).
- Without PDB: The deployment could potentially terminate 2 pods and create 1 new one, leaving only 2 pods available (5 total - 2 unavailable + 1 surge). This might violate application requirements for minimum availability.
- With PDB: If a PDB is defined with a minimum of 3 available pods, the deployment can only terminate pods down to 3 running pods (5 total - 3 minimum available). This ensures your application remains functional with at least 3 pods even during the update.
- In case of a conflict between maxUnavailable and the PDB's minimum available pods, the stricter setting takes precedence. This ensures the PDB's minimum availability requirement is met.
Setting the PDB’s minimum available pods to a value greater than or equal to maxUnavailable in your deployment strategy ensures the PDB doesn’t accidentally restrict the deployment’s ability to perform rolling updates within the defined boundaries.
Using percentages for both PDB’s minimum available and deployment’s maxUnavailable allows for flexibility as you scale your application.

Pod Priority and Preemption

Kubernetes supports prioritizing your pods when it comes to scheduling. A critical workload/job could be marked as a higher priority pod compared to some other non-critical pods or jobs thereby increasing the chances of getting scheduling ahead of some other lower priority pods.

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 10000
globalDefault: false # Not the default priority class
description: "This priority class is for high-priority pods."

apiVersion: v1
kind: Pod
metadata:
  name: my-high-priority-app
spec:
  priorityClassName: high-priority # Assign the priority class
  containers:
  - name: my-container
    image: my-image
    # ... other pod specifications

Important Factors
R esource quotas can interact with pod priorities. A higher-priority pod might still be subject to resource quotas.
PodDisruptionBudgets can protect even low-priority pods from being preempted if they are part of a critical application.

Quality of Service (QoS)

QoS classes define how Kubernetes handles resource requests and limits for pods. They influence how Kubernetes schedules pods and how it handles resource contention. Kubernetes automatically assigns a QoS class to a pod based on its resource requests and limits.

QoS Classes:

Guaranteed: Pods that have both resource requests and limits specified for all containers, and the requests are equal to the limits. These pods are given the highest priority for resources. They are less likely to be evicted.
Burstable: Pods that have resource requests and limits specified, but the requests are less than the limits. They can "burst" up to their limits if resources are available. They are more likely to be evicted than Guaranteed pods.
BestEffort: Pods that do not have resource requests or limits specified. They are given the lowest priority for resources and are the most likely to be evicted.

Qos and Pod Priority

While Scheduling Kubernetes first considers pod priority when scheduling pods. Higher-priority pods are scheduled first. Within a priority level, Kubernetes uses QoS classes to determine how to allocate resources. Guaranteed pods are given preference. When resources are scarce, Kubernetes uses both pod priority and QoS to decide which pods to evict. Lower-priority pods are evicted first. Within a priority level, BestEffort pods are evicted before Burstable pods, and Burstable pods are evicted before Guaranteed pods.

As a starting point, make sure to set appropriate resource requests and limits for your pods. This is crucial for QoS and for ensuring that your applications have the resources they need. Use the Guaranteed QoS class for critical system pods or applications that require predictable performance. Burstable is often a good balance for most applications, allowing them to burst up to their limits when resources are available. However, only use BestEffort for non-critical pods or background tasks that can tolerate resource scarcity and potential eviction.

Pod Affinity and Node Affinity

Pod affinity and node affinity are Kubernetes features that allow you to control how pods are scheduled onto nodes. They are essential tools for building resilient applications on EKS (or any Kubernetes cluster).

Pod affinity allows you to specify rules about where a pod should be scheduled based on other pods that are already running in the cluster. You can use it to attract pods to the same node or zone as other pods (co-location) or to repel pods from the same node or zone (anti-affinity).

Co-location (Attraction):

Placing related pods (e.g., a web server and its database) on the same node to reduce latency.
Ensuring that pods that communicate frequently are located close to each other.

Anti-affinity (Repulsion):

Spreading replicas of a pod across different nodes (or zones) to increase availability and fault tolerance. If one node fails, the other replicas will continue to run.
Preventing pods that consume a lot of resources from being scheduled on the same node and avoiding the bad-neighbor effect.

Types of Pod Affinity:

requiredDuringSchedulingIgnoredDuringExecution: The rule must be satisfied during scheduling. If the rule cannot be met, the pod will not be scheduled. IgnoredDuringExecution means that if the affinity rule becomes violated after the pod is scheduled (e.g., the other pod is terminated), the pod will continue to run but will not be rescheduled if it is evicted.
preferredDuringSchedulingIgnoredDuringExecution: The rule is preferred but not required. Kubernetes will try to satisfy the rule, but if it cannot, the pod will still be scheduled (potentially on a different node). IgnoredDuringExecution has the same meaning as above.
requiredDuringSchedulingRequiredDuringExecution: The rule must be satisfied during scheduling and must continue to be satisfied during execution. If the rule becomes violated after the pod is scheduled, the pod will be evicted. This is the strongest form of affinity.
preferredDuringSchedulingRequiredDuringExecution: The rule is preferred during scheduling and must be satisfied during execution. If the rule is violated after the pod is scheduled, the pod will be evicted.

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        topologyKey: kubernetes.io/hostname # Spread across nodes
        labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - my-app
  # ... other pod specifications

Node Affinity

Node affinity allows you to specify rules about which nodes a pod should be scheduled on based on labels that are attached to the nodes. For instance scheduling pods on nodes with specific hardware (e.g., GPUs).

Types of Node Affinity: Similar to pod affinity, you have requiredDuringSchedulingIgnoredDuringExecution, preferredDuringSchedulingIgnoredDuringExecution, requiredDuringSchedulingRequiredDuringExecution, and preferredDuringSchedulingRequiredDuringExecution.

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: gpu
            operator: Exists # Node must have the 'gpu' label
  # ... other pod specifications

Anti-affinity (both pod and node) is crucial for high availability. By spreading your pods across nodes and availability zones, you ensure that your application can survive node or zone failures. Makes it fault-tolerant in the event a node fails, the pods running on other nodes will continue to serve traffic. Affinity can also help you optimize resource utilization by co-locating related pods and ensuring that pods are scheduled on nodes with the appropriate resources. Co-locating pods that communicate frequently can reduce latency and improve performance. If you use node affinity, you might also need to use tolerations to allow pods to be scheduled on nodes with taints.

Note: Consider preferred Over required Unless it's essential, use preferred affinity rules over required rules. required rules can make it difficult to schedule pods if the constraints cannot be met.

Horizontal Pod Auto Scaling (HPA)

HPA is one of the best knobs to control in the Data Plane that allows scaling your deployments horizontally based on metrics. You have out-of-the-box metrics like CPU and Memory available that you could scale on or even combine metrics for the scaler. In addition custom metrics relevant to your applications such as connection_pool, requests_per_second etc. could be configured that may provide a better trigger for your applications to inform the scaler.

Fine-tune behavior for HPA

HPA supports scaleUp and ScaleDown behaviors to be configured per your needs through scaling policies. One or more scaling policies can be specified in the behavior section of the spec. When multiple policies are specified the policy that allows the highest amount of change is the policy that is selected by default. The stabilizationWindowSeconds is used to restrict the flapping of replica count when the metrics used for scaling keep fluctuating and is important to get it right for your workloads.

Note: When HPA is enabled, it is recommended that the value of spec.replicas of the Deployment and / or StatefulSet be removed from their manifest(s). If this isn't done, any time a change to that object is applied, this will instruct Kubernetes to scale the current number of Pods to the value of the spec.replicas key. This may not be desired and could be troublesome when an HPA is active, resulting in thrashing or flapping behavior.

Vertical Pod Auto Scaling (VPA)

The Vertical Pod Autoscaler (VPA) is a Kubernetes component that automatically adjusts the CPU and memory requests and limits of your pods. It analyzes the historical resource consumption of your pods such as peak usage, average usage, and resource usage trends, and then recommends or optionally can apply new resource settings to optimize resource utilization and improve overall cluster efficiency.

Pair this with the Pod Disruption Budget as VPA’s Updater component could apply recommendations that require pod restarts to take effect.
Starting with Recommender mode only, learning more about the recommendations and carefully monitoring your application are highly recommended best practices.
A good starting point would be with either HPA or VPA, depending on your needs. If you need to scale based on traffic, use HPA. If you need to optimize per-pod resources, use VPA.
In some specific use cases combining both may be required but note that contradicting recommendations for the scalers could make it complex to manage and troubleshoot. For instance, If both HPA and VPA are set to scale based on CPU or memory usage, they might contradict each other, leading to inefficient resource allocation. When using both HPA and VPA, consider using custom metrics for HPA to avoid conflicts with VPA’s resource adjustments.

Multi-Dimensional Pod Auto Scaler (Feature Request for EKS)

This is a feature request into EKS and would be great to support this in a future release to ease EKS users from having to juggle between HPA and VPA with a single AutoScaler that would eliminate contradicting scaling decisions.

Kubernetes Event-driven Autoscaling

KEDA (Kubernetes Event-driven Autoscaling) plays a crucial role in enhancing resilience on EKS (and Kubernetes in general) by enabling your applications to automatically scale based on various event triggers, rather than just CPU or memory usage. This event-driven autoscaling is key to building more responsive, robust, and cost-effective systems. You could certainly argue that custom metrics may come in handy and may suffice. However, when considering Event-Driven architectures, KEDA specifically shines in this space as it can scale based on a variety of scalers (e.g., RabbitMQ, Kafka, AWS SQS, PostgreSQL, Datadog, and many more), ensuring that your application can process messages as they arrive at scale. Combined with HPA, KEDA could prove to be a powerful tool in your kit to build resilient workloads!

Managing Computational Resources

Running resilient workloads would also mean thinking about your current computational resource utilization, requests, limits, quotas (QOS), and more. Please do check out Dive into managing Kubernetes computational resources which we published on this topic earlier. It dives into a lot more details and helps you gain insights into computational resources.

Self Healing with Kubernetes Probes

Kubernetes Probes allows Kubernetes to monitor the health and readiness of your pods and take action when issues arise, ensuring your application remains available and responsive. Needless to say how important probes are for running resilient and self-healing workloads. It is a good practice to configure these to strike a balance between speed and reliability, as you don't want to configure thresholds too small that it takes several restarts to start one, nor do you want to bump thresholds up too much that delays traffic being routed to the pod that has been ready for a while. Highly recommend reading Dive into Kubernetes Healthchecks (2-part series) published earlier that will help you gain a solid understanding of the various probes and their impact on your workloads.

Run Lean or Distroless images

Running lean images or distroless images plays a significant role in EKS resiliency by improving security, reducing resource consumption, and speeding up deployments.

Lean images and distroless images contain only the essential components needed to run your application. They eliminate unnecessary libraries, tools, and system utilities that might be present in traditional base images. This significantly reduces the attack surface, minimizing potential vulnerabilities that attackers could exploit. Lean images make it easier to audit your container images and ensure that they comply with security policies and go well with your
Lean images and distroless images are significantly smaller in size compared to traditional base images and hence can be pulled more quickly from container registries, reducing the time it takes to deploy your application. This is especially important for scaling and rolling updates. Lean images often have a smaller memory footprint, allowing your applications to run more efficiently and potentially allowing you to run more pods on the same node.
Smaller images lead to faster image pulls, which speeds up the deployment process and often results in faster startup times because they have fewer components to initialize. This can be crucial for applications that need to scale quickly or even for microservices that are frequently deployed or even in optimizing HPA.

Use a Service Mesh/Service Network

Service meshes like Istio, Consul, LinkerD, or a Service Network like VPC Lattice enable service-to-service communication and increase the observability and resiliency of your microservices network. Most service mesh products work by having a small network proxy run alongside each service that intercepts and inspects the application’s network traffic. You can place your application in a mesh without modifying your application. Using the service proxy’s built-in features, you can have it generate network statistics, create access logs, and add HTTP headers to outbound requests for distributed tracing, enable automatic request retries, timeouts, circuit-breaking, rate-limiting as well as improve security patterns.

Service Networks such as VPC Lattice have a different architecture that does not require one to configure proxies or sidecars. Instead, VPC Lattice provides a managed control plane and data plane, eliminating the need for additional components within your Pods.

Circuit Breaking, Retry, and Backoff

Circuit breakers are a powerful technique to improve resilience. By preventing additional connections or requests to an overloaded service, circuit breakers limit the “blast radius” of an overloaded service. The circuit-breaker pattern could be applied within your application that communicates with various upstream services, and or at the ingress controller or the service mesh if one was supported. This should also be paired with an appropriate Retry/Backoff mechanism where Retry/Backoff attempts to recover from temporary issues, and the circuit breaker steps in to prevent further attempts when the problem is more persistent. This combination makes your system more resilient to failures.

AWS Well-Architected Framework

Highly recommend that we consider the Well-Architected Framework AWS Well-Architected Framework which provides a structured approach to designing and operating resilient services, building and operating secure, high-performing, resilient, and efficient infrastructure for your applications and workloads in the AWS cloud. By adhering to its principles, you can build systems that are better equipped to withstand failures, recover quickly from disruptions, and provide consistent availability to your users. It’s about designing for failure, not just hoping it won’t happen. Please note that it's not a managed service itself, but rather a conceptual model that provides a consistent approach to evaluating and improving your cloud architecture.

Resilience Assessment and Design For Failure

Photo by Francisco De Legarreta C. on Unsplash

It is best to assume that anything that can go wrong will go wrong

The National Academy of Sciences defines resilience as “the ability to prepare and plan for, absorb, recover from, or more successfully adapt to actual or potential events”.

It's always a good practice to test and see how your design choices hold when failures are induced. Chaos engineering can be used to validate the effectiveness of design choices made with “ designing for failure ” in mind. Chaos could be induced into your workloads and clusters in various ways, ranging from inducing node restarts, pod restarts, triggering failover, etc. It is also a good idea to use something like AWS Fault Injection Service in combination with Behavior Driven Development (BDD) or similar for managing your experiments and most importantly orchestrating the expected behavior.

AWS Resilience Hub

Leverage AWS Resilience Hub to manage and improve the resilience posture of your applications on AWS. AWS Resilience Hub enables you to define your resilience goals, assess your resilience posture against those goals, and implement recommendations for improvement based on the AWS Well-Architected Framework. Within AWS Resilience Hub, you can also create and run AWS Fault Injection Service (AWS FIS) experiments, which mimic real-life disruptions to your application to help you better understand dependencies and uncover potential weaknesses. AWS Resilience Hub provides you with the services and tooling you need to continuously strengthen your resilience posture, all in a single place.

Secure your traffic with AWS WAF and AWS Shield

WAFs are a critical component of resilient workloads. They protect against application-level attacks, enhance availability, support incident response, enable safe deployments, and help organizations meet compliance requirements. While not a replacement for other security measures, a properly configured and maintained WAF provides a valuable layer of defense for any publicly accessible web application on EKS.

While AWS Shield provides DDoS protection at the network layer, AWS WAF can protect against application-layer DDoS attacks. Combining these AWS solutions with EKS can help harden your services, and improve your availability and resilience.

Golden Signals

The Golden Signals are high-level indicators of service health and performance.

Photo by Mikail McVerry on Unsplash

Data plane observability provides the underlying data that feeds these signals. By monitoring the Golden Signals, you can gain a quick understanding of the state of your services and take action to ensure their reliability and resilience. Some key aspects for operating and managing production workloads:

Setting SLOs (Service Level Objectives): Define target values for your Golden Signals (e.g., “99.9% of requests should have a latency of less than 200ms”). These SLOs become your key performance indicators (KPIs) for service reliability.
Alerting: Set up alerts based on your SLOs. If a Golden Signal deviates significantly from its target, you’ll be notified so you can investigate.
Troubleshooting: When an issue occurs, use the Golden Signals to quickly understand the impact. For example, if you see a spike in latency, you can investigate further using traces and logs.
Capacity Planning: Use traffic and saturation metrics to understand your capacity needs and plan for future growth.
Performance Optimization: Identify bottlenecks by analyzing latency and saturation metrics.
Tooling: Use appropriate tooling such as APM to provide end-to-end observability and traceability for your infrastructure. AWS X-Ray is a good choice and an invaluable tool for enhancing observability in your EKS environment. It provides detailed distributed tracing, improves your understanding of Golden Signals, and ultimately contributes to building more resilient and performant applications. By enabling faster root cause analysis, proactive issue detection, and performance optimization, X-Ray empowers you to operate your workloads more effectively and ensure a better user experience.

Summary

As you may have figured out by now, operating resilient workloads requires quite a bit of planning and effort. Of course, EKS makes lives easier as described with EKS auto mode and Karpenter, but one has to still focus on optimizing and securing their data planes. I hope this article leaves you with a reasonable understanding of the knobs you could turn to make your critical services highly available and resilient! Lastly leaving a bunch of reference materials that were an inspiration to this article and may come in handy along your journey for operating resilient workloads on Kubernetes! Cheers!

References

Thanks to Jonathan Dawson for the feedback on this article!

GP2 to GP3 for AWS RDS Postgres

Kris Iyer — Fri, 21 Feb 2025 05:01:56 +0000

AWS GP3

Overview

Recently I had an opportunity to work on a migration for RDS Postgres storage types from GP2 to GP3 for a large database. The migration was mostly motivated by potential performance improvements, getting past throughput limits on GP2, better IOPS, and of course, getting on a more modern storage architecture offered by AWS at a lower cost. This post mostly highlights some challenges you may encounter during a migration process which is not very obvious until you run into them.

Migration Options

Depending on the size of the database and/or migration windows/strategy you may choose one of the following:

Modify Storage Type This is the simplest approach. You can directly modify the storage type of your primary instance to GP3 through the AWS Management Console, CLI, or SDK. It’s generally non-disruptive, with minimal performance impact during the conversion. Use this option when:
You need a simple and non-disruptive upgrade: This method is the easiest way to convert your storage and offers minimal downtime. It’s generally suitable for most GP2 to GP3 migration scenarios.
Downtime tolerance is low: The modification process is typically non-disruptive and in place, with only a brief period of potential performance impact while the underlying storage configuration is adjusted (Storage Optimization).
- Your database size is manageable: Modifying storage type works well for databases of various sizes. However, for very large databases, a snapshot/restore approach might offer more control over the migration process.
Database Migration Service (DMS): Use DMS to migrate your data from the GP2 instance to a newly created GP3 instance. This is flexible for complex migrations and minimizes downtime, but it might require more configuration and potentially incur additional costs and could have an impact on the primary depending on the configuration.
Replication task scheduling: Schedule the initial data load for off-peak hours when the writer's workload is lighter.
Bandwidth throttling: Limit the amount of data DMS reads from the source per unit of time to minimize performance impact.
CDC (Change Data Capture): For ongoing replication, DMS uses CDC techniques to capture only the changes made to the source data, reducing the load compared to full table scans.
pglogical: Can be used as a plugin for Postgres Logical Replication for DMS thereby reducing the impact on the writer further.
Snapshot and Restore: If you have a very strict window for downtime during the migration and modifying storage type directly isn’t feasible due to the brief outage it can cause, Snapshot and Restore can offer more control. You can create a snapshot of the GP2 volume during a maintenance window and then restore it to a newly provisioned GP3 instance during another window. This approach allows you to minimize downtime on the primary instance by performing the restoration on a separate instance.
Replica with GP3 storage: While keeping your primary instance on GP2 adding a replica with GP3 storage can be a potential strategy for transitioning to GP3. Some Considerations:
Cost: This could be a factor for this solution as this solution does require additional instance(s).
Rehydration : Status could be a big factor that requires one to either work with some tools to speed up rehydration and/or collaborate with AWS over support tickets to understand the rehydration the status from S3 to EBS.

Some scenarios and comparisons

Hydration after restore

For Amazon RDS instances that are restored from snapshots (automated and manual), the instances are made available as soon as the needed infrastructure is provisioned. However, there is an ongoing process that continues to copy the storage blocks from Amazon S3 to the EBS volume; this is called lazy loading. While lazy loading is in progress, I/O operations might need to wait for the blocks being accessed to be first read from Amazon S3. This causes increased I/O latency, which doesn’t always have an impact on applications using the Amazon RDS instance. If you want to reduce any slowness due to hydration, read all the data blocks as soon as the restore is complete.

Mitigating Effects of Lazy Loading

There are a few strategies available that help mitigate or minimize the impact or in other words also help speed up lazy loading. For Amazon RDS for PostgreSQL, the following options are available:

Use the pg_prewarm shared library module to read through all the tables
pg_prewarm doesn’t pre-fetch the following:
Toast tables [RDS limitation] — No workaround
Indexes [pg_prewarm limitation] — No workaround
DB Objects owned by other users [RDS limitation]. The workaround here is to re-run the SQL once as each DB User (that owns any table) Useful script can be found below: https://github.com/robins/PrewarmRDSPostgres/blob/master/singledb.sql https://github.com/robins/PrewarmRDSPostgres/blob/master/toast.sql
Use the pg_dump utility with jobs and data-only parameters to perform an export of all application schemas
Perform an explicit select on all the large and heavily used tables individually with parallelism.
For large tables, you may be able to split the query into ranges based on the primary key. For example, the below query gives 4 ranges of equal number of rows (primary key col1)

select nt,max(col1),count(*) 
from (SELECT col1, Ntile(4) over(ORDER BY col1) nt FROM testuser.test_table)st 
group by nt 
order by nt;

        NT MAX(COL1) COUNT(*)
---------- ---------- ----------
         1 125000 125000
         2 250000 125000
         3 375000 125000
         4 500000 125000

NOTE: test the query before using in production, if you decide to use this query.

DMS however acts as a conduit for transferring data between databases. During a GP2 to GP3 migration using DMS, the data is directly transferred from the source GP2 instance to the target GP3 instance. DMS doesn’t store the entire migrated dataset in S3 as an intermediate step thus eliminating the need to rehydrate from S3.

PostgreSQL Error Conflict Recovery

ERROR: canceling statement due to conflict with recovery
DETAIL: User query might have needed to see row versions that must be removed.
CONTEXT: SQL statement "select pg_prewarm(temprow.tablename)"
PL/pgSQL function inline_code_block line 6 at SQL statement

Conflict Recovery Error might occur due to the lack of visibility from the primary instance over the activity that’s happening on the read replica. The conflict with recovery occurs when WAL information can’t be applied on the read replica because the changes might obstruct an activity that’s happening on the read replica.
Query conflict might happen when a transaction on the read replica is reading tuples that are set for deletion on the primary instance. The deletion of tuples followed by vacuuming on the primary instance causes a conflict with the SELECT query that’s still running on the replica. In this case, the SELECT query on the replica is terminated with the following error message:

ERROR: canceling statement due to conflict with recovery
DETAIL: User query might have needed to see row versions that must be removed.

max_standby_archive_delay

max_standby_archive_delay is a configuration parameter on a PostgreSQL replica instance in RDS that controls how long the replica waits for conflicting queries to finish before canceling its attempt to apply a specific Write-Ahead Log (WAL) segment. In other words, allows you to manage the trade-off between data consistency and accommodating long-running queries on your RDS PostgreSQL replica.

If WAL data is read from the archive location in Amazon Simple Storage Service (Amazon S3), then use the max_standby_archive_delay parameter.

max_standby_streaming_delay

max_standby_streaming_delay on the replica primarily affects its behavior and data consistency with the primary. However, in high-load scenarios or during failovers, it can have indirect consequences for the writer (primary) due to potential replication lag. For instance, If the replica falls significantly behind due to frequent pauses or cancellations caused by, the primary might experience increased write load as it needs to buffer more WAL segments before they can be applied on the replica.

If you are increasing max_standby_archive_delay to avoid canceling queries that conflict with reading WAL archive entries, then consider increasing max_standby_streaming_delay as well to avoid cancelations linked to conflict with streaming WAL entries.

If WAL data is read from streaming replication, then use the max_standby_streaming_delay parameter.

vacuum_defer_cleanup_age

NOTE: Applies to Postgres versions < 17.

With vacuum_defer_cleanup_age, you could specify a time delay (in seconds) for how long the replica would defer cleaning up certain types of data during auto vacuum. The purpose of this deferral was to potentially avoid conflicts between ongoing queries on the replica and the auto vacuum process cleaning up data that those queries might still be accessing.

For versions ≥ 17

The combination of the statement_timeout parameter and the hot_standby_feedback feature can achieve a similar outcome to vacuum_defer_cleanup_age .

Lazy Loading and Multi-AZ

When you change your Single-AZ instance to Multi-AZ, Amazon RDS creates a snapshot of the instance’s volumes. The snapshot is used to create new volumes in another Availability Zone. Although these new volumes are immediately available for use, you might experience a performance impact. This impact occurs because the new volume’s data is still loading from Amazon Simple Storage Service (Amazon S3). Meanwhile, the DB instance continues to load data in the background. This process might lead to elevated write latency and a performance impact during and after the modification process.

a. Initiate a failover on your Reader instance to be sure that the new AZ is the primary AZ.

b. Perform read operations in your Reader instance — Perform an explicit select on all the large and heavily used tables individually with parallelism or — Use the pg_prewarm shared library module to read through all the tables

c. Confirm that the write latency has returned to normal levels by reviewing the WriteLatency metric in Amazon CloudWatch.

For more information, refer to “What’s the impact of modifying my Single-AZ Amazon RDS instance to a Multi-AZ instance and vice versa?”

Cascading Replicas and Rollback Scenarios

Note: This only applies to Postgres versions ≥14.1.

While you make your transition from GP2 to GP3, certainly plan on a fallback or rollback strategy as best practice. As you promote your GP3-based replica to the primary (standalone) make sure to add a replica which is GP2 just in case. You can achieve this with cascading replicas which could be set up ahead of time along with the GP3 replica. Rehydration and lazy-loading apply the same as any others including any Multi-AZ Standby instances.

With cascading read replicas, RDS for PostgreSQL DB instance sends WAL data to the first read replica in the chain. That read replica then sends WAL data to the second replica in the chain, and so on. The end result is that all read replicas in the chain have the changes from the RDS for PostgreSQL DB instance, but without the overhead solely on the source DB instance.

Conclusion

Choosing the right strategy depends on how critical minimizing downtime is, cost, and scalability among other things. Choose wisely, consult AWS support, and engage your AWS TAM and AWS Solution Architects to discuss the impact of migration on your databases and SLAs while considering lazy loading and/or full initialization of RDS GP3 volume as you plan yours!

Analyzing Amazon Load Balancer Access Logs

Kris Iyer — Mon, 05 Feb 2024 22:11:19 +0000

Overview

Analyzing access logs may be required for several reasons and is a great practice, in general, to stay on top of your access logs to understand traffic, distribution, user agents, URI classification(client's IP address, latencies, request paths, and server responses). Overall you can use these access logs to analyze traffic patterns and troubleshoot issues.

Access logs are not activated by default. Once enabled access logs are shipped to Amazon S3. Before you enable them on yours please read the below carefully(including s3 costs):

https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-access-logs.html

It's important to also note that the access log files are compressed. If you open the files using the Amazon S3 console, they are uncompressed and the information is displayed. If you download the files, you must uncompress them to view the information. Depending on your use cases your access logs could be running into gigabytes of data and processing and analyzing could be challenging.

There are several ways you could approach analyzing access logs. Below is a summary of some of our options not listed in any particular order.

AWS Based Log Analyzers

Log Analytics with Amazon Athena

Since Load Balancer access Logs are shipped to S3, you may use the power of Athena to query from S3. You can then slice this data based on various dimensions using plain old SQL which works great and is effective.

https://repost.aws/knowledge-center/athena-analyze-access-logs
https://repost.aws/knowledge-center/analyze-logs-athena

You may further choose to combine this with Amazon QuickSight to build powerful dashboards for BI use cases.

Amazon OpenSearch Service

Amazon OpenSearch Service (successor to Amazon Elasticsearch Service) operates OpenSearch and open-source Elasticsearch, making it easy to search, visualize, and analyze your data across multiple use cases such as:

Fast, Scalable Full-Text Search
Application and Infrastructure Monitoring
Security and Event Information Management
Operational Health Tracking

CloudWatch Log Insights

This does need some extra work on our part before transform access logs from s3 to JSON format to Cloudwatch. However once in CloudWatch, we could use CloudWatch Insights and its capabilities to analyze this data. Optionally you can also use natural language (with the AI assistant) to create CloudWatch Logs Insights queries that may otherwise be challenging to build.

External Log Analyzers

Several Enterprise Solutions exist in the market that allow you to ingest and analyze logs (not limited to access logs).

Some popular integrations for your review:

Opensource Log Analyzers

elb-log-analyzer (py)

elb-log-analyzer is a Python-based utility that lets you connect to your origin (s3) and analyze logs. In addition, it does have several features including, downloading logs from s3, analyzing logs, streamlit integration for dashboards, slack integration for setting up anomaly alerts, docker integration, and more!

elb-rebar (RUST)

elb-rebar is a parallel AWS Elastic Load Balancing log analyzer for quick statistics on web requests. This is a RUST-based utility that is easy to install and run!

elb-log-analyzer (NPM)

elb-log-analyzer is an NPM-based utility that lets you quickly install and be up and running by parsing your logs with various dimensions. I find it very flexible regarding usage and its ability to sort, limit, or even filter our search by prefix (this is extremely useful when there is a high volume of unique URIs to track due to request parameters or similar).

Anomaly Detection

ML-backed Anomaly detection with access logs could come in handy in some use cases. There are a few that already offer this capability.

Conclusion

Hopefully, this article leaves you with a bunch of options under AWS, enterprise, and open-source solutions out there that help one deal with a common but challenging problem space with ever-growing data patterns, logs, and analytics requirements! Lastly, make sure to stay on top of your access logs in addition to metrics and application logs for increased Reliability, Security, Stability, and Scalability of your applications and/or services!!!

AWS Pricing Calculator needs hardening

Kris Iyer — Thu, 03 Aug 2023 17:09:08 +0000

Background

AWS Pricing calculator allows you to Configure a cost estimate that fits your unique business or personal needs with AWS products and services. We have had the Simple Monthly Calculator (SMC) (Retired as of 3/31/23) that provided estimates in the past which was then replaced with the AWS Pricing Calculator

Why should you use AWS Pricing Calculator?

The AWS Pricing Calculator has a simplified web interface, which now supports cost estimates of more than 150 AWS services. It also enables cost estimates at scale, like bulk import for EC2 instances. AWS Pricing Calculator is accessible to all users, prospects, or AWS customers, without an AWS account. It provides cost estimates for your workloads, using the public AWS prices. Also the best source to get the most updated and comprehensive pricing estimates in one place.

Areas on AWS Pricing Calculator that needs Hardening

User Experience Related

This one has been out there for a while but its easy to configure your estimate and do all the hard work to get to the point where you have an estimate only to click outside of the primary form and find out you have just lost all your great work.

Make sure to not click anywhere outside of the primary overlay or form for the calculator without saving your work.

Pricing Related

This one in particular has been even more frustrating in many ways. Noticed that the form fields reset automatically when you change around a few options under your service configuration which could lead to a price that is not accurate.

Here is an example where i would try to compare my price for the two Aurora PostgreSQL offerings (Aurora Standard vs Aurora I/O-Optimized). I start out by selecting Aurora Standard and gather the monthly price ($894.98 USD/m for a db.r7g.2xlarge)

All good until this point. Next I update from Aurora Standard to I/O-Optimized to compare and the cost bump really caught me (and a few others) by surprise.

The pricing calculator resets the instance type and picks an alternate type which is not what you expect. Easily goes unnoticed when your focus was a cost comparison and you are looking at the delta. Certainly caught a bunch of us by surprise. Will drop in some support cases with AWS next to see if we can resolve these discrepancies and improve the user experience!

Hope this saves some of you some time!

How to build an AI chatbot with Openfire and OpenAI Chat Completion

Kris Iyer — Fri, 24 Mar 2023 11:02:53 +0000

Photo by Eric Krull on Unsplash

Responsible use of artificial intelligence (AI) and ML technologies is key to fostering continued innovation.

AI chatbots are here, there, and everywhere! Ever since the introduction of ChatGPT in December 2022, tech companies of all sizes have been racing to build AI-powered tools and solutions.

This article describes some high-level building blocks to help you build a chatbot powered by Openfire that could be wired up to OpenAI APIs to provide chat completions!

High-level Architecture for building a chatbot with Openfire.

OpenAI API

The recently released OpenAI APIs are a game changer for both small and large enterprises to provide In-App chatGPT-like applications powered by the same APIs and models that power chatGPT. Check out the API reference and the API playground for more details on the Chat Completions API and more.

The early adopters of the OpenAI APIs such as My AI on Snapchat, Instacart, shop by Shopify, and Speak are great references for building the next generation of AI tools and solutions. For edtech, virtual tutors and learning assistants offered by Quizlet, DuoLingo, and Khan Academy are creative examples of GTP 3 and 4 capabilities!

To make lives easier the APIs come with a variety of SDKs which includes community-based projects that are well-documented, and easy to integrate with your applications!

Openfire

Openfire is a real-time collaboration (RTC) server licensed under the Open Source Apache License. It uses the only widely adopted open protocol for instant messaging, XMPP (also called Jabber). Openfire is incredibly easy to set up and administer but offers rock-solid security and performance.

The project was originated by Jive Software around 2002, and continues to thrive under a community model, as part of the Ignite Realtime Foundation that does a fantastic job!

Some of the core features (but not limited to):

XMPP server written in Java and licensed under the Apache License 2.0
User-friendly web-based installation and administration panel
Shared groups for easy roster deploying
Plugin interface
SSL/TLS support
Offline Messages support
Server-to-Server connectivity
Database connectivity for storing messages and user details (including the embedded HSQL database and support for MySQL, PostgreSQL, etc.)
LDAP integration
Platform independent (with the installers for different platforms)
Connection manager for load balancing
Clustering support (hazelcast)
Message archiving-logging
Content filtering, packet rules
Pluggable Roster Module
Custom Authentication Provider
Support for BOSH as well as WebSockets.
File Sharing

To get started visit working with Openfire to get your chat server up in a few mins!

An admin dashboard for managing openfire!

XMPP Clients

Openfire server could be used along with a Javascript XMPP client of your choice. Some popular projects and plugins that can get you started are as follows:

ConverseJS is a popular Javascript XMPP client that implements a full range of XMPP extensions. Also available as a plugin for openfire — inverse-openfire-plugin and can be installed on openfire with a few clicks.
StanzaJS is a JavaScript/TypeScript library for using modern XMPP, and it does that by exposing everything as JSON. Unless you insist, you have no need to ever see or touch any XML when using StanzaJS.
JSXC is a Javascript XMPP client that is also available as a jsxc openfire plugin that could be installed with a few clicks.

The plugin page in openfire with inverse and jsxc installed.

Plugins

Plugins are a great way to extend or customize capabilities on your Openfire server.

There are a no of community-based plugins that you could readily install and use on your Openfire server. Adding a new plugin for your custom needs requires setting up a jar/war per the Openfire plugin specification and deploying them to your Openfire server.

You are going to need one for a chatBot!

Botz

The Botz library adds to the already rich and extensible Openfire with the ability to create internal user bots. With the Botz library, programmers may choose to develop a virtual user or a chatbot as a plugin. Although Openfire does not really distinguish this virtual user from the real users, one could intercept messages to the chatBot from your users, and be able to respond per your needs. An example would be to integrate the Botz library with the relatively new OpenAI Java SDK to provide an AI chatBot or a ChatGPT experience for your Openfire users.

Setup BOTZ within your Openfire plugin

https://medium.com/media/1d6aea2fff779da1bf9c68905cf6bb02/href

Botz version 1.2.0 was released recently and can be used alongside Openfire 4.7.4. Thanks to @guusdk !!!

OpenAI API integration

Add the OpenAI java dependency to your project. The below example uses gpt-3.

<dependency>
 <groupId>com.theokanning.openai-gpt3-java</groupId><!-- use latest as needed -->
 <artifactId>service</artifactId>
 <version>0.11.0</version><!-- use latest as needed -->
</dependency>

Add a helper to wrap any customizations for your service.

https://medium.com/media/730ea9ec8030560d8d1030e6097baf74/href

At this point, packaging your plugin code and deploying it to the Openfire server should enable your BOT user and set up the presence.

Auto Display Bot for your users

Now that we have a way to add a virtual BOT user, we will need a mechanism to have the BOT user appear under the contact lists for your real chat users. In order to do that:

Create a group in Openfire’s admin console
Make the bot a member of that group
Enable contact list group sharing, to make the group appear on the contact list of every user in the system

Openfire group and contact listing for bot user.

Wrap up

chat completions with openAI with converse (inverse openfire plugin)

What you see above is a real user chatting with the virtual user myaibot.
Since we made the BOT a member of a new group and enabled contact list group sharing for all users we see the BOT listed under the user's contact list along with the online presence that was set up using Botz.

I hope some of these building blocks help you build the next cool AI BOT for your application powered by Openfire, BOTZ, and openAI!

At Houghton Mifflin Harcourt we’ve only just begun to explore ideas on embedding AI and chatGPT-like tools into our platform and products that could benefit learners and educators as well as help find efficiencies with internal workflows. Julie Miles (SVP — Learning Sciences), myself, and the rest of my colleagues really think of this as the tip of the literal iceberg. Some of the ideas we have been working on across some broad buckets of research are listed below:

Reduce teacher effort in lesson planning
Use assessment data to group students or differentiate assignments to assist teachers
Provide feedback on areas to improve to both the student and the teacher on the student’s work, recommendations, etc
Provide intelligent tutoring to students (e.g., encouragement, hints, feedback on essays, etc.) while they’re working through a lesson
Create efficiencies in functional workflows so that we free up more time for thought leadership which leads to more innovation. Some areas we have already experimented with include:
Writing first drafts of content or assessment items
Translating content into other languages
Inserting explanatory comments into existing code
Asking how to fix code that doesn’t work
Checking code for bugs and auto-completion
Drafting marketing names for new products and many more areas.

With open AI leading the way with GPT 3 and GPT-4, the future looks promising as well as exciting! Looking forward to learning and experimenting with building safe, accountable, and trustworthy AI solutions!

Helpful Resources

There is a plethora of open-source/no-code/low-code solutions out there that you could play with to spin up your AI BOT. I want to leave you with some helpful resources:

RocketChat recently came out with an OpenAI chat completion app, and browser-based no-code app builder
flutter.io announced easy additions for OpenAI chat completions to their projects.
sms-chatbot with nodejs and twillio
Your Personal Michelin Star Chef with OpenAI’s GPT-3 Engine, python and twillio
Learn more about Botz and Openfire

That’s a wrap for this article. Good Luck and Stay Safe!

Thanks to Julie Miles and Tom Holt for their contributions to this post!

High CPU and zombie threads on Amazon Aurora Mysql 5.6

Kris Iyer — Thu, 22 Sep 2022 21:12:38 +0000

Recently noticed some high avg CPU utilization on an Amazon Aurora Mysql Databases running Mysql 5.6 (oscar:5.6.mysql_aurora.1.22.2). Something that was noticed that I thought was interesting to share were zombie threads or threads that were running for a long period of time and never finished as well as threads that were not possible to be killed.

These were simple DDL statements that were triggered by a little reporting engine that created a bunch of temporary tables to gather some aggregations.

A quick look up on the process list tells us that there are some DDL statements stuck for 4 days as shown below:

mysql> show full processlist;
| Id       | User            | Host                | db         | Command | Time   | State                     | Info                
| 77569519 | app        | x.x.x.x:yyyyy | test | Query   | 404949 | init                      | DROP TEMPORARY TABLE IF EXISTS temp1
:::::

TRX status for the same:

mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_TRX where trx_mysql_thread_id = 77569519 \G
*************************** 1. row ***************************
                    trx_id: 124803462108
                 trx_state: RUNNING
               trx_started: 2022-09-17 21:01:45
     trx_requested_lock_id: NULL
          trx_wait_started: NULL
                trx_weight: 33614
       trx_mysql_thread_id: 77569519
                 trx_query: DROP TEMPORARY TABLE IF EXISTS temp1
       trx_operation_state: NULL
         trx_tables_in_use: 0
         trx_tables_locked: 0
          trx_lock_structs: 14
     trx_lock_memory_bytes: 376
           trx_rows_locked: 0
         trx_rows_modified: 33600
   trx_concurrency_tickets: 0
       trx_isolation_level: READ COMMITTED
         trx_unique_checks: 1
    trx_foreign_key_checks: 1
trx_last_foreign_key_error: NULL
 trx_adaptive_hash_latched: 0
 trx_adaptive_hash_timeout: 0
          trx_is_read_only: 0
trx_autocommit_non_locking: 0
1 row in set (0.00 sec)

The initial suspect was disk issues causing these long running queries but was ruled out as metrics seemed ok and the database appeared to have plenty of Local Storage to deal with temporary tables. The next attempt to recover from these were to kill the long query to free up the CPU cycles.

mysql> call mysql.rds_kill(77569519);
Query OK, 0 rows affected (0.00 sec)

mysql> call mysql.rds_kill_query(77569519);
Query OK, 0 rows affected (0.00 sec)

No luck despite attempts to kill the query and even the connection. While rds_kill_query did not change anything rds_kill did change the command status from Query to killed. Neither of these were helpful in this case and the trx_state continued to be RUNNING.

mysql> show full processlist;
| Id       | User            | Host                | db         | Command | Time   | State                     | Info                
| 77569519 | app        | x.x.x.x:yyyyy | test | Killed  | 422937 | init     | DROP TEMPORARY TABLE IF EXISTS temp1
:::::

Next up was to seek some help from AWS Support and thus gathered the below recommendations:

Reboot the Amazon Aurora Cluster (or trigger a failover).
Upgrade From Amazon Aurora 1.x to Amazon Aurora to 2.x. Particularly 2.07.8 which has some fixes from the community edition for stability around temporary tables.

Note that Aurora 2.x would mean an upgrade to Mysql 5.7.x from a compatibility standpoint.

Hope this helps!

Tracking down high CPU Utilization on Amazon Aurora PostgreSQL

Kris Iyer — Tue, 30 Aug 2022 16:10:12 +0000

High CPU utilization on Aurora RDS

In one of my previous articles, I discuss some interesting ways we can troubleshoot high local storage utilization on Amazon Aurora PostgreSQL. In this article, I share some thoughts on troubleshooting high CPU utilization as well as some best practices for Amazon Aurora PostgreSQL.

Keep it Simple

PostgreSQL built-in extensions pg_stat_statements, pg_stat_activity, and pg_stat_user_tables are great starting points and can quickly help you gather insights around your top SQL, missing indexes, gather insights into locking and identify blocked queries along with blocking PID's/queries.

https://medium.com/media/2f24ca9c7a72fe833d90317710c48f82/href

Slow Query Logging

For heavy and concurrent workloads, slow query logging could provide you with some great insights. Go ahead and turn on your slow logs but make sure to set up some reasonable thresholds just so that you catch enough. Note that logging all statements could have a huge impact on performance as well as result in high resource utilization. Logging lock waits could also be a useful addition to see if lock waits were contributing to your performance issues.

# Example that logs statements executing for > 500ms.
log_min_duration_statement=500

# Useful in determining if lock waits are causing poor performance 
log_lock_waits=1

Follow up with EXPLAIN ANALYZE and look for improvements.

Some pointers for you to look for:

Look for a difference in estimated vs actual rows.
No index, wrong index (cardinality)
Large no of buffers read (check on the working-set and if it fits under shared_buffers)
A large no of rows filtered by a post join predicate.
Reading more data than necessary (pruning, clustering, index-only)
Look for slow nodes in your plans (SORT[AGG], NOT IN, OR, large seq_scans, CTEs, COUNT, function usage in filters, etc.)

Note that sequential scans in some cases may be faster than an index scan specifically when the SELECT returns more than approximately 5–10% of all rows in the table. This is because an index scan requires several IO operations for each row whereas a sequential scan only requires a single IO for each row or even less because a block (page) on the disk contains more than one row, so more than one row can be fetched with a single IO operation. More bounded queries are better to inform the optimizer and can help pick the right scan strategy.

Analyzing the plan could be overwhelming sometimes and could use tools such as dalibo and depesz that help visualize your explain plans (Make sure to read the data retention policies on these tools and ideally anonymize your queries for security reasons before you upload your plans)!

Performance Insights

Turning on Performance Insights on your Aurora PostgreSQL cluster is another great way to get detailed insights into your performance and resource utilization. With Performance Insights, you have a quick way to slice your queries by top SQL, top wait, etc, and can come in handy to continually monitor your production workloads.

Performance Insights showing high CPU utilization sliced by top SQL.

Another great metric dimension to look at would be to look at waits and identify where your database may be spending the most time. Metrics are broken down below by Top SQL and sorted by the top wait.

Performance Insights showing metrics sliced by top SQL and wait.

If you need a good overview and understanding of Performance Insights I highly recommend watching the talk on Performance Insights at AWS re:Invent.

Configuration

shared_buffers

One of the common pitfalls with setting shared_buffers very large is that the memory is nailed down for page caching, and can’t be used for other purposes, such as temporary memory for sorts, hashing, and materialization (work_mem) or vacuuming and index build (maintenance_work_mem).

If you can’t fit your entire workload within shared_buffers, then there are a number of reasons to keep it relatively small. If the working set is larger than shared_buffers, most buffer accesses will miss the database buffer cache and fault a page in from the OS; clearly, it makes no sense to allocate a large amount of memory to a cache with a low hit rate.

https://medium.com/media/1d635554846d95a294535ffb62d847dd/href

wal_buffers

PostgreSQL backend processes initially write their write-ahead log records into these buffers, and then the buffers are flushed to the disk. Once the contents of any given 8kB buffer are durably on disk, the buffer can be reused. Since insertions and writes are both sequential, the WAL buffers are in effect a ring buffer, with insertions filling the buffer and WAL flushes draining it. Performance suffers when the buffer fills up and no more WAL can be inserted until the current flush is complete. The effects are mitigated by the fact that, when synchronous_commit is not turned off, every transaction commit waits for its WAL record to be flushed to disk; thus, with small transactions at low concurrency levels, a large buffer is not critical. With PostgreSQL 14 you can now get more insights into your wal_buffers with pg_stat_wal.

Below you can see a high CPU but also high WAL:write which could provide us some hints to tune the database further such as setting aside some extra memory for the wal_buffers.

Performance Insights showing metrics sliced by top waits.

random_page_cost

Defaults to 4. Storage that has a low random read cost relative to sequential, such as solid-state drives could be better modeled with a lower value for random_page_cost, e.g. 1.0. Best to configure Aurora databases with 1.0 and measure improvements.

work_mem, max_parallel_workers_per_gather

For a complex query, several sort or hash operations may be running in parallel. Also, several running sessions could be doing such operations concurrently. Therefore, the total memory used could be many times the value of work_mem and should be tuned appropriately. Setting in too low or too high could have an impact on performance. The default(4MB) for OLTP workloads is a good starting point. It can be increased to a much higher value for non-OLTP workloads.

Similarly, a parallel query using 4 workers may use up to 5 times as much CPU time, memory, I/O bandwidth, and so forth. Defaults to 2. Recommended configuration formax_parallel_workers_per_gather for highly concurrent OLTP workloads spanning several connections is to set to 0 or turn off. For low concurrency, the defaults may suffice. May want to increase slowly and evaluate performance for non-OLTP workloads.

Prevent The Bloat

The importance of removing dead tuples is twofold. Dead tuples not only decrease space utilization but can also lead to database performance issues. When a table has a large number of dead tuples, its size grows much more than it actually needs — usually called bloat. A bloat results in a cascading effect on your database such as a sequential scan on a bloated table has more pages to go through, costing additional I/O and taking longer, a bloated index results in more unnecessary I/O fetches, thus slowing down index lookup and scanning, etc.

For databases that have high volumes of write operations, the growth rate of dead tuples can be high. In addition, the default configuration of autovacuum_max_workers is 3. Recommend monitoring the bloat on the database by inspecting dead tuples across your tables that deal with high concurrency.

-- monitor dead tuples
SELECT relname, n_dead_tup FROM pg_stat_user_tables;

-- monitor auto vacuum
SELECT relname, last_vacuum, last_autovacuum FROM pg_stat_user_tables;

While increasing autovacuum_max_workers maybe needed in some cases it also means increasing resource utilization. Careful tuning might result in overall performance improvement by cleaning up dead tuples faster and being able to keep up.

Write Amplification, fillfactor, and HOT Updates

“ fillfactor for a table is a percentage between 10 and 100. 100 (complete packing) is the default. When a smaller fillfactor is specified, INSERT operations pack table pages only to the indicated percentage; the remaining space on each page is reserved for updating rows on that page”

In a scenario when a row is updated and at least one of the indexed columns is part of the updated columns, PostgreSQL needs to update all indexes on the table to point to the latest version of the row. This phenomenon is called write amplification (also one of the biggest architectural challenges with PostgreSQL to not used clustered indexes that affect performance).

Heap-only Tuples ( HOT ) updates are an efficient way to prevent write amplification within PostgreSQL. Lowering the fillfactor can have a positive impact by increasing the percentage of HOT updates. A lower fillfactor can stimulate more HOT updates i.e. fewer write operations. Since we write less we also generate fewer WAL writes. Another benefit of HOT updates is that they ease the maintenance tasks on the table. After performing a HOT update, the old and new versions of the row are on the same page. This makes the single page cleanup more efficient and the vacuum operation has less work to do.

HOT updates help to limit table and index bloat. Since HOT updates do not update the index at all, we don’t add any bloat to the indexes.

-- review your fillFactor
SELECT 
 pc.relname AS ObjectName,
    pc.reloptions AS ObjectOptions
FROM pg_class AS pc
INNER JOIN pg_namespace AS pns 
 ON pns.oid = pc.relnamespace
WHERE pns.nspname = 'public';

HOT comes with some performance trade-offs that affect read performance for index scans. So carefully reduce the fillfactor (such as 85%) on tables that get a lot of updates and measure performance differences!

Hoping something like the zheap storage engine initiative should help us get past these bottlenecks in the future. Until then we may not be able to prevent the bloat but could certainly minimize the impact.

Enhanced Monitoring

Turning on Enhanced Monitoring can provide you with useful insights at the database host level as well as the process list. Specifically useful if you have to track down a specific process (PID) consuming a lot of CPU and map that to pg_stat_activity for more details on the query. Also gives you a great metric dimension for comparing Read/Write IOPS, memory, etc. to CPU utilization.

Query Plan Management (QPM)

A major cause of response time variability is query plan instability. There are various factors that can unexpectedly change the execution plan of queries. For example:

Change in optimizer statistics (manually or automatically)
Changes to the query planner configuration parameters
Changes to the schema, such as adding a new index
Changes to the bind variables used in the query
Minor version or major version upgrades to the PostgreSQL database version. (Analyze operation isn’t performed after the upgrade to refresh the pg_statistic table?)
Planner configuration options such as default_statistics_target, from_collapse_limit, join_collapse_limit etc.

QPM is a great feature that allows us to manage query plans, prevent plan regressions and improve plan stability. QPM collects plan statistics and allows us the controls needed to approve plans that have a lower cost estimate and/or let Aurora adapt to run the plan with minimal cost automatically.

QPM is available on Amazon Aurora PostgreSQL version 10.5-compatible (Aurora 2.1.0) and later and can be enabled in production (minimal overhead) and or enabled/disabled against your test working-sets with tools such as sysbench. I highly recommend turning this on under your test environments and also practice plan evolution (reviewing and approving plans) before applying QPM in production. Once applied to production a periodic review will be necessary to see if the optimizer has found better plans with a lower cost estimate that needs to be approved.

Anomaly Detection

CPU or resource utilization on your database server that is predictable and can be repeatable for testing purposes are usually easier to deal with and be tuned for performance. The issues that happen once in a while or are not repeatable will need a more careful inspection of metrics and some tools to troubleshoot.

AWS DevOps Guru is one of the alternatives that is ML-based and used to identify anomalies such as increased latency, error rates, and resource constraints and then send alerts with a description and actionable recommendations for remediation. From a database perspective, you could for example alert on high-load wait events (based on the db_load metric) and CPU capacity exceeded. In addition, DevOps Guru can catch anomalies in logs which is a useful feature to have. For example, you could now alert on any abnormal error rate seen in PostgreSQL logs.

Amazon Aurora on PostgreSQL 14

Amazon Aurora announced support for PostgreSQL major version 14 (14.3) recently! PostgreSQL 14 includes performance improvements for parallel queries, heavily-concurrent workloads, partitioned tables, logical replication, and vacuuming. In addition, this release includes enhancements to observability, developer experience, and security.

For workloads that use many connections, the PostgreSQL 14 upgrade has achieved an improvement of 2x on some benchmarks. Heavy workloads, and workloads with many small write operations, also benefit from the new ability to pipeline queries to a database, which can boost performance over high-latency connections. This client-side feature can be used with any modern PostgreSQL database with the version 14 client or a client driver built with version 14 of libpq.

Another big plus in PostgreSQL 14 is that dead tuples are automatically detected and removed even between vacuums, allowing for a reduced number of page splits, which in turn reduces index bloat.

For distributed workloads, the use of logical replication can stream in-progress transactions to subscribers, with performance benefits for applying large transactions.

Please refer to Amazon Aurora PostgreSQL updates for more information.

Right Sizing and Alternative Architecture Patterns

In some cases, you may have a reasonably optimized database and queries but are still looking to find that extra bit of performance improvement. You have a number of choices you could make to better manage your workloads, and improve performance and resource utilization. Often times this means Right Sizing your database, database upgrades, or even choosing a better architecture for your applications. Check out my previous post on Right Sizing for some recommendations on this and more!

Conclusion

Try to keep things simple and start with the built-in tools and extensions available with your database engine on the cloud to quickly pinpoint and fix resource utilization issues. Continue to monitor your databases with slow logs, Cloudwatch, Performance Insights, Enhanced Monitoring, DevOps Guru, or an APM of your choice. Last but not least reduce bloat to have a bigger impact on performance overall. While this post is mostly centered around Aurora PostgreSQL you could achieve similar insights on Aurora Mysql as well as RDS. Lastly, I want to leave some useful references for you to read up on this topic. Good Luck!

References

Thank you to my co-workers Sasha Mandich and Francislainy Campos for their feedback on this post!

Testcontainers for Hashicorp Consul and Vault

Kris Iyer — Wed, 16 Feb 2022 15:59:39 +0000

Testcontainers for Vault, Consul and LocalStack with DIND.

In my earlier post, I touched on some interesting architectural patterns for Configuration and Secret Management for your microservices on k8s. I highly recommend reading if you haven’t already. This article builds on the same idea and highlights the importance of integration tests of your service and how we could leverage Testcontainers to ease out some of these challenges such as divergent test environments and configuration or even mock components used in your tests that may not be enough to catch issues early on in your continuous integration pipelines and provide adequeate code coverage.

Testcontainers

Testcontainers is a JVM library that allows users to run and manage Docker images with Java code and frameworks such as JUnit. The most common use-cases for Testcontainers include integration tests against microservices with external dependencies such as Vault, Consul, AWS, Databases, Cache Frameworks, and more. With an API-based approach, it is easy to manage the lifecycle of a container along with the configuration that may be required for your services.

Integration tests with Testcontainers

Some obvious benefits for integration testing with Testcontainers:

TestContainers can help your tests mirror managed environments and components as closely as possible.
- For sourcing secrets and configuration from Vault (includes the various secret backends)and consul.
- Integration with cloud providers such as AWS (most commonly used services if not all), GCloud (incubating), Azure (incubating).
- Databases such as PostgreSQL or MySQL etc.
- Kafka, RabbitMQ for distributed messaging
- The list goes on. Check out the full list of Testcontainer modules.
Testing compatibility and tech stack upgrades for client libraries and dependencies such as spring cloud (Vault, Consul, AWS), AWS SDK, etc.

Why Vault and Consul?

Secret Management in microservices needs to be high on your priority list to build secure and scalable microservices. Secrets can be sensitive , dynamic, and time-bound. They require proper access control models with audit logs and encryption. We also need to support unique life-cycle policies and rotations for microservices. While there are a few options that may work for your needs, Hashicorp Vault is certainly the most popular and comprehensive solution in this space.

Configuration Management also presents a set of challenges with microservices. Support for static and dynamic configuration, externalized configurations, watching for changes, and updating application configuration without any service disruption are some of the key features we would expect from microservices. In addition, the microservice ecosystem in mature organizations often leads to a web with many microservices deployed but also inter-connected where Service Discovery or even a Service Mesh becomes important in a cloud infrastructure to manage endpoint configuration, load balance, security, etc. Hashicorp Consulis a great fit for meeting these requirements.

Testcontainers with Vault

The Hashicorp Vault Testcontainer module aims to solve your app’s integration testing with Vault. You can use it to source static and dynamic credentials for your application as well as test how your application may behave with Vault by writing different test scenarios in Junit such as corner cases like lease rotations, lease expiry, exception handling, etc.

<dependency>
    <groupId>org.testcontainers</groupId>
    <artifactId>vault</artifactId>
    <version>${testcontainers.version}</version>
    <scope>test</scope>
</dependency>

Testcontainers for Consul

One of the challenges we hit was that there wasn’t a Testcontainer module for consul. Based on some discussions with the test container teams (#4680) we decided to fork off an existing project, polished it a bit, and published this artifact for the OSS community as part of Houghton Mifflin Harcourt. This also turns out to be a tiny milestone in some ways as it is our first (out of many more to come) OSS artifact on Sonatype!!!

<dependency>
 <groupId>com.hmhco</groupId>
 <artifactId>testcontainers-consul</artifactId>
 <version>0.0.4</version>
 <scope>test</scope>
</dependency>

The module supports legacy and newer versions of Consul, ACL, Clustering, and more. The project can be found on Github and we welcome contributions via pull request and/or the discussion forum for any issues or improvements! Eventually, the hope would be to get this module added to testcontainers-java and be supported alongside the rest of the modules.

Some useful pointers when working with Testcontainers

There may be test scenarios that require a Testcontainer to be able to talk to another in the scope of a test. TestContainers networking support makes it easy for you to do that with a generic host address “ host.testcontainers.internal:{port} ” when looking up containers and ports that may be exposed.
An example would be to write a test when integrating Vault and Consul KV via ACL or even Vault integrating with other secret backends such as AWS (localstack) or Database backends (PostgreSQL or similar.)
Supports Junit 4 and Jupiter/Junit 5 and Spock. Choose what’s best for your test framework.
Manage Testcontainer life-cyclesappropriately.
Often you may want to re-use a Testcontainer across your tests. This may also help speed up your test phase.
Use containers for your tests and pipelines. Various patterns including DIND are available.
Running an image for every test method, image per class, or even running one image for all integration test executions. When sharing an image we need to pay close attention to test data and rollback to clean up the state after test execution.
Configure testcontainers.properties when working with private docker registries.
hub.image.name.prefix ={your_private_registry}
Also see image dependencies for what you may need on your private registry for Testcontainers and getting around any Docker Rate Limiting.
Configuring logback to see Testcontainer logs is useful at development time and troubleshooting any issues. You could also stream container logs if you choose to.
Use container labels and image pull policy as appropriate to make sure you benefit from any caching for images that are not changing often. Please see advanced options for more details on this.
Configure wait_timeouts for containers. Also not a bad idea to Assert the container is up before you kick off your tests.

Assert.assertTrue(yourContainer.isRunning()));

Follow the FIRST principle for your tests as defined in the book Clean Code: A Handbook of Agile Software Craftsmanship written by Robert C. Martin. It’s a great read for coding best practices and I highly recommend it!

Getting Started

A quick start for a spring-boot service with Spring Cloud Vault and Consul has been provided.
This example includes integration tests for Consul KV, Vault Secret Backends (KV, Consul, AWS).
It also includes a docker-compose recipe for running all of these integrations on your local development environment.

Summary

The addition of consul as an independent Testcontainer module allows you to integrate Vault and Consul for your test pipelines and add a lot more coverage for your code! I also hope the quickstart examples provided above serve as a good starting point for adding integration tests to your microservice and standardizing your development and test pipelines with best practices for configuration and secret management.

Also, tune in to HashiTalks 2022 on Feb 17/18, 2022 if you are interested in learning more about HashiCorp Vault, Consul, and many more services for your cloud infrastructure.

Speaker card for Hashitalks 2022.

If you would like to learn more about Spring Cloud and integration with Vault and Consul for your microservice please join me at the HashiTalks on Feb 17, 12:05–12:35 GMT. We have a great lineup for speakers and topics this year and looking forward to speaking at the event as well as learning a lot more from the Hashicorp user group!

See you there!

Amazon Aurora and Local Storage

Kris Iyer — Tue, 25 May 2021 08:12:07 +0000

Amazon Aurora RDS Metrics.

“ERROR: could not write block `n` of temporary file: No space left on device.”

Sounds familiar? “No space left on device” is certainly not common when it comes to Amazon Aurora as storage scales automatically up to 128TB and you’re less likely to reach the limit when you scale up your application on a single Amazon Aurora database cluster. No need to delete data or to split the database across multiple instances for storage purposes which is great. What’s going on with Local Storage then? We could be still be left with low local storage or no space leading to failover if we are running databases that are generally used for OLTP but has the need to periodically run fewer but large jobs that push the local storage limits.

This post attempts to shed some more light on Amazon Aurora, the Local Storage architecture as well as some options to improve performance and utilize local storage, lower IO costs as well as not run into local storage limits. For simplicity, I will be focusing on Amazon Aurora for PostgreSQL for engine-specific examples and references on architecture for storage, memory, and optimizations.

Amazon Aurora Storage Architecture and IO

Amazon Aurora is backed by a robust, scalable, and distributed storage architecture. One of the big advantages of Amazon Aurora has been elastic storage that scales with your data eliminating the need for provisioning large storage capacity and utilize some percentage of that. For a while when you deleted data from Aurora clusters, such as by dropping a table or partition, the overall allocated storage space remained the same. Since October 2020, the storage space allocated to your Amazon Aurora database cluster decreases dynamically when you delete data from the cluster. The storage space already automatically increases up to a maximum size of 128 tebibytes (TiB), and will now automatically decrease when data is deleted. Please refer to Dynamic Resizing of Database Storage Space for more details.

Amazon Aurora Storage Architecture.

Storage Types

Amazon Aurora clusters have two types of storage:

Storage used for persistent data (shared cluster storage). For more information, see What the cluster volume contains.
Storage used for temporary data and logs (local storage). All DB files (for example, error logs, temporary files, and temporary tables) are stored in the DB instance local storage. This includes sorting operations, hash tables, grouping operations that SQL queries require, storage that the error logs occupy, and temporary tables that form. Simply put, the Temp space uses the local “ephemeral” volume on the instance.

In general, the components which contribute towards the local storage depend on the engine. For example, on Amazon Aurora for PostgreSQL :

Any temp tables created by PostgreSQL transactions (includes implicit and explicit user-defined temp tables)
Data files
WAL logs
DB logs

Storage architecture for PostgreSQL on RDS differs from Amazon Aurora where temporary tables use persistent storage as opposed to local instance storage. This could potentially be seen as a big change dependent on the nature of your application and workloads. However, ephemeral storage is faster and cheaper (Amazon Aurora does not charge for IO against local storage) than permanent storage, which makes the queries run faster at less cost.

On the flip side, there is also a vertical limit based on the instance size to what can be processed in memory and on local storage on Amazon Aurora. Each Aurora DB instance contains a limited amount of local storage that is determined by the DB instance class. Typically, the amount of local storage is 2X the amount of RAM for your instance class. An example for db.r5 instance classes:

db.r5.large ~ 31 GiB
db.r5.xlarge ~ 62 GiB
db.r5.2xlarge ~ 124 GiB
db.r5.4xlarge ~ 249 GiB
db.r5.8xlarge ~ 499 GiB
db.r5.12xlarge ~ 748 GiB
db.r5.24xlarge ~ 1500 GiB

Free Local Storage metric from an example running sort operations against a large table against default database settings on Amazon Aurora on PostgreSQL.

In a traditional PostgreSQL installation, one could overcome this limitation by creating tablespaces and configuring this to the appropriate storage volume. Aurora PostgreSQL doesn’t have a filesystem for its tablespace. It has a block store and does not allow using the primary data volume that is elastic. While tablespaces are still allowed to be created on Amazon Aurora (and could configure temp_tablespaces) it is mostly there for compatibility.

This is a huge shortcoming in terms of the storage architecture, which could have Amazon Aurora users needing to upgrade the instance class just for this purpose leading to additional costs (Note: IO against local storage is not charged) and left with unused capacity. The preferred solution is to optimize our workloads and database configuration to minimize or completely avoid spillover to disk along with having multiple instances in the cluster for failover in the event where you run low on local disk space.

I hope local instance storage on Amazon Aurora be considered for auto-scaling and/or be made configurable in the future!

IO

As folks migrate to Amazon Aurora it is really useful to understand the IO subsystem from a performance as well as a cost perspective.

I/Os are input/output operations performed by the Aurora database engine against its SSD-based virtualized storage layer. Every database page read operation counts as one I/O. The Aurora database engine issues reads against the storage layer in order to fetch database pages not present in memory in the cache. If your query traffic can be totally served from memory or the cache, you will not be charged for retrieving any data pages from memory. If your query traffic cannot be served entirely from memory, you will be charged for any data pages that need to be retrieved from storage. Each database page is 8KB in Aurora PostgreSQL.

Amazon Aurora was designed to eliminate unnecessary I/O operations in order to reduce costs and to ensure resources are available for serving read/write traffic. Write I/Os are only consumed when persisting write-ahead log records in Aurora PostgreSQL to the storage layer for the purpose of making writes durable. Write I/Os are counted in 4KB units. For example, a log record that is 1024 bytes will count as one write I/O operation. However, if the log record is larger than 4KB, more than one write I/O operation will be needed to persist it. Concurrent write operations whose log records are less than 4KB may be batched together by the Aurora database engine in order to optimize I/O consumption if they are persisted on the same storage protection groups. It is also important to note that, unlike traditional database engines, Aurora never flushes dirty data pages to storage.

PostgreSQL Memory Architecture

It is important to have a good understanding of the PostgreSQL memory components and architecture as we work through SQL optimizations related to performance and IO. Let us take a look at the big pieces next!

A simplified representation of the PostgreSQL memory architecture.

Work_Mem

Probably the most important of all of the buffers in PostgreSQL and requires to be tuned in most cases when it comes to reducing IO or improving query performance. The work_mem value defaults to 4MB in PostgreSQL which means that per Postgres activity (each join, some sorts, etc.) can consume 4MB before it starts overflowing to disk. Honestly, that’s a bit low for many modern use-cases. When Postgres starts writing temp files to disk, obviously things will be much slower than in memory. One size does not fit all and it’s always tough to get the right value for work_mem perfect. A lot depends on your workloads such as fewer connections with large queries (good to tune), or a lot of connections with a lot of smaller queries (defaults will work), or a combination of both (probably a combination of tuning and/or defaults with session overrides). Often a sane default (> PostgreSQL defaults) can be figured out with appropriate testing and effective monitoring.

Note that starting PostgreSQL 10 parallel execution is enabled and could make a significant difference in query processing and resource utilization. While adjusting work_mem we need to factor in the amount of memory that will be used by your process overall. WORK_MEM limits the memory usage of each process! Not just for queries: work_mem * processes * joins could lead to significant memory usage. A few additional parameters to tune along work_mem are listed below:

max_parallel_workers_per_gather — no of workers an executor will use for the parallel execution of a planner node
max_worker_processes — the total number of workers to the number of CPU cores installed on a server
max_parallel_workers — Sets the maximum number of workers that the system can support for parallel queries.

Maintenance_Work_Mem

Defaults to 64MB and is used by maintenance operations, such as VACUUM, CREATE INDEX, CREATE MATERIALIZED VIEW, and ALTER TABLE ADD FOREIGN KEY. Note that the actual memory used depends on the no of auto vacuum workers (Defaults to 3). For better control of memory available for VACUUM operations, we could configure autovacuum_work_mem which defaults to -1 . (uses maintenance_work_mem when set to defaults)

Temp_Buffer

The default is 8MB assuming a BLCKSZ of 8KB. These are session-local buffers used only for access to temporary tables. Note that this value can only be changed before the first use of temporary tables in the session. Any subsequent attempt will have no effect. For data-heavy queries using temporary tables configuring this to a higher value could lead to performance improvements as well as minimizing overflow to disk.

The temp buffers are only used for access to temporary tables in a user session. There is no relation between temp buffers in memory and the temporary files that are created under the pgsql_tmp directory during large sort and hash table operations.

Shared Buffer

The shared_buffer defines how much dedicated system memory PostgreSQL will use for the cache. There is also a difference compared to PostgreSQL on RDS. This is because Aurora PostgreSQL eliminates double buffering and doesn’t utilize file system cache. As a result, Aurora PostgreSQL can increase shared_buffers to improve performance. It’s a best practice to use the default value of 75% (SUM({DBInstanceClassMemory/12038},-50003)) for the shared_buffers DB parameter when using Aurora PostgreSQL. A smaller value can degrade performance by reducing the available memory to the data pages while also increasing I/O on the Aurora storage subsystem. Tuning the shared buffer may be required in some cases (Such as lowering to 50% of available memory) where we run fewer connections that require a larger work_mem configuration on the session for optimal performance.

WAL Buffer

WAL buffers are used to hold write-ahead log (WAL) records that aren’t yet written to storage. The size of the WAL buffer cache is controlled by the wal_buffers setting. Aurora uses a log-based storage engine and changes are sent to storage nodes for persistence. Given the difference in how writes are handled by the Aurora storage engine, this parameter should be left unchanged (Defaults to 16MB)when using Aurora PostgreSQL.

CLOG Buffers

CLOG (commit log) buffers are an area in operating system RAM dedicated to holding commit log pages. The commit log pages contain a log of transaction metadata and differ from the WAL data. The commit logs have the commit status of all transactions and indicate whether or not a transaction has been completed (committed). There is no specific parameter to control this area of memory. This is automatically managed by the database engine in tiny amounts. This is a shared memory component, which is accessible to all the background server and user processes of a PostgreSQL database.

Memory for Locks / Lock Space

This memory space is used to store all heavyweight locks used by the PostgreSQL instance. These locks are shared across all the background server and user processes connecting to the database. A non-default larger setting of two database parameters namely max_locks_per_transaction (Defaults to 64) and max_pred_locks_per_transaction (Defaults to 64) in a way influences the size of this memory component.

Monitoring Queries and disk usage

This is an interesting space and almost all Aurora engines (Mysql/PostgreSQL etc.) have a rich set of capabilities to monitor and provision adequate memory buffers for optimal processing. I’ll be mostly using PostgreSQL as an example for the rest of my post, which IMHO is a great example of comprehensive in-built monitoring and statistics capabilities.

pg_stat_statements

postgres=> \d pg_stat_statements;
                    View "public.pg_stat_statements"
       Column | Type | Collation | Nullable | Default 
--------------------------+------------------+-----------+----------+---------
 userid | oid | | | 
 dbid | oid | | | 
 queryid | bigint | | | 
 query | text | | | 
 calls | bigint | | | 
 total_time | double precision | | | 
 min_time | double precision | | | 
 max_time | double precision | | | 
 mean_time | double precision | | | 
 stddev_time | double precision | | | 
 rows | bigint | | | 
 shared_blks_hit | bigint | | | 
 shared_blks_read | bigint | | | 
 shared_blks_dirtied | bigint | | | 
 shared_blks_written | bigint | | | 
 local_blks_hit | bigint | | | 
 local_blks_read | bigint | | | 
 local_blks_dirtied | bigint | | | 
 local_blks_written | bigint | | | 
 temp_blks_read | bigint | | | 
 temp_blks_written | bigint | | | 
 blk_read_time | double precision | | | 
 blk_write_time | double precision | | |

shared blocks contain data from regular tables and indexes.
local blocks contain data from temporary tables and indexes.
temp blocks contain short-term working data used in sorts, hashes, Materialize plan nodes, and similar cases.

The Other tried and tested formula is to enable log_temp_files on your database server which will log any queries that create any temporary file and also their sizes. Sample Log:

2021-04-10 09:30:15 UTC:xx.xx.x.xx(xxxxx):postgres@example:[xxxxx]: **LOG** : 00000: **temporary file** : path "base/pgsql_tmp/pgsql_tmp20362.65", size **1073741824**  
2021-04-10 09:30:15 UTC:xx.xx.x.xx(xxxxx):postgres@example:[xxxxx]: **CONTEXT** : SQL statement xxxx

Additionally, we could combine these metrics with native tools for a deep dive on log analysis for your engine such as pgbadger for PostgreSQL.

pg_statio

The pg_statio_ views are primarily useful to determine the effectiveness of the buffer cache. When the number of actual disk reads is much smaller than the number of buffer hits, then the cache is satisfying most read requests without invoking a kernel call. However, these statistics do not give the entire story: due to the way in which PostgreSQL handles disk I/O, data that is not in the PostgreSQL buffer cache might still reside in the kernel's I/O cache, and might therefore still be fetched without requiring a physical read.

Useful Tips

Enable auto_explain with auto_explain.log_nested_statements = onallows you to see the duration and the execution plans of the SQL statements inside the function in the PostgreSQL log file.
Enable pg_stat_statements and set the parameter pg_stat_statements.track = all. pg_stat_statements will track information for the SQL statements inside a function. That way you can see metrics at the SQL statement level within your stored procedure or function. In addition, consider configuring pg_stat_statements.max to a higher value than defaults (5000) if you would like to keep a larger dataset to compare. Otherwise, information about the least-executed statements is discarded.
Track queries and cache hit ratios. It is important to get a good hit ratio such as > 90% in most cases.
Track top queries and slice by no of executions, execution time avg, CPU, etc. Often CPU-heavy queries are an indicator of a performance problem. It may or may not be a root cause but could be a symptom of an IO-related issue. We can easily get to see these using pg_stat_statements.
Use pg_statio_all_tables and pg_statio_all_indexes to track IO metrics for tables along with indexes.
It is a common practice to create temporary tables, insert data and then add an index. However, depending on our data set and key length, we could end up overflowing to disk.
A longer Log Retention also means more local storage consumption. Workloads and log retention should be adjusted appropriately. Note that Amazon Aurora does compress (gzip) older logs to optimize when storage is low, and will also delete if it's necessary or if storage is critically low. See PostgreSQL database log files for more details on this. For longer periods of retention, consider offloading logs to cloudwatch and/or your choice of monitoring/log analyzer tool.
Tune conservatively, to begin with, setting shared_buffers and work_mem to some reasonable values based on database size and workload, and setting maintenance_work_mem, effective_cache_size, effective_io_concurrency , temp_buffers and random_page_cost according to your instance size. Note that Amazon Aurora defaults are often based on instance class (memory, CPU, etc). Review carefully and back these configuration updates with a good performance test representative of your data sets and workloads, benchmark results, and iterate.
Eliminate order by clauses where possible. Sorts against queries spanning several joins and/or large tables are one of the most common bottlenecks for disk overflow that end up creating temporary tables. These are mostly query operations that cannot fit under configured memory buffers such as work_mem, temp_buffers, etc. Please see PostgreSQL Memory Architecture for more details. In many cases, applications can deal with unsorted query results (or just do not care) and can help reduce memory utilization and save us from overflowing to disk on the database.
Watch out for the Aggregate strategy as well as the sort method on your explain plan surrounding your group by and order by clause. Shown below is an explain plan for a query against a large table with a group by that uses GroupAggregate with external merge disk for its sort method:

**Finalize GroupAggregate** (cost=92782922.93..126232488.05 rows=116059400 width=119) (actual time=2414955.083..2820682.212 rows=59736353 loops=1)
Group Key: t1.id, t1.name, t1.group
-> **Gather Merge** (cost=92782922.93..122750706.05 rows=232118800 width=119) (actual time=2414955.070..2792088.265 rows=108495098 loops=1)
     Workers Planned: 2
     Workers Launched: 2
    -> **Partial GroupAggregate** (cost=92781922.91..95957437.06 rows=116059400 width=119) (actual time=2417780.397..2660540.817 rows=36173064 loops=3)
      Group Key: t1.id, t1.name, t1.group
        -> Sort (cost=92781922.91..93184906.94 rows=161193612 width=119) (actual time=2417780.369..2607915.169 rows=264507496 loops=3)
          Sort Key: t1.id, t1.name, t1.group
          **Sort Method: external merge Disk: 33520120kB**
          **Worker 0: Sort Method: external merge Disk: 33514064kB  
 Worker 1: Sort Method: external merge Disk: 33903440kB**  
           -> Hash Join (cost=31.67..50973458.94 rows=161193612 width=119) (actual time=123.618..260171.993 rows=264507496 loops=3)
           :::
 Planning Time: 0.413 ms
 **Execution Time: 2977412.203 ms**

Shown below is an explain plan with optimal configuration for work_mem . We can see a huge improvement as this plan usesHashAggregate instead of the GroupAggregate and processes in-memory, thus reduce IO and also improve performance on Amazon Aurora.

**HashAggregate** (cost=79383833.03..80544427.03 rows=116059400 width=119) (actual time=1057250.962..1086043.902 rows=59736353 loops=1)
    Group Key: t1.id, t1.name, t1.group
    -> **Hash Join** (cost=31.67..75515186.35 rows=386864668 width=119) (actual time=272.742..755349.784 rows=793522488 loops=1)
    :::
Planning Time: 0.402 ms
**Execution Time: 1237705.369 ms**

Performance Insights

Amazon Aurora makes many of the key engine-specific metrics available as dashboards. Performance Insights is currently available for Amazon Aurora PostgreSQL compatible Edition, MySQL compatible Edition, PostgreSQL, MySQL, Oracle, SQL Server, and MariaDB.

Performance Insights for Aurora PostgreSQL.

For PostgreSQL stats frompg_stat_statements as well as the active processes and live queries from pg_stat_activity are available on the performance insights dashboard.

Performance Insights can only collect statistics for queries in pg_stat_activity that aren't truncated. By default, PostgreSQL databases truncate queries longer than 1,024 bytes. To increase the query size, change the track_activity_query_size parameter in the DB parameter group associated with your DB instance. When you change this parameter, a DB instance reboot is required.

Testing Query Optimizations and Caching

We always need a good way to test our optimizations and it's quite common that we get different results depending on a test against a cold start vs a test against a warmed up database using cache effectively. This certainly makes it harder to test your optimizations and improvements. However, it is important to note that in runtime you will benefit from caching and should be tested with cache in effect.

There are some good reasons though for testing without cache, and the capabilities to allow that depend on your engine. In Amazon Aurora for Mysql, you could run RESET QUERY CACHE ; between your tests for comparable tests.

In Amazon Aurora for PostgreSQL, we do not deal with any os level cache. I/O is handled by the Aurora storage driver and there is no file system or secondary level of caching for tables or indexes (also another reason to increase your shared_buffer). However, we could use an EXPLAIN (ANALYZE,BUFFERS) to gain insights into your query plan as well as estimate execution times.

postgres=> **EXPLAIN (ANALYZE,BUFFERS) SELECT \* FROM foo;**

QUERY PLAN
------------------------------------------------------------------------ **Seq Scan** on foo ( **cost** =0.00..715.04 **rows** =25004 width=172) (actual time=0.011..3.037 **rows** =25000 **loops** =1)

**Buffers** : shared hit=465
**Planning Time** : 0.038 ms
**Execution Time** : 4.252 ms

We could see how our buffers work for a query and how that translates to caching. For clearing the session cache and the query plan cache, you can use DISCARD PLAN or DISCARD ALL if you want to clear everything. Please see SQL-DISCARD for more information on options.

I hope this post provides you with some insights on Amazon Aurora, the PostgreSQL architecture, challenges with local storage, as well as some useful tips to help you along this journey! Next up, I plan on looking at architecture patterns for transparent read/write splitting on Amazon Aurora as well as benchmarking AWS-provided JDBC driver for PostgreSQL that supports fast failover!

Stay Safe!

Useful References

Thank you Attila Vágó, Francislâiny Campos, and Andrew Brand for your feedback on this post!

AWS STS with Spring Cloud Vault

Kris Iyer — Tue, 30 Mar 2021 08:31:25 +0000

AWS STS with Spring Cloud Vault

In my last post “Spring Boot Configuration and Secret Management Patterns on Kubernetes” I touched on some integration patterns for secret management with Spring Cloud Vault. Along with that I also highlighted that one of the issues I was working on was about enabling AWS STS for S_pring Cloud Vault_. This is now available with spring cloud 2020.0.2!!!

<dependency>
 <groupId>org.springframework.cloud</groupId>
 <artifactId>spring-cloud-dependencies</artifactId>
 <version>2020.0.2</version>
 <type>pom</type>
 <scope>import</scope>
</dependency>

Notice the new Release Train versioning naming convention!

AWS Security Token Service (AWS STS)

AWS Security Token Service (AWS STS) is a web service that enables you to request temporary, limited-privilege credentials for AWS Identity and Access Management (IAM) users or for users that you authenticate (federated users). The key purpose of AWS STS is to allow a user or an application to assume a role and obtain access to AWS services or resources. For more information, see Temporary Security Credentials in the IAM User Guide.

IAM user assuming role via STS.

For applications it's no different, and we could have the application assume the role and request temporary credentials to AWS resources such as EC2, S3, etc. This is where Spring Cloud Vault combined with AWS Secrets backend on Vault provides the capability for a Spring Boot application to use dynamic credentials.

Vault AWS Secret Backend

The AWS secrets engine generates AWS access credentials dynamically based on IAM policies. The AWS IAM credentials are time-based and are automatically revoked when the Vault lease expires.

Vault supports three different types of credentials to retrieve from AWS:

iam_user: Vault will create an IAM user for each lease, attach the managed and inline IAM policies as specified in the role to the user, and if a permissions boundary is specified on the role, the permissions boundary will also be attached. Vault will then generate an access key and secret key for the IAM user and return them to the caller. IAM users have no session tokens and so no session token will be returned. Vault will delete the IAM user upon reaching the TTL expiration.
assumed_role: Vault will call sts:AssumeRole and return the access key, secret key, and session token to the caller.
federation_token: Vault will call sts:GetFederationToken passing in the supplied AWS policy document and return the access key, secret key, and session token to the caller.

More details on the setup can be found under AWS Secrets Engine.

Spring Cloud Vault

The AWS secret engine can be enabled by adding the spring-cloud-vault-config-aws

<dependencies>
    <dependency>
        <groupId>org.springframework.cloud</groupId>
        <artifactId>spring-cloud-vault-config-aws</artifactId>
        <version>3.0.2</version>
    </dependency>
</dependencies>

The AWS secret integration now supports the notion of a credential-type which defaults to iam_user for backward compatibility.

Sample iam_user configuration:

spring.cloud.vault:
  aws:
    enabled: true
    role: readonly
    backend: aws
    access-key-property: cloud.aws.credentials.accessKey
    secret-key-property: cloud.aws.credentials.secretKey

enabled setting this value to true enables the AWS backend config usage
role sets the role name of the AWS role definition
backend sets the path of the AWS mount to use
access-key-property sets the property name in which the AWS access key is stored
secret-key-property sets the property name in which the AWS secret key is stored

For AWS STS supported values forcredential-type are assumed_role or federation_token.

Sample assume_role configuration:

spring.cloud.vault:
  aws:
    enabled: true
      role: sts-vault-role
      backend: aws
      credential-type: assumed_role
      access-key-property: cloud.aws.credentials.accessKey
      secret-key-property: cloud.aws.credentials.secretKey
      session-token-key-property: cloud.aws.credentials.sessionToken
      ttl: 3600s
      role-arn: arn:aws:iam::${AWS_ACCOUNT}:role/sts-app-role

New additions STS configuration:

session-token-key-property sets the property name in which the AWS STS security token is stored
credential-type sets the AWS credential type to use for this backend. Defaults to iam_user
ttl sets the TTL for the STS token when using assumed_role or federation_token. Defaults to the TTL specified by the vault role. Min/Max values are also limited to what AWS would support for STS.
role-arn sets the IAM role to assume if more than one is configured for the vault role when using assumed_role

Please read Spring Cloud Vault AWS Backend for more details on this integration.

Lease Rotation and Property Sources

STS credentials default to a TTL of 60 mins. You can adjust the TTL based on your requirement. It's important to note that min/max TTL values allowed are as per what AWS STS would allow for configuration.

For assumed_role we could set that between a minimum of 900 seconds (15 minutes) up to the maximum session duration setting for the role which could be anywhere between 3,600s (1 hour) and 43,200s (12 hours). The default expiration period for federation_token is substantially longer (12 hours instead of one hour compared to assumed_role) and we could specify the duration between 900 seconds (15 minutes) to 129,600 seconds (36 hours).

Spring Cloud Vault managed leases can either be RENEWED (if they are renewable) or ROTATED based on the vault lifecycle configuration.

Sample Vault lifecycle:

vault:
  enabled: true
  host: 127.0.0.1
  port: 8200
  scheme: http
  uri: [http://127.0.0.1:8200/](http://127.0.0.1:8200/)
  config:
    lifecycle:
      min-renewal: 1m
      expiry-threshold: 5m

min-renewal makes sure that the leases are not renewed/rotated too frequently and at least stick around for the configured duration. expiry-threshold is the configured duration before the lease expiry that vault will renew/rotate a lease.

Spring Cloud Vault and the LeaseContainer will make sure the property sources are updated with the new set of credentials upon a Lease Expiry. However, it is the responsibility of the application to make sure any properties updated under the property sources and environment are propagated through any spring beans initialized with credentials.

Let us assume a spring boot application that managed AWS creds through a ConfigurationProperties class such as below:

Let's also assume that there was another Refresh Scope bean that has an autowired dependency for AwsConfigurationProperties.

In this scenario, it becomes important to listen to SecretLeaseCreatedEvent and rebind/refresh the respective configuration properties and any other refresh scoped beans within the application that may need updated properties, such as AWS credentials injected. Let us review how we can achieve this next.

VaultAwsConfiguration shown above registers a lease listener during postConstruct()
Rebinds any ConfigurationProperties (AwsConfigurationProperties) using a ConfigurationPropertiesRebinder
Refreshes any refresh scoped beans (AwsConfiguration, basicAWSCredentials, amazonS3Client) on the ApplicationContext upon receiving a SecretLeaseCreatedEvent.

Wondering why not just use spring cloud aws? If you are, you are absolutely right! At the moment, Spring Cloud AWS v2.3.0 only supports AWS access key and secrets. It also does not integrate with vault and lease events at the moment. I do have an issue logged for supporting STS Session Token and it will be a nice addition for Spring Cloud AWS to integrate with Spring Cloud Vault for its credential manager implementation.

Graceful Shutdown

Spring Cloud Vault performs a revoke on any active lease as the application container shuts down. The application will need to configure appropriate permissions on the vault role to perform sys/leases/revoke so that spring cloud vault could revoke leases.

Something I ran across with Spring Boot 2.4 and legacy bootstrap is that /actuator/refresh ends up closing the context and thus also triggers destroy() on the LeaseContainer resulting in a revoke. There isn’t a fix or a workaround for this yet under legacy bootstrap, but the recommendation is to cut over to using Config Data API. Note that spring config imports are processed in reverse order. For instance, if we were using multiple sources such as vault and consul (with ACL) for imports, and would like vault secrets to be resolved and imported before others, they will have to be set up as:

spring:
  config:
    import: consul://,vault://

Unrelated to STS and vault I have a spring boot issue raised for ordered dependency resolution with config data API where we need property sources updated to be honored before we process imports (Such as the consul ACL token from the vault consul backend).

Known Issues

With Spring cloud v2020.0.2 there is a known issue (java.lang.NoSuchMethodError) that stems from spring-cloud-configdue to an incorrect dependency resolution for spring-vault-core. See Vault core dependency resolution causing java.lang.NoSuchMethodError for more details. This will be corrected in a subsequent release but in the meanwhile, you could implement a workaround to override spring-vault-core version such as:

<dependency>
 <groupId>org.springframework.vault</groupId>
 <artifactId>spring-vault-core</artifactId>
 <version>2.3.2</version>
</dependency>

I hope you enjoyed the read and this helps you on your journey to build secured cloud applications using temporary credentials with AWS STS!

Thank you and stay safe!

Thanks to Attila Vágó, Darragh Grace, and Francislâiny Campos for their feedback on this post!

Forem: Kris Iyer

Demystifying the AWS Advanced JDBC Driver: Pools, Plugins, and the Traps I Hit

Demystifying the AWS Advanced JDBC Driver: Pools, Plugins, and the Traps I Hit

TL;DR

Why I'm writing this

What the AWS Advanced JDBC Driver actually is

Configuration profiles: convenience with teeth

The F0 gotcha: a few hours I won't get back

v3.x: cp- properties and connectionPoolType=hikari

When to use presets vs manual configuration

External pooling vs internal pooling — what each layer is actually doing

External pool (my application's HikariCP, managed by Spring Boot)

Internal pool (managed by the wrapper, one per Aurora instance)

Why both are needed — and the official caveat

So how do you actually disable the external pool?

Trade-offs at a glance

Where I am with this

F0 vs SF_F0: should Spring Boot apps use readWriteSplitting?

HikariCP and virtual threads: a known compatibility issue

Sizing rule

Life of a single SELECT

The plugin catalog, and when I use which

What I always run for Aurora

What I enable conditionally

My default stack for Aurora PG + HikariCP

Aurora with multiple readers: the configuration I'm shipping

Reader host selection

The exception-translation line I almost missed

Performance aspects

Where time actually goes

Metrics I always watch

My sizing calculator

Failover — what happens under the hood

Sequence during a writer failover

Tuning the detection budget

Resilience patterns worth knowing

failoverMode

Connection churn during failover (don't panic)

Read-only traffic during failover

Blue/green deployments

RDS Proxy: when, and how it interacts with this driver

When AWS recommends RDS Proxy

How RDS Proxy actually routes

Plugin compatibility behind RDS Proxy

The srw plugin — SQL-aware splitting through RDS Proxy

Decision tree

Pinning — the multiplexing trap

My take

The checklist I run through before shipping

Where this leaves me

References

AWS Advanced JDBC Wrapper — driver docs

AWS Advanced JDBC Wrapper — plugin docs

AWS Advanced JDBC Wrapper — examples and changelog

RDS Proxy

External

This PHD from AWS Might Save Your Weekend!

The Mystery

What Really Happens During a Forced Upgrade

The Connection Pooling Trap

The Bigger Picture — It’s Not Just RDS

What the Personal Health Dashboard Actually Does

What I Changed After That

Looking Ahead — AI, Automation, and MCP

Why It Matters

TL;DR

Kubernetes Features for operating resilient workloads on Amazon EKS

Introduction

Control Plane

Data Plane

Multi-Region Kubernetes Clusters

Multi-cluster

Compute Resources (Nodes/Node Groups)

Deployment Strategies

Topology Spread Constraints (TSC)

Descheduler

Topology Aware Routing on Amazon EKS

Pod Disruption Budget (PDB)

Pod Disruption Budget (PDB) and Rolling Update Strategy

Pod Priority and Preemption

v3.x: `cp-` properties and `connectionPoolType=hikari`

F0 vs SF_F0: should Spring Boot apps use `readWriteSplitting`?

`failoverMode`

The `srw` plugin — SQL-aware splitting through RDS Proxy